train.lgbm¶

LightGBM multiclass trainer with class-weight support and evaluation.

Overview¶

Trains a LightGBM gradient-boosted tree model for 8-class task classification. The module handles categorical encoding, feature extraction, class-imbalance weighting, and validation-set evaluation in a single pipeline:

features_df → prepare_xy → train_lgbm → (model, metrics, confusion_df, params, cat_encoders)

Categorical columns are label-encoded to integers so LightGBM can use them as native categoricals. In the current default schema (v3), categoricals are app_id, app_category, and domain_category; user_id remains on persisted rows for joins/evaluation but is not part of the default model feature vector. During training, rare categories (below min_category_freq) and a random fraction (unknown_mask_rate) of known categories are replaced with __unknown__ so the model learns a meaningful embedding for unseen values. At inference time, unseen values map to __unknown__ (or -1 for legacy encoders without it).

Constants¶

FEATURE_COLUMNS¶

Ordered list of 34 feature names consumed by the model. The first four are categorical; the rest are numeric:

#	Feature	Type
0	`app_id`	categorical
1	`app_category`	categorical
2	`is_browser`	boolean
3	`is_editor`	boolean
4	`is_terminal`	boolean
5	`app_switch_count_last_5m`	numeric
6	`app_foreground_time_ratio`	numeric
7	`app_change_count`	numeric
8	`keys_per_min`	numeric
9	`backspace_ratio`	numeric
10	`shortcut_rate`	numeric
11	`clicks_per_min`	numeric
12	`scroll_events_per_min`	numeric
13	`mouse_distance`	numeric
14	`active_seconds_keyboard`	numeric
15	`active_seconds_mouse`	numeric
16	`active_seconds_any`	numeric
17	`max_idle_run_seconds`	numeric
18	`event_density`	numeric
19	`domain_category`	categorical
20	`window_title_bucket`	numeric
21	`title_repeat_count_session`	numeric
22	`keys_per_min_rolling_5`	numeric
23	`keys_per_min_rolling_15`	numeric
24	`mouse_distance_rolling_5`	numeric
25	`mouse_distance_rolling_15`	numeric
26	`keys_per_min_delta`	numeric
27	`clicks_per_min_delta`	numeric
28	`mouse_distance_delta`	numeric
29	`app_switch_count_last_15m`	numeric
30	`hour_of_day`	numeric
31	`day_of_week`	numeric
32	`session_length_so_far`	numeric
33	`user_id`	categorical

CATEGORICAL_COLUMNS¶

Subset of FEATURE_COLUMNS that are label-encoded to integers for LightGBM native categorical support:

app_id
app_category
domain_category
user_id

FEATURE_COLUMNS_V2¶

Same as FEATURE_COLUMNS with user_id removed (33 features). Used when training schema-v2 models where personalization is handled via calibrators and per-user post-processing instead of a model feature.

FEATURE_COLUMNS_V3¶

Current default feature set. Starts from FEATURE_COLUMNS_V2 and adds numeric-only keyed title-sketch features plus scalar title statistics. These features increase browser-title learning signal without exporting reversible title vocabularies into the model bundle.

CATEGORICAL_COLUMNS_V2¶

Same as CATEGORICAL_COLUMNS with user_id removed:

app_id
app_category
domain_category

get_feature_columns / get_categorical_columns¶

get_feature_columns(schema_version: str) -> list[str]
get_categorical_columns(schema_version: str) -> list[str]

Dispatch helpers that return the appropriate column list for "v1", "v2", or "v3". Raise ValueError for unknown versions.

Default hyperparameters¶

Parameter	Default	Description
`objective`	`multiclass`	LightGBM objective function
`metric`	`multi_logloss`	Evaluation metric
`num_leaves`	`31`	Maximum tree leaves (complexity control)
`learning_rate`	`0.1`	Gradient descent step size
`num_boost_round`	`100`	Boosting iterations (from `core.defaults`)
`verbose`	`-1`	Suppress LightGBM training logs

Functions¶

encode_categoricals¶

encode_categoricals(
    df: pd.DataFrame,
    cat_encoders: dict[str, LabelEncoder] | None = None,
    *,
    min_category_freq: int = 5,
    unknown_mask_rate: float = 0.05,
    random_state: int | None = None,
    schema_version: str = "v1",
) -> tuple[pd.DataFrame, dict[str, LabelEncoder]]

Label-encodes the four categorical columns in-place. Operates in two modes:

Fit-new (cat_encoders=None): counts value frequencies, replaces values with count below min_category_freq with "__unknown__", randomly masks unknown_mask_rate of remaining known values to "__unknown__" (seeded by random_state), then fits a LabelEncoder per column.
Reuse (cat_encoders provided): transforms using existing encoders; values not in the encoder map to "__unknown__" if present, otherwise to -1 (legacy fallback).

Parameter	Default	Description
`min_category_freq`	`5`	Minimum count for a category to keep its own code
`unknown_mask_rate`	`0.05`	Fraction of known-category rows randomly masked to `__unknown__`
`random_state`	`None`	Seed for reproducible masking
`schema_version`	inferred	Schema version selecting which categorical columns to encode

prepare_xy¶

prepare_xy(
    df: pd.DataFrame,
    label_encoder: LabelEncoder | None = None,
    cat_encoders: dict[str, LabelEncoder] | None = None,
    *,
    schema_version: str | None = None,
) -> tuple[np.ndarray, np.ndarray, LabelEncoder, dict[str, LabelEncoder]]

Extracts a (X, y, label_encoder, cat_encoders) tuple from a labeled DataFrame. Encodes categoricals, fills missing numeric values with 0, and encodes labels against LABEL_SET_V1 (8 classes, sorted).

compute_sample_weights¶

compute_sample_weights(
    y: np.ndarray,
    method: Literal["balanced", "none"] = "balanced",
) -> np.ndarray | None

Maps encoded labels to per-sample weights. "balanced" uses inverse class frequency:

weight = n_samples / (n_classes * count_per_class)

"none" returns None (no weighting).

train_lgbm¶

train_lgbm(
    train_df: pd.DataFrame,
    val_df: pd.DataFrame,
    *,
    num_boost_round: int = DEFAULT_NUM_BOOST_ROUND,
    extra_params: dict[str, Any] | None = None,
    class_weight: Literal["balanced", "none"] = "balanced",
    min_category_freq: int = 5,
    unknown_mask_rate: float = 0.05,
    random_state: int | None = None,
    schema_version: str | None = None,
) -> tuple[lgb.Booster, dict, pd.DataFrame, dict[str, Any], dict[str, LabelEncoder]]

Trains a LightGBM multiclass model and evaluates on the validation set. Returns a 5-tuple:

Element	Type	Description
`model`	`lgb.Booster`	Trained model
`metrics`	`dict`	Macro/weighted F1 and per-class metrics
`confusion_df`	`pd.DataFrame`	Confusion matrix
`params`	`dict`	Merged hyperparameters (includes `class_weight_method`)
`cat_encoders`	`dict[str, LabelEncoder]`	Fitted categorical encoders

Usage¶

from taskclf.train.lgbm import train_lgbm
from taskclf.train.dataset import split_by_time

labeled_df = ...  # DataFrame with the selected schema's feature columns + "label"
splits = split_by_time(labeled_df)
train_df = labeled_df.iloc[splits["train"]].reset_index(drop=True)
val_df = labeled_df.iloc[splits["val"]].reset_index(drop=True)

model, metrics, confusion_df, params, cat_encoders = train_lgbm(
    train_df, val_df,
    num_boost_round=100,
    class_weight="balanced",
)
print(f"Macro F1: {metrics['macro_f1']:.4f}")

After training, pass model and cat_encoders to evaluate_model for full evaluation with acceptance checks, or to fit_calibrator_store for per-user probability calibration.

`taskclf.train.lgbm` ¶

LightGBM multiclass trainer with class-weight support and evaluation.

`get_feature_columns(schema_version)` ¶

Return the feature column list for schema_version.

Raises:

Type	Description
`ValueError`	If schema_version is not `"v1"`, `"v2"`, or `"v3"`.

Source code in src/taskclf/train/lgbm.py

def get_feature_columns(schema_version: str) -> list[str]:
    """Return the feature column list for *schema_version*.

    Raises:
        ValueError: If *schema_version* is not ``"v1"``, ``"v2"``, or ``"v3"``.
    """
    if schema_version == "v1":
        return list(FEATURE_COLUMNS)
    if schema_version == "v2":
        return list(FEATURE_COLUMNS_V2)
    if schema_version == "v3":
        return list(FEATURE_COLUMNS_V3)
    raise ValueError(f"Unknown schema version: {schema_version!r}")

`get_categorical_columns(schema_version)` ¶

Return the categorical column list for schema_version.

Raises:

Type	Description
`ValueError`	If schema_version is not `"v1"`, `"v2"`, or `"v3"`.

Source code in src/taskclf/train/lgbm.py

def get_categorical_columns(schema_version: str) -> list[str]:
    """Return the categorical column list for *schema_version*.

    Raises:
        ValueError: If *schema_version* is not ``"v1"``, ``"v2"``, or ``"v3"``.
    """
    if schema_version == "v1":
        return list(CATEGORICAL_COLUMNS)
    if schema_version == "v2":
        return list(CATEGORICAL_COLUMNS_V2)
    if schema_version == "v3":
        return list(CATEGORICAL_COLUMNS_V3)
    raise ValueError(f"Unknown schema version: {schema_version!r}")

`encode_categoricals(df, cat_encoders=None, *, min_category_freq=5, unknown_mask_rate=0.05, random_state=None, schema_version=None)` ¶

Label-encode categorical columns in-place and return fitted encoders.

During training (cat_encoders is None), rare values (frequency below min_category_freq) are replaced with "__unknown__" and a random fraction (unknown_mask_rate) of known values are also masked to "__unknown__" so the model learns a meaningful embedding for unseen categories.

During inference (cat_encoders provided), values not present in the fitted encoder are mapped to "__unknown__" if it exists in the encoder's vocabulary, otherwise to -1 for backward compatibility with legacy encoders.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame with the columns listed in `CATEGORICAL_COLUMNS`.	required
`cat_encoders`	`dict[str, LabelEncoder] \| None`	Pre-fitted encoders keyed by column name. When `None`, new encoders are fitted on the data.	`None`
`min_category_freq`	`int`	Minimum count for a value to be kept as its own category during training. Values below this threshold are replaced with `"__unknown__"`.	`5`
`unknown_mask_rate`	`float`	Fraction of known-category rows to randomly mask to `"__unknown__"` during training (for robustness).	`0.05`
`random_state`	`int \| None`	Seed for the random masking (reproducibility).	`None`
`schema_version`	`str \| None`	`"v1"`, `"v2"`, or `"v3"`. Selects which categorical columns to encode.	`None`

Returns:

Type	Description
`DataFrame`	`(encoded_df, cat_encoders)` -- the DataFrame with categorical
`dict[str, LabelEncoder]`	columns replaced by integer codes, and the encoder dict.

Source code in src/taskclf/train/lgbm.py

def encode_categoricals(
    df: pd.DataFrame,
    cat_encoders: dict[str, LabelEncoder] | None = None,
    *,
    min_category_freq: int = 5,
    unknown_mask_rate: float = 0.05,
    random_state: int | None = None,
    schema_version: str | None = None,
) -> tuple[pd.DataFrame, dict[str, LabelEncoder]]:
    """Label-encode categorical columns in-place and return fitted encoders.

    During training (``cat_encoders is None``), rare values (frequency
    below *min_category_freq*) are replaced with ``"__unknown__"`` and a
    random fraction (*unknown_mask_rate*) of known values are also masked
    to ``"__unknown__"`` so the model learns a meaningful embedding for
    unseen categories.

    During inference (``cat_encoders`` provided), values not present in
    the fitted encoder are mapped to ``"__unknown__"`` if it exists in
    the encoder's vocabulary, otherwise to ``-1`` for backward
    compatibility with legacy encoders.

    Args:
        df: DataFrame with the columns listed in ``CATEGORICAL_COLUMNS``.
        cat_encoders: Pre-fitted encoders keyed by column name.  When
            ``None``, new encoders are fitted on the data.
        min_category_freq: Minimum count for a value to be kept as its
            own category during training.  Values below this threshold
            are replaced with ``"__unknown__"``.
        unknown_mask_rate: Fraction of *known*-category rows to randomly
            mask to ``"__unknown__"`` during training (for robustness).
        random_state: Seed for the random masking (reproducibility).
        schema_version: ``"v1"``, ``"v2"``, or ``"v3"``.  Selects which categorical
            columns to encode.

    Returns:
        ``(encoded_df, cat_encoders)`` -- the DataFrame with categorical
        columns replaced by integer codes, and the encoder dict.
    """
    resolved_schema_version = _resolve_schema_version(df, schema_version)
    cat_cols = get_categorical_columns(resolved_schema_version)
    df = df.copy()
    if cat_encoders is None:
        rng = np.random.RandomState(random_state)
        cat_encoders = {}
        for col in cat_cols:
            vals = df[col].astype(str)
            freq = vals.value_counts()
            rare_mask = vals.isin(freq[freq < min_category_freq].index)
            vals = vals.copy()
            vals[rare_mask] = _UNKNOWN_TOKEN
            if unknown_mask_rate > 0:
                known_mask = vals != _UNKNOWN_TOKEN
                n_known = known_mask.sum()
                n_mask = int(round(n_known * unknown_mask_rate))
                if n_mask > 0:
                    mask_idx = rng.choice(
                        vals.index[known_mask], size=n_mask, replace=False
                    )
                    vals.iloc[vals.index.get_indexer(pd.Index(mask_idx))] = (
                        _UNKNOWN_TOKEN
                    )
            le = LabelEncoder()
            df[col] = le.fit_transform(vals)
            cat_encoders[col] = le
    else:
        for col in cat_cols:
            le = cat_encoders[col]
            known = set(le.classes_)
            has_unknown = _UNKNOWN_TOKEN in known

            def _encode(
                v: str, _k: set = known, _le: LabelEncoder = le, _hu: bool = has_unknown
            ) -> int:
                if v in _k:
                    return int(_le.transform([v])[0])
                if _hu:
                    return int(_le.transform([_UNKNOWN_TOKEN])[0])
                return -1

            df[col] = df[col].astype(str).apply(_encode)
    return df, cat_encoders

`prepare_xy(df, label_encoder=None, cat_encoders=None, *, min_category_freq=5, unknown_mask_rate=0.05, random_state=None, schema_version=None)` ¶

Extract feature matrix and encoded label vector from df.

Categorical columns are label-encoded to integers so LightGBM can use them as native categoricals. Missing numeric values are filled with 0.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Labeled feature DataFrame (must contain the feature columns for the selected schema_version and a `label` column).	required
`label_encoder`	`LabelEncoder \| None`	Pre-fitted encoder to reuse (e.g. the one returned from the training call). If `None`, a new encoder is fitted on the canonical `LABEL_SET_V1`.	`None`
`cat_encoders`	`dict[str, LabelEncoder] \| None`	Pre-fitted categorical encoders. If `None`, new ones are fitted from df.	`None`
`min_category_freq`	`int`	Forwarded to :func:`encode_categoricals`.	`5`
`unknown_mask_rate`	`float`	Forwarded to :func:`encode_categoricals`.	`0.05`
`random_state`	`int \| None`	Forwarded to :func:`encode_categoricals`.	`None`
`schema_version`	`str \| None`	`"v1"`, `"v2"`, or `"v3"`.	`None`

Returns:

Type	Description
`tuple[ndarray, ndarray, LabelEncoder, dict[str, LabelEncoder]]`	A `(X, y, label_encoder, cat_encoders)` tuple.

Source code in src/taskclf/train/lgbm.py

def prepare_xy(
    df: pd.DataFrame,
    label_encoder: LabelEncoder | None = None,
    cat_encoders: dict[str, LabelEncoder] | None = None,
    *,
    min_category_freq: int = 5,
    unknown_mask_rate: float = 0.05,
    random_state: int | None = None,
    schema_version: str | None = None,
) -> tuple[np.ndarray, np.ndarray, LabelEncoder, dict[str, LabelEncoder]]:
    """Extract feature matrix and encoded label vector from *df*.

    Categorical columns are label-encoded to integers so LightGBM can
    use them as native categoricals.  Missing numeric values are filled
    with 0.

    Args:
        df: Labeled feature DataFrame (must contain the feature columns
            for the selected *schema_version* and a ``label`` column).
        label_encoder: Pre-fitted encoder to reuse (e.g. the one returned
            from the training call).  If ``None``, a new encoder is fitted
            on the canonical ``LABEL_SET_V1``.
        cat_encoders: Pre-fitted categorical encoders.  If ``None``, new
            ones are fitted from *df*.
        min_category_freq: Forwarded to :func:`encode_categoricals`.
        unknown_mask_rate: Forwarded to :func:`encode_categoricals`.
        random_state: Forwarded to :func:`encode_categoricals`.
        schema_version: ``"v1"``, ``"v2"``, or ``"v3"``.

    Returns:
        A ``(X, y, label_encoder, cat_encoders)`` tuple.
    """
    resolved_schema_version = _resolve_schema_version(df, schema_version)
    feat_cols = get_feature_columns(resolved_schema_version)
    feat_df = df[feat_cols].copy()
    feat_df, cat_encoders = encode_categoricals(
        feat_df,
        cat_encoders,
        min_category_freq=min_category_freq,
        unknown_mask_rate=unknown_mask_rate,
        random_state=random_state,
        schema_version=resolved_schema_version,
    )
    x = feat_df.fillna(0).to_numpy(dtype=np.float64)

    if label_encoder is None:
        label_encoder = LabelEncoder()
        label_encoder.fit(sorted(LABEL_SET_V1))

    y = label_encoder.transform(df["label"].values)
    return x, y, label_encoder, cat_encoders

`compute_sample_weights(y, method='balanced')` ¶

Map encoded labels to per-sample weights using inverse class frequency.

Parameters:

Name	Type	Description	Default
`y`	`ndarray`	Integer-encoded label array (output of `LabelEncoder.transform`).	required
`method`	`Literal['balanced', 'none']`	`"balanced"` computes `n_samples / (n_classes * count_per_class)` and maps each sample to its class weight. `"none"` returns `None`.	`'balanced'`

Returns:

Type	Description
`ndarray \| None`	Per-sample weight array with the same length as y, or `None`
`ndarray \| None`	when method is `"none"`.

Source code in src/taskclf/train/lgbm.py

def compute_sample_weights(
    y: np.ndarray,
    method: Literal["balanced", "none"] = "balanced",
) -> np.ndarray | None:
    """Map encoded labels to per-sample weights using inverse class frequency.

    Args:
        y: Integer-encoded label array (output of ``LabelEncoder.transform``).
        method: ``"balanced"`` computes ``n_samples / (n_classes * count_per_class)``
            and maps each sample to its class weight.  ``"none"`` returns ``None``.

    Returns:
        Per-sample weight array with the same length as *y*, or ``None``
        when *method* is ``"none"``.
    """
    if method == "none":
        return None
    n_samples = len(y)
    n_classes = int(y.max()) + 1
    counts = np.bincount(y, minlength=n_classes).astype(np.float64)
    counts[counts == 0] = 1.0
    class_weights = n_samples / (n_classes * counts)
    return class_weights[y]

`train_lgbm(train_df, val_df, *, num_boost_round=DEFAULT_NUM_BOOST_ROUND, extra_params=None, class_weight='balanced', min_category_freq=5, unknown_mask_rate=0.05, random_state=None, schema_version=None)` ¶

Train a LightGBM multiclass model and evaluate on the val set.

Parameters:

Name	Type	Description	Default
`train_df`	`DataFrame`	Training DataFrame with feature columns and a `label` column.	required
`val_df`	`DataFrame`	Validation DataFrame (same schema as train_df).	required
`num_boost_round`	`int`	Number of boosting iterations.	`DEFAULT_NUM_BOOST_ROUND`
`extra_params`	`dict[str, Any] \| None`	Additional LightGBM parameters merged on top of the built-in defaults.	`None`
`class_weight`	`Literal['balanced', 'none']`	Strategy for handling class imbalance. `"balanced"` uses inverse-frequency sample weights; `"none"` disables weighting.	`'balanced'`
`min_category_freq`	`int`	Minimum count for a category to keep its own code; rarer values become `__unknown__`.	`5`
`unknown_mask_rate`	`float`	Fraction of known-category rows randomly masked to `__unknown__` during training.	`0.05`
`random_state`	`int \| None`	Seed for the random unknown masking.	`None`
`schema_version`	`str \| None`	`"v1"`, `"v2"`, or `"v3"`.	`None`

Returns:

Type	Description
`Booster`	A `(model, metrics, confusion_df, params, cat_encoders)` tuple
`dict`	where cat_encoders maps each categorical column name to its
`DataFrame`	fitted `LabelEncoder`.

Source code in src/taskclf/train/lgbm.py

def train_lgbm(
    train_df: pd.DataFrame,
    val_df: pd.DataFrame,
    *,
    num_boost_round: int = DEFAULT_NUM_BOOST_ROUND,
    extra_params: dict[str, Any] | None = None,
    class_weight: Literal["balanced", "none"] = "balanced",
    min_category_freq: int = 5,
    unknown_mask_rate: float = 0.05,
    random_state: int | None = None,
    schema_version: str | None = None,
) -> tuple[lgb.Booster, dict, pd.DataFrame, dict[str, Any], dict[str, LabelEncoder]]:
    """Train a LightGBM multiclass model and evaluate on the val set.

    Args:
        train_df: Training DataFrame with feature columns and a ``label``
            column.
        val_df: Validation DataFrame (same schema as *train_df*).
        num_boost_round: Number of boosting iterations.
        extra_params: Additional LightGBM parameters merged on top of the
            built-in defaults.
        class_weight: Strategy for handling class imbalance.
            ``"balanced"`` uses inverse-frequency sample weights;
            ``"none"`` disables weighting.
        min_category_freq: Minimum count for a category to keep its own
            code; rarer values become ``__unknown__``.
        unknown_mask_rate: Fraction of known-category rows randomly
            masked to ``__unknown__`` during training.
        random_state: Seed for the random unknown masking.
        schema_version: ``"v1"``, ``"v2"``, or ``"v3"``.

    Returns:
        A ``(model, metrics, confusion_df, params, cat_encoders)`` tuple
        where *cat_encoders* maps each categorical column name to its
        fitted ``LabelEncoder``.
    """
    resolved_schema_version = _resolve_schema_version(train_df, schema_version)
    feat_cols = get_feature_columns(resolved_schema_version)
    x_train, y_train, le, cat_encoders = prepare_xy(
        train_df,
        min_category_freq=min_category_freq,
        unknown_mask_rate=unknown_mask_rate,
        random_state=random_state,
        schema_version=resolved_schema_version,
    )
    x_val, y_val, _, _ = prepare_xy(
        val_df,
        label_encoder=le,
        cat_encoders=cat_encoders,
        schema_version=resolved_schema_version,
    )

    params = {**_DEFAULT_PARAMS, "num_class": len(le.classes_)}
    if extra_params:
        params.update(extra_params)

    cat_indices = _categorical_feature_indices(resolved_schema_version)
    sample_weights = compute_sample_weights(y_train, method=class_weight)

    train_ds = lgb.Dataset(
        x_train,
        label=y_train,
        weight=sample_weights,
        feature_name=feat_cols,
        categorical_feature=cat_indices,
        free_raw_data=False,
    )
    val_ds = lgb.Dataset(x_val, label=y_val, reference=train_ds, free_raw_data=False)

    model = lgb.train(
        params,
        train_ds,
        num_boost_round=num_boost_round,
        valid_sets=[val_ds],
        valid_names=["val"],
    )

    y_pred_idx = model.predict(x_val).argmax(axis=1)  # type: ignore[union-attr]
    y_pred_labels = le.inverse_transform(y_pred_idx)
    y_true_labels = le.inverse_transform(y_val)

    label_names = list(le.classes_)
    metrics = compute_metrics(y_true_labels, y_pred_labels, label_names)
    cm_df = confusion_matrix_df(y_true_labels, y_pred_labels, label_names)

    params["class_weight_method"] = class_weight
    params["unknown_category_freq_threshold"] = min_category_freq
    params["unknown_category_mask_rate"] = unknown_mask_rate
    return model, metrics, cm_df, params, cat_encoders

train.lgbm¶

Overview¶

Constants¶

FEATURE_COLUMNS¶

CATEGORICAL_COLUMNS¶

FEATURE_COLUMNS_V2¶

FEATURE_COLUMNS_V3¶

CATEGORICAL_COLUMNS_V2¶

get_feature_columns / get_categorical_columns¶

Default hyperparameters¶

Functions¶

encode_categoricals¶

prepare_xy¶

compute_sample_weights¶

train_lgbm¶

Usage¶

taskclf.train.lgbm ¶

get_feature_columns(schema_version) ¶

get_categorical_columns(schema_version) ¶

encode_categoricals(df, cat_encoders=None, *, min_category_freq=5, unknown_mask_rate=0.05, random_state=None, schema_version=None) ¶

prepare_xy(df, label_encoder=None, cat_encoders=None, *, min_category_freq=5, unknown_mask_rate=0.05, random_state=None, schema_version=None) ¶

compute_sample_weights(y, method='balanced') ¶

train_lgbm(train_df, val_df, *, num_boost_round=DEFAULT_NUM_BOOST_ROUND, extra_params=None, class_weight='balanced', min_category_freq=5, unknown_mask_rate=0.05, random_state=None, schema_version=None) ¶

`taskclf.train.lgbm` ¶

`get_feature_columns(schema_version)` ¶

`get_categorical_columns(schema_version)` ¶

`encode_categoricals(df, cat_encoders=None, *, min_category_freq=5, unknown_mask_rate=0.05, random_state=None, schema_version=None)` ¶

`prepare_xy(df, label_encoder=None, cat_encoders=None, *, min_category_freq=5, unknown_mask_rate=0.05, random_state=None, schema_version=None)` ¶

`compute_sample_weights(y, method='balanced')` ¶

`train_lgbm(train_df, val_df, *, num_boost_round=DEFAULT_NUM_BOOST_ROUND, extra_params=None, class_weight='balanced', min_category_freq=5, unknown_mask_rate=0.05, random_state=None, schema_version=None)` ¶