Skip to content

train.lgbm

LightGBM multiclass trainer with class-weight support and evaluation.

Overview

Trains a LightGBM gradient-boosted tree model for 8-class task classification. The module handles categorical encoding, feature extraction, class-imbalance weighting, and validation-set evaluation in a single pipeline:

features_df → prepare_xy → train_lgbm → (model, metrics, confusion_df, params, cat_encoders)

Categorical columns are label-encoded to integers so LightGBM can use them as native categoricals. In the current default schema (v3), categoricals are app_id, app_category, and domain_category; user_id remains on persisted rows for joins/evaluation but is not part of the default model feature vector. During training, rare categories (below min_category_freq) and a random fraction (unknown_mask_rate) of known categories are replaced with __unknown__ so the model learns a meaningful embedding for unseen values. At inference time, unseen values map to __unknown__ (or -1 for legacy encoders without it).

Constants

FEATURE_COLUMNS

Ordered list of 34 feature names consumed by the model. The first four are categorical; the rest are numeric:

# Feature Type
0 app_id categorical
1 app_category categorical
2 is_browser boolean
3 is_editor boolean
4 is_terminal boolean
5 app_switch_count_last_5m numeric
6 app_foreground_time_ratio numeric
7 app_change_count numeric
8 keys_per_min numeric
9 backspace_ratio numeric
10 shortcut_rate numeric
11 clicks_per_min numeric
12 scroll_events_per_min numeric
13 mouse_distance numeric
14 active_seconds_keyboard numeric
15 active_seconds_mouse numeric
16 active_seconds_any numeric
17 max_idle_run_seconds numeric
18 event_density numeric
19 domain_category categorical
20 window_title_bucket numeric
21 title_repeat_count_session numeric
22 keys_per_min_rolling_5 numeric
23 keys_per_min_rolling_15 numeric
24 mouse_distance_rolling_5 numeric
25 mouse_distance_rolling_15 numeric
26 keys_per_min_delta numeric
27 clicks_per_min_delta numeric
28 mouse_distance_delta numeric
29 app_switch_count_last_15m numeric
30 hour_of_day numeric
31 day_of_week numeric
32 session_length_so_far numeric
33 user_id categorical

CATEGORICAL_COLUMNS

Subset of FEATURE_COLUMNS that are label-encoded to integers for LightGBM native categorical support:

  • app_id
  • app_category
  • domain_category
  • user_id

FEATURE_COLUMNS_V2

Same as FEATURE_COLUMNS with user_id removed (33 features). Used when training schema-v2 models where personalization is handled via calibrators and per-user post-processing instead of a model feature.

FEATURE_COLUMNS_V3

Current default feature set. Starts from FEATURE_COLUMNS_V2 and adds numeric-only keyed title-sketch features plus scalar title statistics. These features increase browser-title learning signal without exporting reversible title vocabularies into the model bundle.

CATEGORICAL_COLUMNS_V2

Same as CATEGORICAL_COLUMNS with user_id removed:

  • app_id
  • app_category
  • domain_category

get_feature_columns / get_categorical_columns

get_feature_columns(schema_version: str) -> list[str]
get_categorical_columns(schema_version: str) -> list[str]

Dispatch helpers that return the appropriate column list for "v1", "v2", or "v3". Raise ValueError for unknown versions.

Default hyperparameters

Parameter Default Description
objective multiclass LightGBM objective function
metric multi_logloss Evaluation metric
num_leaves 31 Maximum tree leaves (complexity control)
learning_rate 0.1 Gradient descent step size
num_boost_round 100 Boosting iterations (from core.defaults)
verbose -1 Suppress LightGBM training logs

Functions

encode_categoricals

encode_categoricals(
    df: pd.DataFrame,
    cat_encoders: dict[str, LabelEncoder] | None = None,
    *,
    min_category_freq: int = 5,
    unknown_mask_rate: float = 0.05,
    random_state: int | None = None,
    schema_version: str = "v1",
) -> tuple[pd.DataFrame, dict[str, LabelEncoder]]

Label-encodes the four categorical columns in-place. Operates in two modes:

  • Fit-new (cat_encoders=None): counts value frequencies, replaces values with count below min_category_freq with "__unknown__", randomly masks unknown_mask_rate of remaining known values to "__unknown__" (seeded by random_state), then fits a LabelEncoder per column.
  • Reuse (cat_encoders provided): transforms using existing encoders; values not in the encoder map to "__unknown__" if present, otherwise to -1 (legacy fallback).
Parameter Default Description
min_category_freq 5 Minimum count for a category to keep its own code
unknown_mask_rate 0.05 Fraction of known-category rows randomly masked to __unknown__
random_state None Seed for reproducible masking
schema_version inferred Schema version selecting which categorical columns to encode

prepare_xy

prepare_xy(
    df: pd.DataFrame,
    label_encoder: LabelEncoder | None = None,
    cat_encoders: dict[str, LabelEncoder] | None = None,
    *,
    schema_version: str | None = None,
) -> tuple[np.ndarray, np.ndarray, LabelEncoder, dict[str, LabelEncoder]]

Extracts a (X, y, label_encoder, cat_encoders) tuple from a labeled DataFrame. Encodes categoricals, fills missing numeric values with 0, and encodes labels against LABEL_SET_V1 (8 classes, sorted).

compute_sample_weights

compute_sample_weights(
    y: np.ndarray,
    method: Literal["balanced", "none"] = "balanced",
) -> np.ndarray | None

Maps encoded labels to per-sample weights. "balanced" uses inverse class frequency:

weight = n_samples / (n_classes * count_per_class)

"none" returns None (no weighting).

train_lgbm

train_lgbm(
    train_df: pd.DataFrame,
    val_df: pd.DataFrame,
    *,
    num_boost_round: int = DEFAULT_NUM_BOOST_ROUND,
    extra_params: dict[str, Any] | None = None,
    class_weight: Literal["balanced", "none"] = "balanced",
    min_category_freq: int = 5,
    unknown_mask_rate: float = 0.05,
    random_state: int | None = None,
    schema_version: str | None = None,
) -> tuple[lgb.Booster, dict, pd.DataFrame, dict[str, Any], dict[str, LabelEncoder]]

Trains a LightGBM multiclass model and evaluates on the validation set. Returns a 5-tuple:

Element Type Description
model lgb.Booster Trained model
metrics dict Macro/weighted F1 and per-class metrics
confusion_df pd.DataFrame Confusion matrix
params dict Merged hyperparameters (includes class_weight_method)
cat_encoders dict[str, LabelEncoder] Fitted categorical encoders

Usage

from taskclf.train.lgbm import train_lgbm
from taskclf.train.dataset import split_by_time

labeled_df = ...  # DataFrame with the selected schema's feature columns + "label"
splits = split_by_time(labeled_df)
train_df = labeled_df.iloc[splits["train"]].reset_index(drop=True)
val_df = labeled_df.iloc[splits["val"]].reset_index(drop=True)

model, metrics, confusion_df, params, cat_encoders = train_lgbm(
    train_df, val_df,
    num_boost_round=100,
    class_weight="balanced",
)
print(f"Macro F1: {metrics['macro_f1']:.4f}")

After training, pass model and cat_encoders to evaluate_model for full evaluation with acceptance checks, or to fit_calibrator_store for per-user probability calibration.

taskclf.train.lgbm

LightGBM multiclass trainer with class-weight support and evaluation.

get_feature_columns(schema_version)

Return the feature column list for schema_version.

Raises:

Type Description
ValueError

If schema_version is not "v1", "v2", or "v3".

Source code in src/taskclf/train/lgbm.py
def get_feature_columns(schema_version: str) -> list[str]:
    """Return the feature column list for *schema_version*.

    Raises:
        ValueError: If *schema_version* is not ``"v1"``, ``"v2"``, or ``"v3"``.
    """
    if schema_version == "v1":
        return list(FEATURE_COLUMNS)
    if schema_version == "v2":
        return list(FEATURE_COLUMNS_V2)
    if schema_version == "v3":
        return list(FEATURE_COLUMNS_V3)
    raise ValueError(f"Unknown schema version: {schema_version!r}")

get_categorical_columns(schema_version)

Return the categorical column list for schema_version.

Raises:

Type Description
ValueError

If schema_version is not "v1", "v2", or "v3".

Source code in src/taskclf/train/lgbm.py
def get_categorical_columns(schema_version: str) -> list[str]:
    """Return the categorical column list for *schema_version*.

    Raises:
        ValueError: If *schema_version* is not ``"v1"``, ``"v2"``, or ``"v3"``.
    """
    if schema_version == "v1":
        return list(CATEGORICAL_COLUMNS)
    if schema_version == "v2":
        return list(CATEGORICAL_COLUMNS_V2)
    if schema_version == "v3":
        return list(CATEGORICAL_COLUMNS_V3)
    raise ValueError(f"Unknown schema version: {schema_version!r}")

encode_categoricals(df, cat_encoders=None, *, min_category_freq=5, unknown_mask_rate=0.05, random_state=None, schema_version=None)

Label-encode categorical columns in-place and return fitted encoders.

During training (cat_encoders is None), rare values (frequency below min_category_freq) are replaced with "__unknown__" and a random fraction (unknown_mask_rate) of known values are also masked to "__unknown__" so the model learns a meaningful embedding for unseen categories.

During inference (cat_encoders provided), values not present in the fitted encoder are mapped to "__unknown__" if it exists in the encoder's vocabulary, otherwise to -1 for backward compatibility with legacy encoders.

Parameters:

Name Type Description Default
df DataFrame

DataFrame with the columns listed in CATEGORICAL_COLUMNS.

required
cat_encoders dict[str, LabelEncoder] | None

Pre-fitted encoders keyed by column name. When None, new encoders are fitted on the data.

None
min_category_freq int

Minimum count for a value to be kept as its own category during training. Values below this threshold are replaced with "__unknown__".

5
unknown_mask_rate float

Fraction of known-category rows to randomly mask to "__unknown__" during training (for robustness).

0.05
random_state int | None

Seed for the random masking (reproducibility).

None
schema_version str | None

"v1", "v2", or "v3". Selects which categorical columns to encode.

None

Returns:

Type Description
DataFrame

(encoded_df, cat_encoders) -- the DataFrame with categorical

dict[str, LabelEncoder]

columns replaced by integer codes, and the encoder dict.

Source code in src/taskclf/train/lgbm.py
def encode_categoricals(
    df: pd.DataFrame,
    cat_encoders: dict[str, LabelEncoder] | None = None,
    *,
    min_category_freq: int = 5,
    unknown_mask_rate: float = 0.05,
    random_state: int | None = None,
    schema_version: str | None = None,
) -> tuple[pd.DataFrame, dict[str, LabelEncoder]]:
    """Label-encode categorical columns in-place and return fitted encoders.

    During training (``cat_encoders is None``), rare values (frequency
    below *min_category_freq*) are replaced with ``"__unknown__"`` and a
    random fraction (*unknown_mask_rate*) of known values are also masked
    to ``"__unknown__"`` so the model learns a meaningful embedding for
    unseen categories.

    During inference (``cat_encoders`` provided), values not present in
    the fitted encoder are mapped to ``"__unknown__"`` if it exists in
    the encoder's vocabulary, otherwise to ``-1`` for backward
    compatibility with legacy encoders.

    Args:
        df: DataFrame with the columns listed in ``CATEGORICAL_COLUMNS``.
        cat_encoders: Pre-fitted encoders keyed by column name.  When
            ``None``, new encoders are fitted on the data.
        min_category_freq: Minimum count for a value to be kept as its
            own category during training.  Values below this threshold
            are replaced with ``"__unknown__"``.
        unknown_mask_rate: Fraction of *known*-category rows to randomly
            mask to ``"__unknown__"`` during training (for robustness).
        random_state: Seed for the random masking (reproducibility).
        schema_version: ``"v1"``, ``"v2"``, or ``"v3"``.  Selects which categorical
            columns to encode.

    Returns:
        ``(encoded_df, cat_encoders)`` -- the DataFrame with categorical
        columns replaced by integer codes, and the encoder dict.
    """
    resolved_schema_version = _resolve_schema_version(df, schema_version)
    cat_cols = get_categorical_columns(resolved_schema_version)
    df = df.copy()
    if cat_encoders is None:
        rng = np.random.RandomState(random_state)
        cat_encoders = {}
        for col in cat_cols:
            vals = df[col].astype(str)
            freq = vals.value_counts()
            rare_mask = vals.isin(freq[freq < min_category_freq].index)
            vals = vals.copy()
            vals[rare_mask] = _UNKNOWN_TOKEN
            if unknown_mask_rate > 0:
                known_mask = vals != _UNKNOWN_TOKEN
                n_known = known_mask.sum()
                n_mask = int(round(n_known * unknown_mask_rate))
                if n_mask > 0:
                    mask_idx = rng.choice(
                        vals.index[known_mask], size=n_mask, replace=False
                    )
                    vals.iloc[vals.index.get_indexer(pd.Index(mask_idx))] = (
                        _UNKNOWN_TOKEN
                    )
            le = LabelEncoder()
            df[col] = le.fit_transform(vals)
            cat_encoders[col] = le
    else:
        for col in cat_cols:
            le = cat_encoders[col]
            known = set(le.classes_)
            has_unknown = _UNKNOWN_TOKEN in known

            def _encode(
                v: str, _k: set = known, _le: LabelEncoder = le, _hu: bool = has_unknown
            ) -> int:
                if v in _k:
                    return int(_le.transform([v])[0])
                if _hu:
                    return int(_le.transform([_UNKNOWN_TOKEN])[0])
                return -1

            df[col] = df[col].astype(str).apply(_encode)
    return df, cat_encoders

prepare_xy(df, label_encoder=None, cat_encoders=None, *, min_category_freq=5, unknown_mask_rate=0.05, random_state=None, schema_version=None)

Extract feature matrix and encoded label vector from df.

Categorical columns are label-encoded to integers so LightGBM can use them as native categoricals. Missing numeric values are filled with 0.

Parameters:

Name Type Description Default
df DataFrame

Labeled feature DataFrame (must contain the feature columns for the selected schema_version and a label column).

required
label_encoder LabelEncoder | None

Pre-fitted encoder to reuse (e.g. the one returned from the training call). If None, a new encoder is fitted on the canonical LABEL_SET_V1.

None
cat_encoders dict[str, LabelEncoder] | None

Pre-fitted categorical encoders. If None, new ones are fitted from df.

None
min_category_freq int

Forwarded to :func:encode_categoricals.

5
unknown_mask_rate float

Forwarded to :func:encode_categoricals.

0.05
random_state int | None

Forwarded to :func:encode_categoricals.

None
schema_version str | None

"v1", "v2", or "v3".

None

Returns:

Type Description
tuple[ndarray, ndarray, LabelEncoder, dict[str, LabelEncoder]]

A (X, y, label_encoder, cat_encoders) tuple.

Source code in src/taskclf/train/lgbm.py
def prepare_xy(
    df: pd.DataFrame,
    label_encoder: LabelEncoder | None = None,
    cat_encoders: dict[str, LabelEncoder] | None = None,
    *,
    min_category_freq: int = 5,
    unknown_mask_rate: float = 0.05,
    random_state: int | None = None,
    schema_version: str | None = None,
) -> tuple[np.ndarray, np.ndarray, LabelEncoder, dict[str, LabelEncoder]]:
    """Extract feature matrix and encoded label vector from *df*.

    Categorical columns are label-encoded to integers so LightGBM can
    use them as native categoricals.  Missing numeric values are filled
    with 0.

    Args:
        df: Labeled feature DataFrame (must contain the feature columns
            for the selected *schema_version* and a ``label`` column).
        label_encoder: Pre-fitted encoder to reuse (e.g. the one returned
            from the training call).  If ``None``, a new encoder is fitted
            on the canonical ``LABEL_SET_V1``.
        cat_encoders: Pre-fitted categorical encoders.  If ``None``, new
            ones are fitted from *df*.
        min_category_freq: Forwarded to :func:`encode_categoricals`.
        unknown_mask_rate: Forwarded to :func:`encode_categoricals`.
        random_state: Forwarded to :func:`encode_categoricals`.
        schema_version: ``"v1"``, ``"v2"``, or ``"v3"``.

    Returns:
        A ``(X, y, label_encoder, cat_encoders)`` tuple.
    """
    resolved_schema_version = _resolve_schema_version(df, schema_version)
    feat_cols = get_feature_columns(resolved_schema_version)
    feat_df = df[feat_cols].copy()
    feat_df, cat_encoders = encode_categoricals(
        feat_df,
        cat_encoders,
        min_category_freq=min_category_freq,
        unknown_mask_rate=unknown_mask_rate,
        random_state=random_state,
        schema_version=resolved_schema_version,
    )
    x = feat_df.fillna(0).to_numpy(dtype=np.float64)

    if label_encoder is None:
        label_encoder = LabelEncoder()
        label_encoder.fit(sorted(LABEL_SET_V1))

    y = label_encoder.transform(df["label"].values)
    return x, y, label_encoder, cat_encoders

compute_sample_weights(y, method='balanced')

Map encoded labels to per-sample weights using inverse class frequency.

Parameters:

Name Type Description Default
y ndarray

Integer-encoded label array (output of LabelEncoder.transform).

required
method Literal['balanced', 'none']

"balanced" computes n_samples / (n_classes * count_per_class) and maps each sample to its class weight. "none" returns None.

'balanced'

Returns:

Type Description
ndarray | None

Per-sample weight array with the same length as y, or None

ndarray | None

when method is "none".

Source code in src/taskclf/train/lgbm.py
def compute_sample_weights(
    y: np.ndarray,
    method: Literal["balanced", "none"] = "balanced",
) -> np.ndarray | None:
    """Map encoded labels to per-sample weights using inverse class frequency.

    Args:
        y: Integer-encoded label array (output of ``LabelEncoder.transform``).
        method: ``"balanced"`` computes ``n_samples / (n_classes * count_per_class)``
            and maps each sample to its class weight.  ``"none"`` returns ``None``.

    Returns:
        Per-sample weight array with the same length as *y*, or ``None``
        when *method* is ``"none"``.
    """
    if method == "none":
        return None
    n_samples = len(y)
    n_classes = int(y.max()) + 1
    counts = np.bincount(y, minlength=n_classes).astype(np.float64)
    counts[counts == 0] = 1.0
    class_weights = n_samples / (n_classes * counts)
    return class_weights[y]

train_lgbm(train_df, val_df, *, num_boost_round=DEFAULT_NUM_BOOST_ROUND, extra_params=None, class_weight='balanced', min_category_freq=5, unknown_mask_rate=0.05, random_state=None, schema_version=None)

Train a LightGBM multiclass model and evaluate on the val set.

Parameters:

Name Type Description Default
train_df DataFrame

Training DataFrame with feature columns and a label column.

required
val_df DataFrame

Validation DataFrame (same schema as train_df).

required
num_boost_round int

Number of boosting iterations.

DEFAULT_NUM_BOOST_ROUND
extra_params dict[str, Any] | None

Additional LightGBM parameters merged on top of the built-in defaults.

None
class_weight Literal['balanced', 'none']

Strategy for handling class imbalance. "balanced" uses inverse-frequency sample weights; "none" disables weighting.

'balanced'
min_category_freq int

Minimum count for a category to keep its own code; rarer values become __unknown__.

5
unknown_mask_rate float

Fraction of known-category rows randomly masked to __unknown__ during training.

0.05
random_state int | None

Seed for the random unknown masking.

None
schema_version str | None

"v1", "v2", or "v3".

None

Returns:

Type Description
Booster

A (model, metrics, confusion_df, params, cat_encoders) tuple

dict

where cat_encoders maps each categorical column name to its

DataFrame

fitted LabelEncoder.

Source code in src/taskclf/train/lgbm.py
def train_lgbm(
    train_df: pd.DataFrame,
    val_df: pd.DataFrame,
    *,
    num_boost_round: int = DEFAULT_NUM_BOOST_ROUND,
    extra_params: dict[str, Any] | None = None,
    class_weight: Literal["balanced", "none"] = "balanced",
    min_category_freq: int = 5,
    unknown_mask_rate: float = 0.05,
    random_state: int | None = None,
    schema_version: str | None = None,
) -> tuple[lgb.Booster, dict, pd.DataFrame, dict[str, Any], dict[str, LabelEncoder]]:
    """Train a LightGBM multiclass model and evaluate on the val set.

    Args:
        train_df: Training DataFrame with feature columns and a ``label``
            column.
        val_df: Validation DataFrame (same schema as *train_df*).
        num_boost_round: Number of boosting iterations.
        extra_params: Additional LightGBM parameters merged on top of the
            built-in defaults.
        class_weight: Strategy for handling class imbalance.
            ``"balanced"`` uses inverse-frequency sample weights;
            ``"none"`` disables weighting.
        min_category_freq: Minimum count for a category to keep its own
            code; rarer values become ``__unknown__``.
        unknown_mask_rate: Fraction of known-category rows randomly
            masked to ``__unknown__`` during training.
        random_state: Seed for the random unknown masking.
        schema_version: ``"v1"``, ``"v2"``, or ``"v3"``.

    Returns:
        A ``(model, metrics, confusion_df, params, cat_encoders)`` tuple
        where *cat_encoders* maps each categorical column name to its
        fitted ``LabelEncoder``.
    """
    resolved_schema_version = _resolve_schema_version(train_df, schema_version)
    feat_cols = get_feature_columns(resolved_schema_version)
    x_train, y_train, le, cat_encoders = prepare_xy(
        train_df,
        min_category_freq=min_category_freq,
        unknown_mask_rate=unknown_mask_rate,
        random_state=random_state,
        schema_version=resolved_schema_version,
    )
    x_val, y_val, _, _ = prepare_xy(
        val_df,
        label_encoder=le,
        cat_encoders=cat_encoders,
        schema_version=resolved_schema_version,
    )

    params = {**_DEFAULT_PARAMS, "num_class": len(le.classes_)}
    if extra_params:
        params.update(extra_params)

    cat_indices = _categorical_feature_indices(resolved_schema_version)
    sample_weights = compute_sample_weights(y_train, method=class_weight)

    train_ds = lgb.Dataset(
        x_train,
        label=y_train,
        weight=sample_weights,
        feature_name=feat_cols,
        categorical_feature=cat_indices,
        free_raw_data=False,
    )
    val_ds = lgb.Dataset(x_val, label=y_val, reference=train_ds, free_raw_data=False)

    model = lgb.train(
        params,
        train_ds,
        num_boost_round=num_boost_round,
        valid_sets=[val_ds],
        valid_names=["val"],
    )

    y_pred_idx = model.predict(x_val).argmax(axis=1)  # type: ignore[union-attr]
    y_pred_labels = le.inverse_transform(y_pred_idx)
    y_true_labels = le.inverse_transform(y_val)

    label_names = list(le.classes_)
    metrics = compute_metrics(y_true_labels, y_pred_labels, label_names)
    cm_df = confusion_matrix_df(y_true_labels, y_pred_labels, label_names)

    params["class_weight_method"] = class_weight
    params["unknown_category_freq_threshold"] = min_category_freq
    params["unknown_category_mask_rate"] = unknown_mask_rate
    return model, metrics, cm_df, params, cat_encoders