train.lgbm¶
LightGBM multiclass trainer with class-weight support and evaluation.
Overview¶
Trains a LightGBM gradient-boosted tree model for 8-class task classification. The module handles categorical encoding, feature extraction, class-imbalance weighting, and validation-set evaluation in a single pipeline:
Categorical columns are label-encoded to integers so LightGBM can use
them as native categoricals. In the current default schema (v3),
categoricals are app_id, app_category, and domain_category;
user_id remains on persisted rows for joins/evaluation but is not part
of the default model feature vector. During training, rare categories (below
min_category_freq) and a random fraction (unknown_mask_rate) of
known categories are replaced with __unknown__ so the model learns
a meaningful embedding for unseen values. At inference time, unseen
values map to __unknown__ (or -1 for legacy encoders without it).
Constants¶
FEATURE_COLUMNS¶
Ordered list of 34 feature names consumed by the model. The first four are categorical; the rest are numeric:
| # | Feature | Type |
|---|---|---|
| 0 | app_id |
categorical |
| 1 | app_category |
categorical |
| 2 | is_browser |
boolean |
| 3 | is_editor |
boolean |
| 4 | is_terminal |
boolean |
| 5 | app_switch_count_last_5m |
numeric |
| 6 | app_foreground_time_ratio |
numeric |
| 7 | app_change_count |
numeric |
| 8 | keys_per_min |
numeric |
| 9 | backspace_ratio |
numeric |
| 10 | shortcut_rate |
numeric |
| 11 | clicks_per_min |
numeric |
| 12 | scroll_events_per_min |
numeric |
| 13 | mouse_distance |
numeric |
| 14 | active_seconds_keyboard |
numeric |
| 15 | active_seconds_mouse |
numeric |
| 16 | active_seconds_any |
numeric |
| 17 | max_idle_run_seconds |
numeric |
| 18 | event_density |
numeric |
| 19 | domain_category |
categorical |
| 20 | window_title_bucket |
numeric |
| 21 | title_repeat_count_session |
numeric |
| 22 | keys_per_min_rolling_5 |
numeric |
| 23 | keys_per_min_rolling_15 |
numeric |
| 24 | mouse_distance_rolling_5 |
numeric |
| 25 | mouse_distance_rolling_15 |
numeric |
| 26 | keys_per_min_delta |
numeric |
| 27 | clicks_per_min_delta |
numeric |
| 28 | mouse_distance_delta |
numeric |
| 29 | app_switch_count_last_15m |
numeric |
| 30 | hour_of_day |
numeric |
| 31 | day_of_week |
numeric |
| 32 | session_length_so_far |
numeric |
| 33 | user_id |
categorical |
CATEGORICAL_COLUMNS¶
Subset of FEATURE_COLUMNS that are label-encoded to integers for
LightGBM native categorical support:
app_idapp_categorydomain_categoryuser_id
FEATURE_COLUMNS_V2¶
Same as FEATURE_COLUMNS with user_id removed (33 features).
Used when training schema-v2 models where personalization is handled
via calibrators and per-user post-processing instead of a model feature.
FEATURE_COLUMNS_V3¶
Current default feature set. Starts from FEATURE_COLUMNS_V2 and adds
numeric-only keyed title-sketch features plus scalar title statistics.
These features increase browser-title learning signal without exporting
reversible title vocabularies into the model bundle.
CATEGORICAL_COLUMNS_V2¶
Same as CATEGORICAL_COLUMNS with user_id removed:
app_idapp_categorydomain_category
get_feature_columns / get_categorical_columns¶
get_feature_columns(schema_version: str) -> list[str]
get_categorical_columns(schema_version: str) -> list[str]
Dispatch helpers that return the appropriate column list for
"v1", "v2", or "v3". Raise ValueError for unknown versions.
Default hyperparameters¶
| Parameter | Default | Description |
|---|---|---|
objective |
multiclass |
LightGBM objective function |
metric |
multi_logloss |
Evaluation metric |
num_leaves |
31 |
Maximum tree leaves (complexity control) |
learning_rate |
0.1 |
Gradient descent step size |
num_boost_round |
100 |
Boosting iterations (from core.defaults) |
verbose |
-1 |
Suppress LightGBM training logs |
Functions¶
encode_categoricals¶
encode_categoricals(
df: pd.DataFrame,
cat_encoders: dict[str, LabelEncoder] | None = None,
*,
min_category_freq: int = 5,
unknown_mask_rate: float = 0.05,
random_state: int | None = None,
schema_version: str = "v1",
) -> tuple[pd.DataFrame, dict[str, LabelEncoder]]
Label-encodes the four categorical columns in-place. Operates in two modes:
- Fit-new (
cat_encoders=None): counts value frequencies, replaces values with count belowmin_category_freqwith"__unknown__", randomly masksunknown_mask_rateof remaining known values to"__unknown__"(seeded byrandom_state), then fits aLabelEncoderper column. - Reuse (
cat_encodersprovided): transforms using existing encoders; values not in the encoder map to"__unknown__"if present, otherwise to-1(legacy fallback).
| Parameter | Default | Description |
|---|---|---|
min_category_freq |
5 |
Minimum count for a category to keep its own code |
unknown_mask_rate |
0.05 |
Fraction of known-category rows randomly masked to __unknown__ |
random_state |
None |
Seed for reproducible masking |
schema_version |
inferred | Schema version selecting which categorical columns to encode |
prepare_xy¶
prepare_xy(
df: pd.DataFrame,
label_encoder: LabelEncoder | None = None,
cat_encoders: dict[str, LabelEncoder] | None = None,
*,
schema_version: str | None = None,
) -> tuple[np.ndarray, np.ndarray, LabelEncoder, dict[str, LabelEncoder]]
Extracts a (X, y, label_encoder, cat_encoders) tuple from a labeled
DataFrame. Encodes categoricals, fills missing numeric values with 0,
and encodes labels against LABEL_SET_V1 (8 classes, sorted).
compute_sample_weights¶
compute_sample_weights(
y: np.ndarray,
method: Literal["balanced", "none"] = "balanced",
) -> np.ndarray | None
Maps encoded labels to per-sample weights. "balanced" uses
inverse class frequency:
"none" returns None (no weighting).
train_lgbm¶
train_lgbm(
train_df: pd.DataFrame,
val_df: pd.DataFrame,
*,
num_boost_round: int = DEFAULT_NUM_BOOST_ROUND,
extra_params: dict[str, Any] | None = None,
class_weight: Literal["balanced", "none"] = "balanced",
min_category_freq: int = 5,
unknown_mask_rate: float = 0.05,
random_state: int | None = None,
schema_version: str | None = None,
) -> tuple[lgb.Booster, dict, pd.DataFrame, dict[str, Any], dict[str, LabelEncoder]]
Trains a LightGBM multiclass model and evaluates on the validation set. Returns a 5-tuple:
| Element | Type | Description |
|---|---|---|
model |
lgb.Booster |
Trained model |
metrics |
dict |
Macro/weighted F1 and per-class metrics |
confusion_df |
pd.DataFrame |
Confusion matrix |
params |
dict |
Merged hyperparameters (includes class_weight_method) |
cat_encoders |
dict[str, LabelEncoder] |
Fitted categorical encoders |
Usage¶
from taskclf.train.lgbm import train_lgbm
from taskclf.train.dataset import split_by_time
labeled_df = ... # DataFrame with the selected schema's feature columns + "label"
splits = split_by_time(labeled_df)
train_df = labeled_df.iloc[splits["train"]].reset_index(drop=True)
val_df = labeled_df.iloc[splits["val"]].reset_index(drop=True)
model, metrics, confusion_df, params, cat_encoders = train_lgbm(
train_df, val_df,
num_boost_round=100,
class_weight="balanced",
)
print(f"Macro F1: {metrics['macro_f1']:.4f}")
After training, pass model and cat_encoders to
evaluate_model for full evaluation with acceptance
checks, or to fit_calibrator_store for per-user
probability calibration.
taskclf.train.lgbm
¶
LightGBM multiclass trainer with class-weight support and evaluation.
get_feature_columns(schema_version)
¶
Return the feature column list for schema_version.
Raises:
| Type | Description |
|---|---|
ValueError
|
If schema_version is not |
Source code in src/taskclf/train/lgbm.py
get_categorical_columns(schema_version)
¶
Return the categorical column list for schema_version.
Raises:
| Type | Description |
|---|---|
ValueError
|
If schema_version is not |
Source code in src/taskclf/train/lgbm.py
encode_categoricals(df, cat_encoders=None, *, min_category_freq=5, unknown_mask_rate=0.05, random_state=None, schema_version=None)
¶
Label-encode categorical columns in-place and return fitted encoders.
During training (cat_encoders is None), rare values (frequency
below min_category_freq) are replaced with "__unknown__" and a
random fraction (unknown_mask_rate) of known values are also masked
to "__unknown__" so the model learns a meaningful embedding for
unseen categories.
During inference (cat_encoders provided), values not present in
the fitted encoder are mapped to "__unknown__" if it exists in
the encoder's vocabulary, otherwise to -1 for backward
compatibility with legacy encoders.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
DataFrame with the columns listed in |
required |
cat_encoders
|
dict[str, LabelEncoder] | None
|
Pre-fitted encoders keyed by column name. When
|
None
|
min_category_freq
|
int
|
Minimum count for a value to be kept as its
own category during training. Values below this threshold
are replaced with |
5
|
unknown_mask_rate
|
float
|
Fraction of known-category rows to randomly
mask to |
0.05
|
random_state
|
int | None
|
Seed for the random masking (reproducibility). |
None
|
schema_version
|
str | None
|
|
None
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
|
dict[str, LabelEncoder]
|
columns replaced by integer codes, and the encoder dict. |
Source code in src/taskclf/train/lgbm.py
151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 | |
prepare_xy(df, label_encoder=None, cat_encoders=None, *, min_category_freq=5, unknown_mask_rate=0.05, random_state=None, schema_version=None)
¶
Extract feature matrix and encoded label vector from df.
Categorical columns are label-encoded to integers so LightGBM can use them as native categoricals. Missing numeric values are filled with 0.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Labeled feature DataFrame (must contain the feature columns
for the selected schema_version and a |
required |
label_encoder
|
LabelEncoder | None
|
Pre-fitted encoder to reuse (e.g. the one returned
from the training call). If |
None
|
cat_encoders
|
dict[str, LabelEncoder] | None
|
Pre-fitted categorical encoders. If |
None
|
min_category_freq
|
int
|
Forwarded to :func: |
5
|
unknown_mask_rate
|
float
|
Forwarded to :func: |
0.05
|
random_state
|
int | None
|
Forwarded to :func: |
None
|
schema_version
|
str | None
|
|
None
|
Returns:
| Type | Description |
|---|---|
tuple[ndarray, ndarray, LabelEncoder, dict[str, LabelEncoder]]
|
A |
Source code in src/taskclf/train/lgbm.py
compute_sample_weights(y, method='balanced')
¶
Map encoded labels to per-sample weights using inverse class frequency.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y
|
ndarray
|
Integer-encoded label array (output of |
required |
method
|
Literal['balanced', 'none']
|
|
'balanced'
|
Returns:
| Type | Description |
|---|---|
ndarray | None
|
Per-sample weight array with the same length as y, or |
ndarray | None
|
when method is |
Source code in src/taskclf/train/lgbm.py
train_lgbm(train_df, val_df, *, num_boost_round=DEFAULT_NUM_BOOST_ROUND, extra_params=None, class_weight='balanced', min_category_freq=5, unknown_mask_rate=0.05, random_state=None, schema_version=None)
¶
Train a LightGBM multiclass model and evaluate on the val set.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
train_df
|
DataFrame
|
Training DataFrame with feature columns and a |
required |
val_df
|
DataFrame
|
Validation DataFrame (same schema as train_df). |
required |
num_boost_round
|
int
|
Number of boosting iterations. |
DEFAULT_NUM_BOOST_ROUND
|
extra_params
|
dict[str, Any] | None
|
Additional LightGBM parameters merged on top of the built-in defaults. |
None
|
class_weight
|
Literal['balanced', 'none']
|
Strategy for handling class imbalance.
|
'balanced'
|
min_category_freq
|
int
|
Minimum count for a category to keep its own
code; rarer values become |
5
|
unknown_mask_rate
|
float
|
Fraction of known-category rows randomly
masked to |
0.05
|
random_state
|
int | None
|
Seed for the random unknown masking. |
None
|
schema_version
|
str | None
|
|
None
|
Returns:
| Type | Description |
|---|---|
Booster
|
A |
dict
|
where cat_encoders maps each categorical column name to its |
DataFrame
|
fitted |
Source code in src/taskclf/train/lgbm.py
323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 | |