train.dataset¶
Time-based dataset splitting utilities.
Functions¶
split_by_time¶
Three-way chronological split (train / val / test) with optional
cross-user holdout. For each non-holdout user, rows are sorted by
bucket_start_ts and split at train_ratio / val_ratio /
remainder boundaries. Holdout users have all data placed in the test
set only.
Returns a dict with "train", "val", "test" (index lists) and
"holdout_users".
taskclf.train.dataset
¶
Time-based dataset splitting utilities.
split_by_time(df, *, train_ratio=0.7, val_ratio=0.15, holdout_user_fraction=0.0)
¶
Three-way chronological split with optional cross-user holdout.
For each non-holdout user the rows are sorted by bucket_start_ts
and split chronologically into train / val / test by the given ratios.
Holdout users (if any) have all their data placed in the test set to
evaluate cold-start generalization.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Labeled feature DataFrame. Must contain |
required |
train_ratio
|
float
|
Fraction of each user's chronological data for training (default 0.70). |
0.7
|
val_ratio
|
float
|
Fraction for validation (default 0.15). The remainder goes to the test set. |
0.15
|
holdout_user_fraction
|
float
|
Fraction of unique users to hold out entirely for the test set (default 0 = no holdout). |
0.0
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
A dict with keys |
dict[str, Any]
|
of integer indices into df), and |
dict[str, Any]
|
held-out user_id strings). |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |