train.build_dataset¶
Training dataset builder: join features with labels, apply exclusion
rules, split by time, and write X.parquet, y.parquet, and
splits.json.
Usage¶
from pathlib import Path
from taskclf.train.build_dataset import build_training_dataset
manifest = build_training_dataset(
features_df,
label_spans,
output_dir=Path("data/processed/training_dataset"),
train_ratio=0.70,
val_ratio=0.15,
holdout_user_fraction=0.1,
)
print(manifest.total_rows, manifest.train_rows)
Label projection uses project_blocks_to_windows() with strict
containment rules per time_spec.md Section 6: full window must fall
inside a single block, conflicting multi-block overlaps are dropped.
Output artifacts¶
| File | Contents |
|---|---|
X.parquet |
Feature columns + ID columns (user_id, bucket_start_ts, session_id) + schema_version |
y.parquet |
user_id, bucket_start_ts, label, provenance |
splits.json |
Train/val/test index lists, holdout users, and metadata (schema versions, class distribution, user count) |
Exclusion rules¶
Windows are dropped from the dataset if:
- They overlap multiple label blocks with conflicting labels or have no covering label.
- All numeric features are null (no useful signal).
- They belong to sessions shorter than
MIN_BLOCK_DURATION_SECONDS(180s = 3 buckets).
taskclf.train.build_dataset
¶
Training dataset builder: join, exclude, split, and write X/y/splits artifacts.
DatasetManifest
¶
Bases: BaseModel
Summary returned by :func:build_training_dataset.
Source code in src/taskclf/train/build_dataset.py
build_training_dataset(features_df, label_spans, *, output_dir, train_ratio=0.7, val_ratio=0.15, holdout_user_fraction=0.0, bucket_seconds=DEFAULT_BUCKET_SECONDS)
¶
Join features with labels, apply exclusions, split, and write artifacts.
Label projection uses strict block-to-window containment rules from
time_spec.md Section 6 (full window must fall inside a single
block; conflicting multi-block overlaps are dropped).
Outputs
output_dir/X.parquet -- feature matrix with ID columns and
schema_version.
output_dir/y.parquet -- labels keyed by user_id and
bucket_start_ts.
output_dir/splits.json -- train/val/test index lists and
metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
features_df
|
DataFrame
|
Feature DataFrame conforming to |
required |
label_spans
|
Sequence[LabelSpan]
|
Label spans to project onto feature windows. |
required |
output_dir
|
Path
|
Directory to write artifacts into (created if needed). |
required |
train_ratio
|
float
|
Fraction of each user's data for training. |
0.7
|
val_ratio
|
float
|
Fraction for validation. |
0.15
|
holdout_user_fraction
|
float
|
Fraction of users held out entirely for the test set (cold-start evaluation). |
0.0
|
bucket_seconds
|
int
|
Window width in seconds. |
DEFAULT_BUCKET_SECONDS
|
Returns:
| Name | Type | Description |
|---|---|---|
A |
DatasetManifest
|
class: |
Source code in src/taskclf/train/build_dataset.py
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 | |