train.evaluate¶
Full model evaluation pipeline: metrics, calibration, acceptance checks.
Overview¶
Evaluates a trained LightGBM model against labeled test data and produces a comprehensive report with acceptance-gate verdicts. Supports multiple evaluation modes so offline metrics align with deployed inference behavior:
model + test_df → evaluate_model → EvaluationReport
├── overall metrics (macro/weighted F1)
├── per-class precision/recall/F1 (+ support)
├── top confusion pairs (off-diagonal)
├── calibration scalars (ECE, Brier, log loss)
├── slice metrics (default feature columns)
├── unknown-category rates vs training encoders
├── per-user macro-F1
├── calibration curves
├── user stratification
├── reject rate
├── flip rate
├── segment duration distribution
└── acceptance checks (pass/fail)
Predictions with max probability below the reject threshold are
classified as Mixed/Unknown (from core.defaults).
Models¶
EvaluationReport¶
Frozen Pydantic model containing all evaluation artifacts.
| Field | Type | Description |
|---|---|---|
macro_f1 |
float |
Overall macro-averaged F1 |
weighted_f1 |
float |
Overall weighted-averaged F1 |
per_class |
dict[str, dict[str, float \| int]] |
Per-class precision, recall, F1, support |
confusion_matrix |
list[list[int]] |
Confusion matrix as nested lists |
label_names |
list[str] |
Ordered label names (rows/columns of confusion matrix) |
top_confusion_pairs |
list[dict[str, str \| int]] |
Largest off-diagonal confusion counts |
expected_calibration_error |
float |
Multiclass ECE (OVR, support-weighted) |
multiclass_brier_score |
float |
Multiclass Brier score |
multiclass_log_loss |
float |
Multiclass log loss |
slice_metrics |
dict[str, dict[str, dict[str, Any]]] |
Per-column slice breakdowns (see core.metrics) |
unknown_category_rates |
dict[str, Any] |
Per-column unseen categorical rate vs bundle encoders |
per_user |
dict[str, dict[str, float]] |
Per-user macro-F1 and row count |
calibration |
dict[str, dict[str, list[float]]] |
Per-class calibration curve data (fraction_of_positives, mean_predicted_value) |
stratification |
dict[str, Any] |
User stratification report with optional warnings |
seen_user_f1 |
float \| None |
Macro-F1 on users seen during training (requires holdout_users) |
unseen_user_f1 |
float \| None |
Macro-F1 on held-out users (requires holdout_users) |
reject_rate |
float |
Fraction of predictions below the reject threshold |
acceptance_checks |
dict[str, bool] |
Named acceptance gates (pass/fail) |
acceptance_details |
dict[str, str] |
Human-readable detail string per check |
flip_rate |
float \| None |
Label-change rate (transitions / total windows) |
segment_duration_distribution |
dict[str, int] \| None |
Histogram of segment durations by bucket ("60s", "120s", "180s", "300s", "300s+") |
eval_mode |
str |
Which evaluation pipeline was used ("raw", "calibrated", "calibrated_reject", "smoothed", "interval") |
RejectTuningResult¶
Result of sweeping reject thresholds on a validation set.
| Field | Type | Description |
|---|---|---|
best_threshold |
float |
Threshold maximizing accuracy on accepted windows within reject-rate bounds |
sweep |
list[dict[str, float]] |
Per-threshold row with threshold, accuracy_on_accepted, reject_rate, coverage, macro_f1 |
Evaluation modes¶
The eval_mode parameter controls the evaluation pipeline:
| Mode | Calibrator | Reject | Smoothing | Description |
|---|---|---|---|---|
"raw" |
No | Yes | No | Default; raw model probabilities with reject threshold |
"calibrated" |
Yes | No | No | Calibrated probabilities, no reject |
"calibrated_reject" |
Yes | Yes | No | Calibrated probabilities with reject threshold |
"smoothed" |
Yes | Yes | Yes | Calibrated + reject + rolling-majority smoothing |
"interval" |
Yes | Yes | Yes | Smoothed predictions aggregated into segments; interval-level accuracy |
Non-raw modes require a calibrator implementing the
Calibrator protocol.
Acceptance checks¶
All checks must pass for a model to be promoted. Thresholds are
defined in the module constants and align with
docs/guide/acceptance.md:
| Check | Threshold | Description |
|---|---|---|
macro_f1 |
>= 0.65 | Overall macro-F1 |
weighted_f1 |
>= 0.70 | Overall weighted-F1 |
breakidle_precision |
>= 0.95 | BreakIdle class precision |
breakidle_recall |
>= 0.90 | BreakIdle class recall |
no_class_below_50_precision |
>= 0.50 | Per-class precision floor |
reject_rate_bounds |
[0.05, 0.30] | Reject rate within window |
seen_user_f1 |
>= 0.70 | Seen-user macro-F1 (when holdout users provided) |
unseen_user_f1 |
>= 0.60 | Unseen-user macro-F1 (when holdout users provided) |
Functions¶
evaluate_model¶
evaluate_model(
model: lgb.Booster,
test_df: pd.DataFrame,
*,
cat_encoders: dict[str, LabelEncoder] | None = None,
holdout_users: Sequence[str] = (),
reject_threshold: float = DEFAULT_REJECT_THRESHOLD,
eval_mode: Literal["raw", "calibrated", "calibrated_reject", "smoothed", "interval"] = "raw",
calibrator: Calibrator | None = None,
smooth_window: int = DEFAULT_SMOOTH_WINDOW,
schema_version: str = "v1",
) -> EvaluationReport
Runs comprehensive evaluation: overall metrics, per-class and per-user
breakdowns, calibration curves, user stratification, slice metrics,
unknown-category rates, probability-based calibration scalars, and
acceptance checks. When holdout_users is non-empty, computes separate
seen/unseen-user F1 scores.
The eval_mode parameter selects the evaluation pipeline (see table
above). Non-raw modes require a calibrator.
schema_version selects which feature columns are treated as
categorical when computing unknown_category_rates (via
get_categorical_columns). Callers loading a model bundle should pass
metadata.schema_version so the column set matches training.
tune_reject_threshold¶
tune_reject_threshold(
model: lgb.Booster,
val_df: pd.DataFrame,
*,
cat_encoders: dict[str, LabelEncoder] | None = None,
thresholds: Sequence[float] | None = None,
reject_rate_min: float = 0.05,
reject_rate_max: float = 0.30,
calibrator: Calibrator | None = None,
) -> RejectTuningResult
Sweeps candidate thresholds (default np.arange(0.10, 1.00, 0.05))
and picks the one that maximizes accuracy on accepted windows while
keeping the reject rate within [reject_rate_min, reject_rate_max].
Falls back to DEFAULT_REJECT_THRESHOLD (0.55) if no candidate
satisfies the bounds.
When calibrator is provided, raw probabilities are calibrated before
extracting confidences for the threshold sweep. This ensures the
threshold is tuned on the same probability space used at inference time.
write_evaluation_artifacts¶
Writes evaluation artifacts to disk:
| File | Content |
|---|---|
evaluation.json |
Full report as JSON |
calibration.json |
Per-class calibration curve data |
confusion_matrix.csv |
Labeled confusion matrix |
calibration.png |
Per-class calibration plots (optional, requires matplotlib) |
Returns a dict mapping artifact name to its written path.
Usage¶
from taskclf.train.evaluate import (
evaluate_model,
tune_reject_threshold,
write_evaluation_artifacts,
)
from taskclf.core.model_io import load_model_bundle
from taskclf.infer.calibration import TemperatureCalibrator
model, metadata, cat_encoders = load_model_bundle(Path("models/run_001"))
# Raw evaluation (default)
raw_report = evaluate_model(
model, test_df,
cat_encoders=cat_encoders,
holdout_users=["user-X"],
)
print(f"Macro F1: {raw_report.macro_f1:.4f}")
print(f"Flip rate: {raw_report.flip_rate:.4f}")
# Calibrated evaluation
cal = TemperatureCalibrator(temperature=1.2)
cal_report = evaluate_model(
model, test_df,
cat_encoders=cat_encoders,
eval_mode="calibrated",
calibrator=cal,
)
print(f"Calibrated F1: {cal_report.macro_f1:.4f}")
# Tune reject threshold on calibrated scores
result = tune_reject_threshold(
model, val_df,
cat_encoders=cat_encoders,
calibrator=cal,
)
print(f"Best threshold: {result.best_threshold}")
# Write artifacts
paths = write_evaluation_artifacts(raw_report, Path("artifacts/eval"))
taskclf.train.evaluate
¶
Full model evaluation pipeline: metrics, calibration, acceptance checks.
EvaluationReport
¶
Bases: BaseModel
Comprehensive evaluation output for a trained model on a test set.
Source code in src/taskclf/train/evaluate.py
RejectTuningResult
¶
Bases: BaseModel
Result of sweeping reject thresholds on a validation set.
Attributes:
| Name | Type | Description |
|---|---|---|
best_threshold |
float
|
Threshold that maximises accuracy on accepted windows while keeping reject rate within acceptance bounds. |
sweep |
list[dict[str, float]]
|
List of dicts, one per candidate threshold, each with
|
Source code in src/taskclf/train/evaluate.py
evaluate_model(model, test_df, *, cat_encoders=None, holdout_users=(), reject_threshold=DEFAULT_REJECT_THRESHOLD, eval_mode='raw', calibrator=None, smooth_window=DEFAULT_SMOOTH_WINDOW, schema_version=None)
¶
Run comprehensive evaluation of a trained model on a test set.
Computes overall metrics (macro-F1, weighted-F1), per-class precision / recall / F1, per-user macro-F1, calibration curves, user-stratification report, and acceptance-gate checks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Booster
|
Trained LightGBM booster. |
required |
test_df
|
DataFrame
|
Test DataFrame containing |
required |
cat_encoders
|
dict[str, LabelEncoder] | None
|
Pre-fitted categorical encoders from the training run. |
None
|
holdout_users
|
Sequence[str]
|
User IDs that were held out from training, used to split seen-vs-unseen evaluation. |
()
|
reject_threshold
|
float
|
Max-probability below which a prediction is
treated as rejected ( |
DEFAULT_REJECT_THRESHOLD
|
eval_mode
|
Literal['raw', 'calibrated', 'calibrated_reject', 'smoothed', 'interval']
|
Evaluation pipeline to use. |
'raw'
|
calibrator
|
Calibrator | None
|
Probability calibrator to apply in non-raw modes.
Required when eval_mode is not |
None
|
smooth_window
|
int
|
Window size for rolling-majority smoothing. |
DEFAULT_SMOOTH_WINDOW
|
schema_version
|
str | None
|
|
None
|
Returns:
| Type | Description |
|---|---|
EvaluationReport
|
A frozen :class: |
Source code in src/taskclf/train/evaluate.py
216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 | |
tune_reject_threshold(model, val_df, *, cat_encoders=None, thresholds=None, reject_rate_min=_ACCEPT_REJECT_RATE_MIN, reject_rate_max=_ACCEPT_REJECT_RATE_MAX, calibrator=None, schema_version=None)
¶
Sweep reject thresholds and pick the best one.
For each candidate threshold the function computes accuracy on accepted (non-rejected) windows, the reject rate, coverage (fraction of windows kept), and macro-F1. The best threshold is the one that maximises accuracy on accepted windows while keeping reject rate within [reject_rate_min, reject_rate_max].
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Booster
|
Trained LightGBM booster. |
required |
val_df
|
DataFrame
|
Validation DataFrame with |
required |
cat_encoders
|
dict[str, LabelEncoder] | None
|
Pre-fitted categorical encoders. |
None
|
thresholds
|
Sequence[float] | None
|
Candidate thresholds to evaluate. Defaults to
|
None
|
reject_rate_min
|
float
|
Lower bound for acceptable reject rate. |
_ACCEPT_REJECT_RATE_MIN
|
reject_rate_max
|
float
|
Upper bound for acceptable reject rate. |
_ACCEPT_REJECT_RATE_MAX
|
calibrator
|
Calibrator | None
|
When provided, raw probabilities are calibrated before extracting confidences for the threshold sweep. This ensures the threshold is tuned on the same probability space used at inference time. |
None
|
schema_version
|
str | None
|
|
None
|
Returns:
| Name | Type | Description |
|---|---|---|
A |
RejectTuningResult
|
class: |
RejectTuningResult
|
the full sweep table. |
Source code in src/taskclf/train/evaluate.py
427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 | |
write_evaluation_artifacts(report, output_dir)
¶
Write evaluation report artifacts to disk.
Writes evaluation.json (full report) and calibration.json
(per-class calibration curve data) into output_dir.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
report
|
EvaluationReport
|
A completed evaluation report. |
required |
output_dir
|
Path
|
Target directory (created if needed). |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Path]
|
Dict mapping artifact name to its written path. |