train.evaluate¶

Full model evaluation pipeline: metrics, calibration, acceptance checks.

Overview¶

Evaluates a trained LightGBM model against labeled test data and produces a comprehensive report with acceptance-gate verdicts. Supports multiple evaluation modes so offline metrics align with deployed inference behavior:

model + test_df → evaluate_model → EvaluationReport
                                       ├── overall metrics (macro/weighted F1)
                                       ├── per-class precision/recall/F1 (+ support)
                                       ├── top confusion pairs (off-diagonal)
                                       ├── calibration scalars (ECE, Brier, log loss)
                                       ├── slice metrics (default feature columns)
                                       ├── unknown-category rates vs training encoders
                                       ├── per-user macro-F1
                                       ├── calibration curves
                                       ├── user stratification
                                       ├── reject rate
                                       ├── flip rate
                                       ├── segment duration distribution
                                       └── acceptance checks (pass/fail)

Predictions with max probability below the reject threshold are classified as Mixed/Unknown (from core.defaults).

Models¶

EvaluationReport¶

Frozen Pydantic model containing all evaluation artifacts.

Field	Type	Description
`macro_f1`	`float`	Overall macro-averaged F1
`weighted_f1`	`float`	Overall weighted-averaged F1
`per_class`	`dict[str, dict[str, float \\| int]]`	Per-class precision, recall, F1, support
`confusion_matrix`	`list[list[int]]`	Confusion matrix as nested lists
`label_names`	`list[str]`	Ordered label names (rows/columns of confusion matrix)
`top_confusion_pairs`	`list[dict[str, str \\| int]]`	Largest off-diagonal confusion counts
`expected_calibration_error`	`float`	Multiclass ECE (OVR, support-weighted)
`multiclass_brier_score`	`float`	Multiclass Brier score
`multiclass_log_loss`	`float`	Multiclass log loss
`slice_metrics`	`dict[str, dict[str, dict[str, Any]]]`	Per-column slice breakdowns (see `core.metrics`)
`unknown_category_rates`	`dict[str, Any]`	Per-column unseen categorical rate vs bundle encoders
`per_user`	`dict[str, dict[str, float]]`	Per-user macro-F1 and row count
`calibration`	`dict[str, dict[str, list[float]]]`	Per-class calibration curve data (`fraction_of_positives`, `mean_predicted_value`)
`stratification`	`dict[str, Any]`	User stratification report with optional warnings
`seen_user_f1`	`float \\| None`	Macro-F1 on users seen during training (requires `holdout_users`)
`unseen_user_f1`	`float \\| None`	Macro-F1 on held-out users (requires `holdout_users`)
`reject_rate`	`float`	Fraction of predictions below the reject threshold
`acceptance_checks`	`dict[str, bool]`	Named acceptance gates (pass/fail)
`acceptance_details`	`dict[str, str]`	Human-readable detail string per check
`flip_rate`	`float \\| None`	Label-change rate (transitions / total windows)
`segment_duration_distribution`	`dict[str, int] \\| None`	Histogram of segment durations by bucket (`"60s"`, `"120s"`, `"180s"`, `"300s"`, `"300s+"`)
`eval_mode`	`str`	Which evaluation pipeline was used (`"raw"`, `"calibrated"`, `"calibrated_reject"`, `"smoothed"`, `"interval"`)

RejectTuningResult¶

Result of sweeping reject thresholds on a validation set.

Field	Type	Description
`best_threshold`	`float`	Threshold maximizing accuracy on accepted windows within reject-rate bounds
`sweep`	`list[dict[str, float]]`	Per-threshold row with `threshold`, `accuracy_on_accepted`, `reject_rate`, `coverage`, `macro_f1`

Evaluation modes¶

The eval_mode parameter controls the evaluation pipeline:

Mode	Calibrator	Reject	Smoothing	Description
`"raw"`	No	Yes	No	Default; raw model probabilities with reject threshold
`"calibrated"`	Yes	No	No	Calibrated probabilities, no reject
`"calibrated_reject"`	Yes	Yes	No	Calibrated probabilities with reject threshold
`"smoothed"`	Yes	Yes	Yes	Calibrated + reject + rolling-majority smoothing
`"interval"`	Yes	Yes	Yes	Smoothed predictions aggregated into segments; interval-level accuracy

Non-raw modes require a calibrator implementing the Calibrator protocol.

Acceptance checks¶

All checks must pass for a model to be promoted. Thresholds are defined in the module constants and align with docs/guide/acceptance.md:

Check	Threshold	Description
`macro_f1`	>= 0.65	Overall macro-F1
`weighted_f1`	>= 0.70	Overall weighted-F1
`breakidle_precision`	>= 0.95	BreakIdle class precision
`breakidle_recall`	>= 0.90	BreakIdle class recall
`no_class_below_50_precision`	>= 0.50	Per-class precision floor
`reject_rate_bounds`	[0.05, 0.30]	Reject rate within window
`seen_user_f1`	>= 0.70	Seen-user macro-F1 (when holdout users provided)
`unseen_user_f1`	>= 0.60	Unseen-user macro-F1 (when holdout users provided)

Functions¶

evaluate_model¶

evaluate_model(
    model: lgb.Booster,
    test_df: pd.DataFrame,
    *,
    cat_encoders: dict[str, LabelEncoder] | None = None,
    holdout_users: Sequence[str] = (),
    reject_threshold: float = DEFAULT_REJECT_THRESHOLD,
    eval_mode: Literal["raw", "calibrated", "calibrated_reject", "smoothed", "interval"] = "raw",
    calibrator: Calibrator | None = None,
    smooth_window: int = DEFAULT_SMOOTH_WINDOW,
    schema_version: str = "v1",
) -> EvaluationReport

Runs comprehensive evaluation: overall metrics, per-class and per-user breakdowns, calibration curves, user stratification, slice metrics, unknown-category rates, probability-based calibration scalars, and acceptance checks. When holdout_users is non-empty, computes separate seen/unseen-user F1 scores.

The eval_mode parameter selects the evaluation pipeline (see table above). Non-raw modes require a calibrator.

schema_version selects which feature columns are treated as categorical when computing unknown_category_rates (via get_categorical_columns). Callers loading a model bundle should pass metadata.schema_version so the column set matches training.

tune_reject_threshold¶

tune_reject_threshold(
    model: lgb.Booster,
    val_df: pd.DataFrame,
    *,
    cat_encoders: dict[str, LabelEncoder] | None = None,
    thresholds: Sequence[float] | None = None,
    reject_rate_min: float = 0.05,
    reject_rate_max: float = 0.30,
    calibrator: Calibrator | None = None,
) -> RejectTuningResult

Sweeps candidate thresholds (default np.arange(0.10, 1.00, 0.05)) and picks the one that maximizes accuracy on accepted windows while keeping the reject rate within [reject_rate_min, reject_rate_max]. Falls back to DEFAULT_REJECT_THRESHOLD (0.55) if no candidate satisfies the bounds.

When calibrator is provided, raw probabilities are calibrated before extracting confidences for the threshold sweep. This ensures the threshold is tuned on the same probability space used at inference time.

write_evaluation_artifacts¶

write_evaluation_artifacts(
    report: EvaluationReport,
    output_dir: Path,
) -> dict[str, Path]

Writes evaluation artifacts to disk:

File	Content
`evaluation.json`	Full report as JSON
`calibration.json`	Per-class calibration curve data
`confusion_matrix.csv`	Labeled confusion matrix
`calibration.png`	Per-class calibration plots (optional, requires matplotlib)

Returns a dict mapping artifact name to its written path.

Usage¶

from taskclf.train.evaluate import (
    evaluate_model,
    tune_reject_threshold,
    write_evaluation_artifacts,
)
from taskclf.core.model_io import load_model_bundle
from taskclf.infer.calibration import TemperatureCalibrator

model, metadata, cat_encoders = load_model_bundle(Path("models/run_001"))

# Raw evaluation (default)
raw_report = evaluate_model(
    model, test_df,
    cat_encoders=cat_encoders,
    holdout_users=["user-X"],
)
print(f"Macro F1: {raw_report.macro_f1:.4f}")
print(f"Flip rate: {raw_report.flip_rate:.4f}")

# Calibrated evaluation
cal = TemperatureCalibrator(temperature=1.2)
cal_report = evaluate_model(
    model, test_df,
    cat_encoders=cat_encoders,
    eval_mode="calibrated",
    calibrator=cal,
)
print(f"Calibrated F1: {cal_report.macro_f1:.4f}")

# Tune reject threshold on calibrated scores
result = tune_reject_threshold(
    model, val_df,
    cat_encoders=cat_encoders,
    calibrator=cal,
)
print(f"Best threshold: {result.best_threshold}")

# Write artifacts
paths = write_evaluation_artifacts(raw_report, Path("artifacts/eval"))

`taskclf.train.evaluate` ¶

Full model evaluation pipeline: metrics, calibration, acceptance checks.

`EvaluationReport` ¶

Bases: BaseModel

Comprehensive evaluation output for a trained model on a test set.

Source code in src/taskclf/train/evaluate.py

class EvaluationReport(BaseModel, frozen=True):
    """Comprehensive evaluation output for a trained model on a test set."""

    macro_f1: float
    weighted_f1: float
    per_class: dict[str, dict[str, float | int]]
    confusion_matrix: list[list[int]]
    label_names: list[str]
    per_user: dict[str, dict[str, float]]
    calibration: dict[str, dict[str, list[float]]]
    stratification: dict[str, Any]
    seen_user_f1: float | None = None
    unseen_user_f1: float | None = None
    reject_rate: float
    acceptance_checks: dict[str, bool]
    acceptance_details: dict[str, str]
    flip_rate: float | None = None
    segment_duration_distribution: dict[str, int] | None = None
    eval_mode: str = "raw"
    top_confusion_pairs: list[dict[str, str | int]] = Field(default_factory=list)
    expected_calibration_error: float = 0.0
    multiclass_brier_score: float = 0.0
    multiclass_log_loss: float = 0.0
    slice_metrics: dict[str, dict[str, dict[str, Any]]] = Field(default_factory=dict)
    unknown_category_rates: dict[str, Any] = Field(default_factory=dict)

`RejectTuningResult` ¶

Bases: BaseModel

Result of sweeping reject thresholds on a validation set.

Attributes:

Name	Type	Description
`best_threshold`	`float`	Threshold that maximises accuracy on accepted windows while keeping reject rate within acceptance bounds.
`sweep`	`list[dict[str, float]]`	List of dicts, one per candidate threshold, each with `threshold`, `accuracy_on_accepted`, `reject_rate`, `coverage`, and `macro_f1`.

Source code in src/taskclf/train/evaluate.py

class RejectTuningResult(BaseModel, frozen=True):
    """Result of sweeping reject thresholds on a validation set.

    Attributes:
        best_threshold: Threshold that maximises accuracy on accepted
            windows while keeping reject rate within acceptance bounds.
        sweep: List of dicts, one per candidate threshold, each with
            ``threshold``, ``accuracy_on_accepted``, ``reject_rate``,
            ``coverage``, and ``macro_f1``.
    """

    best_threshold: float
    sweep: list[dict[str, float]]

`evaluate_model(model, test_df, *, cat_encoders=None, holdout_users=(), reject_threshold=DEFAULT_REJECT_THRESHOLD, eval_mode='raw', calibrator=None, smooth_window=DEFAULT_SMOOTH_WINDOW, schema_version=None)` ¶

Run comprehensive evaluation of a trained model on a test set.

Computes overall metrics (macro-F1, weighted-F1), per-class precision / recall / F1, per-user macro-F1, calibration curves, user-stratification report, and acceptance-gate checks.

Parameters:

Name	Type	Description	Default
`model`	`Booster`	Trained LightGBM booster.	required
`test_df`	`DataFrame`	Test DataFrame containing `FEATURE_COLUMNS`, `label`, and `user_id` columns.	required
`cat_encoders`	`dict[str, LabelEncoder] \| None`	Pre-fitted categorical encoders from the training run.	`None`
`holdout_users`	`Sequence[str]`	User IDs that were held out from training, used to split seen-vs-unseen evaluation.	`()`
`reject_threshold`	`float`	Max-probability below which a prediction is treated as rejected (`Mixed/Unknown`).	`DEFAULT_REJECT_THRESHOLD`
`eval_mode`	`Literal['raw', 'calibrated', 'calibrated_reject', 'smoothed', 'interval']`	Evaluation pipeline to use. `"raw"` uses model probabilities directly. `"calibrated"` applies a calibrator before metrics (no reject). `"calibrated_reject"` applies calibrator + reject. `"smoothed"` adds rolling-majority smoothing after reject. `"interval"` aggregates smoothed predictions into segments and evaluates per-interval accuracy.	`'raw'`
`calibrator`	`Calibrator \| None`	Probability calibrator to apply in non-raw modes. Required when eval_mode is not `"raw"`.	`None`
`smooth_window`	`int`	Window size for rolling-majority smoothing.	`DEFAULT_SMOOTH_WINDOW`
`schema_version`	`str \| None`	`"v1"`, `"v2"`, or `"v3"`. When omitted, infer from `test_df` before selecting categorical columns for unknown-category-rate (see :func:`~taskclf.train.lgbm.get_categorical_columns`).	`None`

Returns:

Type	Description
`EvaluationReport`	A frozen :class:`EvaluationReport` with all evaluation artifacts.

Source code in src/taskclf/train/evaluate.py

def evaluate_model(
    model: lgb.Booster,
    test_df: pd.DataFrame,
    *,
    cat_encoders: dict[str, LabelEncoder] | None = None,
    holdout_users: Sequence[str] = (),
    reject_threshold: float = DEFAULT_REJECT_THRESHOLD,
    eval_mode: Literal[
        "raw", "calibrated", "calibrated_reject", "smoothed", "interval"
    ] = "raw",
    calibrator: Calibrator | None = None,
    smooth_window: int = DEFAULT_SMOOTH_WINDOW,
    schema_version: str | None = None,
) -> EvaluationReport:
    """Run comprehensive evaluation of a trained model on a test set.

    Computes overall metrics (macro-F1, weighted-F1), per-class precision /
    recall / F1, per-user macro-F1, calibration curves, user-stratification
    report, and acceptance-gate checks.

    Args:
        model: Trained LightGBM booster.
        test_df: Test DataFrame containing ``FEATURE_COLUMNS``, ``label``,
            and ``user_id`` columns.
        cat_encoders: Pre-fitted categorical encoders from the training run.
        holdout_users: User IDs that were held out from training, used to
            split seen-vs-unseen evaluation.
        reject_threshold: Max-probability below which a prediction is
            treated as rejected (``Mixed/Unknown``).
        eval_mode: Evaluation pipeline to use.  ``"raw"`` uses model
            probabilities directly.  ``"calibrated"`` applies a calibrator
            before metrics (no reject).  ``"calibrated_reject"`` applies
            calibrator + reject.  ``"smoothed"`` adds rolling-majority
            smoothing after reject.  ``"interval"`` aggregates smoothed
            predictions into segments and evaluates per-interval accuracy.
        calibrator: Probability calibrator to apply in non-raw modes.
            Required when *eval_mode* is not ``"raw"``.
        smooth_window: Window size for rolling-majority smoothing.
        schema_version: ``"v1"``, ``"v2"``, or ``"v3"``. When omitted, infer from
            ``test_df`` before selecting categorical columns for unknown-category-rate
            (see :func:`~taskclf.train.lgbm.get_categorical_columns`).

    Returns:
        A frozen :class:`EvaluationReport` with all evaluation artifacts.
    """
    le = LabelEncoder()
    le.fit(sorted(LABEL_SET_V1))
    label_names = list(le.classes_)

    resolved_schema_version = resolve_feature_schema_version(test_df, schema_version)
    y_proba = predict_proba(
        model,
        test_df,
        cat_encoders,
        schema_version=resolved_schema_version,
    )

    if eval_mode != "raw" and calibrator is not None:
        y_proba = calibrator.calibrate(y_proba)

    y_pred_indices = y_proba.argmax(axis=1)
    y_pred_labels = list(le.inverse_transform(y_pred_indices))

    apply_reject = eval_mode in ("raw", "calibrated_reject", "smoothed", "interval")
    if apply_reject:
        confidences = y_proba.max(axis=1)
        rejected = confidences < reject_threshold
        labels_for_metrics = [
            MIXED_UNKNOWN if rej else lbl for lbl, rej in zip(y_pred_labels, rejected)
        ]
    else:
        labels_for_metrics = list(y_pred_labels)

    if eval_mode in ("smoothed", "interval"):
        labels_for_metrics = rolling_majority(labels_for_metrics, smooth_window)

    y_true = list(test_df["label"].values)
    user_ids = list(test_df["user_id"].values)
    y_true_indices = le.transform(y_true)

    if eval_mode == "interval":
        bucket_starts = [
            datetime(2000, 1, 1) + timedelta(seconds=i * DEFAULT_BUCKET_SECONDS)
            for i in range(len(labels_for_metrics))
        ]
        pred_segments = segmentize(
            bucket_starts, labels_for_metrics, DEFAULT_BUCKET_SECONDS
        )
        true_segments = segmentize(bucket_starts, y_true, DEFAULT_BUCKET_SECONDS)

        interval_correct = 0
        interval_total = len(pred_segments)
        true_map = {s.start_ts: s.label for s in true_segments}
        for seg in pred_segments:
            gold = true_map.get(seg.start_ts)
            if gold is not None and gold == seg.label:
                interval_correct += 1
        interval_accuracy = (
            interval_correct / interval_total if interval_total > 0 else 0.0
        )

        metrics = {
            "macro_f1": round(interval_accuracy, 4),
            "weighted_f1": round(interval_accuracy, 4),
        }
        pc = per_class_metrics(y_true, labels_for_metrics, label_names)
    else:
        metrics = compute_metrics(y_true, labels_for_metrics, label_names)
        pc = per_class_metrics(y_true, labels_for_metrics, label_names)

    cm_df = confusion_matrix_df(y_true, labels_for_metrics, label_names)
    cm_list = cm_df.values.tolist()
    top_pairs = top_confusion_pairs(cm_list, label_names)
    ece = round(
        expected_calibration_error_multiclass(y_true_indices, y_proba, label_names),
        4,
    )
    brier = round(multiclass_brier_score(y_true_indices, y_proba), 4)
    ll = round(multiclass_log_loss_score(y_true_indices, y_proba), 4)
    slices = slice_metrics_by_columns(
        test_df,
        y_true,
        labels_for_metrics,
        label_names,
    )
    cat_cols = get_categorical_columns(resolved_schema_version)
    unknown_rates = unknown_category_rates(test_df, cat_encoders, cat_cols)

    pu = per_user_metrics(y_true, labels_for_metrics, user_ids, label_names)
    cal = calibration_curve_data(y_true_indices, y_proba, label_names)
    strat = user_stratification_report(user_ids, y_true, label_names)
    rr = reject_rate(labels_for_metrics, MIXED_UNKNOWN)

    fr = round(flap_rate(labels_for_metrics), 4)
    seg_dist = _segment_duration_distribution(labels_for_metrics)

    seen_f1: float | None = None
    unseen_f1: float | None = None
    holdout_set = set(holdout_users)

    if holdout_set:
        seen_mask = [uid not in holdout_set for uid in user_ids]
        unseen_mask = [uid in holdout_set for uid in user_ids]

        if any(seen_mask):
            seen_true = [y for y, m in zip(y_true, seen_mask) if m]
            seen_pred = [y for y, m in zip(labels_for_metrics, seen_mask) if m]
            seen_f1 = round(
                float(
                    f1_score(
                        seen_true,
                        seen_pred,
                        labels=label_names,
                        average="macro",
                        zero_division=0,
                    )
                ),
                4,
            )

        if any(unseen_mask):
            unseen_true = [y for y, m in zip(y_true, unseen_mask) if m]
            unseen_pred = [y for y, m in zip(labels_for_metrics, unseen_mask) if m]
            unseen_f1 = round(
                float(
                    f1_score(
                        unseen_true,
                        unseen_pred,
                        labels=label_names,
                        average="macro",
                        zero_division=0,
                    )
                ),
                4,
            )

    checks, check_details = _check_acceptance(
        metrics["macro_f1"],
        metrics["weighted_f1"],
        pc,
        rr,
        seen_f1,
        unseen_f1,
    )

    return EvaluationReport(
        macro_f1=metrics["macro_f1"],
        weighted_f1=metrics["weighted_f1"],
        per_class=pc,
        confusion_matrix=cm_list,
        label_names=label_names,
        per_user=pu,
        calibration=cal,
        stratification=strat,
        seen_user_f1=seen_f1,
        unseen_user_f1=unseen_f1,
        reject_rate=round(rr, 4),
        acceptance_checks=checks,
        acceptance_details=check_details,
        flip_rate=fr,
        segment_duration_distribution=seg_dist,
        eval_mode=eval_mode,
        top_confusion_pairs=top_pairs,
        expected_calibration_error=ece,
        multiclass_brier_score=brier,
        multiclass_log_loss=ll,
        slice_metrics=slices,
        unknown_category_rates=unknown_rates,
    )

`tune_reject_threshold(model, val_df, *, cat_encoders=None, thresholds=None, reject_rate_min=_ACCEPT_REJECT_RATE_MIN, reject_rate_max=_ACCEPT_REJECT_RATE_MAX, calibrator=None, schema_version=None)` ¶

Sweep reject thresholds and pick the best one.

For each candidate threshold the function computes accuracy on accepted (non-rejected) windows, the reject rate, coverage (fraction of windows kept), and macro-F1. The best threshold is the one that maximises accuracy on accepted windows while keeping reject rate within [reject_rate_min, reject_rate_max].

Parameters:

Name	Type	Description	Default
`model`	`Booster`	Trained LightGBM booster.	required
`val_df`	`DataFrame`	Validation DataFrame with `FEATURE_COLUMNS`, `label`, and `user_id` columns.	required
`cat_encoders`	`dict[str, LabelEncoder] \| None`	Pre-fitted categorical encoders.	`None`
`thresholds`	`Sequence[float] \| None`	Candidate thresholds to evaluate. Defaults to `np.arange(0.10, 1.00, 0.05)`.	`None`
`reject_rate_min`	`float`	Lower bound for acceptable reject rate.	`_ACCEPT_REJECT_RATE_MIN`
`reject_rate_max`	`float`	Upper bound for acceptable reject rate.	`_ACCEPT_REJECT_RATE_MAX`
`calibrator`	`Calibrator \| None`	When provided, raw probabilities are calibrated before extracting confidences for the threshold sweep. This ensures the threshold is tuned on the same probability space used at inference time.	`None`
`schema_version`	`str \| None`	`"v1"`, `"v2"`, or `"v3"`. When omitted, infer from `val_df`.	`None`

Returns:

Name	Type	Description
`A`	`RejectTuningResult`	class:`RejectTuningResult` with the optimal threshold and
	`RejectTuningResult`	the full sweep table.

Source code in src/taskclf/train/evaluate.py

def tune_reject_threshold(
    model: lgb.Booster,
    val_df: pd.DataFrame,
    *,
    cat_encoders: dict[str, LabelEncoder] | None = None,
    thresholds: Sequence[float] | None = None,
    reject_rate_min: float = _ACCEPT_REJECT_RATE_MIN,
    reject_rate_max: float = _ACCEPT_REJECT_RATE_MAX,
    calibrator: Calibrator | None = None,
    schema_version: str | None = None,
) -> RejectTuningResult:
    """Sweep reject thresholds and pick the best one.

    For each candidate threshold the function computes accuracy on
    accepted (non-rejected) windows, the reject rate, coverage (fraction
    of windows kept), and macro-F1.  The best threshold is the one that
    maximises accuracy on accepted windows while keeping reject rate
    within *[reject_rate_min, reject_rate_max]*.

    Args:
        model: Trained LightGBM booster.
        val_df: Validation DataFrame with ``FEATURE_COLUMNS``, ``label``,
            and ``user_id`` columns.
        cat_encoders: Pre-fitted categorical encoders.
        thresholds: Candidate thresholds to evaluate.  Defaults to
            ``np.arange(0.10, 1.00, 0.05)``.
        reject_rate_min: Lower bound for acceptable reject rate.
        reject_rate_max: Upper bound for acceptable reject rate.
        calibrator: When provided, raw probabilities are calibrated
            before extracting confidences for the threshold sweep.
            This ensures the threshold is tuned on the same probability
            space used at inference time.
        schema_version: ``"v1"``, ``"v2"``, or ``"v3"``. When omitted, infer from
            ``val_df``.

    Returns:
        A :class:`RejectTuningResult` with the optimal threshold and
        the full sweep table.
    """
    if thresholds is None:
        thresholds = list(np.round(np.arange(0.10, 1.00, 0.05), 2))

    le = LabelEncoder()
    le.fit(sorted(LABEL_SET_V1))
    label_names = list(le.classes_)

    resolved_schema_version = resolve_feature_schema_version(val_df, schema_version)
    y_proba = predict_proba(
        model,
        val_df,
        cat_encoders,
        schema_version=resolved_schema_version,
    )
    if calibrator is not None:
        y_proba = calibrator.calibrate(y_proba)
    y_pred_indices = y_proba.argmax(axis=1)
    y_pred_labels = np.array(le.inverse_transform(y_pred_indices))
    y_true = np.array(val_df["label"].values)
    confidences = y_proba.max(axis=1)

    sweep: list[dict[str, float]] = []
    best_threshold = DEFAULT_REJECT_THRESHOLD
    best_acc = -1.0

    for t in thresholds:
        rejected = confidences < t
        rr = float(rejected.mean())
        coverage = 1.0 - rr

        accepted_mask = ~rejected
        if accepted_mask.any():
            acc = float(
                accuracy_score(y_true[accepted_mask], y_pred_labels[accepted_mask])
            )
            mf1 = float(
                f1_score(
                    y_true[accepted_mask],
                    y_pred_labels[accepted_mask],
                    labels=label_names,
                    average="macro",
                    zero_division=0,
                )
            )
        else:
            acc = 0.0
            mf1 = 0.0

        sweep.append(
            {
                "threshold": round(float(t), 4),
                "accuracy_on_accepted": round(acc, 4),
                "reject_rate": round(rr, 4),
                "coverage": round(coverage, 4),
                "macro_f1": round(mf1, 4),
            }
        )

        if reject_rate_min <= rr <= reject_rate_max and acc > best_acc:
            best_acc = acc
            best_threshold = float(t)

    return RejectTuningResult(
        best_threshold=round(best_threshold, 4),
        sweep=sweep,
    )

`write_evaluation_artifacts(report, output_dir)` ¶

Write evaluation report artifacts to disk.

Writes evaluation.json (full report) and calibration.json (per-class calibration curve data) into output_dir.

Parameters:

Name	Type	Description	Default
`report`	`EvaluationReport`	A completed evaluation report.	required
`output_dir`	`Path`	Target directory (created if needed).	required

Returns:

Type	Description
`dict[str, Path]`	Dict mapping artifact name to its written path.

Source code in src/taskclf/train/evaluate.py

def write_evaluation_artifacts(
    report: EvaluationReport,
    output_dir: Path,
) -> dict[str, Path]:
    """Write evaluation report artifacts to disk.

    Writes ``evaluation.json`` (full report) and ``calibration.json``
    (per-class calibration curve data) into *output_dir*.

    Args:
        report: A completed evaluation report.
        output_dir: Target directory (created if needed).

    Returns:
        Dict mapping artifact name to its written path.
    """
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    paths: dict[str, Path] = {}

    eval_path = output_dir / "evaluation.json"
    eval_path.write_text(json.dumps(report.model_dump(), indent=2, default=str))
    paths["evaluation"] = eval_path

    cal_path = output_dir / "calibration.json"
    cal_path.write_text(json.dumps(report.calibration, indent=2))
    paths["calibration"] = cal_path

    cm_df = pd.DataFrame(
        report.confusion_matrix,
        index=report.label_names,
        columns=report.label_names,
    )
    cm_path = output_dir / "confusion_matrix.csv"
    cm_df.to_csv(cm_path)
    paths["confusion_matrix"] = cm_path

    try:
        import matplotlib

        matplotlib.use("Agg")
        import matplotlib.pyplot as plt

        fig, axes = plt.subplots(2, 4, figsize=(16, 8))
        axes_flat = axes.flatten()
        for idx, name in enumerate(report.label_names):
            ax = axes_flat[idx]
            cal = report.calibration.get(name, {})
            frac = cal.get("fraction_of_positives", [])
            mean_pred = cal.get("mean_predicted_value", [])
            ax.plot([0, 1], [0, 1], "k--", alpha=0.5)
            if frac and mean_pred:
                ax.plot(mean_pred, frac, "s-")
            ax.set_title(name, fontsize=9)
            ax.set_xlim(0, 1)
            ax.set_ylim(0, 1)
            ax.set_xlabel("Mean predicted", fontsize=7)
            ax.set_ylabel("Fraction positive", fontsize=7)

        for idx in range(len(report.label_names), len(axes_flat)):
            axes_flat[idx].set_visible(False)

        fig.suptitle("Per-Class Calibration Curves")
        fig.tight_layout()
        plot_path = output_dir / "calibration.png"
        fig.savefig(plot_path, dpi=100)
        plt.close(fig)
        paths["calibration_plot"] = plot_path
    except Exception:
        logger.debug("Calibration plot generation failed", exc_info=True)

    return paths

train.evaluate¶

Overview¶

Models¶

EvaluationReport¶

RejectTuningResult¶

Evaluation modes¶

Acceptance checks¶

Functions¶

evaluate_model¶

tune_reject_threshold¶

write_evaluation_artifacts¶

Usage¶

taskclf.train.evaluate ¶

EvaluationReport ¶

RejectTuningResult ¶

evaluate_model(model, test_df, *, cat_encoders=None, holdout_users=(), reject_threshold=DEFAULT_REJECT_THRESHOLD, eval_mode='raw', calibrator=None, smooth_window=DEFAULT_SMOOTH_WINDOW, schema_version=None) ¶

tune_reject_threshold(model, val_df, *, cat_encoders=None, thresholds=None, reject_rate_min=_ACCEPT_REJECT_RATE_MIN, reject_rate_max=_ACCEPT_REJECT_RATE_MAX, calibrator=None, schema_version=None) ¶

write_evaluation_artifacts(report, output_dir) ¶

`taskclf.train.evaluate` ¶

`EvaluationReport` ¶

`RejectTuningResult` ¶

`evaluate_model(model, test_df, *, cat_encoders=None, holdout_users=(), reject_threshold=DEFAULT_REJECT_THRESHOLD, eval_mode='raw', calibrator=None, smooth_window=DEFAULT_SMOOTH_WINDOW, schema_version=None)` ¶

`tune_reject_threshold(model, val_df, *, cat_encoders=None, thresholds=None, reject_rate_min=_ACCEPT_REJECT_RATE_MIN, reject_rate_max=_ACCEPT_REJECT_RATE_MAX, calibrator=None, schema_version=None)` ¶

`write_evaluation_artifacts(report, output_dir)` ¶