Skip to content

train.evaluate

Full model evaluation pipeline: metrics, calibration, acceptance checks.

Overview

Evaluates a trained LightGBM model against labeled test data and produces a comprehensive report with acceptance-gate verdicts. Supports multiple evaluation modes so offline metrics align with deployed inference behavior:

model + test_df → evaluate_model → EvaluationReport
                                       ├── overall metrics (macro/weighted F1)
                                       ├── per-class precision/recall/F1 (+ support)
                                       ├── top confusion pairs (off-diagonal)
                                       ├── calibration scalars (ECE, Brier, log loss)
                                       ├── slice metrics (default feature columns)
                                       ├── unknown-category rates vs training encoders
                                       ├── per-user macro-F1
                                       ├── calibration curves
                                       ├── user stratification
                                       ├── reject rate
                                       ├── flip rate
                                       ├── segment duration distribution
                                       └── acceptance checks (pass/fail)

Predictions with max probability below the reject threshold are classified as Mixed/Unknown (from core.defaults).

Models

EvaluationReport

Frozen Pydantic model containing all evaluation artifacts.

Field Type Description
macro_f1 float Overall macro-averaged F1
weighted_f1 float Overall weighted-averaged F1
per_class dict[str, dict[str, float \| int]] Per-class precision, recall, F1, support
confusion_matrix list[list[int]] Confusion matrix as nested lists
label_names list[str] Ordered label names (rows/columns of confusion matrix)
top_confusion_pairs list[dict[str, str \| int]] Largest off-diagonal confusion counts
expected_calibration_error float Multiclass ECE (OVR, support-weighted)
multiclass_brier_score float Multiclass Brier score
multiclass_log_loss float Multiclass log loss
slice_metrics dict[str, dict[str, dict[str, Any]]] Per-column slice breakdowns (see core.metrics)
unknown_category_rates dict[str, Any] Per-column unseen categorical rate vs bundle encoders
per_user dict[str, dict[str, float]] Per-user macro-F1 and row count
calibration dict[str, dict[str, list[float]]] Per-class calibration curve data (fraction_of_positives, mean_predicted_value)
stratification dict[str, Any] User stratification report with optional warnings
seen_user_f1 float \| None Macro-F1 on users seen during training (requires holdout_users)
unseen_user_f1 float \| None Macro-F1 on held-out users (requires holdout_users)
reject_rate float Fraction of predictions below the reject threshold
acceptance_checks dict[str, bool] Named acceptance gates (pass/fail)
acceptance_details dict[str, str] Human-readable detail string per check
flip_rate float \| None Label-change rate (transitions / total windows)
segment_duration_distribution dict[str, int] \| None Histogram of segment durations by bucket ("60s", "120s", "180s", "300s", "300s+")
eval_mode str Which evaluation pipeline was used ("raw", "calibrated", "calibrated_reject", "smoothed", "interval")

RejectTuningResult

Result of sweeping reject thresholds on a validation set.

Field Type Description
best_threshold float Threshold maximizing accuracy on accepted windows within reject-rate bounds
sweep list[dict[str, float]] Per-threshold row with threshold, accuracy_on_accepted, reject_rate, coverage, macro_f1

Evaluation modes

The eval_mode parameter controls the evaluation pipeline:

Mode Calibrator Reject Smoothing Description
"raw" No Yes No Default; raw model probabilities with reject threshold
"calibrated" Yes No No Calibrated probabilities, no reject
"calibrated_reject" Yes Yes No Calibrated probabilities with reject threshold
"smoothed" Yes Yes Yes Calibrated + reject + rolling-majority smoothing
"interval" Yes Yes Yes Smoothed predictions aggregated into segments; interval-level accuracy

Non-raw modes require a calibrator implementing the Calibrator protocol.

Acceptance checks

All checks must pass for a model to be promoted. Thresholds are defined in the module constants and align with docs/guide/acceptance.md:

Check Threshold Description
macro_f1 >= 0.65 Overall macro-F1
weighted_f1 >= 0.70 Overall weighted-F1
breakidle_precision >= 0.95 BreakIdle class precision
breakidle_recall >= 0.90 BreakIdle class recall
no_class_below_50_precision >= 0.50 Per-class precision floor
reject_rate_bounds [0.05, 0.30] Reject rate within window
seen_user_f1 >= 0.70 Seen-user macro-F1 (when holdout users provided)
unseen_user_f1 >= 0.60 Unseen-user macro-F1 (when holdout users provided)

Functions

evaluate_model

evaluate_model(
    model: lgb.Booster,
    test_df: pd.DataFrame,
    *,
    cat_encoders: dict[str, LabelEncoder] | None = None,
    holdout_users: Sequence[str] = (),
    reject_threshold: float = DEFAULT_REJECT_THRESHOLD,
    eval_mode: Literal["raw", "calibrated", "calibrated_reject", "smoothed", "interval"] = "raw",
    calibrator: Calibrator | None = None,
    smooth_window: int = DEFAULT_SMOOTH_WINDOW,
    schema_version: str = "v1",
) -> EvaluationReport

Runs comprehensive evaluation: overall metrics, per-class and per-user breakdowns, calibration curves, user stratification, slice metrics, unknown-category rates, probability-based calibration scalars, and acceptance checks. When holdout_users is non-empty, computes separate seen/unseen-user F1 scores.

The eval_mode parameter selects the evaluation pipeline (see table above). Non-raw modes require a calibrator.

schema_version selects which feature columns are treated as categorical when computing unknown_category_rates (via get_categorical_columns). Callers loading a model bundle should pass metadata.schema_version so the column set matches training.

tune_reject_threshold

tune_reject_threshold(
    model: lgb.Booster,
    val_df: pd.DataFrame,
    *,
    cat_encoders: dict[str, LabelEncoder] | None = None,
    thresholds: Sequence[float] | None = None,
    reject_rate_min: float = 0.05,
    reject_rate_max: float = 0.30,
    calibrator: Calibrator | None = None,
) -> RejectTuningResult

Sweeps candidate thresholds (default np.arange(0.10, 1.00, 0.05)) and picks the one that maximizes accuracy on accepted windows while keeping the reject rate within [reject_rate_min, reject_rate_max]. Falls back to DEFAULT_REJECT_THRESHOLD (0.55) if no candidate satisfies the bounds.

When calibrator is provided, raw probabilities are calibrated before extracting confidences for the threshold sweep. This ensures the threshold is tuned on the same probability space used at inference time.

write_evaluation_artifacts

write_evaluation_artifacts(
    report: EvaluationReport,
    output_dir: Path,
) -> dict[str, Path]

Writes evaluation artifacts to disk:

File Content
evaluation.json Full report as JSON
calibration.json Per-class calibration curve data
confusion_matrix.csv Labeled confusion matrix
calibration.png Per-class calibration plots (optional, requires matplotlib)

Returns a dict mapping artifact name to its written path.

Usage

from taskclf.train.evaluate import (
    evaluate_model,
    tune_reject_threshold,
    write_evaluation_artifacts,
)
from taskclf.core.model_io import load_model_bundle
from taskclf.infer.calibration import TemperatureCalibrator

model, metadata, cat_encoders = load_model_bundle(Path("models/run_001"))

# Raw evaluation (default)
raw_report = evaluate_model(
    model, test_df,
    cat_encoders=cat_encoders,
    holdout_users=["user-X"],
)
print(f"Macro F1: {raw_report.macro_f1:.4f}")
print(f"Flip rate: {raw_report.flip_rate:.4f}")

# Calibrated evaluation
cal = TemperatureCalibrator(temperature=1.2)
cal_report = evaluate_model(
    model, test_df,
    cat_encoders=cat_encoders,
    eval_mode="calibrated",
    calibrator=cal,
)
print(f"Calibrated F1: {cal_report.macro_f1:.4f}")

# Tune reject threshold on calibrated scores
result = tune_reject_threshold(
    model, val_df,
    cat_encoders=cat_encoders,
    calibrator=cal,
)
print(f"Best threshold: {result.best_threshold}")

# Write artifacts
paths = write_evaluation_artifacts(raw_report, Path("artifacts/eval"))

taskclf.train.evaluate

Full model evaluation pipeline: metrics, calibration, acceptance checks.

EvaluationReport

Bases: BaseModel

Comprehensive evaluation output for a trained model on a test set.

Source code in src/taskclf/train/evaluate.py
class EvaluationReport(BaseModel, frozen=True):
    """Comprehensive evaluation output for a trained model on a test set."""

    macro_f1: float
    weighted_f1: float
    per_class: dict[str, dict[str, float | int]]
    confusion_matrix: list[list[int]]
    label_names: list[str]
    per_user: dict[str, dict[str, float]]
    calibration: dict[str, dict[str, list[float]]]
    stratification: dict[str, Any]
    seen_user_f1: float | None = None
    unseen_user_f1: float | None = None
    reject_rate: float
    acceptance_checks: dict[str, bool]
    acceptance_details: dict[str, str]
    flip_rate: float | None = None
    segment_duration_distribution: dict[str, int] | None = None
    eval_mode: str = "raw"
    top_confusion_pairs: list[dict[str, str | int]] = Field(default_factory=list)
    expected_calibration_error: float = 0.0
    multiclass_brier_score: float = 0.0
    multiclass_log_loss: float = 0.0
    slice_metrics: dict[str, dict[str, dict[str, Any]]] = Field(default_factory=dict)
    unknown_category_rates: dict[str, Any] = Field(default_factory=dict)

RejectTuningResult

Bases: BaseModel

Result of sweeping reject thresholds on a validation set.

Attributes:

Name Type Description
best_threshold float

Threshold that maximises accuracy on accepted windows while keeping reject rate within acceptance bounds.

sweep list[dict[str, float]]

List of dicts, one per candidate threshold, each with threshold, accuracy_on_accepted, reject_rate, coverage, and macro_f1.

Source code in src/taskclf/train/evaluate.py
class RejectTuningResult(BaseModel, frozen=True):
    """Result of sweeping reject thresholds on a validation set.

    Attributes:
        best_threshold: Threshold that maximises accuracy on accepted
            windows while keeping reject rate within acceptance bounds.
        sweep: List of dicts, one per candidate threshold, each with
            ``threshold``, ``accuracy_on_accepted``, ``reject_rate``,
            ``coverage``, and ``macro_f1``.
    """

    best_threshold: float
    sweep: list[dict[str, float]]

evaluate_model(model, test_df, *, cat_encoders=None, holdout_users=(), reject_threshold=DEFAULT_REJECT_THRESHOLD, eval_mode='raw', calibrator=None, smooth_window=DEFAULT_SMOOTH_WINDOW, schema_version=None)

Run comprehensive evaluation of a trained model on a test set.

Computes overall metrics (macro-F1, weighted-F1), per-class precision / recall / F1, per-user macro-F1, calibration curves, user-stratification report, and acceptance-gate checks.

Parameters:

Name Type Description Default
model Booster

Trained LightGBM booster.

required
test_df DataFrame

Test DataFrame containing FEATURE_COLUMNS, label, and user_id columns.

required
cat_encoders dict[str, LabelEncoder] | None

Pre-fitted categorical encoders from the training run.

None
holdout_users Sequence[str]

User IDs that were held out from training, used to split seen-vs-unseen evaluation.

()
reject_threshold float

Max-probability below which a prediction is treated as rejected (Mixed/Unknown).

DEFAULT_REJECT_THRESHOLD
eval_mode Literal['raw', 'calibrated', 'calibrated_reject', 'smoothed', 'interval']

Evaluation pipeline to use. "raw" uses model probabilities directly. "calibrated" applies a calibrator before metrics (no reject). "calibrated_reject" applies calibrator + reject. "smoothed" adds rolling-majority smoothing after reject. "interval" aggregates smoothed predictions into segments and evaluates per-interval accuracy.

'raw'
calibrator Calibrator | None

Probability calibrator to apply in non-raw modes. Required when eval_mode is not "raw".

None
smooth_window int

Window size for rolling-majority smoothing.

DEFAULT_SMOOTH_WINDOW
schema_version str | None

"v1", "v2", or "v3". When omitted, infer from test_df before selecting categorical columns for unknown-category-rate (see :func:~taskclf.train.lgbm.get_categorical_columns).

None

Returns:

Type Description
EvaluationReport

A frozen :class:EvaluationReport with all evaluation artifacts.

Source code in src/taskclf/train/evaluate.py
def evaluate_model(
    model: lgb.Booster,
    test_df: pd.DataFrame,
    *,
    cat_encoders: dict[str, LabelEncoder] | None = None,
    holdout_users: Sequence[str] = (),
    reject_threshold: float = DEFAULT_REJECT_THRESHOLD,
    eval_mode: Literal[
        "raw", "calibrated", "calibrated_reject", "smoothed", "interval"
    ] = "raw",
    calibrator: Calibrator | None = None,
    smooth_window: int = DEFAULT_SMOOTH_WINDOW,
    schema_version: str | None = None,
) -> EvaluationReport:
    """Run comprehensive evaluation of a trained model on a test set.

    Computes overall metrics (macro-F1, weighted-F1), per-class precision /
    recall / F1, per-user macro-F1, calibration curves, user-stratification
    report, and acceptance-gate checks.

    Args:
        model: Trained LightGBM booster.
        test_df: Test DataFrame containing ``FEATURE_COLUMNS``, ``label``,
            and ``user_id`` columns.
        cat_encoders: Pre-fitted categorical encoders from the training run.
        holdout_users: User IDs that were held out from training, used to
            split seen-vs-unseen evaluation.
        reject_threshold: Max-probability below which a prediction is
            treated as rejected (``Mixed/Unknown``).
        eval_mode: Evaluation pipeline to use.  ``"raw"`` uses model
            probabilities directly.  ``"calibrated"`` applies a calibrator
            before metrics (no reject).  ``"calibrated_reject"`` applies
            calibrator + reject.  ``"smoothed"`` adds rolling-majority
            smoothing after reject.  ``"interval"`` aggregates smoothed
            predictions into segments and evaluates per-interval accuracy.
        calibrator: Probability calibrator to apply in non-raw modes.
            Required when *eval_mode* is not ``"raw"``.
        smooth_window: Window size for rolling-majority smoothing.
        schema_version: ``"v1"``, ``"v2"``, or ``"v3"``. When omitted, infer from
            ``test_df`` before selecting categorical columns for unknown-category-rate
            (see :func:`~taskclf.train.lgbm.get_categorical_columns`).

    Returns:
        A frozen :class:`EvaluationReport` with all evaluation artifacts.
    """
    le = LabelEncoder()
    le.fit(sorted(LABEL_SET_V1))
    label_names = list(le.classes_)

    resolved_schema_version = resolve_feature_schema_version(test_df, schema_version)
    y_proba = predict_proba(
        model,
        test_df,
        cat_encoders,
        schema_version=resolved_schema_version,
    )

    if eval_mode != "raw" and calibrator is not None:
        y_proba = calibrator.calibrate(y_proba)

    y_pred_indices = y_proba.argmax(axis=1)
    y_pred_labels = list(le.inverse_transform(y_pred_indices))

    apply_reject = eval_mode in ("raw", "calibrated_reject", "smoothed", "interval")
    if apply_reject:
        confidences = y_proba.max(axis=1)
        rejected = confidences < reject_threshold
        labels_for_metrics = [
            MIXED_UNKNOWN if rej else lbl for lbl, rej in zip(y_pred_labels, rejected)
        ]
    else:
        labels_for_metrics = list(y_pred_labels)

    if eval_mode in ("smoothed", "interval"):
        labels_for_metrics = rolling_majority(labels_for_metrics, smooth_window)

    y_true = list(test_df["label"].values)
    user_ids = list(test_df["user_id"].values)
    y_true_indices = le.transform(y_true)

    if eval_mode == "interval":
        bucket_starts = [
            datetime(2000, 1, 1) + timedelta(seconds=i * DEFAULT_BUCKET_SECONDS)
            for i in range(len(labels_for_metrics))
        ]
        pred_segments = segmentize(
            bucket_starts, labels_for_metrics, DEFAULT_BUCKET_SECONDS
        )
        true_segments = segmentize(bucket_starts, y_true, DEFAULT_BUCKET_SECONDS)

        interval_correct = 0
        interval_total = len(pred_segments)
        true_map = {s.start_ts: s.label for s in true_segments}
        for seg in pred_segments:
            gold = true_map.get(seg.start_ts)
            if gold is not None and gold == seg.label:
                interval_correct += 1
        interval_accuracy = (
            interval_correct / interval_total if interval_total > 0 else 0.0
        )

        metrics = {
            "macro_f1": round(interval_accuracy, 4),
            "weighted_f1": round(interval_accuracy, 4),
        }
        pc = per_class_metrics(y_true, labels_for_metrics, label_names)
    else:
        metrics = compute_metrics(y_true, labels_for_metrics, label_names)
        pc = per_class_metrics(y_true, labels_for_metrics, label_names)

    cm_df = confusion_matrix_df(y_true, labels_for_metrics, label_names)
    cm_list = cm_df.values.tolist()
    top_pairs = top_confusion_pairs(cm_list, label_names)
    ece = round(
        expected_calibration_error_multiclass(y_true_indices, y_proba, label_names),
        4,
    )
    brier = round(multiclass_brier_score(y_true_indices, y_proba), 4)
    ll = round(multiclass_log_loss_score(y_true_indices, y_proba), 4)
    slices = slice_metrics_by_columns(
        test_df,
        y_true,
        labels_for_metrics,
        label_names,
    )
    cat_cols = get_categorical_columns(resolved_schema_version)
    unknown_rates = unknown_category_rates(test_df, cat_encoders, cat_cols)

    pu = per_user_metrics(y_true, labels_for_metrics, user_ids, label_names)
    cal = calibration_curve_data(y_true_indices, y_proba, label_names)
    strat = user_stratification_report(user_ids, y_true, label_names)
    rr = reject_rate(labels_for_metrics, MIXED_UNKNOWN)

    fr = round(flap_rate(labels_for_metrics), 4)
    seg_dist = _segment_duration_distribution(labels_for_metrics)

    seen_f1: float | None = None
    unseen_f1: float | None = None
    holdout_set = set(holdout_users)

    if holdout_set:
        seen_mask = [uid not in holdout_set for uid in user_ids]
        unseen_mask = [uid in holdout_set for uid in user_ids]

        if any(seen_mask):
            seen_true = [y for y, m in zip(y_true, seen_mask) if m]
            seen_pred = [y for y, m in zip(labels_for_metrics, seen_mask) if m]
            seen_f1 = round(
                float(
                    f1_score(
                        seen_true,
                        seen_pred,
                        labels=label_names,
                        average="macro",
                        zero_division=0,
                    )
                ),
                4,
            )

        if any(unseen_mask):
            unseen_true = [y for y, m in zip(y_true, unseen_mask) if m]
            unseen_pred = [y for y, m in zip(labels_for_metrics, unseen_mask) if m]
            unseen_f1 = round(
                float(
                    f1_score(
                        unseen_true,
                        unseen_pred,
                        labels=label_names,
                        average="macro",
                        zero_division=0,
                    )
                ),
                4,
            )

    checks, check_details = _check_acceptance(
        metrics["macro_f1"],
        metrics["weighted_f1"],
        pc,
        rr,
        seen_f1,
        unseen_f1,
    )

    return EvaluationReport(
        macro_f1=metrics["macro_f1"],
        weighted_f1=metrics["weighted_f1"],
        per_class=pc,
        confusion_matrix=cm_list,
        label_names=label_names,
        per_user=pu,
        calibration=cal,
        stratification=strat,
        seen_user_f1=seen_f1,
        unseen_user_f1=unseen_f1,
        reject_rate=round(rr, 4),
        acceptance_checks=checks,
        acceptance_details=check_details,
        flip_rate=fr,
        segment_duration_distribution=seg_dist,
        eval_mode=eval_mode,
        top_confusion_pairs=top_pairs,
        expected_calibration_error=ece,
        multiclass_brier_score=brier,
        multiclass_log_loss=ll,
        slice_metrics=slices,
        unknown_category_rates=unknown_rates,
    )

tune_reject_threshold(model, val_df, *, cat_encoders=None, thresholds=None, reject_rate_min=_ACCEPT_REJECT_RATE_MIN, reject_rate_max=_ACCEPT_REJECT_RATE_MAX, calibrator=None, schema_version=None)

Sweep reject thresholds and pick the best one.

For each candidate threshold the function computes accuracy on accepted (non-rejected) windows, the reject rate, coverage (fraction of windows kept), and macro-F1. The best threshold is the one that maximises accuracy on accepted windows while keeping reject rate within [reject_rate_min, reject_rate_max].

Parameters:

Name Type Description Default
model Booster

Trained LightGBM booster.

required
val_df DataFrame

Validation DataFrame with FEATURE_COLUMNS, label, and user_id columns.

required
cat_encoders dict[str, LabelEncoder] | None

Pre-fitted categorical encoders.

None
thresholds Sequence[float] | None

Candidate thresholds to evaluate. Defaults to np.arange(0.10, 1.00, 0.05).

None
reject_rate_min float

Lower bound for acceptable reject rate.

_ACCEPT_REJECT_RATE_MIN
reject_rate_max float

Upper bound for acceptable reject rate.

_ACCEPT_REJECT_RATE_MAX
calibrator Calibrator | None

When provided, raw probabilities are calibrated before extracting confidences for the threshold sweep. This ensures the threshold is tuned on the same probability space used at inference time.

None
schema_version str | None

"v1", "v2", or "v3". When omitted, infer from val_df.

None

Returns:

Name Type Description
A RejectTuningResult

class:RejectTuningResult with the optimal threshold and

RejectTuningResult

the full sweep table.

Source code in src/taskclf/train/evaluate.py
def tune_reject_threshold(
    model: lgb.Booster,
    val_df: pd.DataFrame,
    *,
    cat_encoders: dict[str, LabelEncoder] | None = None,
    thresholds: Sequence[float] | None = None,
    reject_rate_min: float = _ACCEPT_REJECT_RATE_MIN,
    reject_rate_max: float = _ACCEPT_REJECT_RATE_MAX,
    calibrator: Calibrator | None = None,
    schema_version: str | None = None,
) -> RejectTuningResult:
    """Sweep reject thresholds and pick the best one.

    For each candidate threshold the function computes accuracy on
    accepted (non-rejected) windows, the reject rate, coverage (fraction
    of windows kept), and macro-F1.  The best threshold is the one that
    maximises accuracy on accepted windows while keeping reject rate
    within *[reject_rate_min, reject_rate_max]*.

    Args:
        model: Trained LightGBM booster.
        val_df: Validation DataFrame with ``FEATURE_COLUMNS``, ``label``,
            and ``user_id`` columns.
        cat_encoders: Pre-fitted categorical encoders.
        thresholds: Candidate thresholds to evaluate.  Defaults to
            ``np.arange(0.10, 1.00, 0.05)``.
        reject_rate_min: Lower bound for acceptable reject rate.
        reject_rate_max: Upper bound for acceptable reject rate.
        calibrator: When provided, raw probabilities are calibrated
            before extracting confidences for the threshold sweep.
            This ensures the threshold is tuned on the same probability
            space used at inference time.
        schema_version: ``"v1"``, ``"v2"``, or ``"v3"``. When omitted, infer from
            ``val_df``.

    Returns:
        A :class:`RejectTuningResult` with the optimal threshold and
        the full sweep table.
    """
    if thresholds is None:
        thresholds = list(np.round(np.arange(0.10, 1.00, 0.05), 2))

    le = LabelEncoder()
    le.fit(sorted(LABEL_SET_V1))
    label_names = list(le.classes_)

    resolved_schema_version = resolve_feature_schema_version(val_df, schema_version)
    y_proba = predict_proba(
        model,
        val_df,
        cat_encoders,
        schema_version=resolved_schema_version,
    )
    if calibrator is not None:
        y_proba = calibrator.calibrate(y_proba)
    y_pred_indices = y_proba.argmax(axis=1)
    y_pred_labels = np.array(le.inverse_transform(y_pred_indices))
    y_true = np.array(val_df["label"].values)
    confidences = y_proba.max(axis=1)

    sweep: list[dict[str, float]] = []
    best_threshold = DEFAULT_REJECT_THRESHOLD
    best_acc = -1.0

    for t in thresholds:
        rejected = confidences < t
        rr = float(rejected.mean())
        coverage = 1.0 - rr

        accepted_mask = ~rejected
        if accepted_mask.any():
            acc = float(
                accuracy_score(y_true[accepted_mask], y_pred_labels[accepted_mask])
            )
            mf1 = float(
                f1_score(
                    y_true[accepted_mask],
                    y_pred_labels[accepted_mask],
                    labels=label_names,
                    average="macro",
                    zero_division=0,
                )
            )
        else:
            acc = 0.0
            mf1 = 0.0

        sweep.append(
            {
                "threshold": round(float(t), 4),
                "accuracy_on_accepted": round(acc, 4),
                "reject_rate": round(rr, 4),
                "coverage": round(coverage, 4),
                "macro_f1": round(mf1, 4),
            }
        )

        if reject_rate_min <= rr <= reject_rate_max and acc > best_acc:
            best_acc = acc
            best_threshold = float(t)

    return RejectTuningResult(
        best_threshold=round(best_threshold, 4),
        sweep=sweep,
    )

write_evaluation_artifacts(report, output_dir)

Write evaluation report artifacts to disk.

Writes evaluation.json (full report) and calibration.json (per-class calibration curve data) into output_dir.

Parameters:

Name Type Description Default
report EvaluationReport

A completed evaluation report.

required
output_dir Path

Target directory (created if needed).

required

Returns:

Type Description
dict[str, Path]

Dict mapping artifact name to its written path.

Source code in src/taskclf/train/evaluate.py
def write_evaluation_artifacts(
    report: EvaluationReport,
    output_dir: Path,
) -> dict[str, Path]:
    """Write evaluation report artifacts to disk.

    Writes ``evaluation.json`` (full report) and ``calibration.json``
    (per-class calibration curve data) into *output_dir*.

    Args:
        report: A completed evaluation report.
        output_dir: Target directory (created if needed).

    Returns:
        Dict mapping artifact name to its written path.
    """
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    paths: dict[str, Path] = {}

    eval_path = output_dir / "evaluation.json"
    eval_path.write_text(json.dumps(report.model_dump(), indent=2, default=str))
    paths["evaluation"] = eval_path

    cal_path = output_dir / "calibration.json"
    cal_path.write_text(json.dumps(report.calibration, indent=2))
    paths["calibration"] = cal_path

    cm_df = pd.DataFrame(
        report.confusion_matrix,
        index=report.label_names,
        columns=report.label_names,
    )
    cm_path = output_dir / "confusion_matrix.csv"
    cm_df.to_csv(cm_path)
    paths["confusion_matrix"] = cm_path

    try:
        import matplotlib

        matplotlib.use("Agg")
        import matplotlib.pyplot as plt

        fig, axes = plt.subplots(2, 4, figsize=(16, 8))
        axes_flat = axes.flatten()
        for idx, name in enumerate(report.label_names):
            ax = axes_flat[idx]
            cal = report.calibration.get(name, {})
            frac = cal.get("fraction_of_positives", [])
            mean_pred = cal.get("mean_predicted_value", [])
            ax.plot([0, 1], [0, 1], "k--", alpha=0.5)
            if frac and mean_pred:
                ax.plot(mean_pred, frac, "s-")
            ax.set_title(name, fontsize=9)
            ax.set_xlim(0, 1)
            ax.set_ylim(0, 1)
            ax.set_xlabel("Mean predicted", fontsize=7)
            ax.set_ylabel("Fraction positive", fontsize=7)

        for idx in range(len(report.label_names), len(axes_flat)):
            axes_flat[idx].set_visible(False)

        fig.suptitle("Per-Class Calibration Curves")
        fig.tight_layout()
        plot_path = output_dir / "calibration.png"
        fig.savefig(plot_path, dpi=100)
        plt.close(fig)
        paths["calibration_plot"] = plot_path
    except Exception:
        logger.debug("Calibration plot generation failed", exc_info=True)

    return paths