core.metrics¶
Evaluation metrics for model assessment: macro-F1, per-class precision/recall, confusion matrices, calibration curves, reject-rate analysis, and per-user breakdowns. All metric functions accept string-typed labels and the ordered label vocabulary, returning plain dicts suitable for JSON serialisation and artifact storage.
Function overview¶
| Function | Purpose |
|---|---|
compute_metrics |
Macro-F1, weighted-F1, and confusion matrix |
class_distribution |
Per-class counts and fractions |
confusion_matrix_df |
Labelled confusion matrix as a DataFrame |
per_class_metrics |
Per-class precision, recall, F1, and optional support |
top_confusion_pairs |
Largest off-diagonal confusion counts (ranked) |
expected_calibration_error_multiclass |
OVR binary ECE, weighted by class support |
multiclass_brier_score |
One-hot vs predicted probability MSE |
multiclass_log_loss_score |
Multiclass log loss (clipped probs) |
slice_metrics_by_columns |
Macro/weighted F1 and per-class metrics per slice |
unknown_category_rates |
Share of rows with unseen categorical values vs encoders |
reject_rate |
Fraction of predictions equal to the reject label |
compare_baselines |
Side-by-side comparison of multiple prediction methods |
per_user_metrics |
Macro-F1 and per-class F1 grouped by user |
calibration_curve_data |
Per-class reliability diagram data |
user_stratification_report |
Training-set imbalance analysis per user |
reject_rate_by_group |
Reject rate by (user, date) with drift flags |
compute_metrics¶
Primary evaluation entry point. Returns aggregate scores and the full confusion matrix for a single set of predictions.
| Return key | Type | Description |
|---|---|---|
macro_f1 |
float |
Unweighted mean F1 across classes |
weighted_f1 |
float |
Support-weighted mean F1 |
confusion_matrix |
list[list[int]] |
Row = true, column = predicted |
label_names |
list[str] |
Label order matching matrix axes |
from taskclf.core.metrics import compute_metrics
from taskclf.core.types import LABEL_SET_V1
result = compute_metrics(y_true, y_pred, sorted(LABEL_SET_V1))
print(f"Macro-F1: {result['macro_f1']:.4f}")
class_distribution¶
Reports how many samples belong to each class, useful for detecting label imbalance before training.
Returns a dict mapping each label to {"count": int, "fraction": float}.
Labels absent from y_true appear with count 0.
confusion_matrix_df¶
Wraps sklearn.metrics.confusion_matrix into a pd.DataFrame with
label_names as both the row index (true labels) and column index
(predicted labels). Convenient for CSV export or display.
per_class_metrics¶
Returns per-class precision, recall, and F1 as a nested dict. By default
each class also includes support: the number of true instances of
that class in y_true (same notion as sklearn). Pass
include_support=False to omit support for callers that only need P/R/F1.
{
"Build": {"precision": 0.85, "recall": 0.90, "f1": 0.87, "support": 120},
"Meet": {"precision": 0.92, "recall": 0.88, "f1": 0.90, "support": 45},
...
}
Uses zero_division=0 so classes with no predictions get 0.0 instead
of warnings.
top_confusion_pairs¶
Takes a square confusion matrix and label order; returns up to k
off-diagonal pairs {"true_label", "pred_label", "count"} sorted by
count descending. Used for bundle inspection and evaluation reports.
expected_calibration_error_multiclass¶
Computes a support-weighted mean of one-vs-rest binary expected
calibration error (uniform probability bins) across classes. Requires
integer-encoded true labels and a probability matrix (n_samples,
n_classes).
multiclass_brier_score / multiclass_log_loss_score¶
Probability-based scores aligned with the same y_proba used in
train.evaluate. Log loss uses clipped
probabilities to avoid log(0).
slice_metrics_by_columns¶
Default slice columns are
user_id, app_id, app_category, domain_category, hour_of_day
(:data:~taskclf.core.metrics.DEFAULT_SLICE_COLUMNS), intersected with
columns present in the frame. For each slice value (top groups by
frequency, capped per column), returns row count, macro/weighted F1,
reject rate, and per-class metrics.
unknown_category_rates¶
For each evaluated categorical column, reports the fraction of rows
whose string value is not in the fitted LabelEncoder.classes_
(what becomes __unknown__ / legacy -1 at encode time). Returns
per_column, overall_rate (mean of evaluated columns), and
columns_evaluated. Column set should match the feature schema (see
train.lgbm.get_categorical_columns).
reject_rate¶
Computes the fraction of predictions matching the reject label
(default MIXED_UNKNOWN from core.defaults).
Returns 0.0 for empty input.
compare_baselines¶
Evaluates multiple prediction methods against the same ground truth in
a single call. Each method receives its own macro_f1, weighted_f1,
reject_rate, per_class breakdown, and confusion_matrix.
from taskclf.core.metrics import compare_baselines
results = compare_baselines(
y_true,
{"lgbm": lgbm_preds, "majority": majority_preds},
label_names,
)
for name, m in results.items():
print(f"{name}: F1={m['macro_f1']:.4f} reject={m['reject_rate']:.2%}")
The label vocabulary is extended with reject_label if it is not
already present, so reject predictions are counted in the matrix.
per_user_metrics¶
Groups predictions by user_ids and computes per-user macro-F1 plus
per-class F1 scores. Useful for identifying users whose data the
model struggles with.
Each user entry contains macro_f1, count, and {label}_f1 keys.
calibration_curve_data¶
Generates per-class reliability diagram data using one-vs-rest
binarization. Requires integer-encoded true labels and a probability
matrix (n_samples, n_classes).
| Return key (per class) | Type | Description |
|---|---|---|
fraction_of_positives |
list[float] |
Observed positive fraction per bin |
mean_predicted_value |
list[float] |
Mean predicted probability per bin |
Classes with zero positive samples return empty lists.
user_stratification_report¶
Analyses per-user contribution to the training set. Flags users whose
row fraction exceeds dominance_threshold (default 0.5) as dominant,
emitting human-readable warnings.
| Return key | Type | Description |
|---|---|---|
per_user |
dict |
Per-user count, fraction, label_distribution |
total_rows |
int |
Total rows in the dataset |
user_count |
int |
Number of distinct users |
warnings |
list[str] |
Dominance warnings (empty if balanced) |
reject_rate_by_group¶
Computes reject rate grouped by (user_id, date) for drift detection.
Groups whose reject rate exceeds global_reject_rate * spike_multiplier
(default 2.0) are added to drift_flags.
| Return key | Type | Description |
|---|---|---|
global_reject_rate |
float |
Overall reject fraction |
per_group |
dict |
Keyed by "user_id\|YYYY-MM-DD" with reject_rate, total, rejected |
drift_flags |
list[str] |
Group keys that exceed the spike threshold |
See also¶
train.evaluate-- model evaluation pipeline that calls these functionsinfer.baseline-- baseline comparisons usingcompare_baselinescore.defaults--MIXED_UNKNOWNreject label constant
taskclf.core.metrics
¶
Evaluation metrics: macro-F1, confusion matrices, calibration, and per-user helpers.
compute_metrics(y_true, y_pred, label_names)
¶
Return macro-F1, weighted-F1, and a nested confusion matrix.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_true
|
Sequence[str]
|
Ground-truth label strings. |
required |
y_pred
|
Sequence[str]
|
Predicted label strings. |
required |
label_names
|
Sequence[str]
|
Ordered label vocabulary (defines row/column order of the matrix). |
required |
Returns:
| Type | Description |
|---|---|
dict
|
Dict with keys |
dict
|
|
dict
|
(list of str). |
Source code in src/taskclf/core/metrics.py
class_distribution(y_true, label_names)
¶
Per-class counts and fractions for imbalance reporting.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_true
|
Sequence[str]
|
Ground-truth label strings. |
required |
label_names
|
Sequence[str]
|
Full label vocabulary (defines which classes appear in the output, even if absent from y_true). |
required |
Returns:
| Type | Description |
|---|---|
dict[str, dict[str, float | int]]
|
Dict mapping each label to |
dict[str, dict[str, float | int]]
|
Fractions sum to 1.0 (within rounding tolerance). If y_true is |
dict[str, dict[str, float | int]]
|
empty, all fractions are 0.0. |
Source code in src/taskclf/core/metrics.py
confusion_matrix_df(y_true, y_pred, label_names)
¶
Build a labelled confusion-matrix DataFrame suitable for CSV export.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_true
|
Sequence[str]
|
Ground-truth label strings. |
required |
y_pred
|
Sequence[str]
|
Predicted label strings. |
required |
label_names
|
Sequence[str]
|
Ordered label vocabulary (used as both row and column index of the resulting DataFrame). |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
Square DataFrame with label_names as row and column labels. |
Source code in src/taskclf/core/metrics.py
reject_rate(labels, reject_label=MIXED_UNKNOWN)
¶
Fraction of labels that equal reject_label.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
labels
|
Sequence[str]
|
Predicted label strings. |
required |
reject_label
|
str
|
The label treated as a reject / unknown. |
MIXED_UNKNOWN
|
Returns:
| Type | Description |
|---|---|
float
|
A float in |
Source code in src/taskclf/core/metrics.py
per_class_metrics(y_true, y_pred, label_names, *, include_support=True)
¶
Per-class precision, recall, F1, and optionally support (true-class counts).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_true
|
Sequence[str]
|
Ground-truth label strings. |
required |
y_pred
|
Sequence[str]
|
Predicted label strings. |
required |
label_names
|
Sequence[str]
|
Ordered label vocabulary. |
required |
include_support
|
bool
|
When |
True
|
Returns:
| Type | Description |
|---|---|
dict[str, dict[str, float | int]]
|
Dict mapping each label to precision, recall, f1, and optionally support. |
Source code in src/taskclf/core/metrics.py
compare_baselines(y_true, predictions, label_names, reject_label=MIXED_UNKNOWN)
¶
Compare multiple prediction methods against the same ground truth.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_true
|
Sequence[str]
|
Ground-truth label strings. |
required |
predictions
|
Mapping[str, Sequence[str]]
|
Mapping of |
required |
label_names
|
Sequence[str]
|
Ordered core label vocabulary. |
required |
reject_label
|
str
|
The label treated as a reject. |
MIXED_UNKNOWN
|
Returns:
| Type | Description |
|---|---|
dict[str, dict]
|
Dict keyed by method name, each containing |
dict[str, dict]
|
|
dict[str, dict]
|
|
Source code in src/taskclf/core/metrics.py
per_user_metrics(y_true, y_pred, user_ids, label_names)
¶
Compute macro-F1 and per-class F1 grouped by user.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_true
|
Sequence[str]
|
Ground-truth label strings (one per window). |
required |
y_pred
|
Sequence[str]
|
Predicted label strings (same length as y_true). |
required |
user_ids
|
Sequence[str]
|
User identifier per window (same length as y_true). |
required |
label_names
|
Sequence[str]
|
Ordered label vocabulary. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, dict[str, float]]
|
Dict keyed by user_id, each containing |
dict[str, dict[str, float]]
|
|
Source code in src/taskclf/core/metrics.py
calibration_curve_data(y_true_indices, y_proba, label_names, *, n_bins=10)
¶
Per-class calibration curve data for reliability diagrams.
Uses one-vs-rest binarization so each class gets its own curve.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
y_true_indices
|
ndarray
|
Integer-encoded true labels (shape |
required |
y_proba
|
ndarray
|
Predicted probability matrix (shape |
required |
label_names
|
Sequence[str]
|
Ordered label vocabulary matching columns of y_proba. |
required |
n_bins
|
int
|
Number of probability bins. |
10
|
Returns:
| Type | Description |
|---|---|
dict[str, dict[str, list[float]]]
|
Dict keyed by label name, each containing |
dict[str, dict[str, list[float]]]
|
and |
Source code in src/taskclf/core/metrics.py
top_confusion_pairs(cm, label_names, *, k=20)
¶
Rank largest off-diagonal confusion counts (true_class -> pred_class).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cm
|
list[list[int]] | ndarray
|
Square confusion matrix (rows true, columns predicted). |
required |
label_names
|
Sequence[str]
|
Label order for rows/columns. |
required |
k
|
int
|
Maximum number of pairs to return. |
20
|
Returns:
| Type | Description |
|---|---|
list[dict[str, str | int]]
|
List of dicts with |
list[dict[str, str | int]]
|
by count descending (off-diagonal only). |
Source code in src/taskclf/core/metrics.py
expected_calibration_error_multiclass(y_true_indices, y_proba, label_names, *, n_bins=10)
¶
Weighted mean of one-vs-rest binary ECE across classes with support.
Source code in src/taskclf/core/metrics.py
multiclass_brier_score(y_true_indices, y_proba)
¶
Mean squared error between one-hot true labels and predicted probabilities.
Source code in src/taskclf/core/metrics.py
multiclass_log_loss_score(y_true_indices, y_proba, *, eps=1e-15)
¶
Multiclass log loss with clipped probabilities.
Source code in src/taskclf/core/metrics.py
slice_metrics_by_columns(df, y_true, y_pred, label_names, slice_columns=None, *, max_groups_per_column=100, reject_label=MIXED_UNKNOWN)
¶
Per-slice macro/weighted F1, reject rate, and row counts.
For each column, groups are sorted by frequency and truncated to max_groups_per_column to keep output bounded when cardinality is high.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Feature rows aligned with y_true / y_pred. |
required |
y_true
|
Sequence[str]
|
Ground-truth labels. |
required |
y_pred
|
Sequence[str]
|
Predicted labels (after reject/smoothing if applicable). |
required |
label_names
|
Sequence[str]
|
Core label vocabulary for sklearn metrics. |
required |
slice_columns
|
Sequence[str] | None
|
Columns to slice by; defaults to
:data: |
None
|
max_groups_per_column
|
int
|
Max distinct slice values per column. |
100
|
reject_label
|
str
|
Label counted as rejected for per-slice reject_rate. |
MIXED_UNKNOWN
|
Returns:
| Type | Description |
|---|---|
dict[str, dict[str, dict[str, Any]]]
|
Nested dict |
Source code in src/taskclf/core/metrics.py
unknown_category_rates(df, cat_encoders, categorical_columns)
¶
Fraction of rows where a categorical maps to unknown or legacy -1 encoding.
Mirrors inference-time behavior in :func:taskclf.train.lgbm.encode_categoricals:
values not in the fitted encoder vocabulary map to __unknown__ when
present in the vocabulary, else -1.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Feature rows (same rows as evaluation). |
required |
cat_encoders
|
dict[str, LabelEncoder] | None
|
Fitted encoders from the training bundle (may be empty). |
required |
categorical_columns
|
Sequence[str]
|
Categorical column names for this schema (e.g. from
|
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dict with |
dict[str, Any]
|
rates over columns present), and |
Source code in src/taskclf/core/metrics.py
user_stratification_report(user_ids, labels, label_names, *, dominance_threshold=0.5)
¶
Analyse per-user contribution to the training set and flag imbalance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
user_ids
|
Sequence[str]
|
User identifier per row. |
required |
labels
|
Sequence[str]
|
Label per row. |
required |
label_names
|
Sequence[str]
|
Ordered label vocabulary. |
required |
dominance_threshold
|
float
|
Fraction above which a single user is considered dominant and a warning is emitted. |
0.5
|
Returns:
| Type | Description |
|---|---|
dict
|
Dict with |
dict
|
|
dict
|
for any user exceeding dominance_threshold). |
Source code in src/taskclf/core/metrics.py
reject_rate_by_group(labels, user_ids, timestamps, *, reject_label=MIXED_UNKNOWN, spike_multiplier=2.0)
¶
Compute reject rate grouped by (user_id, date) for drift detection.
A group is flagged as a drift signal when its reject rate exceeds spike_multiplier times the global reject rate.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
labels
|
Sequence[str]
|
Predicted label strings (may include reject_label). |
required |
user_ids
|
Sequence[str]
|
User identifier per window. |
required |
timestamps
|
Sequence
|
Timestamp per window (anything parseable by
|
required |
reject_label
|
str
|
The label treated as a reject / unknown. |
MIXED_UNKNOWN
|
spike_multiplier
|
float
|
A group's reject rate must exceed
|
2.0
|
Returns:
| Type | Description |
|---|---|
dict
|
Dict with |
dict
|
|
dict
|
|