Acceptance Criteria & Quality Gates v1¶

Version: 1.0 Status: Stable Last Updated: 2026-02-23

This document defines measurable quality thresholds required for:

Baseline release
Model promotion
Personalization activation
Retraining approval

Models must meet these criteria before deployment.

1. Baseline (Heuristic) Acceptance¶

Before ML deployment, heuristic baseline must satisfy:

BreakIdle precision ≥ 0.95
Reject rate ≤ 40%
No catastrophic misclassification of idle as Build/Write

Baseline establishes minimum acceptable behavior.

2. Global Model Minimum Performance¶

Evaluated on test set.

2.1 Core Requirements¶

Macro F1 ≥ 0.65
Weighted F1 ≥ 0.70
BreakIdle precision ≥ 0.95
BreakIdle recall ≥ 0.90
No class precision < 0.50 (except Meet in cold-start users)

If any class precision < 0.50: - Model must not be promoted.

2.2 Seen vs Unseen Users¶

Seen users:

Macro F1 ≥ 0.70

Unseen users:

Macro F1 ≥ 0.60

If unseen user F1 < 0.55: - Cold-start UX must default to heuristic assist mode.

3. Calibration Requirements¶

After calibration:

Brier score improvement ≥ 5% over raw model OR
Reliability curve visually closer to diagonal

Overconfidence is not allowed.

4. Reject Rate Bounds¶

Reject rate must satisfy:

≥ 5% (avoid overconfidence)
≤ 30% (avoid unusable system)

If reject rate > 35%: - Model considered underfit or feature insufficient.

5. Label Stability (Flap Rate)¶

Define flap rate:

Number of label changes / total windows

Acceptance:

Flap rate ≤ 0.25 before smoothing
≤ 0.15 after smoothing

High flap rate indicates unstable predictions.

6. Smoothing Acceptance¶

After block merging:

≥ 80% of blocks ≥ MIN_BLOCK_DURATION
No more than 10% of blocks shorter than 2 minutes

7. Drift Monitoring Thresholds¶

Trigger investigation if:

Macro F1 drops by ≥ 10% relative to previous model
Reject rate increases by ≥ 10%
Feature PSI > 0.2 for any major feature
Class distribution shift > 15%

8. Personalization Activation Criteria¶

Per-user calibration enabled only if:

≥ 200 labeled windows
≥ 3 separate days of data
≥ 3 distinct core labels observed

Otherwise: - Use global calibration.

9. Training Reproducibility¶

Each promoted model must include:

Model artifact hash
Dataset snapshot hash
Feature schema version
Label schema version
Config parameters
Training timestamp

If training cannot be reproduced, model cannot be promoted.

10. Safety Rules¶

The following errors are blockers:

BreakIdle frequently misclassified as Build
System crashes on missing feature
Probability vector does not sum to 1.0
Inconsistent label ordering across inference calls

11. Production Promotion Checklist¶

Before release:

All acceptance criteria met
Drift test passed on most recent week
Manual sanity check performed on 3 users
Documentation updated

12. Versioning¶

Any change to:

Threshold values
Required metrics
Reject bounds
Personalization activation conditions

Requires version bump.