Acceptance Criteria & Quality Gates v1¶
Version: 1.0 Status: Stable Last Updated: 2026-02-23
This document defines measurable quality thresholds required for:
- Baseline release
- Model promotion
- Personalization activation
- Retraining approval
Models must meet these criteria before deployment.
1. Baseline (Heuristic) Acceptance¶
Before ML deployment, heuristic baseline must satisfy:
- BreakIdle precision ≥ 0.95
- Reject rate ≤ 40%
- No catastrophic misclassification of idle as Build/Write
Baseline establishes minimum acceptable behavior.
2. Global Model Minimum Performance¶
Evaluated on test set.
2.1 Core Requirements¶
- Macro F1 ≥ 0.65
- Weighted F1 ≥ 0.70
- BreakIdle precision ≥ 0.95
- BreakIdle recall ≥ 0.90
- No class precision < 0.50 (except Meet in cold-start users)
If any class precision < 0.50: - Model must not be promoted.
2.2 Seen vs Unseen Users¶
Seen users:
- Macro F1 ≥ 0.70
Unseen users:
- Macro F1 ≥ 0.60
If unseen user F1 < 0.55: - Cold-start UX must default to heuristic assist mode.
3. Calibration Requirements¶
After calibration:
- Brier score improvement ≥ 5% over raw model OR
- Reliability curve visually closer to diagonal
Overconfidence is not allowed.
4. Reject Rate Bounds¶
Reject rate must satisfy:
- ≥ 5% (avoid overconfidence)
- ≤ 30% (avoid unusable system)
If reject rate > 35%: - Model considered underfit or feature insufficient.
5. Label Stability (Flap Rate)¶
Define flap rate:
Acceptance:
- Flap rate ≤ 0.25 before smoothing
- ≤ 0.15 after smoothing
High flap rate indicates unstable predictions.
6. Smoothing Acceptance¶
After block merging:
- ≥ 80% of blocks ≥ MIN_BLOCK_DURATION
- No more than 10% of blocks shorter than 2 minutes
7. Drift Monitoring Thresholds¶
Trigger investigation if:
- Macro F1 drops by ≥ 10% relative to previous model
- Reject rate increases by ≥ 10%
- Feature PSI > 0.2 for any major feature
- Class distribution shift > 15%
8. Personalization Activation Criteria¶
Per-user calibration enabled only if:
- ≥ 200 labeled windows
- ≥ 3 separate days of data
- ≥ 3 distinct core labels observed
Otherwise: - Use global calibration.
9. Training Reproducibility¶
Each promoted model must include:
- Model artifact hash
- Dataset snapshot hash
- Feature schema version
- Label schema version
- Config parameters
- Training timestamp
If training cannot be reproduced, model cannot be promoted.
10. Safety Rules¶
The following errors are blockers:
- BreakIdle frequently misclassified as Build
- System crashes on missing feature
- Probability vector does not sum to 1.0
- Inconsistent label ordering across inference calls
11. Production Promotion Checklist¶
Before release:
- All acceptance criteria met
- Drift test passed on most recent week
- Manual sanity check performed on 3 users
- Documentation updated
12. Versioning¶
Any change to:
- Threshold values
- Required metrics
- Reject bounds
- Personalization activation conditions
Requires version bump.