Add benchmark-driven model promotion workflow and pipeline stages

2026-02-28 15:12:33 -03:00 · 2026-02-28 15:12:33 -03:00 · 8c1f7c1e13
commit 8c1f7c1e13
parent 98b13d1069
38 changed files with 5300 additions and 503 deletions
--- a/docs/model-eval-methodology.md
+++ b/docs/model-eval-methodology.md
@ -0,0 +1,70 @@
+# Model Speed/Quality Methodology
+
+## Goal
+
+Find a local model + generation parameter set that significantly reduces latency while preserving output quality for Aman cleanup.
+
+## Prompting Contract
+
+All model candidates must run with the same prompt framing:
+
+- XML-tagged system contract for pass 1 (draft) and pass 2 (audit)
+- XML-tagged user messages (`<request>`, `<language>`, `<transcript>`, `<dictionary>`, output contract tags)
+- Strict JSON output contracts:
+  - pass 1: `{"candidate_text":"...","decision_spans":[...]}`
+  - pass 2: `{"cleaned_text":"..."}`
+
+Pipeline:
+
+1. Draft pass: produce candidate cleaned text + ambiguity decisions
+2. Audit pass: validate ambiguous corrections conservatively and emit final text
+3. Optional heuristic alignment eval: run deterministic alignment against
+   timed-word fixtures (`heuristics_dataset.jsonl`)
+
+## Scoring
+
+Per-run quality metrics:
+
+- `parse_valid`: output parsed and contains `cleaned_text`
+- `exact_match`: normalized exact match against expected output
+- `similarity`: normalized text similarity
+- `contract_compliance`: non-empty contract-compliant output
+- `i_mean_literal_false_positive_rate`: literal `I mean` cases wrongly converted to correction
+- `i_mean_correction_false_negative_rate`: correction `I mean` cases wrongly preserved literally
+- `spelling_disambiguation_accuracy`: spelling hints resolved to expected final token
+
+Per-run latency metrics:
+
+- `pass1_ms`, `pass2_ms`, `total_ms`
+
+Hybrid score:
+
+`0.40*parse_valid + 0.20*exact_match + 0.30*similarity + 0.10*contract_compliance`
+
+Heuristic score (when `--heuristic-dataset` is provided):
+
+- `exact_match_rate` on aligned text
+- `token_f1_avg`
+- `rule_match_avg` (required/forbidden rule compliance + min applied decisions)
+- `decision_rule_precision` / `decision_rule_recall`
+- `combined_score_avg = 0.50*exact + 0.30*token_f1 + 0.20*rule_match`
+
+Combined ranking score:
+
+`combined_score = (1 - heuristic_weight) * hybrid_score_avg + heuristic_weight * heuristic_combined_score_avg`
+
+## Promotion Gate
+
+Candidate can be promoted if:
+
+- `parse_valid_rate >= 0.99`
+- `hybrid_score_avg >= baseline_hybrid - 0.08`
+- lower p50 latency than baseline on long-text cases
+
+## Sources
+
+- https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct
+- https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct
+- https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct
+- https://github.com/ggml-org/llama.cpp
+- https://github.com/abetlen/llama-cpp-python
--- a/docs/release-checklist.md
+++ b/docs/release-checklist.md
@ -4,14 +4,19 @@
 2. Bump `project.version` in `pyproject.toml`.
 3. Run quality and build gates:
   - `make release-check`
-4. Build packaging artifacts:
+   - `make check-default-model`
+4. Ensure model promotion artifacts are current:
+   - `benchmarks/results/latest.json` has the latest `winner_recommendation.name`
+   - `benchmarks/model_artifacts.json` contains that winner with URL + SHA256
+   - `make sync-default-model` (if constants drifted)
+5. Build packaging artifacts:
   - `make package`
-5. Verify artifacts:
+6. Verify artifacts:
   - `dist/*.whl`
   - `dist/*.tar.gz`
   - `dist/*.deb`
   - `dist/arch/PKGBUILD`
-6. Tag release:
+7. Tag release:
   - `git tag vX.Y.Z`
   - `git push origin vX.Y.Z`
-7. Publish release and upload package artifacts from `dist/`.
+8. Publish release and upload package artifacts from `dist/`.