Add benchmark-driven model promotion workflow and pipeline stages

2026-02-28 15:12:33 -03:00 · 2026-02-28 15:12:33 -03:00 · 8c1f7c1e13
commit 8c1f7c1e13
parent 98b13d1069
38 changed files with 5300 additions and 503 deletions
--- a/benchmarks/README.md
+++ b/benchmarks/README.md
@ -0,0 +1,48 @@
+# Model Evaluation Benchmarks
+
+This folder defines the inputs for `aman eval-models`.
+
+## Files
+
+- `cleanup_dataset.jsonl`: expected-output cases for rewrite quality.
+- `heuristics_dataset.raw.jsonl`: source authoring file for heuristic-alignment evaluation.
+- `heuristics_dataset.jsonl`: canonical heuristic dataset with explicit timed words.
+- `model_matrix.small_first.json`: small-model candidate matrix and parameter sweeps.
+- `model_artifacts.json`: model-name to artifact URL/SHA256 registry used for promotion.
+- `results/latest.json`: latest winner report used by `sync-default-model`.
+
+## Run
+
+```bash
+aman build-heuristic-dataset \
+  --input benchmarks/heuristics_dataset.raw.jsonl \
+  --output benchmarks/heuristics_dataset.jsonl
+
+aman eval-models \
+  --dataset benchmarks/cleanup_dataset.jsonl \
+  --matrix benchmarks/model_matrix.small_first.json \
+  --heuristic-dataset benchmarks/heuristics_dataset.jsonl \
+  --heuristic-weight 0.25 \
+  --output benchmarks/results/latest.json
+
+aman sync-default-model \
+  --report benchmarks/results/latest.json \
+  --artifacts benchmarks/model_artifacts.json \
+  --constants src/constants.py
+```
+
+## Notes
+
+- The matrix uses local GGUF model paths. Replace each `model_path` with files present on your machine.
+- All candidates are evaluated with the same XML-tagged prompt contract and the same user input shape.
+- Matrix baseline should be the currently promoted managed default model.
+- Keep `model_artifacts.json` in sync with candidate names so winner promotion remains deterministic.
+- `cleanup_dataset` tags drive additional LLM safety metrics:
+  - `i_mean_literal`
+  - `i_mean_correction`
+  - `spelling_disambiguation`
+- `heuristics_dataset` evaluates alignment behavior directly and reports:
+  - aligned text exact match
+  - token F1
+  - rule precision/recall
+  - per-tag breakdown