Add benchmark-driven model promotion workflow and pipeline stages
Some checks failed
ci / test-and-build (push) Has been cancelled

This commit is contained in:
Thales Maciel 2026-02-28 15:12:33 -03:00
parent 98b13d1069
commit 8c1f7c1e13
38 changed files with 5300 additions and 503 deletions

48
benchmarks/README.md Normal file
View file

@ -0,0 +1,48 @@
# Model Evaluation Benchmarks
This folder defines the inputs for `aman eval-models`.
## Files
- `cleanup_dataset.jsonl`: expected-output cases for rewrite quality.
- `heuristics_dataset.raw.jsonl`: source authoring file for heuristic-alignment evaluation.
- `heuristics_dataset.jsonl`: canonical heuristic dataset with explicit timed words.
- `model_matrix.small_first.json`: small-model candidate matrix and parameter sweeps.
- `model_artifacts.json`: model-name to artifact URL/SHA256 registry used for promotion.
- `results/latest.json`: latest winner report used by `sync-default-model`.
## Run
```bash
aman build-heuristic-dataset \
--input benchmarks/heuristics_dataset.raw.jsonl \
--output benchmarks/heuristics_dataset.jsonl
aman eval-models \
--dataset benchmarks/cleanup_dataset.jsonl \
--matrix benchmarks/model_matrix.small_first.json \
--heuristic-dataset benchmarks/heuristics_dataset.jsonl \
--heuristic-weight 0.25 \
--output benchmarks/results/latest.json
aman sync-default-model \
--report benchmarks/results/latest.json \
--artifacts benchmarks/model_artifacts.json \
--constants src/constants.py
```
## Notes
- The matrix uses local GGUF model paths. Replace each `model_path` with files present on your machine.
- All candidates are evaluated with the same XML-tagged prompt contract and the same user input shape.
- Matrix baseline should be the currently promoted managed default model.
- Keep `model_artifacts.json` in sync with candidate names so winner promotion remains deterministic.
- `cleanup_dataset` tags drive additional LLM safety metrics:
- `i_mean_literal`
- `i_mean_correction`
- `spelling_disambiguation`
- `heuristics_dataset` evaluates alignment behavior directly and reports:
- aligned text exact match
- token F1
- rule precision/recall
- per-tag breakdown