Add benchmark-driven model promotion workflow and pipeline stages
Some checks failed
ci / test-and-build (push) Has been cancelled
Some checks failed
ci / test-and-build (push) Has been cancelled
This commit is contained in:
parent
98b13d1069
commit
8c1f7c1e13
38 changed files with 5300 additions and 503 deletions
48
benchmarks/README.md
Normal file
48
benchmarks/README.md
Normal file
|
|
@ -0,0 +1,48 @@
|
|||
# Model Evaluation Benchmarks
|
||||
|
||||
This folder defines the inputs for `aman eval-models`.
|
||||
|
||||
## Files
|
||||
|
||||
- `cleanup_dataset.jsonl`: expected-output cases for rewrite quality.
|
||||
- `heuristics_dataset.raw.jsonl`: source authoring file for heuristic-alignment evaluation.
|
||||
- `heuristics_dataset.jsonl`: canonical heuristic dataset with explicit timed words.
|
||||
- `model_matrix.small_first.json`: small-model candidate matrix and parameter sweeps.
|
||||
- `model_artifacts.json`: model-name to artifact URL/SHA256 registry used for promotion.
|
||||
- `results/latest.json`: latest winner report used by `sync-default-model`.
|
||||
|
||||
## Run
|
||||
|
||||
```bash
|
||||
aman build-heuristic-dataset \
|
||||
--input benchmarks/heuristics_dataset.raw.jsonl \
|
||||
--output benchmarks/heuristics_dataset.jsonl
|
||||
|
||||
aman eval-models \
|
||||
--dataset benchmarks/cleanup_dataset.jsonl \
|
||||
--matrix benchmarks/model_matrix.small_first.json \
|
||||
--heuristic-dataset benchmarks/heuristics_dataset.jsonl \
|
||||
--heuristic-weight 0.25 \
|
||||
--output benchmarks/results/latest.json
|
||||
|
||||
aman sync-default-model \
|
||||
--report benchmarks/results/latest.json \
|
||||
--artifacts benchmarks/model_artifacts.json \
|
||||
--constants src/constants.py
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- The matrix uses local GGUF model paths. Replace each `model_path` with files present on your machine.
|
||||
- All candidates are evaluated with the same XML-tagged prompt contract and the same user input shape.
|
||||
- Matrix baseline should be the currently promoted managed default model.
|
||||
- Keep `model_artifacts.json` in sync with candidate names so winner promotion remains deterministic.
|
||||
- `cleanup_dataset` tags drive additional LLM safety metrics:
|
||||
- `i_mean_literal`
|
||||
- `i_mean_correction`
|
||||
- `spelling_disambiguation`
|
||||
- `heuristics_dataset` evaluates alignment behavior directly and reports:
|
||||
- aligned text exact match
|
||||
- token F1
|
||||
- rule precision/recall
|
||||
- per-tag breakdown
|
||||
Loading…
Add table
Add a link
Reference in a new issue