|
|
||
|---|---|---|
| .. | ||
| results | ||
| cleanup_dataset.jsonl | ||
| heuristics_dataset.jsonl | ||
| heuristics_dataset.raw.jsonl | ||
| model_artifacts.json | ||
| model_matrix.small_first.json | ||
| README.md | ||
Model Evaluation Benchmarks
This folder defines the inputs for aman eval-models.
Files
cleanup_dataset.jsonl: expected-output cases for rewrite quality.heuristics_dataset.raw.jsonl: source authoring file for heuristic-alignment evaluation.heuristics_dataset.jsonl: canonical heuristic dataset with explicit timed words.model_matrix.small_first.json: small-model candidate matrix and parameter sweeps.model_artifacts.json: model-name to artifact URL/SHA256 registry used for promotion.results/latest.json: latest winner report used bysync-default-model.
Run
aman build-heuristic-dataset \
--input benchmarks/heuristics_dataset.raw.jsonl \
--output benchmarks/heuristics_dataset.jsonl
aman eval-models \
--dataset benchmarks/cleanup_dataset.jsonl \
--matrix benchmarks/model_matrix.small_first.json \
--heuristic-dataset benchmarks/heuristics_dataset.jsonl \
--heuristic-weight 0.25 \
--output benchmarks/results/latest.json
aman sync-default-model \
--report benchmarks/results/latest.json \
--artifacts benchmarks/model_artifacts.json \
--constants src/constants.py
Notes
- The matrix uses local GGUF model paths. Replace each
model_pathwith files present on your machine. - All candidates are evaluated with the same XML-tagged prompt contract and the same user input shape.
- Matrix baseline should be the currently promoted managed default model.
- Keep
model_artifacts.jsonin sync with candidate names so winner promotion remains deterministic. cleanup_datasettags drive additional LLM safety metrics:i_mean_literali_mean_correctionspelling_disambiguation
heuristics_datasetevaluates alignment behavior directly and reports:- aligned text exact match
- token F1
- rule precision/recall
- per-tag breakdown