48 lines
1.8 KiB
Markdown
48 lines
1.8 KiB
Markdown
# Model Evaluation Benchmarks
|
|
|
|
This folder defines the inputs for `aman eval-models`.
|
|
|
|
## Files
|
|
|
|
- `cleanup_dataset.jsonl`: expected-output cases for rewrite quality.
|
|
- `heuristics_dataset.raw.jsonl`: source authoring file for heuristic-alignment evaluation.
|
|
- `heuristics_dataset.jsonl`: canonical heuristic dataset with explicit timed words.
|
|
- `model_matrix.small_first.json`: small-model candidate matrix and parameter sweeps.
|
|
- `model_artifacts.json`: model-name to artifact URL/SHA256 registry used for promotion.
|
|
- `results/latest.json`: latest winner report used by `sync-default-model`.
|
|
|
|
## Run
|
|
|
|
```bash
|
|
aman build-heuristic-dataset \
|
|
--input benchmarks/heuristics_dataset.raw.jsonl \
|
|
--output benchmarks/heuristics_dataset.jsonl
|
|
|
|
aman eval-models \
|
|
--dataset benchmarks/cleanup_dataset.jsonl \
|
|
--matrix benchmarks/model_matrix.small_first.json \
|
|
--heuristic-dataset benchmarks/heuristics_dataset.jsonl \
|
|
--heuristic-weight 0.25 \
|
|
--output benchmarks/results/latest.json
|
|
|
|
aman sync-default-model \
|
|
--report benchmarks/results/latest.json \
|
|
--artifacts benchmarks/model_artifacts.json \
|
|
--constants src/constants.py
|
|
```
|
|
|
|
## Notes
|
|
|
|
- The matrix uses local GGUF model paths. Replace each `model_path` with files present on your machine.
|
|
- All candidates are evaluated with the same XML-tagged prompt contract and the same user input shape.
|
|
- Matrix baseline should be the currently promoted managed default model.
|
|
- Keep `model_artifacts.json` in sync with candidate names so winner promotion remains deterministic.
|
|
- `cleanup_dataset` tags drive additional LLM safety metrics:
|
|
- `i_mean_literal`
|
|
- `i_mean_correction`
|
|
- `spelling_disambiguation`
|
|
- `heuristics_dataset` evaluates alignment behavior directly and reports:
|
|
- aligned text exact match
|
|
- token F1
|
|
- rule precision/recall
|
|
- per-tag breakdown
|