# Model Evaluation Benchmarks This folder defines the inputs for `aman eval-models`. ## Files - `cleanup_dataset.jsonl`: expected-output cases for rewrite quality. - `heuristics_dataset.raw.jsonl`: source authoring file for heuristic-alignment evaluation. - `heuristics_dataset.jsonl`: canonical heuristic dataset with explicit timed words. - `model_matrix.small_first.json`: small-model candidate matrix and parameter sweeps. - `model_artifacts.json`: model-name to artifact URL/SHA256 registry used for promotion. - `results/latest.json`: latest winner report used by `sync-default-model`. ## Run ```bash aman build-heuristic-dataset \ --input benchmarks/heuristics_dataset.raw.jsonl \ --output benchmarks/heuristics_dataset.jsonl aman eval-models \ --dataset benchmarks/cleanup_dataset.jsonl \ --matrix benchmarks/model_matrix.small_first.json \ --heuristic-dataset benchmarks/heuristics_dataset.jsonl \ --heuristic-weight 0.25 \ --output benchmarks/results/latest.json aman sync-default-model \ --report benchmarks/results/latest.json \ --artifacts benchmarks/model_artifacts.json \ --constants src/constants.py ``` ## Notes - The matrix uses local GGUF model paths. Replace each `model_path` with files present on your machine. - All candidates are evaluated with the same XML-tagged prompt contract and the same user input shape. - Matrix baseline should be the currently promoted managed default model. - Keep `model_artifacts.json` in sync with candidate names so winner promotion remains deterministic. - `cleanup_dataset` tags drive additional LLM safety metrics: - `i_mean_literal` - `i_mean_correction` - `spelling_disambiguation` - `heuristics_dataset` evaluates alignment behavior directly and reports: - aligned text exact match - token F1 - rule precision/recall - per-tag breakdown