aman/benchmarks/README.md

# Model Evaluation Benchmarks

This folder defines the inputs for `aman eval-models`.

## Files

- `cleanup_dataset.jsonl`: expected-output cases for rewrite quality.
- `heuristics_dataset.raw.jsonl`: source authoring file for heuristic-alignment evaluation.
- `heuristics_dataset.jsonl`: canonical heuristic dataset with explicit timed words.
- `model_matrix.small_first.json`: small-model candidate matrix and parameter sweeps.
- `model_artifacts.json`: model-name to artifact URL/SHA256 registry used for promotion.
- `results/latest.json`: latest winner report used by `sync-default-model`.

## Run

```bash
aman build-heuristic-dataset \
  --input benchmarks/heuristics_dataset.raw.jsonl \
  --output benchmarks/heuristics_dataset.jsonl

aman eval-models \
  --dataset benchmarks/cleanup_dataset.jsonl \
  --matrix benchmarks/model_matrix.small_first.json \
  --heuristic-dataset benchmarks/heuristics_dataset.jsonl \
  --heuristic-weight 0.25 \
  --output benchmarks/results/latest.json

aman sync-default-model \
  --report benchmarks/results/latest.json \
  --artifacts benchmarks/model_artifacts.json \
  --constants src/constants.py
```

## Notes

- The matrix uses local GGUF model paths. Replace each `model_path` with files present on your machine.
- All candidates are evaluated with the same XML-tagged prompt contract and the same user input shape.
- Matrix baseline should be the currently promoted managed default model.
- Keep `model_artifacts.json` in sync with candidate names so winner promotion remains deterministic.
- `cleanup_dataset` tags drive additional LLM safety metrics:
  - `i_mean_literal`
  - `i_mean_correction`
  - `spelling_disambiguation`
- `heuristics_dataset` evaluates alignment behavior directly and reports:
  - aligned text exact match
  - token F1
  - rule precision/recall
  - per-tag breakdown