Model Evaluation Benchmarks

This folder defines the inputs for aman eval-models.

Files

cleanup_dataset.jsonl: expected-output cases for rewrite quality.
heuristics_dataset.raw.jsonl: source authoring file for heuristic-alignment evaluation.
heuristics_dataset.jsonl: canonical heuristic dataset with explicit timed words.
model_matrix.small_first.json: small-model candidate matrix and parameter sweeps.
model_artifacts.json: model-name to artifact URL/SHA256 registry used for promotion.
results/latest.json: latest winner report used by sync-default-model.

Run

aman build-heuristic-dataset \
  --input benchmarks/heuristics_dataset.raw.jsonl \
  --output benchmarks/heuristics_dataset.jsonl

aman eval-models \
  --dataset benchmarks/cleanup_dataset.jsonl \
  --matrix benchmarks/model_matrix.small_first.json \
  --heuristic-dataset benchmarks/heuristics_dataset.jsonl \
  --heuristic-weight 0.25 \
  --output benchmarks/results/latest.json

aman sync-default-model \
  --report benchmarks/results/latest.json \
  --artifacts benchmarks/model_artifacts.json \
  --constants src/constants.py

Notes

The matrix uses local GGUF model paths. Replace each model_path with files present on your machine.
All candidates are evaluated with the same XML-tagged prompt contract and the same user input shape.
Matrix baseline should be the currently promoted managed default model.
Keep model_artifacts.json in sync with candidate names so winner promotion remains deterministic.
cleanup_dataset tags drive additional LLM safety metrics:
- i_mean_literal
- i_mean_correction
- spelling_disambiguation
heuristics_dataset evaluates alignment behavior directly and reports:
- aligned text exact match
- token F1
- rule precision/recall
- per-tag breakdown

1.8 KiB Raw Blame History

Model Evaluation Benchmarks

Files

Run

Notes

1.8 KiB

Raw Blame History