aman/benchmarks/README.md
Thales Maciel 8c1f7c1e13
Some checks failed
ci / test-and-build (push) Has been cancelled
Add benchmark-driven model promotion workflow and pipeline stages
2026-02-28 15:12:33 -03:00

1.8 KiB

Model Evaluation Benchmarks

This folder defines the inputs for aman eval-models.

Files

  • cleanup_dataset.jsonl: expected-output cases for rewrite quality.
  • heuristics_dataset.raw.jsonl: source authoring file for heuristic-alignment evaluation.
  • heuristics_dataset.jsonl: canonical heuristic dataset with explicit timed words.
  • model_matrix.small_first.json: small-model candidate matrix and parameter sweeps.
  • model_artifacts.json: model-name to artifact URL/SHA256 registry used for promotion.
  • results/latest.json: latest winner report used by sync-default-model.

Run

aman build-heuristic-dataset \
  --input benchmarks/heuristics_dataset.raw.jsonl \
  --output benchmarks/heuristics_dataset.jsonl

aman eval-models \
  --dataset benchmarks/cleanup_dataset.jsonl \
  --matrix benchmarks/model_matrix.small_first.json \
  --heuristic-dataset benchmarks/heuristics_dataset.jsonl \
  --heuristic-weight 0.25 \
  --output benchmarks/results/latest.json

aman sync-default-model \
  --report benchmarks/results/latest.json \
  --artifacts benchmarks/model_artifacts.json \
  --constants src/constants.py

Notes

  • The matrix uses local GGUF model paths. Replace each model_path with files present on your machine.
  • All candidates are evaluated with the same XML-tagged prompt contract and the same user input shape.
  • Matrix baseline should be the currently promoted managed default model.
  • Keep model_artifacts.json in sync with candidate names so winner promotion remains deterministic.
  • cleanup_dataset tags drive additional LLM safety metrics:
    • i_mean_literal
    • i_mean_correction
    • spelling_disambiguation
  • heuristics_dataset evaluates alignment behavior directly and reports:
    • aligned text exact match
    • token F1
    • rule precision/recall
    • per-tag breakdown