aman/docs/model-eval-methodology.md
Thales Maciel 8c1f7c1e13
Some checks failed
ci / test-and-build (push) Has been cancelled
Add benchmark-driven model promotion workflow and pipeline stages
2026-02-28 15:12:33 -03:00

2.4 KiB

Model Speed/Quality Methodology

Goal

Find a local model + generation parameter set that significantly reduces latency while preserving output quality for Aman cleanup.

Prompting Contract

All model candidates must run with the same prompt framing:

  • XML-tagged system contract for pass 1 (draft) and pass 2 (audit)
  • XML-tagged user messages (<request>, <language>, <transcript>, <dictionary>, output contract tags)
  • Strict JSON output contracts:
    • pass 1: {"candidate_text":"...","decision_spans":[...]}
    • pass 2: {"cleaned_text":"..."}

Pipeline:

  1. Draft pass: produce candidate cleaned text + ambiguity decisions
  2. Audit pass: validate ambiguous corrections conservatively and emit final text
  3. Optional heuristic alignment eval: run deterministic alignment against timed-word fixtures (heuristics_dataset.jsonl)

Scoring

Per-run quality metrics:

  • parse_valid: output parsed and contains cleaned_text
  • exact_match: normalized exact match against expected output
  • similarity: normalized text similarity
  • contract_compliance: non-empty contract-compliant output
  • i_mean_literal_false_positive_rate: literal I mean cases wrongly converted to correction
  • i_mean_correction_false_negative_rate: correction I mean cases wrongly preserved literally
  • spelling_disambiguation_accuracy: spelling hints resolved to expected final token

Per-run latency metrics:

  • pass1_ms, pass2_ms, total_ms

Hybrid score:

0.40*parse_valid + 0.20*exact_match + 0.30*similarity + 0.10*contract_compliance

Heuristic score (when --heuristic-dataset is provided):

  • exact_match_rate on aligned text
  • token_f1_avg
  • rule_match_avg (required/forbidden rule compliance + min applied decisions)
  • decision_rule_precision / decision_rule_recall
  • combined_score_avg = 0.50*exact + 0.30*token_f1 + 0.20*rule_match

Combined ranking score:

combined_score = (1 - heuristic_weight) * hybrid_score_avg + heuristic_weight * heuristic_combined_score_avg

Promotion Gate

Candidate can be promoted if:

  • parse_valid_rate >= 0.99
  • hybrid_score_avg >= baseline_hybrid - 0.08
  • lower p50 latency than baseline on long-text cases

Sources