aman/docs/model-eval-methodology.md
Thales Maciel 94ead25737 Prune stale editor and Wayland surface area
Stop shipping code that implied Aman supported a two-pass editor, external API cleanup, or a Wayland scaffold when the runtime only exercises single-pass local cleanup on X11.\n\nCollapse aiprocess to the active single-pass Llama contract, delete desktop_wayland and the empty wayland extra, and make model_eval reject pass1_/pass2_ tuning keys while keeping pass1_ms/pass2_ms as report compatibility fields.\n\nRemove the unused pillow dependency, switch to SPDX-style license metadata, and clean setuptools build state before packaging so deleted modules do not leak into wheels. Update the methodology and repo guidance docs, and add focused tests for desktop adapter selection, stale param rejection, and portable wheel contents.\n\nValidate with uv lock, python3 -m unittest discover -s tests -p 'test_*.py', python3 -m py_compile src/*.py tests/*.py, and python3 -m build --wheel --sdist --no-isolation.
2026-03-14 17:48:23 -03:00

2.5 KiB

Model Speed/Quality Methodology

Goal

Find a local model + generation parameter set that significantly reduces latency while preserving output quality for Aman cleanup.

Prompting Contract

All model candidates must run with the same prompt framing:

  • A single cleanup system prompt shared across all local model candidates
  • XML-tagged user messages (<request>, <language>, <transcript>, <dictionary>, output contract tags)
  • Strict JSON output contract: {"cleaned_text":"..."}

Pipeline:

  1. Single local cleanup pass emits final text JSON
  2. Optional heuristic alignment eval: run deterministic alignment against timed-word fixtures (heuristics_dataset.jsonl)

Scoring

Per-run quality metrics:

  • parse_valid: output parsed and contains cleaned_text
  • exact_match: normalized exact match against expected output
  • similarity: normalized text similarity
  • contract_compliance: non-empty contract-compliant output
  • i_mean_literal_false_positive_rate: literal I mean cases wrongly converted to correction
  • i_mean_correction_false_negative_rate: correction I mean cases wrongly preserved literally
  • spelling_disambiguation_accuracy: spelling hints resolved to expected final token

Per-run latency metrics:

  • pass1_ms, pass2_ms, total_ms

Compatibility note:

  • The runtime editor is single-pass today.
  • Reports keep pass1_ms and pass2_ms for schema stability.
  • In current runs, pass1_ms should remain 0.0 and pass2_ms carries the full editor latency.

Hybrid score:

0.40*parse_valid + 0.20*exact_match + 0.30*similarity + 0.10*contract_compliance

Heuristic score (when --heuristic-dataset is provided):

  • exact_match_rate on aligned text
  • token_f1_avg
  • rule_match_avg (required/forbidden rule compliance + min applied decisions)
  • decision_rule_precision / decision_rule_recall
  • combined_score_avg = 0.50*exact + 0.30*token_f1 + 0.20*rule_match

Combined ranking score:

combined_score = (1 - heuristic_weight) * hybrid_score_avg + heuristic_weight * heuristic_combined_score_avg

Promotion Gate

Candidate can be promoted if:

  • parse_valid_rate >= 0.99
  • hybrid_score_avg >= baseline_hybrid - 0.08
  • lower p50 latency than baseline on long-text cases

Sources