aman/docs/model-eval-methodology.md
Thales Maciel 94ead25737 Prune stale editor and Wayland surface area
Stop shipping code that implied Aman supported a two-pass editor, external API cleanup, or a Wayland scaffold when the runtime only exercises single-pass local cleanup on X11.\n\nCollapse aiprocess to the active single-pass Llama contract, delete desktop_wayland and the empty wayland extra, and make model_eval reject pass1_/pass2_ tuning keys while keeping pass1_ms/pass2_ms as report compatibility fields.\n\nRemove the unused pillow dependency, switch to SPDX-style license metadata, and clean setuptools build state before packaging so deleted modules do not leak into wheels. Update the methodology and repo guidance docs, and add focused tests for desktop adapter selection, stale param rejection, and portable wheel contents.\n\nValidate with uv lock, python3 -m unittest discover -s tests -p 'test_*.py', python3 -m py_compile src/*.py tests/*.py, and python3 -m build --wheel --sdist --no-isolation.
2026-03-14 17:48:23 -03:00

74 lines
2.5 KiB
Markdown

# Model Speed/Quality Methodology
## Goal
Find a local model + generation parameter set that significantly reduces latency while preserving output quality for Aman cleanup.
## Prompting Contract
All model candidates must run with the same prompt framing:
- A single cleanup system prompt shared across all local model candidates
- XML-tagged user messages (`<request>`, `<language>`, `<transcript>`, `<dictionary>`, output contract tags)
- Strict JSON output contract: `{"cleaned_text":"..."}`
Pipeline:
1. Single local cleanup pass emits final text JSON
2. Optional heuristic alignment eval: run deterministic alignment against
timed-word fixtures (`heuristics_dataset.jsonl`)
## Scoring
Per-run quality metrics:
- `parse_valid`: output parsed and contains `cleaned_text`
- `exact_match`: normalized exact match against expected output
- `similarity`: normalized text similarity
- `contract_compliance`: non-empty contract-compliant output
- `i_mean_literal_false_positive_rate`: literal `I mean` cases wrongly converted to correction
- `i_mean_correction_false_negative_rate`: correction `I mean` cases wrongly preserved literally
- `spelling_disambiguation_accuracy`: spelling hints resolved to expected final token
Per-run latency metrics:
- `pass1_ms`, `pass2_ms`, `total_ms`
Compatibility note:
- The runtime editor is single-pass today.
- Reports keep `pass1_ms` and `pass2_ms` for schema stability.
- In current runs, `pass1_ms` should remain `0.0` and `pass2_ms` carries the
full editor latency.
Hybrid score:
`0.40*parse_valid + 0.20*exact_match + 0.30*similarity + 0.10*contract_compliance`
Heuristic score (when `--heuristic-dataset` is provided):
- `exact_match_rate` on aligned text
- `token_f1_avg`
- `rule_match_avg` (required/forbidden rule compliance + min applied decisions)
- `decision_rule_precision` / `decision_rule_recall`
- `combined_score_avg = 0.50*exact + 0.30*token_f1 + 0.20*rule_match`
Combined ranking score:
`combined_score = (1 - heuristic_weight) * hybrid_score_avg + heuristic_weight * heuristic_combined_score_avg`
## Promotion Gate
Candidate can be promoted if:
- `parse_valid_rate >= 0.99`
- `hybrid_score_avg >= baseline_hybrid - 0.08`
- lower p50 latency than baseline on long-text cases
## Sources
- https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct
- https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct
- https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct
- https://github.com/ggml-org/llama.cpp
- https://github.com/abetlen/llama-cpp-python