# Model Speed/Quality Methodology ## Goal Find a local model + generation parameter set that significantly reduces latency while preserving output quality for Aman cleanup. ## Prompting Contract All model candidates must run with the same prompt framing: - A single cleanup system prompt shared across all local model candidates - XML-tagged user messages (``, ``, ``, ``, output contract tags) - Strict JSON output contract: `{"cleaned_text":"..."}` Pipeline: 1. Single local cleanup pass emits final text JSON 2. Optional heuristic alignment eval: run deterministic alignment against timed-word fixtures (`heuristics_dataset.jsonl`) ## Scoring Per-run quality metrics: - `parse_valid`: output parsed and contains `cleaned_text` - `exact_match`: normalized exact match against expected output - `similarity`: normalized text similarity - `contract_compliance`: non-empty contract-compliant output - `i_mean_literal_false_positive_rate`: literal `I mean` cases wrongly converted to correction - `i_mean_correction_false_negative_rate`: correction `I mean` cases wrongly preserved literally - `spelling_disambiguation_accuracy`: spelling hints resolved to expected final token Per-run latency metrics: - `pass1_ms`, `pass2_ms`, `total_ms` Compatibility note: - The runtime editor is single-pass today. - Reports keep `pass1_ms` and `pass2_ms` for schema stability. - In current runs, `pass1_ms` should remain `0.0` and `pass2_ms` carries the full editor latency. Hybrid score: `0.40*parse_valid + 0.20*exact_match + 0.30*similarity + 0.10*contract_compliance` Heuristic score (when `--heuristic-dataset` is provided): - `exact_match_rate` on aligned text - `token_f1_avg` - `rule_match_avg` (required/forbidden rule compliance + min applied decisions) - `decision_rule_precision` / `decision_rule_recall` - `combined_score_avg = 0.50*exact + 0.30*token_f1 + 0.20*rule_match` Combined ranking score: `combined_score = (1 - heuristic_weight) * hybrid_score_avg + heuristic_weight * heuristic_combined_score_avg` ## Promotion Gate Candidate can be promoted if: - `parse_valid_rate >= 0.99` - `hybrid_score_avg >= baseline_hybrid - 0.08` - lower p50 latency than baseline on long-text cases ## Sources - https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct - https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct - https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct - https://github.com/ggml-org/llama.cpp - https://github.com/abetlen/llama-cpp-python