2.4 KiB
2.4 KiB
Model Speed/Quality Methodology
Goal
Find a local model + generation parameter set that significantly reduces latency while preserving output quality for Aman cleanup.
Prompting Contract
All model candidates must run with the same prompt framing:
- XML-tagged system contract for pass 1 (draft) and pass 2 (audit)
- XML-tagged user messages (
<request>,<language>,<transcript>,<dictionary>, output contract tags) - Strict JSON output contracts:
- pass 1:
{"candidate_text":"...","decision_spans":[...]} - pass 2:
{"cleaned_text":"..."}
- pass 1:
Pipeline:
- Draft pass: produce candidate cleaned text + ambiguity decisions
- Audit pass: validate ambiguous corrections conservatively and emit final text
- Optional heuristic alignment eval: run deterministic alignment against
timed-word fixtures (
heuristics_dataset.jsonl)
Scoring
Per-run quality metrics:
parse_valid: output parsed and containscleaned_textexact_match: normalized exact match against expected outputsimilarity: normalized text similaritycontract_compliance: non-empty contract-compliant outputi_mean_literal_false_positive_rate: literalI meancases wrongly converted to correctioni_mean_correction_false_negative_rate: correctionI meancases wrongly preserved literallyspelling_disambiguation_accuracy: spelling hints resolved to expected final token
Per-run latency metrics:
pass1_ms,pass2_ms,total_ms
Hybrid score:
0.40*parse_valid + 0.20*exact_match + 0.30*similarity + 0.10*contract_compliance
Heuristic score (when --heuristic-dataset is provided):
exact_match_rateon aligned texttoken_f1_avgrule_match_avg(required/forbidden rule compliance + min applied decisions)decision_rule_precision/decision_rule_recallcombined_score_avg = 0.50*exact + 0.30*token_f1 + 0.20*rule_match
Combined ranking score:
combined_score = (1 - heuristic_weight) * hybrid_score_avg + heuristic_weight * heuristic_combined_score_avg
Promotion Gate
Candidate can be promoted if:
parse_valid_rate >= 0.99hybrid_score_avg >= baseline_hybrid - 0.08- lower p50 latency than baseline on long-text cases