70 lines
2.4 KiB
Markdown
70 lines
2.4 KiB
Markdown
# Model Speed/Quality Methodology
|
|
|
|
## Goal
|
|
|
|
Find a local model + generation parameter set that significantly reduces latency while preserving output quality for Aman cleanup.
|
|
|
|
## Prompting Contract
|
|
|
|
All model candidates must run with the same prompt framing:
|
|
|
|
- XML-tagged system contract for pass 1 (draft) and pass 2 (audit)
|
|
- XML-tagged user messages (`<request>`, `<language>`, `<transcript>`, `<dictionary>`, output contract tags)
|
|
- Strict JSON output contracts:
|
|
- pass 1: `{"candidate_text":"...","decision_spans":[...]}`
|
|
- pass 2: `{"cleaned_text":"..."}`
|
|
|
|
Pipeline:
|
|
|
|
1. Draft pass: produce candidate cleaned text + ambiguity decisions
|
|
2. Audit pass: validate ambiguous corrections conservatively and emit final text
|
|
3. Optional heuristic alignment eval: run deterministic alignment against
|
|
timed-word fixtures (`heuristics_dataset.jsonl`)
|
|
|
|
## Scoring
|
|
|
|
Per-run quality metrics:
|
|
|
|
- `parse_valid`: output parsed and contains `cleaned_text`
|
|
- `exact_match`: normalized exact match against expected output
|
|
- `similarity`: normalized text similarity
|
|
- `contract_compliance`: non-empty contract-compliant output
|
|
- `i_mean_literal_false_positive_rate`: literal `I mean` cases wrongly converted to correction
|
|
- `i_mean_correction_false_negative_rate`: correction `I mean` cases wrongly preserved literally
|
|
- `spelling_disambiguation_accuracy`: spelling hints resolved to expected final token
|
|
|
|
Per-run latency metrics:
|
|
|
|
- `pass1_ms`, `pass2_ms`, `total_ms`
|
|
|
|
Hybrid score:
|
|
|
|
`0.40*parse_valid + 0.20*exact_match + 0.30*similarity + 0.10*contract_compliance`
|
|
|
|
Heuristic score (when `--heuristic-dataset` is provided):
|
|
|
|
- `exact_match_rate` on aligned text
|
|
- `token_f1_avg`
|
|
- `rule_match_avg` (required/forbidden rule compliance + min applied decisions)
|
|
- `decision_rule_precision` / `decision_rule_recall`
|
|
- `combined_score_avg = 0.50*exact + 0.30*token_f1 + 0.20*rule_match`
|
|
|
|
Combined ranking score:
|
|
|
|
`combined_score = (1 - heuristic_weight) * hybrid_score_avg + heuristic_weight * heuristic_combined_score_avg`
|
|
|
|
## Promotion Gate
|
|
|
|
Candidate can be promoted if:
|
|
|
|
- `parse_valid_rate >= 0.99`
|
|
- `hybrid_score_avg >= baseline_hybrid - 0.08`
|
|
- lower p50 latency than baseline on long-text cases
|
|
|
|
## Sources
|
|
|
|
- https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct
|
|
- https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct
|
|
- https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct
|
|
- https://github.com/ggml-org/llama.cpp
|
|
- https://github.com/abetlen/llama-cpp-python
|