Stop shipping code that implied Aman supported a two-pass editor, external API cleanup, or a Wayland scaffold when the runtime only exercises single-pass local cleanup on X11.\n\nCollapse aiprocess to the active single-pass Llama contract, delete desktop_wayland and the empty wayland extra, and make model_eval reject pass1_/pass2_ tuning keys while keeping pass1_ms/pass2_ms as report compatibility fields.\n\nRemove the unused pillow dependency, switch to SPDX-style license metadata, and clean setuptools build state before packaging so deleted modules do not leak into wheels. Update the methodology and repo guidance docs, and add focused tests for desktop adapter selection, stale param rejection, and portable wheel contents.\n\nValidate with uv lock, python3 -m unittest discover -s tests -p 'test_*.py', python3 -m py_compile src/*.py tests/*.py, and python3 -m build --wheel --sdist --no-isolation.
2.5 KiB
2.5 KiB
Model Speed/Quality Methodology
Goal
Find a local model + generation parameter set that significantly reduces latency while preserving output quality for Aman cleanup.
Prompting Contract
All model candidates must run with the same prompt framing:
- A single cleanup system prompt shared across all local model candidates
- XML-tagged user messages (
<request>,<language>,<transcript>,<dictionary>, output contract tags) - Strict JSON output contract:
{"cleaned_text":"..."}
Pipeline:
- Single local cleanup pass emits final text JSON
- Optional heuristic alignment eval: run deterministic alignment against
timed-word fixtures (
heuristics_dataset.jsonl)
Scoring
Per-run quality metrics:
parse_valid: output parsed and containscleaned_textexact_match: normalized exact match against expected outputsimilarity: normalized text similaritycontract_compliance: non-empty contract-compliant outputi_mean_literal_false_positive_rate: literalI meancases wrongly converted to correctioni_mean_correction_false_negative_rate: correctionI meancases wrongly preserved literallyspelling_disambiguation_accuracy: spelling hints resolved to expected final token
Per-run latency metrics:
pass1_ms,pass2_ms,total_ms
Compatibility note:
- The runtime editor is single-pass today.
- Reports keep
pass1_msandpass2_msfor schema stability. - In current runs,
pass1_msshould remain0.0andpass2_mscarries the full editor latency.
Hybrid score:
0.40*parse_valid + 0.20*exact_match + 0.30*similarity + 0.10*contract_compliance
Heuristic score (when --heuristic-dataset is provided):
exact_match_rateon aligned texttoken_f1_avgrule_match_avg(required/forbidden rule compliance + min applied decisions)decision_rule_precision/decision_rule_recallcombined_score_avg = 0.50*exact + 0.30*token_f1 + 0.20*rule_match
Combined ranking score:
combined_score = (1 - heuristic_weight) * hybrid_score_avg + heuristic_weight * heuristic_combined_score_avg
Promotion Gate
Candidate can be promoted if:
parse_valid_rate >= 0.99hybrid_score_avg >= baseline_hybrid - 0.08- lower p50 latency than baseline on long-text cases