Thales Maciel 94ead25737 Prune stale editor and Wayland surface area

Stop shipping code that implied Aman supported a two-pass editor, external API cleanup, or a Wayland scaffold when the runtime only exercises single-pass local cleanup on X11.\n\nCollapse aiprocess to the active single-pass Llama contract, delete desktop_wayland and the empty wayland extra, and make model_eval reject pass1_/pass2_ tuning keys while keeping pass1_ms/pass2_ms as report compatibility fields.\n\nRemove the unused pillow dependency, switch to SPDX-style license metadata, and clean setuptools build state before packaging so deleted modules do not leak into wheels. Update the methodology and repo guidance docs, and add focused tests for desktop adapter selection, stale param rejection, and portable wheel contents.\n\nValidate with uv lock, python3 -m unittest discover -s tests -p 'test_*.py', python3 -m py_compile src/*.py tests/*.py, and python3 -m build --wheel --sdist --no-isolation.

2026-03-14 17:48:23 -03:00

2.5 KiB

Raw Blame History

Model Speed/Quality Methodology

Goal

Find a local model + generation parameter set that significantly reduces latency while preserving output quality for Aman cleanup.

Prompting Contract

All model candidates must run with the same prompt framing:

A single cleanup system prompt shared across all local model candidates
XML-tagged user messages (<request>, <language>, <transcript>, <dictionary>, output contract tags)
Strict JSON output contract: {"cleaned_text":"..."}

Pipeline:

Single local cleanup pass emits final text JSON
Optional heuristic alignment eval: run deterministic alignment against timed-word fixtures (heuristics_dataset.jsonl)

Scoring

Per-run quality metrics:

parse_valid: output parsed and contains cleaned_text
exact_match: normalized exact match against expected output
similarity: normalized text similarity
contract_compliance: non-empty contract-compliant output
i_mean_literal_false_positive_rate: literal I mean cases wrongly converted to correction
i_mean_correction_false_negative_rate: correction I mean cases wrongly preserved literally
spelling_disambiguation_accuracy: spelling hints resolved to expected final token

Per-run latency metrics:

pass1_ms, pass2_ms, total_ms

Compatibility note:

The runtime editor is single-pass today.
Reports keep pass1_ms and pass2_ms for schema stability.
In current runs, pass1_ms should remain 0.0 and pass2_ms carries the full editor latency.

Hybrid score:

0.40*parse_valid + 0.20*exact_match + 0.30*similarity + 0.10*contract_compliance

Heuristic score (when --heuristic-dataset is provided):

exact_match_rate on aligned text
token_f1_avg
rule_match_avg (required/forbidden rule compliance + min applied decisions)
decision_rule_precision / decision_rule_recall
combined_score_avg = 0.50*exact + 0.30*token_f1 + 0.20*rule_match

Combined ranking score:

combined_score = (1 - heuristic_weight) * hybrid_score_avg + heuristic_weight * heuristic_combined_score_avg

Promotion Gate

Candidate can be promoted if:

parse_valid_rate >= 0.99
hybrid_score_avg >= baseline_hybrid - 0.08
lower p50 latency than baseline on long-text cases

2.5 KiB Raw Blame History