Add Vosk keystroke eval tooling and findings

2026-02-28 17:20:09 -03:00 · 2026-02-28 17:20:09 -03:00 · 510d280b74
commit 510d280b74
parent 8c1f7c1e13
15 changed files with 2219 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -294,6 +294,51 @@ aman bench --text-file ./bench-input.txt --repeat 20 --json
 the processing path from input transcript text through alignment/editor/fact-guard/vocabulary cleanup and
 prints timing summaries.

+Internal Vosk exploration (fixed-phrase dataset collection):
+
+```bash
+aman collect-fixed-phrases \
+  --phrases-file exploration/vosk/fixed_phrases/phrases.txt \
+  --out-dir exploration/vosk/fixed_phrases \
+  --samples-per-phrase 10
+```
+
+This internal command prompts each allowed phrase and records labeled WAV
+samples with manual start/stop (Enter to start, Enter to stop). It does not run
+Vosk decoding and does not execute desktop commands. Output includes:
+- `exploration/vosk/fixed_phrases/samples/`
+- `exploration/vosk/fixed_phrases/manifest.jsonl`
+
+Internal Vosk exploration (keystroke dictation: literal vs NATO):
+
+```bash
+# collect literal-key dataset
+aman collect-fixed-phrases \
+  --phrases-file exploration/vosk/keystrokes/literal/phrases.txt \
+  --out-dir exploration/vosk/keystrokes/literal \
+  --samples-per-phrase 10
+
+# collect NATO-key dataset
+aman collect-fixed-phrases \
+  --phrases-file exploration/vosk/keystrokes/nato/phrases.txt \
+  --out-dir exploration/vosk/keystrokes/nato \
+  --samples-per-phrase 10
+
+# evaluate both grammars across available Vosk models
+aman eval-vosk-keystrokes \
+  --literal-manifest exploration/vosk/keystrokes/literal/manifest.jsonl \
+  --nato-manifest exploration/vosk/keystrokes/nato/manifest.jsonl \
+  --intents exploration/vosk/keystrokes/intents.json \
+  --output-dir exploration/vosk/keystrokes/eval_runs \
+  --models-file exploration/vosk/keystrokes/models.example.json
+```
+
+`eval-vosk-keystrokes` writes a structured report (`summary.json`) with:
+- intent accuracy and unknown-rate by grammar
+- per-intent/per-letter confusion tables
+- latency (avg/p50/p95), RTF, and model-load time
+- strict grammar compliance checks (out-of-grammar hypotheses hard-fail the model run)
+
 Model evaluation lab (dataset + matrix sweep):

 ```bash
@ -344,6 +389,8 @@ aman run --config ~/.config/aman/config.json
 aman doctor --config ~/.config/aman/config.json --json
 aman self-check --config ~/.config/aman/config.json --json
 aman bench --text "example transcript" --repeat 5 --warmup 1
+aman collect-fixed-phrases --phrases-file exploration/vosk/fixed_phrases/phrases.txt --out-dir exploration/vosk/fixed_phrases --samples-per-phrase 10
+aman eval-vosk-keystrokes --literal-manifest exploration/vosk/keystrokes/literal/manifest.jsonl --nato-manifest exploration/vosk/keystrokes/nato/manifest.jsonl --intents exploration/vosk/keystrokes/intents.json --output-dir exploration/vosk/keystrokes/eval_runs --json
 aman build-heuristic-dataset --input benchmarks/heuristics_dataset.raw.jsonl --output benchmarks/heuristics_dataset.jsonl --json
 aman eval-models --dataset benchmarks/cleanup_dataset.jsonl --matrix benchmarks/model_matrix.small_first.json --heuristic-dataset benchmarks/heuristics_dataset.jsonl --heuristic-weight 0.25 --json
 aman sync-default-model --check --report benchmarks/results/latest.json --artifacts benchmarks/model_artifacts.json --constants src/constants.py