Add vocabulary correction pipeline and example config

2026-02-25 10:03:32 -03:00 · 2026-02-25 10:03:32 -03:00 · c3503fbbde
commit c3503fbbde
parent f9224621fa
9 changed files with 865 additions and 23 deletions
--- a/README.md
+++ b/README.md
@ -1,6 +1,6 @@
 # lel

-Python X11 STT daemon that records audio, runs Whisper, and injects text. It can optionally run local AI post-processing before injection.
+Python X11 STT daemon that records audio, runs Whisper, applies local AI cleanup, and injects text.

 ## Requirements

@ -92,21 +92,50 @@ Create `~/.config/lel/config.json`:
  "stt": { "model": "base", "device": "cpu" },
  "injection": { "backend": "clipboard" },
  "ai": { "enabled": true },
-  "logging": { "log_transcript": false }
+  "logging": { "log_transcript": false },
+  "vocabulary": {
+    "replacements": [
+      { "from": "Martha", "to": "Marta" },
+      { "from": "docker", "to": "Docker" }
+    ],
+    "terms": ["Systemd", "Kubernetes"],
+    "max_rules": 500,
+    "max_terms": 500
+  },
+  "domain_inference": { "enabled": true, "mode": "auto" }
 }
 ```

 Recording input can be a device index (preferred) or a substring of the device
 name.

-`ai.enabled` controls local cleanup. When enabled, the LLM model is downloaded
-on first use to `~/.cache/lel/models/` and uses the locked Llama-3.2-3B GGUF
-model.
+`ai.enabled` is accepted for compatibility but currently has no runtime effect.
+AI cleanup is always enabled and uses the locked local Llama-3.2-3B GGUF model
+downloaded to `~/.cache/lel/models/` on first use.

 `logging.log_transcript` controls whether recognized/processed text is written
 to logs. This is disabled by default. `-v/--verbose` also enables transcript
 logging and llama.cpp logs; llama logs are prefixed with `llama::`.

+Vocabulary correction:
+
+- `vocabulary.replacements` is deterministic correction (`from -> to`).
+- `vocabulary.terms` is a preferred spelling list used as hinting context.
+- Wildcards are intentionally rejected (`*`, `?`, `[`, `]`, `{`, `}`) to avoid ambiguous rules.
+- Rules are deduplicated case-insensitively; conflicting replacements are rejected.
+- Limits are bounded by `max_rules` and `max_terms`.
+
+Domain inference:
+
+- `domain_inference.mode` currently supports `auto`.
+- Domain context is advisory only and is used to improve cleanup prompts.
+- When confidence is low, it falls back to `general` context.
+
+STT hinting:
+
+- Vocabulary is passed to Whisper as `hotwords`/`initial_prompt` only when those
+  arguments are supported by the installed `faster-whisper` runtime.
+
 ## systemd user service

 ```bash