aman/docs/x11-ga/03-runtime-reliability-and-diagnostics.md
Thales Maciel ed1b59240b
Harden runtime diagnostics for milestone 3
Make the milestone 3 runtime story predictable instead of treating doctor, self-check, and startup failures as loosely related surfaces.

Split doctor and self-check into distinct read-only flows, add tri-state diagnostic status with stable IDs and next steps, and reuse that wording in CLI output, service logs, and tray-triggered diagnostics. Add non-mutating config/model probes, a make runtime-check gate, and public recovery/validation docs for the X11 GA roadmap.

Validation: make runtime-check; PYTHONPATH=src python3 -m unittest discover -s tests -p 'test_*.py'; python3 -m py_compile src/*.py tests/*.py; PYTHONPATH=src python3 -m aman doctor --help; PYTHONPATH=src python3 -m aman self-check --help. Leave milestone 3 open in the roadmap until the manual X11 validation rows are filled.
2026-03-12 17:41:23 -03:00

3.2 KiB

Milestone 3: Runtime Reliability and Diagnostics

Why this milestone exists

Once Aman is installed, the next GA risk is not feature depth. It is whether the product behaves predictably, fails loudly, and tells the user what to do next. This milestone turns diagnostics and recovery into a first-class product surface.

Problems it closes

  • Startup readiness and failure paths are not yet shaped into one user-facing recovery model.
  • Diagnostics exist, but their roles are not clearly separated.
  • Audio, hotkey, injection, and model-cache failures can still feel like implementation details instead of guided support flows.
  • The release process does not yet require restart, recovery, or soak evidence.

In scope

  • Define aman doctor as the fast preflight check for config, runtime dependencies, hotkey validity, audio device resolution, and service prerequisites.
  • Define aman self-check as the deeper installed-system readiness check, including managed model availability, writable cache locations, and end-to-end startup prerequisites.
  • Make diagnostics return actionable messages with one next step, not generic failures.
  • Standardize startup and runtime error wording across CLI output, service logs, tray-triggered diagnostics, and docs.
  • Cover recovery paths for:
    • broken config
    • missing audio device
    • hotkey registration failure
    • X11 injection failure
    • model download or cache failure
    • service startup failure
  • Add repeated-run validation, restart validation, and offline-start validation to release gates.
  • Treat journalctl --user -u aman and aman run --verbose as the default support escalations after diagnostics.

Out of scope

  • New dictation features unrelated to supportability.
  • Remote telemetry or cloud monitoring.
  • Non-X11 backends.

Dependencies

  • Milestone 1 support contract.
  • Milestone 2 portable install layout and service lifecycle.
  • Existing diagnostics commands and systemd service behavior.

Definition of done: objective

  • doctor and self-check have distinct documented roles.
  • The main end-user failure modes each produce an actionable diagnostic result or service-log message.
  • No supported happy-path failure is known to fail silently.
  • Restart after reboot and restart after service crash are part of the validation matrix.
  • Offline start with already-cached models is part of the validation matrix.
  • Release gates include repeated-run and recovery scenarios, not only unit tests.
  • Support docs map each common failure class to a matching diagnostic command or log path.

Definition of done: subjective

  • When Aman fails, the user can usually answer "what broke?" and "what should I try next?" without reading source code.
  • Daily use feels predictable even when the environment is imperfect.
  • The support story feels unified instead of scattered across commands and logs.

Evidence required to close

  • Updated command help and docs for doctor and self-check, including a public runtime recovery guide.
  • Diagnostic output examples for success, warning, and failure cases.
  • A release validation report covering restart, offline-start, and representative recovery scenarios.
  • Manual support runbooks that use diagnostics first and verbose foreground mode second.