aman/docs/x11-ga/03-runtime-reliability-and-diagnostics.md

# Milestone 3: Runtime Reliability and Diagnostics

## Why this milestone exists

Once Aman is installed, the next GA risk is not feature depth. It is whether the product behaves predictably, fails loudly, and tells the user what to do next. This milestone turns diagnostics and recovery into a first-class product surface.

## Problems it closes

- Startup readiness and failure paths are not yet shaped into one user-facing recovery model.
- Diagnostics exist, but their roles are not clearly separated.
- Audio, hotkey, injection, and model-cache failures can still feel like implementation details instead of guided support flows.
- The release process does not yet require restart, recovery, or soak evidence.

## In scope

- Define `aman doctor` as the fast preflight check for config, runtime dependencies, hotkey validity, audio device resolution, and service prerequisites.
- Define `aman self-check` as the deeper installed-system readiness check, including managed model availability, writable cache locations, and end-to-end startup prerequisites.
- Make diagnostics return actionable messages with one next step, not generic failures.
- Standardize startup and runtime error wording across CLI output, service logs, tray notifications, and docs.
- Cover recovery paths for:
  - broken config
  - missing audio device
  - hotkey registration failure
  - X11 injection failure
  - model download or cache failure
  - service startup failure
- Add repeated-run validation, restart validation, and offline-start validation to release gates.
- Treat `journalctl --user -u aman` and `aman run --verbose` as the default support escalations after diagnostics.

## Out of scope

- New dictation features unrelated to supportability.
- Remote telemetry or cloud monitoring.
- Non-X11 backends.

## Dependencies

- Milestone 1 support contract.
- Milestone 2 portable install layout and service lifecycle.
- Existing diagnostics commands and systemd service behavior.

## Definition of done: objective

- `doctor` and `self-check` have distinct documented roles.
- The main end-user failure modes each produce an actionable diagnostic result or service-log message.
- No supported happy-path failure is known to fail silently.
- Restart after reboot and restart after service crash are part of the validation matrix.
- Offline start with already-cached models is part of the validation matrix.
- Release gates include repeated-run and recovery scenarios, not only unit tests.
- Support docs map each common failure class to a matching diagnostic command or log path.

## Definition of done: subjective

- When Aman fails, the user can usually answer "what broke?" and "what should I try next?" without reading source code.
- Daily use feels predictable even when the environment is imperfect.
- The support story feels unified instead of scattered across commands and logs.

## Evidence required to close

- Updated command help and docs for `doctor` and `self-check`.
- Diagnostic output examples for success, warning, and failure cases.
- A release validation report covering restart, offline-start, and representative recovery scenarios.
- Manual support runbooks that use diagnostics first and verbose foreground mode second.