# Milestone 3: Runtime Reliability and Diagnostics

## Why this milestone exists

Once Aman is installed, the next GA risk is not feature depth. It is whether the product behaves predictably, fails loudly, and tells the user what to do next. This milestone turns diagnostics and recovery into a first-class product surface.

## Problems it closes

- Startup readiness and failure paths are not yet shaped into one user-facing recovery model.
- Diagnostics exist, but their roles are not clearly separated.
- Audio, hotkey, injection, and model-cache failures can still feel like implementation details instead of guided support flows.
- The release process does not yet require restart, recovery, or soak evidence.

## In scope

- Define `aman doctor` as the fast preflight check for config, runtime dependencies, hotkey validity, audio device resolution, and service prerequisites.
- Define `aman self-check` as the deeper installed-system readiness check, including managed model availability, writable cache locations, and end-to-end startup prerequisites.
- Make diagnostics return actionable messages with one next step, not generic failures.
- Standardize startup and runtime error wording across CLI output, service logs, tray-triggered diagnostics, and docs.
- Cover recovery paths for:
  - broken config
  - missing audio device
  - hotkey registration failure
  - X11 injection failure
  - model download or cache failure
  - service startup failure
- Add repeated-run validation, restart validation, and offline-start validation
  to release gates, and manually validate them on at least one representative
  distro family for milestone closeout.
- Treat `journalctl --user -u aman` and `aman run --verbose` as the default support escalations after diagnostics.

## Out of scope

- New dictation features unrelated to supportability.
- Remote telemetry or cloud monitoring.
- Non-X11 backends.

## Dependencies

- Milestone 1 support contract.
- Milestone 2 portable install layout and service lifecycle.
- Existing diagnostics commands and systemd service behavior.

## Definition of done: objective

- `doctor` and `self-check` have distinct documented roles.
- The main end-user failure modes each produce an actionable diagnostic result or service-log message.
- No supported happy-path failure is known to fail silently.
- Restart after reboot and restart after service crash are part of the
  validation matrix and are manually validated on at least one representative
  distro family for milestone closeout.
- Offline start with already-cached models is part of the validation matrix and
  is manually validated on at least one representative distro family for
  milestone closeout.
- Release gates include repeated-run and recovery scenarios, not only unit tests.
- Support docs map each common failure class to a matching diagnostic command or log path.

## Definition of done: subjective

- When Aman fails, the user can usually answer "what broke?" and "what should I try next?" without reading source code.
- Daily use feels predictable even when the environment is imperfect.
- The support story feels unified instead of scattered across commands and logs.

## Evidence required to close

- Updated command help and docs for `doctor` and `self-check`, including a public runtime recovery guide.
- Diagnostic output examples for success, warning, and failure cases.
- A release validation report covering restart, offline-start, and
  representative recovery scenarios, with one real distro pass sufficient for
  milestone closeout and full four-family coverage deferred to milestone 5 GA
  signoff.
- Manual support runbooks that use diagnostics first and verbose foreground mode second.