Harden runtime diagnostics for milestone 3

Make the milestone 3 runtime story predictable instead of treating doctor, self-check, and startup failures as loosely related surfaces. Split doctor and self-check into distinct read-only flows, add tri-state diagnostic status with stable IDs and next steps, and reuse that wording in CLI output, service logs, and tray-triggered diagnostics. Add non-mutating config/model probes, a make runtime-check gate, and public recovery/validation docs for the X11 GA roadmap. Validation: make runtime-check; PYTHONPATH=src python3 -m unittest discover -s tests -p 'test_*.py'; python3 -m py_compile src/*.py tests/*.py; PYTHONPATH=src python3 -m aman doctor --help; PYTHONPATH=src python3 -m aman self-check --help. Leave milestone 3 open in the roadmap until the manual X11 validation rows are filled.
2026-03-12 17:41:23 -03:00 · 2026-03-12 17:41:23 -03:00 · ed1b59240b
commit ed1b59240b
parent a3368056ff
16 changed files with 1298 additions and 248 deletions
--- a/docs/portable-install.md
+++ b/docs/portable-install.md
@ -144,3 +144,6 @@ If installation succeeds but runtime behavior is wrong, use the supported recove
 2. `aman self-check --config ~/.config/aman/config.json`
 3. `journalctl --user -u aman -f`
 4. `aman run --config ~/.config/aman/config.json --verbose`
+
+The failure IDs and example outputs for this flow are documented in
+[`docs/runtime-recovery.md`](./runtime-recovery.md).
--- a/docs/release-checklist.md
+++ b/docs/release-checklist.md
@ -7,6 +7,7 @@ GA signoff bar. The GA signoff sections are required for `v1.0.0` and later.
 2. Bump `project.version` in `pyproject.toml`.
 3. Run quality and build gates:
   - `make release-check`
+   - `make runtime-check`
   - `make check-default-model`
 4. Ensure model promotion artifacts are current:
   - `benchmarks/results/latest.json` has the latest `winner_recommendation.name`
@ -34,7 +35,11 @@ GA signoff bar. The GA signoff sections are required for `v1.0.0` and later.
   - The support matrix names X11, runtime dependency ownership, `systemd --user`, and the representative distro families.
   - Service mode is documented as the default daily-use path and `aman run` as the manual support/debug path.
   - The recovery sequence `aman doctor` -> `aman self-check` -> `journalctl --user -u aman` -> `aman run --verbose` is documented consistently.
-11. GA validation signoff (`v1.0.0` and later):
+11. GA runtime reliability signoff (`v1.0.0` and later):
+   - `make runtime-check` passes.
+   - [`docs/runtime-recovery.md`](./runtime-recovery.md) matches the shipped diagnostic IDs and next-step wording.
+   - [`docs/x11-ga/runtime-validation-report.md`](./x11-ga/runtime-validation-report.md) contains current automated evidence and release-specific manual validation entries.
+12. GA validation signoff (`v1.0.0` and later):
   - Validation evidence exists for Debian/Ubuntu, Arch, Fedora, and openSUSE.
   - The portable installer, upgrade path, and uninstall path are validated.
   - End-user docs and release notes match the shipped artifact set.
--- a/docs/runtime-recovery.md
+++ b/docs/runtime-recovery.md
@ -0,0 +1,48 @@
+# Runtime Recovery Guide
+
+Use this guide when Aman is installed but not behaving correctly.
+
+## Command roles
+
+- `aman doctor --config ~/.config/aman/config.json` is the fast, read-only preflight for config, X11 session, audio runtime, input device resolution, hotkey availability, injection backend selection, and service prerequisites.
+- `aman self-check --config ~/.config/aman/config.json` is the deeper, still read-only readiness check. It includes every `doctor` check plus the managed model cache, cache writability, installed user service, current service state, and startup readiness.
+- Tray `Run Diagnostics` uses the same deeper `self-check` path and logs any non-`ok` results.
+
+## Reading the output
+
+- `ok`: the checked surface is ready.
+- `warn`: the checked surface is degraded or incomplete, but the command still exits `0`.
+- `fail`: the supported path is blocked, and the command exits `2`.
+
+Example output:
+
+```text
+[OK] config.load: loaded config from /home/user/.config/aman/config.json
+[WARN] model.cache: managed editor model is not cached at /home/user/.cache/aman/models/Qwen2.5-1.5B-Instruct-Q4_K_M.gguf | next_step: start Aman once on a networked connection so it can download the managed editor model, then rerun `aman self-check --config /home/user/.config/aman/config.json`
+[FAIL] service.state: user service is installed but failed to start | next_step: inspect `journalctl --user -u aman -f` to see why aman.service is failing
+overall: fail
+```
+
+## Failure map
+
+| Symptom | First command | Diagnostic ID | Meaning | Next step |
+| --- | --- | --- | --- | --- |
+| Config missing or invalid | `aman doctor` | `config.load` | Config is absent or cannot be parsed | Save settings, fix the JSON, or rerun `aman init --force`, then rerun `doctor` |
+| No X11 session | `aman doctor` | `session.x11` | `DISPLAY` is missing or Wayland was detected | Start Aman from the same X11 user session you expect to use daily |
+| Audio runtime or microphone missing | `aman doctor` | `runtime.audio` or `audio.input` | PortAudio or the selected input device is unavailable | Install runtime dependencies, connect a microphone, or choose a valid `recording.input` |
+| Hotkey cannot be registered | `aman doctor` | `hotkey.parse` | The configured hotkey is invalid or already taken | Choose a different hotkey in Settings |
+| Output injection fails | `aman doctor` | `injection.backend` | The chosen X11 output path is not usable | Switch to a supported backend or rerun in the foreground with `--verbose` |
+| Managed editor model missing or corrupt | `aman self-check` | `model.cache` | The managed model is absent or has a bad checksum | Start Aman once on a networked connection, or clear the broken cache and retry |
+| Model cache directory is not writable | `aman self-check` | `cache.writable` | Aman cannot create or update its managed model cache | Fix permissions on `~/.cache/aman/models/` |
+| User service missing or disabled | `aman self-check` | `service.unit` or `service.state` | The service was not installed cleanly or is not active | Reinstall Aman or run `systemctl --user enable --now aman` |
+| Startup still fails after install | `aman self-check` | `startup.readiness` | Aman can load config but cannot assemble its runtime without failing | Fix the named runtime dependency, custom model path, or editor dependency, then rerun `self-check` |
+
+## Escalation order
+
+1. Run `aman doctor --config ~/.config/aman/config.json`.
+2. Run `aman self-check --config ~/.config/aman/config.json`.
+3. Inspect `journalctl --user -u aman -f`.
+4. Re-run Aman in the foreground with `aman run --config ~/.config/aman/config.json --verbose`.
+
+If you are collecting evidence for a release or support handoff, copy the first
+non-`ok` diagnostic line and the first matching `journalctl` failure block.
--- a/docs/x11-ga/03-runtime-reliability-and-diagnostics.md
+++ b/docs/x11-ga/03-runtime-reliability-and-diagnostics.md
@ -16,7 +16,7 @@ Once Aman is installed, the next GA risk is not feature depth. It is whether the
 - Define `aman doctor` as the fast preflight check for config, runtime dependencies, hotkey validity, audio device resolution, and service prerequisites.
 - Define `aman self-check` as the deeper installed-system readiness check, including managed model availability, writable cache locations, and end-to-end startup prerequisites.
 - Make diagnostics return actionable messages with one next step, not generic failures.
- Standardize startup and runtime error wording across CLI output, service logs, tray notifications, and docs.
+- Standardize startup and runtime error wording across CLI output, service logs, tray-triggered diagnostics, and docs.
 - Cover recovery paths for:
  - broken config
  - missing audio device
@ -57,7 +57,7 @@ Once Aman is installed, the next GA risk is not feature depth. It is whether the

 ## Evidence required to close

- Updated command help and docs for `doctor` and `self-check`.
+- Updated command help and docs for `doctor` and `self-check`, including a public runtime recovery guide.
 - Diagnostic output examples for success, warning, and failure cases.
 - A release validation report covering restart, offline-start, and representative recovery scenarios.
 - Manual support runbooks that use diagnostics first and verbose foreground mode second.
--- a/docs/x11-ga/README.md
+++ b/docs/x11-ga/README.md
@ -6,14 +6,13 @@ Aman is not starting from zero. It already has a working X11 daemon, a settings-

 The current gaps are:

- No single distro-agnostic end-user install, update, and uninstall path. The repo documents a Debian package path and partial Arch support, but not one canonical path for X11 users on Fedora, openSUSE, or other mainstream distros.
- No explicit support contract for "X11 users on any distro." The current docs describe target personas and a package-first approach, but they do not define the exact environment that GA will support.
- No clear split between service mode and foreground/manual mode. The docs describe enabling a user service and also tell users to run `aman run`, which leaves the default lifecycle ambiguous.
- No representative distro validation matrix. There is no evidence standard that says which distros must pass install, first run, update, restart, and uninstall checks before release.
+- The canonical portable install, update, and uninstall path now exists, but the representative distro rows still need real manual validation evidence before it can count as a GA-ready channel.
+- The X11 support contract and service-versus-foreground split are now documented, but the public release surface still needs the remaining trust and support work from milestones 4 and 5.
+- Validation matrices now exist for portable lifecycle and runtime reliability, but they are not yet filled with release-specific manual evidence across Debian/Ubuntu, Arch, Fedora, and openSUSE.
 - Incomplete trust surface. The project still needs a real license file, real maintainer/contact metadata, real project URLs, published release artifacts, and public checksums.
 - Incomplete first-run story. The product describes a settings window and tray workflow, but there is no short happy path, no expected-result walkthrough, and no visual proof that the experience is real.
- Diagnostics exist, but they are not yet the canonical recovery path for end users. `doctor` and `self-check` are present, but the docs do not yet teach users to rely on them first.
- Release process exists, but not yet as a GA signoff system. The current release checklist is a good base, but it does not yet enforce the broader validation and support evidence required for a public 1.0 release.
+- Diagnostics are now the canonical recovery path, but milestone 3 still needs release-specific X11 evidence for restart, offline-start, tray diagnostics, and recovery scenarios.
+- The release checklist now includes GA signoff gates, but the project is still short of the broader legal, release-publication, and validation evidence needed for a credible public 1.0 release.

 ## GA target

@ -93,7 +92,13 @@ Any future docs, tray copy, and release notes should point users to this same se
  [`portable-validation-matrix.md`](./portable-validation-matrix.md) are filled
  with real manual validation evidence.
 - [ ] [Milestone 3: Runtime Reliability and Diagnostics](./03-runtime-reliability-and-diagnostics.md)
-  Make startup, failure handling, and recovery predictable.
+  Implementation landed on 2026-03-12: `doctor` and `self-check` now have
+  distinct read-only roles, runtime failures log stable IDs plus next steps,
+  `make runtime-check` is part of the release surface, and the runtime recovery
+  guide plus validation report now exist. Leave this milestone open until the
+  release-specific manual rows in
+  [`runtime-validation-report.md`](./runtime-validation-report.md) are filled
+  with real X11 validation evidence.
 - [ ] [Milestone 4: First-Run UX and Support Docs](./04-first-run-ux-and-support-docs.md)
  Turn the product from "documented by the author" into "understandable by a new user."
 - [ ] [Milestone 5: GA Candidate Validation and Release](./05-ga-candidate-validation-and-release.md)
--- a/docs/x11-ga/runtime-validation-report.md
+++ b/docs/x11-ga/runtime-validation-report.md
@ -0,0 +1,44 @@
+# Runtime Validation Report
+
+This document tracks milestone 3 evidence for runtime reliability and
+diagnostics.
+
+## Automated evidence
+
+Completed on 2026-03-12:
+
+- `PYTHONPATH=src python3 -m unittest tests.test_diagnostics tests.test_aman_cli tests.test_aman tests.test_aiprocess`
+  - covers `doctor` versus `self-check`, tri-state diagnostic output, warning
+    versus failure exit codes, read-only model cache probing, and actionable
+    runtime log wording for audio, hotkey, injection, editor, and startup
+    failures
+- `PYTHONPATH=src python3 -m unittest discover -s tests -p 'test_*.py'`
+  - confirms the runtime and diagnostics changes do not regress the broader
+    daemon, CLI, config, and portable bundle flows
+- `python3 -m py_compile src/*.py tests/*.py`
+  - verifies the updated runtime and diagnostics modules compile cleanly
+
+## Automated scenario coverage
+
+| Scenario | Evidence | Status | Notes |
+| --- | --- | --- | --- |
+| `doctor` and `self-check` have distinct roles | `tests.test_diagnostics`, `tests.test_aman_cli` | Complete | `self-check` extends `doctor` with service/model/startup readiness checks |
+| Missing config remains read-only | `tests.test_diagnostics` | Complete | Missing config yields `warn` and does not write a default file |
+| Managed model cache probing is read-only | `tests.test_diagnostics`, `tests.test_aiprocess` | Complete | `self-check` uses cache probing and does not download or repair |
+| Warning-only diagnostics exit `0`; failures exit `2` | `tests.test_aman_cli` | Complete | Human and JSON output share the same status model |
+| Runtime failures log stable IDs and one next step | `tests.test_aman_cli`, `tests.test_aman` | Complete | Covers hotkey, audio-input, injection, editor, and startup failure wording |
+| Repeated start/stop and shutdown return to `idle` | `tests.test_aman` | Complete | Current daemon tests cover start, stop, cancel, pause, and shutdown paths |
+
+## Manual X11 validation
+
+These rows must be filled with release-specific evidence before milestone 3 can
+be closed as complete for GA signoff.
+
+| Scenario | Debian/Ubuntu | Arch | Fedora | openSUSE | Reviewer | Status | Notes |
+| --- | --- | --- | --- | --- | --- | --- | --- |
+| Service restart after a successful install | Pending | Pending | Pending | Pending | Pending | Pending | Verify `systemctl --user restart aman` returns to the tray/ready state |
+| Reboot followed by successful reuse | Pending | Pending | Pending | Pending | Pending | Pending | Validate recovery after a real session restart |
+| Offline startup with an already-cached model | Pending | Pending | Pending | Pending | Pending | Pending | Disable network, then confirm the cached path still starts |
+| Missing runtime dependency recovery | Pending | Pending | Pending | Pending | Pending | Pending | Remove one documented dependency, verify diagnostics point to the correct fix |
+| Tray-triggered diagnostics logging | Pending | Pending | Pending | Pending | Pending | Pending | Use `Run Diagnostics` and confirm the same IDs/messages appear in logs |
+| Service-failure escalation path | Pending | Pending | Pending | Pending | Pending | Pending | Confirm `doctor` -> `self-check` -> `journalctl` -> `aman run --verbose` is enough to explain the failure |