Harden runtime diagnostics for milestone 3

Make the milestone 3 runtime story predictable instead of treating doctor, self-check, and startup failures as loosely related surfaces.

Split doctor and self-check into distinct read-only flows, add tri-state diagnostic status with stable IDs and next steps, and reuse that wording in CLI output, service logs, and tray-triggered diagnostics. Add non-mutating config/model probes, a make runtime-check gate, and public recovery/validation docs for the X11 GA roadmap.

Validation: make runtime-check; PYTHONPATH=src python3 -m unittest discover -s tests -p 'test_*.py'; python3 -m py_compile src/*.py tests/*.py; PYTHONPATH=src python3 -m aman doctor --help; PYTHONPATH=src python3 -m aman self-check --help. Leave milestone 3 open in the roadmap until the manual X11 validation rows are filled.
This commit is contained in:
Thales Maciel 2026-03-12 17:41:23 -03:00
parent a3368056ff
commit ed1b59240b
No known key found for this signature in database
GPG key ID: 33112E6833C34679
16 changed files with 1298 additions and 248 deletions

View file

@ -144,3 +144,6 @@ If installation succeeds but runtime behavior is wrong, use the supported recove
2. `aman self-check --config ~/.config/aman/config.json`
3. `journalctl --user -u aman -f`
4. `aman run --config ~/.config/aman/config.json --verbose`
The failure IDs and example outputs for this flow are documented in
[`docs/runtime-recovery.md`](./runtime-recovery.md).

View file

@ -7,6 +7,7 @@ GA signoff bar. The GA signoff sections are required for `v1.0.0` and later.
2. Bump `project.version` in `pyproject.toml`.
3. Run quality and build gates:
- `make release-check`
- `make runtime-check`
- `make check-default-model`
4. Ensure model promotion artifacts are current:
- `benchmarks/results/latest.json` has the latest `winner_recommendation.name`
@ -34,7 +35,11 @@ GA signoff bar. The GA signoff sections are required for `v1.0.0` and later.
- The support matrix names X11, runtime dependency ownership, `systemd --user`, and the representative distro families.
- Service mode is documented as the default daily-use path and `aman run` as the manual support/debug path.
- The recovery sequence `aman doctor` -> `aman self-check` -> `journalctl --user -u aman` -> `aman run --verbose` is documented consistently.
11. GA validation signoff (`v1.0.0` and later):
11. GA runtime reliability signoff (`v1.0.0` and later):
- `make runtime-check` passes.
- [`docs/runtime-recovery.md`](./runtime-recovery.md) matches the shipped diagnostic IDs and next-step wording.
- [`docs/x11-ga/runtime-validation-report.md`](./x11-ga/runtime-validation-report.md) contains current automated evidence and release-specific manual validation entries.
12. GA validation signoff (`v1.0.0` and later):
- Validation evidence exists for Debian/Ubuntu, Arch, Fedora, and openSUSE.
- The portable installer, upgrade path, and uninstall path are validated.
- End-user docs and release notes match the shipped artifact set.

48
docs/runtime-recovery.md Normal file
View file

@ -0,0 +1,48 @@
# Runtime Recovery Guide
Use this guide when Aman is installed but not behaving correctly.
## Command roles
- `aman doctor --config ~/.config/aman/config.json` is the fast, read-only preflight for config, X11 session, audio runtime, input device resolution, hotkey availability, injection backend selection, and service prerequisites.
- `aman self-check --config ~/.config/aman/config.json` is the deeper, still read-only readiness check. It includes every `doctor` check plus the managed model cache, cache writability, installed user service, current service state, and startup readiness.
- Tray `Run Diagnostics` uses the same deeper `self-check` path and logs any non-`ok` results.
## Reading the output
- `ok`: the checked surface is ready.
- `warn`: the checked surface is degraded or incomplete, but the command still exits `0`.
- `fail`: the supported path is blocked, and the command exits `2`.
Example output:
```text
[OK] config.load: loaded config from /home/user/.config/aman/config.json
[WARN] model.cache: managed editor model is not cached at /home/user/.cache/aman/models/Qwen2.5-1.5B-Instruct-Q4_K_M.gguf | next_step: start Aman once on a networked connection so it can download the managed editor model, then rerun `aman self-check --config /home/user/.config/aman/config.json`
[FAIL] service.state: user service is installed but failed to start | next_step: inspect `journalctl --user -u aman -f` to see why aman.service is failing
overall: fail
```
## Failure map
| Symptom | First command | Diagnostic ID | Meaning | Next step |
| --- | --- | --- | --- | --- |
| Config missing or invalid | `aman doctor` | `config.load` | Config is absent or cannot be parsed | Save settings, fix the JSON, or rerun `aman init --force`, then rerun `doctor` |
| No X11 session | `aman doctor` | `session.x11` | `DISPLAY` is missing or Wayland was detected | Start Aman from the same X11 user session you expect to use daily |
| Audio runtime or microphone missing | `aman doctor` | `runtime.audio` or `audio.input` | PortAudio or the selected input device is unavailable | Install runtime dependencies, connect a microphone, or choose a valid `recording.input` |
| Hotkey cannot be registered | `aman doctor` | `hotkey.parse` | The configured hotkey is invalid or already taken | Choose a different hotkey in Settings |
| Output injection fails | `aman doctor` | `injection.backend` | The chosen X11 output path is not usable | Switch to a supported backend or rerun in the foreground with `--verbose` |
| Managed editor model missing or corrupt | `aman self-check` | `model.cache` | The managed model is absent or has a bad checksum | Start Aman once on a networked connection, or clear the broken cache and retry |
| Model cache directory is not writable | `aman self-check` | `cache.writable` | Aman cannot create or update its managed model cache | Fix permissions on `~/.cache/aman/models/` |
| User service missing or disabled | `aman self-check` | `service.unit` or `service.state` | The service was not installed cleanly or is not active | Reinstall Aman or run `systemctl --user enable --now aman` |
| Startup still fails after install | `aman self-check` | `startup.readiness` | Aman can load config but cannot assemble its runtime without failing | Fix the named runtime dependency, custom model path, or editor dependency, then rerun `self-check` |
## Escalation order
1. Run `aman doctor --config ~/.config/aman/config.json`.
2. Run `aman self-check --config ~/.config/aman/config.json`.
3. Inspect `journalctl --user -u aman -f`.
4. Re-run Aman in the foreground with `aman run --config ~/.config/aman/config.json --verbose`.
If you are collecting evidence for a release or support handoff, copy the first
non-`ok` diagnostic line and the first matching `journalctl` failure block.

View file

@ -16,7 +16,7 @@ Once Aman is installed, the next GA risk is not feature depth. It is whether the
- Define `aman doctor` as the fast preflight check for config, runtime dependencies, hotkey validity, audio device resolution, and service prerequisites.
- Define `aman self-check` as the deeper installed-system readiness check, including managed model availability, writable cache locations, and end-to-end startup prerequisites.
- Make diagnostics return actionable messages with one next step, not generic failures.
- Standardize startup and runtime error wording across CLI output, service logs, tray notifications, and docs.
- Standardize startup and runtime error wording across CLI output, service logs, tray-triggered diagnostics, and docs.
- Cover recovery paths for:
- broken config
- missing audio device
@ -57,7 +57,7 @@ Once Aman is installed, the next GA risk is not feature depth. It is whether the
## Evidence required to close
- Updated command help and docs for `doctor` and `self-check`.
- Updated command help and docs for `doctor` and `self-check`, including a public runtime recovery guide.
- Diagnostic output examples for success, warning, and failure cases.
- A release validation report covering restart, offline-start, and representative recovery scenarios.
- Manual support runbooks that use diagnostics first and verbose foreground mode second.

View file

@ -6,14 +6,13 @@ Aman is not starting from zero. It already has a working X11 daemon, a settings-
The current gaps are:
- No single distro-agnostic end-user install, update, and uninstall path. The repo documents a Debian package path and partial Arch support, but not one canonical path for X11 users on Fedora, openSUSE, or other mainstream distros.
- No explicit support contract for "X11 users on any distro." The current docs describe target personas and a package-first approach, but they do not define the exact environment that GA will support.
- No clear split between service mode and foreground/manual mode. The docs describe enabling a user service and also tell users to run `aman run`, which leaves the default lifecycle ambiguous.
- No representative distro validation matrix. There is no evidence standard that says which distros must pass install, first run, update, restart, and uninstall checks before release.
- The canonical portable install, update, and uninstall path now exists, but the representative distro rows still need real manual validation evidence before it can count as a GA-ready channel.
- The X11 support contract and service-versus-foreground split are now documented, but the public release surface still needs the remaining trust and support work from milestones 4 and 5.
- Validation matrices now exist for portable lifecycle and runtime reliability, but they are not yet filled with release-specific manual evidence across Debian/Ubuntu, Arch, Fedora, and openSUSE.
- Incomplete trust surface. The project still needs a real license file, real maintainer/contact metadata, real project URLs, published release artifacts, and public checksums.
- Incomplete first-run story. The product describes a settings window and tray workflow, but there is no short happy path, no expected-result walkthrough, and no visual proof that the experience is real.
- Diagnostics exist, but they are not yet the canonical recovery path for end users. `doctor` and `self-check` are present, but the docs do not yet teach users to rely on them first.
- Release process exists, but not yet as a GA signoff system. The current release checklist is a good base, but it does not yet enforce the broader validation and support evidence required for a public 1.0 release.
- Diagnostics are now the canonical recovery path, but milestone 3 still needs release-specific X11 evidence for restart, offline-start, tray diagnostics, and recovery scenarios.
- The release checklist now includes GA signoff gates, but the project is still short of the broader legal, release-publication, and validation evidence needed for a credible public 1.0 release.
## GA target
@ -93,7 +92,13 @@ Any future docs, tray copy, and release notes should point users to this same se
[`portable-validation-matrix.md`](./portable-validation-matrix.md) are filled
with real manual validation evidence.
- [ ] [Milestone 3: Runtime Reliability and Diagnostics](./03-runtime-reliability-and-diagnostics.md)
Make startup, failure handling, and recovery predictable.
Implementation landed on 2026-03-12: `doctor` and `self-check` now have
distinct read-only roles, runtime failures log stable IDs plus next steps,
`make runtime-check` is part of the release surface, and the runtime recovery
guide plus validation report now exist. Leave this milestone open until the
release-specific manual rows in
[`runtime-validation-report.md`](./runtime-validation-report.md) are filled
with real X11 validation evidence.
- [ ] [Milestone 4: First-Run UX and Support Docs](./04-first-run-ux-and-support-docs.md)
Turn the product from "documented by the author" into "understandable by a new user."
- [ ] [Milestone 5: GA Candidate Validation and Release](./05-ga-candidate-validation-and-release.md)

View file

@ -0,0 +1,44 @@
# Runtime Validation Report
This document tracks milestone 3 evidence for runtime reliability and
diagnostics.
## Automated evidence
Completed on 2026-03-12:
- `PYTHONPATH=src python3 -m unittest tests.test_diagnostics tests.test_aman_cli tests.test_aman tests.test_aiprocess`
- covers `doctor` versus `self-check`, tri-state diagnostic output, warning
versus failure exit codes, read-only model cache probing, and actionable
runtime log wording for audio, hotkey, injection, editor, and startup
failures
- `PYTHONPATH=src python3 -m unittest discover -s tests -p 'test_*.py'`
- confirms the runtime and diagnostics changes do not regress the broader
daemon, CLI, config, and portable bundle flows
- `python3 -m py_compile src/*.py tests/*.py`
- verifies the updated runtime and diagnostics modules compile cleanly
## Automated scenario coverage
| Scenario | Evidence | Status | Notes |
| --- | --- | --- | --- |
| `doctor` and `self-check` have distinct roles | `tests.test_diagnostics`, `tests.test_aman_cli` | Complete | `self-check` extends `doctor` with service/model/startup readiness checks |
| Missing config remains read-only | `tests.test_diagnostics` | Complete | Missing config yields `warn` and does not write a default file |
| Managed model cache probing is read-only | `tests.test_diagnostics`, `tests.test_aiprocess` | Complete | `self-check` uses cache probing and does not download or repair |
| Warning-only diagnostics exit `0`; failures exit `2` | `tests.test_aman_cli` | Complete | Human and JSON output share the same status model |
| Runtime failures log stable IDs and one next step | `tests.test_aman_cli`, `tests.test_aman` | Complete | Covers hotkey, audio-input, injection, editor, and startup failure wording |
| Repeated start/stop and shutdown return to `idle` | `tests.test_aman` | Complete | Current daemon tests cover start, stop, cancel, pause, and shutdown paths |
## Manual X11 validation
These rows must be filled with release-specific evidence before milestone 3 can
be closed as complete for GA signoff.
| Scenario | Debian/Ubuntu | Arch | Fedora | openSUSE | Reviewer | Status | Notes |
| --- | --- | --- | --- | --- | --- | --- | --- |
| Service restart after a successful install | Pending | Pending | Pending | Pending | Pending | Pending | Verify `systemctl --user restart aman` returns to the tray/ready state |
| Reboot followed by successful reuse | Pending | Pending | Pending | Pending | Pending | Pending | Validate recovery after a real session restart |
| Offline startup with an already-cached model | Pending | Pending | Pending | Pending | Pending | Pending | Disable network, then confirm the cached path still starts |
| Missing runtime dependency recovery | Pending | Pending | Pending | Pending | Pending | Pending | Remove one documented dependency, verify diagnostics point to the correct fix |
| Tray-triggered diagnostics logging | Pending | Pending | Pending | Pending | Pending | Pending | Use `Run Diagnostics` and confirm the same IDs/messages appear in logs |
| Service-failure escalation path | Pending | Pending | Pending | Pending | Pending | Pending | Confirm `doctor` -> `self-check` -> `journalctl` -> `aman run --verbose` is enough to explain the failure |