smoke: discoverable scenarios + selectable runs + parallel dispatch

`scripts/smoke.sh` was a 600-line linear script: no way to see what it
covers without reading the whole thing, and no way to run a single
scenario when iterating. Every iteration paid the full ~5-10 min suite,
which made fast feedback loops painful enough to avoid the suite.

Refactor into a registry + per-scenario functions:

- Top-of-file SMOKE_SCENARIOS (ordered) + SMOKE_DESCS (one-line desc per
  scenario) + SMOKE_CLASS (pure / repodir / global) drive both listing
  and dispatch. The 21 existing scenario blocks become scenario_<name>
  functions. Bodies are the inline blocks verbatim, modulo the workspace
  fixture move described below.
- New CLI: --list (cheap discovery, no install / no env-vars),
  --scenario NAME (or NAME,NAME,...), --jobs N (parallel dispatch),
  -h / --help.
- New setup_fixtures runs once after the install/doctor/restart preamble
  and produces the throwaway git repo at $repodir that 'repodir'-class
  scenarios consume. Lifted out of scenario_workspace_run so single-
  scenario invocations (e.g. --scenario workspace_dryrun) get the
  fixture even when the scenario that historically built it isn't
  selected.
- Wipe ~/.local/state/banger/ssh/known_hosts in the install preamble.
  `system uninstall --purge` clears /var/lib/banger but the user-side
  known_hosts persists by design — and smoke creates VMs that reuse
  guest IPs (172.16.0.2 etc.) with fresh host keys every run, so a
  leftover entry trips StrictHostKeyChecking and the daemon's wait-
  for-ssh sees only timeouts. This was the real cause of the "guest
  ssh did not come up" flakes that surface across smoke iterations.

Parallel dispatch:

- --jobs N opts into a slot-limited pool: 'pure' scenarios fan out as
  individual jobs; 'repodir' scenarios fuse into a single serial chain
  (since they mutate $repodir in registry order); 'global' scenarios
  run serially after the pool, one at a time.
- Cap is min(N, 8) — each parallel slot runs an 8 GiB VM, so RAM is
  the binding constraint.
- Parallel-mode stdout/stderr per scenario buffer to per-scenario
  logs and emit one PASS/FAIL line on completion; on FAIL the buffer
  is dumped. Serial mode (--jobs 1, the default) keeps stdout
  unbuffered exactly as before.
- Parallelism is documented as experimental in --help: it surfaces
  real daemon-side concurrency bugs (image auto-pull manifest race,
  work-seed-refresh race on the shared work-seed.ext4) that don't
  appear in serial mode and that need their own fix in the daemon.
  Serial (--jobs 1) is the reliable path; --jobs N is for fast-
  iteration dev work where occasional re-runs are acceptable.

Exit codes: 0 ok, 1 assertion failed, 2 usage error (unknown
scenario, missing SCENARIO=), 77 explicit selection skipped (NAT
when sudo iptables is unavailable AND nat is the only selected
scenario; soft-skip otherwise).

Makefile additions:

- `make smoke-list` — cheap discovery, no smoke-build dep, no env vars.
- `make smoke-one SCENARIO=name` — single-scenario run, full preamble.
  MAKECMDGOALS guard catches missing SCENARIO= before any rebuild.
- `make smoke JOBS=N` — passes through to the script's --jobs N.
- Help text covers all three.

Verified: serial full suite passes 21/21 in ~140s on this host;
make smoke-one SCENARIO=workspace_restart runs the recently-added
regression test alone in ~50s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Thales Maciel 2026-04-27 16:56:57 -03:00
parent c9358ab390
commit 115eec8576
No known key found for this signature in database
GPG key ID: 33112E6833C34679
2 changed files with 872 additions and 402 deletions

File diff suppressed because it is too large Load diff