Two new pure scenarios:
* detach_run: -d --rm and -d -- <cmd> combos rejected before VM
creation; bare -d leaves the VM running and ssh-able afterward.
* bootstrap_precondition: workspace with a .mise.toml is refused
without --nat; --no-bootstrap bypasses the precondition and the
run completes normally.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three load-bearing fixes that together let `banger update` (and its
auto-rollback path) restart the helper + daemon without killing
every running VM. New smoke scenarios prove the property end-to-end.
Bug fixes:
1. Disable the firecracker SDK's signal-forwarding goroutine. The
default ForwardSignals = [SIGINT, SIGQUIT, SIGTERM, SIGHUP,
SIGABRT] installs a handler in the helper that propagates the
helper's SIGTERM (sent by systemd on `systemctl stop bangerd-
root.service`) to every running firecracker child. Set
ForwardSignals to an empty (non-nil) slice so setupSignals
short-circuits at len()==0.
2. Add SendSIGKILL=no to bangerd-root.service. KillMode=process
limits the initial SIGTERM to the helper main, but systemd
still SIGKILLs leftover cgroup processes during the
FinalKillSignal stage unless SendSIGKILL=no.
3. Route restart-helper / restart-daemon / wait-daemon-ready
failures through rollbackAndRestart instead of rollbackAndWrap.
rollbackAndWrap restored .previous binaries but didn't re-
restart the failed unit, leaving the helper dead with the
rolled-back binary on disk after a failed update.
Testing infrastructure (production binaries unaffected):
- Hidden --manifest-url and --pubkey-file flags on `banger update`
let the smoke harness redirect the updater at locally-built
release artefacts. Marked Hidden in cobra; not advertised in
--help.
- FetchManifestFrom / VerifyBlobSignatureWithKey /
FetchAndVerifySignatureWithKey export the existing logic against
caller-supplied URL / pubkey. The default entry points still
call them with the embedded canonical values.
Smoke scenarios:
- update_check: --check against fake manifest reports update
available
- update_to_unknown: --to v9.9.9 fails before any host mutation
- update_no_root: refuses without sudo, install untouched
- update_dry_run: stages + verifies, no swap, version unchanged
- update_keeps_vm_alive: real swap to v0.smoke.0; same VM (same
boot_id) answers SSH after the daemon restart
- update_rollback_keeps_vm_alive: v0.smoke.broken-bangerd ships a
bangerd that passes --check-migrations but exits 1 as the
daemon. The post-swap `systemctl restart bangerd` fails,
rollbackAndRestart fires, the .previous binaries are restored
and re-restarted; the same VM still answers SSH afterwards
- daemon_admin (separate prep): covers `banger daemon socket`,
`bangerd --check-migrations --system`, `sudo banger daemon
stop`
The smoke release builder generates a fresh ECDSA P-256 keypair
with openssl, signs SHA256SUMS cosign-compatibly, and serves
artefacts from a backgrounded python http.server.
verify_smoke_check_test.go pins the openssl/cosign signature
equivalence so the smoke release builder can't silently drift.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three quality-of-life improvements now that the daemon-side races
that gated parallel mode are fixed:
1. **Smol VMs by default.** Smoke installs a tuned config.toml at
/etc/banger/config.toml between `system install` and `system
restart` so the respawned daemon picks up:
vcpu = 2
memory_mib = 1024
disk_size = "2G"
system_overlay_size = "2G"
Smoke scenarios assert behavior, not capacity — they don't need
4 vCPU / 8 GiB / 8 GiB / 8 GiB. Per-VM RAM cost drops from 8 GiB
to 1 GiB; nominal disk drops from 16 GiB to 4 GiB (sparse, so
actual use is small either way, but the new ceiling is gentler
on hosts that can't overcommit). Scenarios that test
reconfiguration (vm_set's --vcpu 2 → 4) still pass --vcpu
explicitly, so this default doesn't perturb their assertions.
2. **JOBS defaults to nproc.** The Makefile resolves JOBS to
`$(shell nproc)` if unset; the smoke script's existing cap of 8
keeps the parallel pool sane on bigger hosts. The script always
passes --jobs N now, so behavior is consistent. Override with
`make smoke JOBS=1` for a fully serial run.
3. **Help text catches up.** --help no longer flags parallelism as
experimental (the underlying daemon races are fixed) and now
describes the small-VM default. `make help` mentions the new
default and how to override.
Verified: `make smoke` (no JOBS) on a 32-core box auto-runs with
JOBS=8, smol VMs, 21/21 PASS in 172s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three concurrency bugs surfaced by `make smoke JOBS=4` that all stem
from `vm.create` paths assuming single-caller semantics:
1. **Kernel auto-pull manifest race.** Parallel `vm.create` calls that
each need to auto-pull the same kernel ref both run kernelcat.Fetch
in parallel against the same /var/lib/banger/kernels/<name>/. Fetch
writes manifest.json non-atomically (truncate + write); the peer
reads it back mid-write and trips
"parse manifest for X: unexpected end of JSON input".
Fix: per-name `sync.Mutex` map on `ImageService` (kernelPullLock).
`KernelPull` and `readOrAutoPullKernel` both acquire it and re-check
`kernelcat.ReadLocal` after the lock so a peer who finished while we
waited is treated as success — `readOrAutoPullKernel` does NOT call
`s.KernelPull` because that path errors with "already pulled" on a
peer-success, which would be wrong for auto-pull. Different kernels
stay parallel.
2. **Image auto-pull race.** Same shape as the kernel race but on the
image side: parallel `vm.create` calls both run pullFromBundle /
pullFromOCI for the missing image (each ~minutes of OCI fetch +
ext4 build). The publishImage atom under imageOpsMu only protects
the rename + UpsertImage commit, so the loser does all the work
only to fail at the recheck with "image already exists".
Fix: per-name `sync.Mutex` map on `ImageService` (imagePullLock).
`findOrAutoPullImage` acquires it, re-checks FindImage, and only
then calls PullImage. Loser short-circuits with the
freshly-published image instead of redoing minutes of work.
PullImage's own publishImage recheck stays as defense-in-depth
for callers that bypass the auto-pull path.
3. **Work-seed refresh race.** When the host's SSH key has rotated
since an image was last refreshed, `ensureAuthorizedKeyOnWorkDisk`
triggers `refreshManagedWorkSeedFingerprint`, which rewrote the
shared work-seed.ext4 in place via e2rm + e2cp. Peer `vm.create`
calls doing parallel `MaterializeWorkDisk` rdumps observed a torn
ext4 image — "Superblock checksum does not match superblock".
Fix: stage the rewrite on a sibling tmpfile (`<seed>.refresh.<pid>-<ns>.tmp`)
and atomic-rename. Concurrent readers either have the file open
(kernel keeps the pre-rename inode alive) or open after the rename
(see the new inode) — never observe a partial state. Two parallel
refreshes are idempotent (same daemon, same SSH key) so unique tmp
names are enough; whichever rename lands last wins, with identical
content. UpsertImage runs after the rename so the recorded
fingerprint always matches what's on disk.
Plus one smoke harness fix: reclassify `vm_prune` from `pure` to
`global`. `vm prune -f` removes ALL stopped VMs system-wide, not just
the ones the scenario created — so a parallel peer scenario that
happens to have its VM in `created`/`stopped` momentarily gets wiped.
Moving prune to the post-pool serial phase keeps it from racing with
in-flight scenarios.
After all four fixes, `make smoke JOBS=4` passes 21/21 in 174s
(serial baseline 141s; the small overhead is the buffered-output and
`wait -n` semaphore cost — well worth the parallelism for fast-iter
work on a 32-core box).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`scripts/smoke.sh` was a 600-line linear script: no way to see what it
covers without reading the whole thing, and no way to run a single
scenario when iterating. Every iteration paid the full ~5-10 min suite,
which made fast feedback loops painful enough to avoid the suite.
Refactor into a registry + per-scenario functions:
- Top-of-file SMOKE_SCENARIOS (ordered) + SMOKE_DESCS (one-line desc per
scenario) + SMOKE_CLASS (pure / repodir / global) drive both listing
and dispatch. The 21 existing scenario blocks become scenario_<name>
functions. Bodies are the inline blocks verbatim, modulo the workspace
fixture move described below.
- New CLI: --list (cheap discovery, no install / no env-vars),
--scenario NAME (or NAME,NAME,...), --jobs N (parallel dispatch),
-h / --help.
- New setup_fixtures runs once after the install/doctor/restart preamble
and produces the throwaway git repo at $repodir that 'repodir'-class
scenarios consume. Lifted out of scenario_workspace_run so single-
scenario invocations (e.g. --scenario workspace_dryrun) get the
fixture even when the scenario that historically built it isn't
selected.
- Wipe ~/.local/state/banger/ssh/known_hosts in the install preamble.
`system uninstall --purge` clears /var/lib/banger but the user-side
known_hosts persists by design — and smoke creates VMs that reuse
guest IPs (172.16.0.2 etc.) with fresh host keys every run, so a
leftover entry trips StrictHostKeyChecking and the daemon's wait-
for-ssh sees only timeouts. This was the real cause of the "guest
ssh did not come up" flakes that surface across smoke iterations.
Parallel dispatch:
- --jobs N opts into a slot-limited pool: 'pure' scenarios fan out as
individual jobs; 'repodir' scenarios fuse into a single serial chain
(since they mutate $repodir in registry order); 'global' scenarios
run serially after the pool, one at a time.
- Cap is min(N, 8) — each parallel slot runs an 8 GiB VM, so RAM is
the binding constraint.
- Parallel-mode stdout/stderr per scenario buffer to per-scenario
logs and emit one PASS/FAIL line on completion; on FAIL the buffer
is dumped. Serial mode (--jobs 1, the default) keeps stdout
unbuffered exactly as before.
- Parallelism is documented as experimental in --help: it surfaces
real daemon-side concurrency bugs (image auto-pull manifest race,
work-seed-refresh race on the shared work-seed.ext4) that don't
appear in serial mode and that need their own fix in the daemon.
Serial (--jobs 1) is the reliable path; --jobs N is for fast-
iteration dev work where occasional re-runs are acceptable.
Exit codes: 0 ok, 1 assertion failed, 2 usage error (unknown
scenario, missing SCENARIO=), 77 explicit selection skipped (NAT
when sudo iptables is unavailable AND nat is the only selected
scenario; soft-skip otherwise).
Makefile additions:
- `make smoke-list` — cheap discovery, no smoke-build dep, no env vars.
- `make smoke-one SCENARIO=name` — single-scenario run, full preamble.
MAKECMDGOALS guard catches missing SCENARIO= before any rebuild.
- `make smoke JOBS=N` — passes through to the script's --jobs N.
- Help text covers all three.
Verified: serial full suite passes 21/21 in ~140s on this host;
make smoke-one SCENARIO=workspace_restart runs the recently-added
regression test alone in ~50s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
VM stop has been quietly losing data freshly written via
`vm workspace prepare`: stop+start of a workspace-prepared VM would
come back with /root/repo wiped on the work disk.
Root cause is firecracker + Debian's systemd defaults. FC's
SendCtrlAltDel (the only "graceful shutdown" action FC exposes) just
delivers the keystroke; what the guest does with it is its choice.
Debian routes ctrl-alt-del.target -> reboot.target, so the guest
reboots, FC stays alive, the daemon's 10s wait_for_exit window
expires, and the SIGKILL fallback drops anything still in FC's
userspace I/O path. For an idle VM that's invisible. For one that
just took 100s of small writes through a workspace prepare, it's
data loss.
Fix is to dial the guest over SSH inside StopVM and run
`sync; systemctl --no-block poweroff || /sbin/poweroff -f &` before
the existing SendCtrlAltDel path. The synchronous `sync` is the
load-bearing piece — it blocks until every dirty page hits virtio-blk
and lands in the on-host root.ext4. Whether poweroff completes
before SIGKILL fires is incidental; sync has already run. SSH
unreachable falls back to the old SendCtrlAltDel behaviour so a
broken-network guest can't make stop hang.
Bounded by a 5s SSH-dial timeout so a half-broken guest can't extend
the overall stop window past gracefulShutdownWait.
Also adds two smoke scenarios:
- `workspace + stop/start`: prepare -> stop -> start -> assert
marker survives. This is the regression that caught the bug.
- `vm exec`: end-to-end coverage for d59425a — auto-cd into the
prepared workspace, exit-code propagation, dirty-host warning,
--auto-prepare resync, refusal on stopped VM.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds three small but high-leverage presentation tweaks for v0.1:
1. internal/cli/style is a new ~70 LOC package with Pass/Fail/Warn/
Dim/Bold helpers. Each is TTY-gated and obeys NO_COLOR. No
external dep. Wired into the doctor PASS/FAIL/WARN status, the
"banger:" error prefix on stderr, and the dim 'ready in <elapsed>'
line.
2. internal/cli/errors translates rpc.ErrorResponse into user-facing
text. operation_failed becomes invisible (the message wins);
not_found, already_exists, bad_request, bad_version, unauthorized,
unknown_method get short labels; unknown codes pass through. The
daemon-attached op_id lands in dim parens — paste into
journalctl --grep to find the daemon log line that produced the
failure.
3. Tabwriter config converges on (0, 8, 2, ' ', 0) across every
list/table command. The vm prune confirmation table picked up the
right config; system install + system status switched from bare
"key: value\n" lines to tabular form. printVMSpecLine drops its
Unicode middle dot for an ASCII '|' so terminals without UTF-8
render cleanly.
Tests cover translateRPCError for every code, style helpers no-op
on non-TTY and under NO_COLOR. Smoke status greps switch from
"key: value" to "key value" to match the new format.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the supported systemd path to two services: an owner-user bangerd for
orchestration and a narrow root helper for bridge/tap, NAT/resolver, dm/loop,
and Firecracker ownership. This removes repeated sudo from daily vm and image
flows without leaving the general daemon running as root.
Add install metadata, system install/status/restart/uninstall commands, and a
system-owned runtime layout. Keep user SSH/config material in the owner home,
lock file_sync to the owner home, and move daemon known_hosts handling out of
the old root-owned control path.
Route privileged lifecycle steps through typed privilegedOps calls, harden the
two systemd units, and rewrite smoke plus docs around the supported service
model.
Verified with make build, make test, make lint, and make smoke on the
supported systemd host path.
A VM name flows into five places that all have narrower grammars
than "arbitrary string":
- the guest's /etc/hostname (vm_disk.patchRootOverlay)
- the guest's /etc/hosts (same)
- the <name>.vm DNS record (vmdns.RecordName)
- the kernel command line (system.BuildBootArgs*)
- VM-dir file-path fragments (layout.VMsDir/<id>, etc.)
Nothing in the chain was validating the input. A name with
whitespace, newline, dot, slash, colon, or = would produce broken
hostnames, weird DNS labels, smuggled kernel cmdline tokens, or
(in the worst case) surprising traversal through the on-disk
layout. Not host shell injection — we already avoid shelling out
with the raw name — but a real correctness and supportability bug.
New: model.ValidateVMName. Rules:
- 1..63 chars (DNS label max per RFC 1123; also a comfortable
/etc/hostname cap)
- lowercase ASCII letters, digits, '-' only
- no leading or trailing '-'
- no normalization — the name is the user-visible identifier
(store key, `ssh <name>.vm`, `vm show`); silently rewriting
"MyVM" → "myvm" would hand the user back something different
than they typed
Called from two places:
- internal/cli/commands_vm.go vmCreateParamsFromFlags — rejects
bad `--name` values before any RPC. Empty name still passes
through so the daemon can generate one.
- internal/daemon/vm_create.go reserveVM — defense in depth for
any non-CLI RPC caller (SDK, direct JSON over the socket).
Tests:
- internal/model/vm_name_test.go — exhaustive character-class
matrix (space, newline, tab, dot, slash, colon, equals, quote,
control chars, unicode letters, uppercase, leading/trailing
hyphen, over-length, max-length-exact, digits-only).
- internal/cli TestVMCreateParamsFromFlagsRejectsInvalidName —
CLI wire-through + empty-name passthrough.
- internal/daemon TestReserveVMRejectsInvalidName — daemon
defense-in-depth (including `box/../evil` path-traversal).
- scripts/smoke.sh — end-to-end rejection + no-leaked-row
assertion.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
--readonly ran `chmod -R a-w` over the workspace after copying, but
every banger guest boots as root, and root bypasses DAC mode checks.
So a user running `vm workspace prepare ... --readonly` got the
mode bits set to 0444 but `echo x >> file` in the guest still
succeeded. The flag promised enforcement it couldn't deliver.
The feature also doesn't match the product model: workspaces are
prepared precisely so the guest CAN edit them, and `workspace
export` exists to pull those edits back as a patch. A
"read-only workspace" contradicts that loop.
Removed:
- CLI flag `--readonly` on `vm workspace prepare`
- api.VMWorkspacePrepareParams.ReadOnly field
- model.WorkspacePrepareResult.ReadOnly field
- daemon chmod dispatch in prepareVMWorkspaceGuestIO
- smoke scenario pinning the (advisory) mode-bit behavior
- misleading "exportbox-readonly" VM name in an unrelated export
test (the test is about not mutating the real git index;
renamed to exportbox-noindex-mutation)
If real enforcement becomes a user need later, the right primitive
is `chattr +i` (immutable bit — root CAN'T write) or a ro bind-mount.
Reintroducing a new flag is cheaper than debugging what the current
one actually guarantees.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Audit of banger's advertised CLI surface vs. what smoke was exercising
turned up several gaps where a regression would have shipped silently.
New scenarios:
- NAT: asserts the per-VM POSTROUTING MASQUERADE rule is installed
with --nat (scoped to the guest /32), idempotent across stop/start,
and torn down on delete. End-to-end curl tests don't work here
because the bridge IP and uplink IP both belong to the host — a
guest reaching the uplink lands on host-local loopback whether
MASQUERADE is set up or not — so the test pins the iptables rule
itself. Skipped if passwordless `sudo iptables` isn't available.
- vm ports: sshd :22 must be visible with the <name>.vm endpoint
(not localhost, not the raw guest IP — the daemon prefers the
DNS record when one exists).
- vm restart: dedicated verb, not a stop+start alias. Asserts a
fresh boot_id to prove the kernel actually recycled.
- vm kill --signal KILL: forceful termination path (distinct from
`vm stop`'s graceful Ctrl-Alt-Del). Post-kill state must be
'stopped' (not 'error') and the dm-snapshot must be cleaned up.
- vm prune -f: batch delete of non-running VMs while preserving any
that are still running. Regression guard for the case where prune
could wipe a live session.
- workspace prepare --readonly: mode bits on /root/repo/<file>
must drop all write bits. Enforcement is advisory against a root
guest, so the test asserts the bits, not EACCES.
- workspace prepare --mode full_copy: alternate transfer path
(tarred into rootfs, no overlay) still lands the repo contents
at /root/repo.
- workspace export --base-commit: guest-side commits captured in
the patch when the pre-commit SHA is pinned. The feature's whole
reason for existing; it had zero coverage. Includes a control
assertion that the plain (no --base-commit) export does NOT see
the committed file.
- ssh-config --install / --uninstall: HOME-isolated to a smoke
tempdir so we don't touch the invoking user's ~/.ssh/config.
Seeds a pre-existing config to catch any regression where
install clobbers instead of appending. Asserts idempotency
(second install doesn't duplicate the Include line) and clean
round-trip (uninstall leaves the user's own content intact).
Coverage deltas from smoke (vs the last run):
internal/hostnat 14.1% → 64.1% (+50pp — NAT rule dance)
internal/daemon/opstate 56.2% → 87.5% (+31pp)
internal/daemon 43.4% → 49.4% (+6pp)
internal/cli 36.1% → 40.4% (+4pp)
internal/daemon/workspace 64.1% → 67.5% (+3pp)
Scenario count: 12 → 21.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two defects compounded to make `vm create X` → `vm stop X` → `vm start X`
→ `vm ssh X` fail with `not_running: vm X is not running` even though
`vm show` reports `state=running`.
1. firecracker-go-sdk's startVMM spawns a goroutine that SIGTERMs
firecracker when the ctx passed to Machine.Start cancels — and
retains that ctx for the lifetime of the VMM, not just the boot
phase. Our Machine.Start wrapper was plumbing the caller's ctx
through, which on `vm.start` is the RPC request ctx. daemon.go's
handleConn cancels reqCtx via `defer cancel()` right after
writing the response. Net effect: firecracker is killed ~150ms
after the `vm start` RPC "completes", invisibly, and the next
`vm ssh` sees a dead PID. `vm.create` side-stepped the bug
because BeginVMCreate detaches to context.Background() before
calling startVMLocked; `vm.start` used the RPC ctx directly.
Fix: Machine.Start now passes context.Background() to the SDK.
We own firecracker lifecycle explicitly (StopVM / KillVM /
cleanupRuntime), so ctx-driven cancellation here was never
actually wired into anything useful.
2. With (1) fixed, the same scenario exposed a second defect:
patchRootOverlay's e2cp/e2rm refuses to touch the dm-snapshot
with "Inode bitmap checksum does not match bitmap" on a restart,
because the COW holds stale free-block/free-inode counters from
the previous guest boot. Kernel ext4 is fine with this; e2fsprogs
is not. Fix: run `e2fsck -fy` on the snapshot between the
dm_snapshot and patch_root_overlay stages. Idempotent on a fresh
snapshot, reconciles the bitmaps on a reused COW.
Regression coverage:
- scripts/repro-restart-bug.sh — minimal create→stop→start→ssh
reproducer with rich on-failure diagnostics (daemon log trace,
firecracker.log tail, handles.json, pgrep-by-apiSock, apiSock
stat). Exits non-zero if the bug returns.
- scripts/smoke.sh — lifecycle scenario (create/ssh/stop/start/
ssh/delete) and vm-set scenario (--vcpu 2 → stop → set --vcpu 4
→ start → assert nproc=4). Both were pulled when the bug was
first found; now restored.
Supporting:
- internal/system/system.ExitCode — extracts exec.ExitError's
code without forcing callers to import os/exec. Needed by the
e2fsck caller (policy test pins os/exec to the shell-out
packages).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The export round-trip (`vm create` → `workspace prepare` → guest edit →
`workspace export`) exposed a reproducible failure on Debian bookworm
guests: `git read-tree HEAD --index-output=/tmp/...` returns exit 128
"unable to write new index file" when the target lives on tmpfs while
`.git` is on the workspace overlay. Move the temp index into
`$(git rev-parse --git-dir)` so it shares a filesystem with `.git/index`
and the lockfile + rename + hardlink dance git does internally works.
Alongside:
- new workspace-export smoke scenario that would have caught this at
the boundary between daemon and guest git
- `make smoke-fresh` = `smoke-clean && smoke` for release-time runs
that want first-install paths (migrations, image pull) stamped into
the coverage report
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five new smoke scenarios layered on top of the existing bare + workspace
vm-run tests:
- exit-code propagation: `sh -c 'exit 42'` must rc=42
- workspace dry-run: --dry-run lists tracked files without a VM
- workspace --include-untracked: opt-in ships files outside the git
index (regression guard on the security-default flip from review 4)
- concurrent vm runs: two --rm invocations in parallel both succeed
(stresses per-VM locks, createVMMu reservation window, tap pool)
- invalid spec rejection: --vcpu 0 must fail with no VM row left
behind (the "cleanup on partial failure" path the review flagged)
The exit-code scenario caught a real bug on first run:
`banger vm run --rm -- sh -c 'exit 42'` returned rc=0, not 42.
Root cause in internal/cli/ssh.go's sshCommandArgs: extra args were
appended to the ssh argv verbatim, relying on ssh(1)'s implicit
space-join to deliver the remote command. That works for single
tokens (echo hello) but re-tokenises multi-word commands on the
remote side: `ssh host sh -c 'exit 42'` becomes remote
`sh -c exit 42`, where `42` is $0 for the already-completed `exit`,
and the exit code the user asked for is lost.
Fix: shell-quote every element of extra (`'sh'` `'-c'` `'exit 42'`)
and join them into a single trailing argv entry. ssh's space-join
then produces exactly the command the user typed on the remote
shell. TestSSHCommandArgs was updated to pin the quoting; the
existing TestRunVMRunCommandModePropagatesExitCode test needed a
one-word quote tweak (`false` → `'false'`).
Smoke run after fix passes all seven scenarios in ~2 min on warm
state. cmd/banger coverage jumped to 100% (the invalid-spec
scenario hits the error-reporting path that wasn't covered
before).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The unit + integration tests can't cross machine.Start — the SDK
boundary would need a fake firecracker that reimplements the
control-plane HTTP API, and the ongoing maintenance cost of keeping
that fake honest with upstream kills the value. Instead, add a
pre-release smoke target that drives REAL Firecracker + real KVM,
captures coverage from the -cover-instrumented binaries, and
surfaces per-package deltas so regressions in the boot path don't
ship silently.
scripts/smoke.sh:
- Isolated XDG_{CONFIG,STATE,CACHE,RUNTIME} so the smoke run can't
touch real user state (state/cache persist under build/smoke/xdg
for fast reruns; runtime is mktemp'd fresh per-run because
sockets can't be reused)
- Preflight: `banger doctor` must pass; UDP :42069 must be free
(otherwise the user's real daemon is up and the smoke daemon
can't bind its DNS listener — fail with an actionable message)
- Scenario 1 — bare: `banger vm run --rm -- echo smoke-bare-ok`
exercises create → start → socket ownership chown → machine.Start
→ SDK waitForSocket race → vsock agent readiness → guest SSH
wait → exec → cleanup → delete
- Scenario 2 — workspace: creates a throwaway git repo, runs
`banger vm run --rm <repo> -- cat /root/repo/smoke-file.txt`,
verifies the tracked file reached the guest (exercises
workDisk capability PrepareHost + workspace.prepare)
- `banger daemon stop` at the end so instrumented binaries flush
GOCOVERDIR pods before the script exits
Makefile additions:
- smoke-build: builds banger/bangerd under build/smoke/bin/ with
`go build -cover`
- smoke: runs the script with GOCOVERDIR set, reports per-package
coverage via `go tool covdata percent`
- smoke-coverage-html: textfmt + go tool cover for a browsable
report
- smoke-clean: nukes build/smoke/ including the persisted XDG
state
Bonus fix uncovered during the first smoke run: doctor treated a
missing state.db as a FAIL ("out of memory" from SQLite
SQLITE_CANTOPEN), which red-flagged every fresh install. Split
the store check: DB file absent → PASS with "will be created on
first daemon start" detail; DB present but unreadable → FAIL as
before. New TestDoctorReport_StoreMissingSurfacesAsPassForFreshInstall
pins the behaviour.
Concrete coverage delta from the first successful smoke run
(compared to `make coverage-total`'s unit-test-only 37.8%):
internal/firecracker 43.6% → 75.0%
internal/daemon/workspace 33.8% → 60.8%
internal/store 40.1% → 56.3%
internal/guest 63.7% → 57.4% (different mix: smoke
exercises real SSH;
unit tests cover more
error branches)
The packages the review flagged are the ones that moved most —
which is the point.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>