banger/internal
Thales Maciel b4afe13b2a
daemon: fix vm start (on a stopped VM) + regression coverage
Two defects compounded to make `vm create X` → `vm stop X` → `vm start X`
→ `vm ssh X` fail with `not_running: vm X is not running` even though
`vm show` reports `state=running`.

1. firecracker-go-sdk's startVMM spawns a goroutine that SIGTERMs
   firecracker when the ctx passed to Machine.Start cancels — and
   retains that ctx for the lifetime of the VMM, not just the boot
   phase. Our Machine.Start wrapper was plumbing the caller's ctx
   through, which on `vm.start` is the RPC request ctx. daemon.go's
   handleConn cancels reqCtx via `defer cancel()` right after
   writing the response. Net effect: firecracker is killed ~150ms
   after the `vm start` RPC "completes", invisibly, and the next
   `vm ssh` sees a dead PID. `vm.create` side-stepped the bug
   because BeginVMCreate detaches to context.Background() before
   calling startVMLocked; `vm.start` used the RPC ctx directly.
   Fix: Machine.Start now passes context.Background() to the SDK.
   We own firecracker lifecycle explicitly (StopVM / KillVM /
   cleanupRuntime), so ctx-driven cancellation here was never
   actually wired into anything useful.

2. With (1) fixed, the same scenario exposed a second defect:
   patchRootOverlay's e2cp/e2rm refuses to touch the dm-snapshot
   with "Inode bitmap checksum does not match bitmap" on a restart,
   because the COW holds stale free-block/free-inode counters from
   the previous guest boot. Kernel ext4 is fine with this; e2fsprogs
   is not. Fix: run `e2fsck -fy` on the snapshot between the
   dm_snapshot and patch_root_overlay stages. Idempotent on a fresh
   snapshot, reconciles the bitmaps on a reused COW.

Regression coverage:
  - scripts/repro-restart-bug.sh — minimal create→stop→start→ssh
    reproducer with rich on-failure diagnostics (daemon log trace,
    firecracker.log tail, handles.json, pgrep-by-apiSock, apiSock
    stat). Exits non-zero if the bug returns.
  - scripts/smoke.sh — lifecycle scenario (create/ssh/stop/start/
    ssh/delete) and vm-set scenario (--vcpu 2 → stop → set --vcpu 4
    → start → assert nproc=4). Both were pulled when the bug was
    first found; now restored.

Supporting:
  - internal/system/system.ExitCode — extracts exec.ExitError's
    code without forcing callers to import os/exec. Needed by the
    e2fsck caller (policy test pins os/exec to the shell-out
    packages).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 12:01:46 -03:00
..
api vm run: ship tracked files only by default; add --include-untracked + --dry-run 2026-04-21 19:53:17 -03:00
buildinfo Stamp shared build metadata into banger binaries 2026-03-22 17:14:06 -03:00
cli smoke: five more scenarios + fix exit-code propagation bug the new ones caught 2026-04-22 19:37:07 -03:00
config ssh-config: harden sameDirOrParent against symlinks + add edge tests 2026-04-22 17:48:06 -03:00
daemon daemon: fix vm start (on a stopped VM) + regression coverage 2026-04-23 12:01:46 -03:00
firecracker daemon: fix vm start (on a stopped VM) + regression coverage 2026-04-23 12:01:46 -03:00
guest ssh: trust-on-first-use host key pinning everywhere 2026-04-19 16:46:03 -03:00
guestconfig Refactor VM lifecycle around capabilities 2026-03-18 19:28:26 -03:00
guestnet Stop using kernel IP autoconfig for runtime VMs 2026-03-21 21:54:18 -03:00
hostnat coverage: medium batch — hostnat runner, store guest-sessions, daemon helpers 2026-04-18 18:03:37 -03:00
imagecat publish-golden-image: content-addressed tarball names 2026-04-18 15:26:57 -03:00
imagepull imagepull/BuildExt4: omit positional fs-size; rely on file truncation 2026-04-18 14:58:42 -03:00
kernelcat Prune legacy void/alpine + customize.sh flows 2026-04-18 15:39:53 -03:00
model config + store: remove dead knobs and stale schema 2026-04-22 10:54:01 -03:00
namegen coverage: make targets + close zero-cov gaps (namegen, sessionstream) 2026-04-18 17:44:37 -03:00
paths runtime sockets: close the local-user race window around control-plane creation 2026-04-20 12:53:47 -03:00
policy Add vsock-backed VM port inspection 2026-03-19 15:52:11 -03:00
rpc Propagate RPC cancellation to daemon requests 2026-03-16 18:28:33 -03:00
store store: edge-path tests for migrations and Open 2026-04-22 17:51:52 -03:00
system daemon: fix vm start (on a stopped VM) + regression coverage 2026-04-23 12:01:46 -03:00
toolingplan coverage: easy-wins batch across cli, system, paths, vmdns, toolingplan 2026-04-18 17:57:05 -03:00
vmdns coverage: easy-wins batch across cli, system, paths, vmdns, toolingplan 2026-04-18 17:57:05 -03:00
vsockagent Add vsock-backed VM port inspection 2026-03-19 15:52:11 -03:00