banger/internal/daemon
Thales Maciel b4afe13b2a
daemon: fix vm start (on a stopped VM) + regression coverage
Two defects compounded to make `vm create X` → `vm stop X` → `vm start X`
→ `vm ssh X` fail with `not_running: vm X is not running` even though
`vm show` reports `state=running`.

1. firecracker-go-sdk's startVMM spawns a goroutine that SIGTERMs
   firecracker when the ctx passed to Machine.Start cancels — and
   retains that ctx for the lifetime of the VMM, not just the boot
   phase. Our Machine.Start wrapper was plumbing the caller's ctx
   through, which on `vm.start` is the RPC request ctx. daemon.go's
   handleConn cancels reqCtx via `defer cancel()` right after
   writing the response. Net effect: firecracker is killed ~150ms
   after the `vm start` RPC "completes", invisibly, and the next
   `vm ssh` sees a dead PID. `vm.create` side-stepped the bug
   because BeginVMCreate detaches to context.Background() before
   calling startVMLocked; `vm.start` used the RPC ctx directly.
   Fix: Machine.Start now passes context.Background() to the SDK.
   We own firecracker lifecycle explicitly (StopVM / KillVM /
   cleanupRuntime), so ctx-driven cancellation here was never
   actually wired into anything useful.

2. With (1) fixed, the same scenario exposed a second defect:
   patchRootOverlay's e2cp/e2rm refuses to touch the dm-snapshot
   with "Inode bitmap checksum does not match bitmap" on a restart,
   because the COW holds stale free-block/free-inode counters from
   the previous guest boot. Kernel ext4 is fine with this; e2fsprogs
   is not. Fix: run `e2fsck -fy` on the snapshot between the
   dm_snapshot and patch_root_overlay stages. Idempotent on a fresh
   snapshot, reconciles the bitmaps on a reused COW.

Regression coverage:
  - scripts/repro-restart-bug.sh — minimal create→stop→start→ssh
    reproducer with rich on-failure diagnostics (daemon log trace,
    firecracker.log tail, handles.json, pgrep-by-apiSock, apiSock
    stat). Exits non-zero if the bug returns.
  - scripts/smoke.sh — lifecycle scenario (create/ssh/stop/start/
    ssh/delete) and vm-set scenario (--vcpu 2 → stop → set --vcpu 4
    → start → assert nproc=4). Both were pulled when the bug was
    first found; now restored.

Supporting:
  - internal/system/system.ExitCode — extracts exec.ExitError's
    code without forcing callers to import os/exec. Needed by the
    e2fsck caller (policy test pins os/exec to the shell-out
    packages).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 12:01:46 -03:00
..
dmsnap Extract opstate and dmsnap into subpackages 2026-04-15 16:02:43 -03:00
fcproc fcproc: targeted tests for waitForPath + EnsureSocketAccess error paths 2026-04-22 17:49:42 -03:00
imagemgr Remove image build --from-image; doctor treats catalog images as OK 2026-04-18 15:54:29 -03:00
opstate coverage: medium batch — hostnat runner, store guest-sessions, daemon helpers 2026-04-18 18:03:37 -03:00
workspace seams: move the last four package globals onto instance fields 2026-04-22 12:07:14 -03:00
ARCHITECTURE.md docs: resync package docs, AGENTS, and kernel-catalog with current code 2026-04-22 13:01:11 -03:00
autopull_test.go daemon split (6/n): extract wireServices + drop lazy service getters 2026-04-21 15:55:28 -03:00
capabilities.go daemon split (7/n): narrow capability interfaces, wire deps at construction 2026-04-21 15:59:09 -03:00
capabilities_test.go daemon split (7/n): narrow capability interfaces, wire deps at construction 2026-04-21 15:59:09 -03:00
concurrency_test.go daemon split (6/n): extract wireServices + drop lazy service getters 2026-04-21 15:55:28 -03:00
daemon.go vmservice: delete dead guestWaitForSSH + guestDial seams 2026-04-22 12:45:27 -03:00
daemon_test.go daemon split (6/n): extract wireServices + drop lazy service getters 2026-04-21 15:55:28 -03:00
daemon_testing_test.go test: add newTestDaemon harness + options 2026-04-22 17:45:43 -03:00
dns_routing.go seams: move the last four package globals onto instance fields 2026-04-22 12:07:14 -03:00
dns_routing_test.go seams: move the last four package globals onto instance fields 2026-04-22 12:07:14 -03:00
doc.go docs: resync package docs, AGENTS, and kernel-catalog with current code 2026-04-22 13:01:11 -03:00
doctor.go make smoke: end-to-end boot suite with coverage from real VM runs 2026-04-22 18:59:57 -03:00
doctor_test.go make smoke: end-to-end boot suite with coverage from real VM runs 2026-04-22 18:59:57 -03:00
fake_firecracker_test.go remove vm session feature 2026-04-20 12:47:58 -03:00
fastpath_test.go daemon split (6/n): extract wireServices + drop lazy service getters 2026-04-21 15:55:28 -03:00
guest_ssh.go remove vm session feature 2026-04-20 12:47:58 -03:00
host_network.go seams: move the last four package globals onto instance fields 2026-04-22 12:07:14 -03:00
image_seed.go daemon split (2/5): extract *ImageService service 2026-04-20 20:30:32 -03:00
image_service.go daemon split (6/n): extract wireServices + drop lazy service getters 2026-04-21 15:55:28 -03:00
images.go daemon split (2/5): extract *ImageService service 2026-04-20 20:30:32 -03:00
images_helpers_test.go coverage: medium batch — hostnat runner, store guest-sessions, daemon helpers 2026-04-18 18:03:37 -03:00
images_pull.go daemon split (2/5): extract *ImageService service 2026-04-20 20:30:32 -03:00
images_pull_bundle_test.go daemon split (6/n): extract wireServices + drop lazy service getters 2026-04-21 15:55:28 -03:00
images_pull_test.go daemon split (6/n): extract wireServices + drop lazy service getters 2026-04-21 15:55:28 -03:00
kernels.go daemon split (2/5): extract *ImageService service 2026-04-20 20:30:32 -03:00
kernels_test.go daemon split (6/n): extract wireServices + drop lazy service getters 2026-04-21 15:55:28 -03:00
lifecycle_flow_test.go test: end-to-end VMService lifecycle flow harness 2026-04-22 17:55:04 -03:00
logger.go vm state: split transient kernel/process handles off the durable schema 2026-04-19 14:18:13 -03:00
logger_test.go seams: move the last four package globals onto instance fields 2026-04-22 12:07:14 -03:00
nat.go daemon split (1/5): extract *HostNetwork service 2026-04-20 20:11:46 -03:00
nat_capability_test.go tests: targeted coverage for doctor, workspace rejections, and nat capability 2026-04-22 12:58:12 -03:00
nat_test.go vm state: split transient kernel/process handles off the durable schema 2026-04-19 14:18:13 -03:00
open_close_test.go daemon split (6/n): extract wireServices + drop lazy service getters 2026-04-21 15:55:28 -03:00
ports.go daemon split (4/5): extract *VMService service 2026-04-20 20:57:05 -03:00
preflight.go seams: move the last four package globals onto instance fields 2026-04-22 12:07:14 -03:00
runtime_assets.go daemon split (4/5): extract *VMService service 2026-04-20 20:57:05 -03:00
snapshot.go daemon split (1/5): extract *HostNetwork service 2026-04-20 20:11:46 -03:00
snapshot_test.go daemon split (6/n): extract wireServices + drop lazy service getters 2026-04-21 15:55:28 -03:00
ssh_client_config.go ssh-config: harden sameDirOrParent against symlinks + add edge tests 2026-04-22 17:48:06 -03:00
ssh_client_config_test.go ssh-config: harden sameDirOrParent against symlinks + add edge tests 2026-04-22 17:48:06 -03:00
sshd_config_test.go guest sshd: drop DEBUG3 + StrictModes no; normalise /root perms 2026-04-19 13:40:40 -03:00
tap_pool.go daemon split (1/5): extract *HostNetwork service 2026-04-20 20:11:46 -03:00
vm.go daemon split (4/5): extract *VMService service 2026-04-20 20:57:05 -03:00
vm_authsync.go daemon split (3/5): extract *WorkspaceService service 2026-04-20 20:42:31 -03:00
vm_create.go daemon split (4/5): extract *VMService service 2026-04-20 20:57:05 -03:00
vm_create_ops.go daemon split (4/5): extract *VMService service 2026-04-20 20:57:05 -03:00
vm_create_test.go daemon split (6/n): extract wireServices + drop lazy service getters 2026-04-21 15:55:28 -03:00
vm_disk.go daemon split (4/5): extract *VMService service 2026-04-20 20:57:05 -03:00
vm_handles.go daemon split (4/5): extract *VMService service 2026-04-20 20:57:05 -03:00
vm_handles_test.go daemon split (6/n): extract wireServices + drop lazy service getters 2026-04-21 15:55:28 -03:00
vm_lifecycle.go daemon: fix vm start (on a stopped VM) + regression coverage 2026-04-23 12:01:46 -03:00
vm_locks.go Move subsystem state/locks off Daemon into owning types 2026-04-15 15:58:33 -03:00
vm_service.go vmservice: delete dead guestWaitForSSH + guestDial seams 2026-04-22 12:45:27 -03:00
vm_set.go daemon split (4/5): extract *VMService service 2026-04-20 20:57:05 -03:00
vm_stats.go daemon split (4/5): extract *VMService service 2026-04-20 20:57:05 -03:00
vm_test.go daemon split (6/n): extract wireServices + drop lazy service getters 2026-04-21 15:55:28 -03:00
workspace.go smoke: workspace export scenario + smoke-fresh target + fix the export bug it caught 2026-04-23 11:34:55 -03:00
workspace_rejection_test.go tests: targeted coverage for doctor, workspace rejections, and nat capability 2026-04-22 12:58:12 -03:00
workspace_service.go seams: move the last four package globals onto instance fields 2026-04-22 12:07:14 -03:00
workspace_test.go vm run: ship tracked files only by default; add --include-untracked + --dry-run 2026-04-21 19:53:17 -03:00