daemon: fix vm start (on a stopped VM) + regression coverage
Two defects compounded to make `vm create X` → `vm stop X` → `vm start X`
→ `vm ssh X` fail with `not_running: vm X is not running` even though
`vm show` reports `state=running`.
1. firecracker-go-sdk's startVMM spawns a goroutine that SIGTERMs
firecracker when the ctx passed to Machine.Start cancels — and
retains that ctx for the lifetime of the VMM, not just the boot
phase. Our Machine.Start wrapper was plumbing the caller's ctx
through, which on `vm.start` is the RPC request ctx. daemon.go's
handleConn cancels reqCtx via `defer cancel()` right after
writing the response. Net effect: firecracker is killed ~150ms
after the `vm start` RPC "completes", invisibly, and the next
`vm ssh` sees a dead PID. `vm.create` side-stepped the bug
because BeginVMCreate detaches to context.Background() before
calling startVMLocked; `vm.start` used the RPC ctx directly.
Fix: Machine.Start now passes context.Background() to the SDK.
We own firecracker lifecycle explicitly (StopVM / KillVM /
cleanupRuntime), so ctx-driven cancellation here was never
actually wired into anything useful.
2. With (1) fixed, the same scenario exposed a second defect:
patchRootOverlay's e2cp/e2rm refuses to touch the dm-snapshot
with "Inode bitmap checksum does not match bitmap" on a restart,
because the COW holds stale free-block/free-inode counters from
the previous guest boot. Kernel ext4 is fine with this; e2fsprogs
is not. Fix: run `e2fsck -fy` on the snapshot between the
dm_snapshot and patch_root_overlay stages. Idempotent on a fresh
snapshot, reconciles the bitmaps on a reused COW.
Regression coverage:
- scripts/repro-restart-bug.sh — minimal create→stop→start→ssh
reproducer with rich on-failure diagnostics (daemon log trace,
firecracker.log tail, handles.json, pgrep-by-apiSock, apiSock
stat). Exits non-zero if the bug returns.
- scripts/smoke.sh — lifecycle scenario (create/ssh/stop/start/
ssh/delete) and vm-set scenario (--vcpu 2 → stop → set --vcpu 4
→ start → assert nproc=4). Both were pulled when the bug was
first found; now restored.
Supporting:
- internal/system/system.ExitCode — extracts exec.ExitError's
code without forcing callers to import os/exec. Needed by the
e2fsck caller (policy test pins os/exec to the shell-out
packages).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
e94e7c4dcc
commit
b4afe13b2a
5 changed files with 303 additions and 1 deletions
|
|
@ -75,10 +75,28 @@ func NewMachine(ctx context.Context, cfg MachineConfig) (*Machine, error) {
|
|||
}
|
||||
|
||||
func (m *Machine) Start(ctx context.Context) error {
|
||||
if err := m.machine.Start(ctx); err != nil {
|
||||
// The caller's ctx is INTENTIONALLY not forwarded to the SDK.
|
||||
// firecracker-go-sdk's startVMM (machine.go) spawns a goroutine
|
||||
// that SIGTERMs firecracker the instant this ctx cancels, and
|
||||
// retains it for the lifetime of the VMM — not just the boot
|
||||
// phase. Plumbing an RPC request ctx through would mean
|
||||
// firecracker dies the moment the daemon writes its RPC response
|
||||
// (daemon.go:handleConn defers cancel). That silently breaks
|
||||
// `vm start` on a stopped VM: start "succeeds", the handler
|
||||
// returns, ctx cancels, firecracker is SIGTERMed, and the next
|
||||
// `vm ssh` hits `vmAlive = false`. `vm.create` sidesteps the bug
|
||||
// because BeginVMCreate detaches to a background ctx before
|
||||
// calling startVMLocked.
|
||||
//
|
||||
// We own firecracker lifecycle explicitly — StopVM / KillVM /
|
||||
// cleanupRuntime — so losing ctx-driven cancellation here is
|
||||
// deliberate. The SDK still enforces its own boot-phase timeouts
|
||||
// (socket wait, HTTP) with internal deadlines.
|
||||
if err := m.machine.Start(context.Background()); err != nil {
|
||||
m.closeLog()
|
||||
return err
|
||||
}
|
||||
_ = ctx
|
||||
|
||||
go func() {
|
||||
_ = m.machine.Wait(context.Background())
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue