Two defects compounded to make `vm create X` → `vm stop X` → `vm start X`
→ `vm ssh X` fail with `not_running: vm X is not running` even though
`vm show` reports `state=running`.
1. firecracker-go-sdk's startVMM spawns a goroutine that SIGTERMs
firecracker when the ctx passed to Machine.Start cancels — and
retains that ctx for the lifetime of the VMM, not just the boot
phase. Our Machine.Start wrapper was plumbing the caller's ctx
through, which on `vm.start` is the RPC request ctx. daemon.go's
handleConn cancels reqCtx via `defer cancel()` right after
writing the response. Net effect: firecracker is killed ~150ms
after the `vm start` RPC "completes", invisibly, and the next
`vm ssh` sees a dead PID. `vm.create` side-stepped the bug
because BeginVMCreate detaches to context.Background() before
calling startVMLocked; `vm.start` used the RPC ctx directly.
Fix: Machine.Start now passes context.Background() to the SDK.
We own firecracker lifecycle explicitly (StopVM / KillVM /
cleanupRuntime), so ctx-driven cancellation here was never
actually wired into anything useful.
2. With (1) fixed, the same scenario exposed a second defect:
patchRootOverlay's e2cp/e2rm refuses to touch the dm-snapshot
with "Inode bitmap checksum does not match bitmap" on a restart,
because the COW holds stale free-block/free-inode counters from
the previous guest boot. Kernel ext4 is fine with this; e2fsprogs
is not. Fix: run `e2fsck -fy` on the snapshot between the
dm_snapshot and patch_root_overlay stages. Idempotent on a fresh
snapshot, reconciles the bitmaps on a reused COW.
Regression coverage:
- scripts/repro-restart-bug.sh — minimal create→stop→start→ssh
reproducer with rich on-failure diagnostics (daemon log trace,
firecracker.log tail, handles.json, pgrep-by-apiSock, apiSock
stat). Exits non-zero if the bug returns.
- scripts/smoke.sh — lifecycle scenario (create/ssh/stop/start/
ssh/delete) and vm-set scenario (--vcpu 2 → stop → set --vcpu 4
→ start → assert nproc=4). Both were pulled when the bug was
first found; now restored.
Supporting:
- internal/system/system.ExitCode — extracts exec.ExitError's
code without forcing callers to import os/exec. Needed by the
e2fsck caller (policy test pins os/exec to the shell-out
packages).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
315 lines
14 KiB
Bash
Executable file
315 lines
14 KiB
Bash
Executable file
#!/usr/bin/env bash
|
|
#
|
|
# scripts/smoke.sh — end-to-end smoke suite for banger.
|
|
#
|
|
# Drives a real create → start → ssh → exec → delete cycle against
|
|
# real Firecracker + real KVM on the host. Intended as a pre-release
|
|
# gate: the Go unit + integration tests don't and can't cover the
|
|
# post-machine.Start path (socket ownership, guest boot, vsock agent
|
|
# wait, guest SSH, workspace prepare). If this suite fails, don't
|
|
# ship.
|
|
#
|
|
# State lives under $BANGER_SMOKE_XDG_DIR (set by `make smoke`,
|
|
# defaults to build/smoke/xdg). It's ISOLATED from the invoking
|
|
# user's real banger install via XDG_{CONFIG,STATE,CACHE,RUNTIME}
|
|
# overrides, but PERSISTED across runs — so the first smoke pulls
|
|
# the golden image, subsequent smokes reuse it. `make smoke-clean`
|
|
# wipes it.
|
|
#
|
|
# Invoked via `make smoke`, which sets the three env vars below.
|
|
# Don't run this directly unless you know they're set.
|
|
|
|
set -euo pipefail
|
|
|
|
log() { printf '[smoke] %s\n' "$*" >&2; }
|
|
die() { printf '[smoke] FAIL: %s\n' "$*" >&2; exit 1; }
|
|
|
|
# wait_for_ssh polls `vm ssh <vm> -- true` until it succeeds or the
|
|
# timeout expires. `vm ssh` — unlike `vm run` — does not itself wait
|
|
# for guest sshd, so scenarios that call `vm create` / `vm start`
|
|
# back-to-back with `vm ssh` need this shim. 60s matches
|
|
# vmRunSSHTimeout.
|
|
wait_for_ssh() {
|
|
local vm="$1"
|
|
local deadline=$(( $(date +%s) + 60 ))
|
|
while (( $(date +%s) < deadline )); do
|
|
if "$BANGER" vm ssh "$vm" -- true >/dev/null 2>&1; then
|
|
return 0
|
|
fi
|
|
sleep 1
|
|
done
|
|
return 1
|
|
}
|
|
|
|
: "${BANGER_SMOKE_BIN_DIR:?must point at the instrumented binary dir, set by make smoke}"
|
|
: "${BANGER_SMOKE_COVER_DIR:?must point at the coverage dir, set by make smoke}"
|
|
: "${BANGER_SMOKE_XDG_DIR:?must point at the isolated XDG root, set by make smoke}"
|
|
|
|
BANGER="$BANGER_SMOKE_BIN_DIR/banger"
|
|
BANGERD="$BANGER_SMOKE_BIN_DIR/bangerd"
|
|
VSOCK_AGENT="$BANGER_SMOKE_BIN_DIR/banger-vsock-agent"
|
|
|
|
for bin in "$BANGER" "$BANGERD" "$VSOCK_AGENT"; do
|
|
[[ -x "$bin" ]] || die "binary missing or not executable: $bin"
|
|
done
|
|
|
|
# Persistent XDG dirs (state, cache, config) so repeated smoke
|
|
# runs reuse the pulled golden image instead of re-downloading
|
|
# ~300MB each time. Runtime dir needs to be fresh per-run because
|
|
# it holds sockets the daemon cleans up on stop and refuses to
|
|
# reuse if any are stale.
|
|
mkdir -p \
|
|
"$BANGER_SMOKE_XDG_DIR/config" \
|
|
"$BANGER_SMOKE_XDG_DIR/state" \
|
|
"$BANGER_SMOKE_XDG_DIR/cache"
|
|
runtime_dir="$(mktemp -d -t banger-smoke-runtime-XXXXXX)"
|
|
# shellcheck disable=SC2064
|
|
trap "rm -rf '$runtime_dir'" EXIT
|
|
chmod 0700 "$runtime_dir"
|
|
|
|
export XDG_CONFIG_HOME="$BANGER_SMOKE_XDG_DIR/config"
|
|
export XDG_STATE_HOME="$BANGER_SMOKE_XDG_DIR/state"
|
|
export XDG_CACHE_HOME="$BANGER_SMOKE_XDG_DIR/cache"
|
|
export XDG_RUNTIME_DIR="$runtime_dir"
|
|
|
|
# Point banger at its companion binaries inside the smoke build.
|
|
export BANGER_DAEMON_BIN="$BANGERD"
|
|
export BANGER_VSOCK_AGENT_BIN="$VSOCK_AGENT"
|
|
|
|
# Instrumented binaries dump coverage here on clean exit.
|
|
export GOCOVERDIR="$BANGER_SMOKE_COVER_DIR"
|
|
mkdir -p "$GOCOVERDIR"
|
|
|
|
# Any smoke daemon left behind from a prior run that crashed mid-
|
|
# scenario would reuse the stale socket path and confuse
|
|
# ensureDaemon. Best-effort stop; ignore if nothing is running.
|
|
"$BANGER" daemon stop >/dev/null 2>&1 || true
|
|
|
|
# banger's vmDNS binds 127.0.0.1:42069 (UDP) hard. If the user's
|
|
# real (non-smoke) daemon is running, its listener holds the port
|
|
# and the smoke daemon's Open() fails before any scenario runs.
|
|
# Fail fast with an actionable message — don't guess whether to
|
|
# stop the user's daemon for them.
|
|
if command -v ss >/dev/null 2>&1 && ss -Huln 2>/dev/null | awk '{print $4}' | grep -q '[:.]42069$'; then
|
|
die 'port 127.0.0.1:42069 is already bound (likely your real banger daemon); stop it with `banger daemon stop` and re-run `make smoke`'
|
|
fi
|
|
|
|
# --- doctor -----------------------------------------------------------
|
|
log 'doctor: checking host readiness'
|
|
if ! "$BANGER" doctor; then
|
|
die 'doctor reported failures; fix the host before running smoke'
|
|
fi
|
|
|
|
# --- bare vm run ------------------------------------------------------
|
|
log "bare vm run: create + start + ssh + exec 'echo smoke-bare-ok' + --rm"
|
|
bare_out="$("$BANGER" vm run --rm -- echo smoke-bare-ok)" || die "bare vm run exit $?"
|
|
grep -q 'smoke-bare-ok' <<<"$bare_out" || die "bare vm run stdout missing marker: $bare_out"
|
|
|
|
# --- workspace vm run -------------------------------------------------
|
|
log 'workspace vm run: preparing a throwaway git repo'
|
|
repodir="$runtime_dir/fake-repo"
|
|
mkdir -p "$repodir"
|
|
(
|
|
cd "$repodir"
|
|
git init -q -b main
|
|
git config commit.gpgsign false
|
|
git config user.name smoke
|
|
git config user.email smoke@smoke
|
|
echo 'smoke-workspace-marker' > smoke-file.txt
|
|
git add .
|
|
git commit -q -m init
|
|
)
|
|
|
|
log "workspace vm run: create + start + workspace prepare + cat guest file + --rm"
|
|
ws_out="$("$BANGER" vm run --rm "$repodir" -- cat /root/repo/smoke-file.txt)" || die "workspace vm run exit $?"
|
|
grep -q 'smoke-workspace-marker' <<<"$ws_out" || die "workspace vm run didn't ship smoke-file.txt: $ws_out"
|
|
|
|
# --- command exit-code propagation ------------------------------------
|
|
# A non-zero exit from the guest command must surface as banger's own
|
|
# exit code. Regressions here are hard to catch any other way — the
|
|
# local Go tests don't cross the SSH boundary, and users expect their
|
|
# CI scripts that wrap `banger vm run` to fail when the thing inside
|
|
# the VM failed.
|
|
log 'exit-code propagation: guest `sh -c "exit 42"` must produce rc=42'
|
|
set +e
|
|
"$BANGER" vm run --rm -- sh -c 'exit 42'
|
|
rc=$?
|
|
set -e
|
|
[[ "$rc" -eq 42 ]] || die "exit-code propagation: got rc=$rc, want 42"
|
|
|
|
# --- workspace dry-run (no VM) ----------------------------------------
|
|
# Pure CLI-side path — no VM, no sudo, just the local git inspection
|
|
# against d.repoInspector. Fast; catches regressions in the preview
|
|
# output (file list shape, mode line) that the Go tests already pin
|
|
# but that could still be broken by a client-side wiring change.
|
|
log 'workspace dry-run: list tracked files without creating a VM'
|
|
dry_out="$("$BANGER" vm run --dry-run "$repodir")" || die "dry-run exit $?"
|
|
grep -q 'smoke-file.txt' <<<"$dry_out" || die "dry-run didn't list smoke-file.txt: $dry_out"
|
|
grep -q 'mode: tracked only' <<<"$dry_out" || die "dry-run mode line missing or wrong: $dry_out"
|
|
|
|
# --- workspace --include-untracked -----------------------------------
|
|
# The default is tracked-only (review cycle 4). Opt-in must ship
|
|
# untracked files too. Write one, run with --include-untracked, verify
|
|
# it reaches the guest.
|
|
log 'workspace --include-untracked: opt-in ships files outside the git index'
|
|
echo 'untracked-marker' > "$repodir/smoke-untracked.txt"
|
|
inc_out="$("$BANGER" vm run --rm --include-untracked "$repodir" -- cat /root/repo/smoke-untracked.txt)" || die "include-untracked vm run exit $?"
|
|
grep -q 'untracked-marker' <<<"$inc_out" || die "--include-untracked didn't ship the untracked file: $inc_out"
|
|
# Restore repo to tracked-only state for any later scenarios.
|
|
rm -f "$repodir/smoke-untracked.txt"
|
|
|
|
# --- workspace export round-trip --------------------------------------
|
|
# Exercises ExportVMWorkspace: create a VM, prepare the workspace,
|
|
# write a new file inside the guest, then export and assert the
|
|
# emitted patch sees the guest-side change. If the export pipeline
|
|
# (temp-index, git add -A, diff --binary) ever stops capturing
|
|
# guest-side changes, this scenario catches it.
|
|
log 'workspace export: create + prepare + guest edit + export + assert marker'
|
|
export_vm='smoke-export'
|
|
cleanup_export_vm() {
|
|
"$BANGER" vm delete "$export_vm" >/dev/null 2>&1 || true
|
|
}
|
|
# Chain the VM cleanup with the existing runtime_dir trap so a mid-
|
|
# scenario failure still tears the VM down before the script exits.
|
|
# shellcheck disable=SC2064
|
|
trap "cleanup_export_vm; rm -rf '$runtime_dir'" EXIT
|
|
|
|
"$BANGER" vm create --name "$export_vm" --image debian-bookworm >/dev/null \
|
|
|| die "export: vm create exit $?"
|
|
"$BANGER" vm workspace prepare "$export_vm" "$repodir" >/dev/null \
|
|
|| die "export: workspace prepare exit $?"
|
|
"$BANGER" vm ssh "$export_vm" -- sh -c 'echo guest-edit > /root/repo/new-guest-file.txt' \
|
|
|| die "export: guest-side file write exit $?"
|
|
export_patch="$runtime_dir/smoke-export.diff"
|
|
"$BANGER" vm workspace export "$export_vm" --output "$export_patch" \
|
|
|| die "export: workspace export exit $?"
|
|
[[ -s "$export_patch" ]] || die "export: patch file empty at $export_patch"
|
|
grep -q 'new-guest-file.txt' "$export_patch" \
|
|
|| die "export: patch missing new-guest-file.txt marker (head: $(head -c 400 "$export_patch"))"
|
|
|
|
cleanup_export_vm
|
|
# shellcheck disable=SC2064
|
|
trap "rm -rf '$runtime_dir'" EXIT
|
|
|
|
# --- concurrent vm runs -----------------------------------------------
|
|
# Stresses per-VM lock scoping, the tap pool warm-up path, and
|
|
# createVMMu's narrow reservation window. Two `vm run --rm` invocations
|
|
# that actually overlap should both succeed. A regression that
|
|
# serialises create path too aggressively would make this slow but
|
|
# still pass; a regression that breaks tap allocation or name
|
|
# uniqueness would fail one of them.
|
|
log 'concurrent vm runs: two --rm invocations must both succeed'
|
|
tmpA="$runtime_dir/concurrent-a.out"
|
|
tmpB="$runtime_dir/concurrent-b.out"
|
|
"$BANGER" vm run --rm -- echo smoke-concurrent-a > "$tmpA" 2>&1 &
|
|
pidA=$!
|
|
"$BANGER" vm run --rm -- echo smoke-concurrent-b > "$tmpB" 2>&1 &
|
|
pidB=$!
|
|
wait "$pidA" || die "concurrent VM A exited non-zero: $(cat "$tmpA")"
|
|
wait "$pidB" || die "concurrent VM B exited non-zero: $(cat "$tmpB")"
|
|
grep -q 'smoke-concurrent-a' "$tmpA" || die "concurrent VM A missing marker: $(cat "$tmpA")"
|
|
grep -q 'smoke-concurrent-b' "$tmpB" || die "concurrent VM B missing marker: $(cat "$tmpB")"
|
|
|
|
# --- vm lifecycle (create → stop → start → delete) --------------------
|
|
# Exercises lifecycle verbs directly instead of the --rm convenience
|
|
# path. The critical assertion is the second `vm ssh` AFTER stop/start:
|
|
# that path (a) rebuilds the handle cache via rediscoverHandles,
|
|
# (b) runs the e2fsck-snapshot sanitize step before patchRootOverlay
|
|
# on the dirty COW, and (c) shouldn't die from the SDK's
|
|
# ctx-SIGTERM-on-RPC-close goroutine. All three were bugs at one
|
|
# point; this scenario guards all three at once.
|
|
log 'vm lifecycle: explicit create / stop / start / ssh / delete'
|
|
lifecycle_name=smoke-lifecycle
|
|
# shellcheck disable=SC2064
|
|
trap "\"$BANGER\" vm delete $lifecycle_name >/dev/null 2>&1 || true; rm -rf '$runtime_dir'" EXIT
|
|
|
|
"$BANGER" vm create --name "$lifecycle_name" >/dev/null || die "vm create $lifecycle_name failed"
|
|
show_out="$("$BANGER" vm show "$lifecycle_name")" || die "vm show after create failed"
|
|
grep -q '"state": "running"' <<<"$show_out" || die "post-create state not running: $show_out"
|
|
|
|
wait_for_ssh "$lifecycle_name" || die 'vm lifecycle: ssh did not come up after create'
|
|
ssh_out="$("$BANGER" vm ssh "$lifecycle_name" -- echo hello-1)" || die "vm ssh #1 failed"
|
|
grep -q 'hello-1' <<<"$ssh_out" || die "vm ssh #1 missing marker: $ssh_out"
|
|
|
|
"$BANGER" vm stop "$lifecycle_name" >/dev/null || die "vm stop failed"
|
|
show_out="$("$BANGER" vm show "$lifecycle_name")" || die "vm show after stop failed"
|
|
grep -q '"state": "stopped"' <<<"$show_out" || die "post-stop state not stopped: $show_out"
|
|
|
|
"$BANGER" vm start "$lifecycle_name" >/dev/null || die "vm start (from stopped) failed"
|
|
show_out="$("$BANGER" vm show "$lifecycle_name")" || die "vm show after start failed"
|
|
grep -q '"state": "running"' <<<"$show_out" || die "post-start state not running: $show_out"
|
|
|
|
wait_for_ssh "$lifecycle_name" || die 'vm lifecycle: ssh did not come up after restart'
|
|
ssh_out="$("$BANGER" vm ssh "$lifecycle_name" -- echo hello-2)" || die "vm ssh #2 (post-restart) failed"
|
|
grep -q 'hello-2' <<<"$ssh_out" || die "vm ssh #2 missing marker: $ssh_out"
|
|
|
|
"$BANGER" vm delete "$lifecycle_name" >/dev/null || die "vm delete failed"
|
|
set +e
|
|
"$BANGER" vm show "$lifecycle_name" >/dev/null 2>&1
|
|
rc=$?
|
|
set -e
|
|
[[ "$rc" -ne 0 ]] || die "vm show still finds $lifecycle_name after delete"
|
|
# shellcheck disable=SC2064
|
|
trap "rm -rf '$runtime_dir'" EXIT
|
|
|
|
# --- vm set reconfiguration (vcpu change + restart) -------------------
|
|
# Exercises SetVM + configChangeCapability. Create with --vcpu 2,
|
|
# stop, `vm set --vcpu 4`, restart, confirm the guest sees the new
|
|
# count. Regression guard: a restart that reuses the pre-change spec
|
|
# would leave nproc at 2.
|
|
log 'vm set: create --vcpu 2 → stop → set --vcpu 4 → restart → nproc=4'
|
|
# shellcheck disable=SC2064
|
|
trap "\"$BANGER\" vm delete smoke-set >/dev/null 2>&1 || true; rm -rf '$runtime_dir'" EXIT
|
|
|
|
"$BANGER" vm create --name smoke-set --vcpu 2 >/dev/null || die 'vm set: create failed'
|
|
wait_for_ssh smoke-set || die 'vm set: initial ssh did not come up'
|
|
|
|
set +e
|
|
nproc_before="$("$BANGER" vm ssh smoke-set -- nproc 2>/dev/null)"
|
|
rc=$?
|
|
set -e
|
|
[[ "$rc" -eq 0 ]] || die "vm set: initial nproc ssh exit $rc"
|
|
[[ "$(printf '%s' "$nproc_before" | tr -d '[:space:]')" == "2" ]] \
|
|
|| die "vm set: initial nproc got '$nproc_before', want 2"
|
|
|
|
"$BANGER" vm stop smoke-set >/dev/null || die 'vm set: stop failed'
|
|
"$BANGER" vm set smoke-set --vcpu 4 >/dev/null || die 'vm set: reconfigure failed'
|
|
"$BANGER" vm start smoke-set >/dev/null || die 'vm set: restart failed'
|
|
wait_for_ssh smoke-set || die 'vm set: post-reconfig ssh did not come up'
|
|
|
|
set +e
|
|
nproc_after="$("$BANGER" vm ssh smoke-set -- nproc 2>/dev/null)"
|
|
rc=$?
|
|
set -e
|
|
[[ "$rc" -eq 0 ]] || die "vm set: post-reconfig nproc ssh exit $rc"
|
|
[[ "$(printf '%s' "$nproc_after" | tr -d '[:space:]')" == "4" ]] \
|
|
|| die "vm set: post-reconfig nproc got '$nproc_after', want 4 (spec change didn't land)"
|
|
|
|
"$BANGER" vm delete smoke-set >/dev/null || die 'vm set: delete failed'
|
|
# shellcheck disable=SC2064
|
|
trap "rm -rf '$runtime_dir'" EXIT
|
|
|
|
# --- invalid spec rejection + no artifact leak ------------------------
|
|
# Tests the negative-path create flow: a blatantly invalid VM spec
|
|
# must fail before any VM row is persisted. The review cycle flagged
|
|
# "cleanup on partial failure" as under-tested; this scenario pins
|
|
# that a rejected create doesn't leak a reservation we then have to
|
|
# clean up by hand.
|
|
log 'invalid spec rejection: --vcpu 0 must fail and leave no VM behind'
|
|
pre_vms="$("$BANGER" vm list --all 2>/dev/null | wc -l)"
|
|
set +e
|
|
"$BANGER" vm run --rm --vcpu 0 -- echo unused >/dev/null 2>&1
|
|
rc=$?
|
|
set -e
|
|
[[ "$rc" -ne 0 ]] || die 'invalid spec: vm run succeeded despite --vcpu 0'
|
|
post_vms="$("$BANGER" vm list --all 2>/dev/null | wc -l)"
|
|
[[ "$pre_vms" == "$post_vms" ]] || die "invalid spec leaked a VM row: pre=$pre_vms, post=$post_vms"
|
|
|
|
# --- daemon stop (flushes coverage) -----------------------------------
|
|
log 'stopping daemon so instrumented binaries flush coverage'
|
|
"$BANGER" daemon stop >/dev/null 2>&1 || true
|
|
# Give the daemon a moment to write its covdata pod before the trap
|
|
# tears down runtime_dir.
|
|
sleep 0.5
|
|
|
|
log 'all scenarios passed'
|