update: VMs survive banger update and rollback

Three load-bearing fixes that together let `banger update` (and its auto-rollback path) restart the helper + daemon without killing every running VM. New smoke scenarios prove the property end-to-end. Bug fixes: 1. Disable the firecracker SDK's signal-forwarding goroutine. The default ForwardSignals = [SIGINT, SIGQUIT, SIGTERM, SIGHUP, SIGABRT] installs a handler in the helper that propagates the helper's SIGTERM (sent by systemd on `systemctl stop bangerd- root.service`) to every running firecracker child. Set ForwardSignals to an empty (non-nil) slice so setupSignals short-circuits at len()==0. 2. Add SendSIGKILL=no to bangerd-root.service. KillMode=process limits the initial SIGTERM to the helper main, but systemd still SIGKILLs leftover cgroup processes during the FinalKillSignal stage unless SendSIGKILL=no. 3. Route restart-helper / restart-daemon / wait-daemon-ready failures through rollbackAndRestart instead of rollbackAndWrap. rollbackAndWrap restored .previous binaries but didn't re- restart the failed unit, leaving the helper dead with the rolled-back binary on disk after a failed update. Testing infrastructure (production binaries unaffected): - Hidden --manifest-url and --pubkey-file flags on `banger update` let the smoke harness redirect the updater at locally-built release artefacts. Marked Hidden in cobra; not advertised in --help. - FetchManifestFrom / VerifyBlobSignatureWithKey / FetchAndVerifySignatureWithKey export the existing logic against caller-supplied URL / pubkey. The default entry points still call them with the embedded canonical values. Smoke scenarios: - update_check: --check against fake manifest reports update available - update_to_unknown: --to v9.9.9 fails before any host mutation - update_no_root: refuses without sudo, install untouched - update_dry_run: stages + verifies, no swap, version unchanged - update_keeps_vm_alive: real swap to v0.smoke.0; same VM (same boot_id) answers SSH after the daemon restart - update_rollback_keeps_vm_alive: v0.smoke.broken-bangerd ships a bangerd that passes --check-migrations but exits 1 as the daemon. The post-swap `systemctl restart bangerd` fails, rollbackAndRestart fires, the .previous binaries are restored and re-restarted; the same VM still answers SSH afterwards - daemon_admin (separate prep): covers `banger daemon socket`, `bangerd --check-migrations --system`, `sudo banger daemon stop` The smoke release builder generates a fresh ECDSA P-256 keypair with openssl, signs SHA256SUMS cosign-compatibly, and serves artefacts from a backgrounded python http.server. verify_smoke_check_test.go pins the openssl/cosign signature equivalence so the smoke release builder can't silently drift. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 12:08:08 -03:00 · 2026-05-01 12:08:08 -03:00 · 2606bfbabb
commit 2606bfbabb
parent 7e528f30b3
8 changed files with 609 additions and 50 deletions
--- a/internal/cli/commands_system.go
+++ b/internal/cli/commands_system.go
@ -364,18 +364,34 @@ func renderRootHelperSystemdUnit() string {
 		"ExecStart=" + systemBangerdBin + " --root-helper",
 		"Restart=on-failure",
 		"RestartSec=1s",
-		// KillMode=process is load-bearing: the helper unit's cgroup is
-		// where every banger-launched firecracker process lives (see
-		// validateFirecrackerPID). Without this, `systemctl restart
-		// bangerd-root.service` — which `banger update` runs — would
-		// SIGKILL every in-flight VM along with the helper because
-		// systemd's default KillMode=control-group nukes the whole cgroup.
-		// With process mode, only the helper PID is signaled; firecracker
-		// children survive, the new helper instance re-attaches via the
-		// helper RPC, daemon reconcile re-seeds in-memory state, VM keeps
-		// running. `banger system uninstall` and the daemon's vm-stop
-		// path explicitly stop firecracker processes when actually needed.
+		// KillMode=process + SendSIGKILL=no together make the helper
+		// safe to restart while banger-launched firecrackers are
+		// running. firecracker lives in this unit's cgroup (jailer
+		// doesn't open a sub-cgroup), so:
+		//
+		//   - Default control-group mode SIGKILLs every process in
+		//     the cgroup on stop.
+		//   - KillMode=process limits the initial SIGTERM to the
+		//     helper main PID; systemd leaves remaining cgroup
+		//     processes alone (and logs "Unit process N (firecracker)
+		//     remains running after unit stopped").
+		//   - SendSIGKILL=no disables the FinalKillSignal escalation
+		//     that would otherwise SIGKILL leftovers after the timeout.
+		//
+		// One more pitfall: the firecracker SDK installs a default
+		// signal-forwarding goroutine in the helper that catches
+		// SIGTERM (etc.) and forwards it to every firecracker child.
+		// We disable that explicitly via ForwardSignals: []os.Signal{}
+		// in firecracker.buildConfig — without that override, systemd
+		// signaling the helper main would propagate to every running
+		// VM regardless of what these directives do.
+		//
+		// `banger system uninstall` and the daemon's vm-stop path
+		// explicitly stop firecracker processes when actually needed,
+		// so we don't lose the systemd-driven kill as a real safety
+		// net — banger drives those kills itself.
 		"KillMode=process",
+		"SendSIGKILL=no",
 		"Environment=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
 		"Environment=TMPDIR=" + installmeta.DefaultRootHelperRuntimeDir,
 		"UMask=0077",