update: VMs survive banger update and rollback

Three load-bearing fixes that together let `banger update` (and its
auto-rollback path) restart the helper + daemon without killing
every running VM. New smoke scenarios prove the property end-to-end.

Bug fixes:

1. Disable the firecracker SDK's signal-forwarding goroutine. The
   default ForwardSignals = [SIGINT, SIGQUIT, SIGTERM, SIGHUP,
   SIGABRT] installs a handler in the helper that propagates the
   helper's SIGTERM (sent by systemd on `systemctl stop bangerd-
   root.service`) to every running firecracker child. Set
   ForwardSignals to an empty (non-nil) slice so setupSignals
   short-circuits at len()==0.

2. Add SendSIGKILL=no to bangerd-root.service. KillMode=process
   limits the initial SIGTERM to the helper main, but systemd
   still SIGKILLs leftover cgroup processes during the
   FinalKillSignal stage unless SendSIGKILL=no.

3. Route restart-helper / restart-daemon / wait-daemon-ready
   failures through rollbackAndRestart instead of rollbackAndWrap.
   rollbackAndWrap restored .previous binaries but didn't re-
   restart the failed unit, leaving the helper dead with the
   rolled-back binary on disk after a failed update.

Testing infrastructure (production binaries unaffected):

- Hidden --manifest-url and --pubkey-file flags on `banger update`
  let the smoke harness redirect the updater at locally-built
  release artefacts. Marked Hidden in cobra; not advertised in
  --help.
- FetchManifestFrom / VerifyBlobSignatureWithKey /
  FetchAndVerifySignatureWithKey export the existing logic against
  caller-supplied URL / pubkey. The default entry points still
  call them with the embedded canonical values.

Smoke scenarios:

- update_check: --check against fake manifest reports update
  available
- update_to_unknown: --to v9.9.9 fails before any host mutation
- update_no_root: refuses without sudo, install untouched
- update_dry_run: stages + verifies, no swap, version unchanged
- update_keeps_vm_alive: real swap to v0.smoke.0; same VM (same
  boot_id) answers SSH after the daemon restart
- update_rollback_keeps_vm_alive: v0.smoke.broken-bangerd ships a
  bangerd that passes --check-migrations but exits 1 as the
  daemon. The post-swap `systemctl restart bangerd` fails,
  rollbackAndRestart fires, the .previous binaries are restored
  and re-restarted; the same VM still answers SSH afterwards
- daemon_admin (separate prep): covers `banger daemon socket`,
  `bangerd --check-migrations --system`, `sudo banger daemon
  stop`

The smoke release builder generates a fresh ECDSA P-256 keypair
with openssl, signs SHA256SUMS cosign-compatibly, and serves
artefacts from a backgrounded python http.server.
verify_smoke_check_test.go pins the openssl/cosign signature
equivalence so the smoke release builder can't silently drift.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Thales Maciel 2026-05-01 12:08:08 -03:00
parent 7e528f30b3
commit 2606bfbabb
No known key found for this signature in database
GPG key ID: 33112E6833C34679
8 changed files with 609 additions and 50 deletions

View file

@ -364,18 +364,34 @@ func renderRootHelperSystemdUnit() string {
"ExecStart=" + systemBangerdBin + " --root-helper",
"Restart=on-failure",
"RestartSec=1s",
// KillMode=process is load-bearing: the helper unit's cgroup is
// where every banger-launched firecracker process lives (see
// validateFirecrackerPID). Without this, `systemctl restart
// bangerd-root.service` — which `banger update` runs — would
// SIGKILL every in-flight VM along with the helper because
// systemd's default KillMode=control-group nukes the whole cgroup.
// With process mode, only the helper PID is signaled; firecracker
// children survive, the new helper instance re-attaches via the
// helper RPC, daemon reconcile re-seeds in-memory state, VM keeps
// running. `banger system uninstall` and the daemon's vm-stop
// path explicitly stop firecracker processes when actually needed.
// KillMode=process + SendSIGKILL=no together make the helper
// safe to restart while banger-launched firecrackers are
// running. firecracker lives in this unit's cgroup (jailer
// doesn't open a sub-cgroup), so:
//
// - Default control-group mode SIGKILLs every process in
// the cgroup on stop.
// - KillMode=process limits the initial SIGTERM to the
// helper main PID; systemd leaves remaining cgroup
// processes alone (and logs "Unit process N (firecracker)
// remains running after unit stopped").
// - SendSIGKILL=no disables the FinalKillSignal escalation
// that would otherwise SIGKILL leftovers after the timeout.
//
// One more pitfall: the firecracker SDK installs a default
// signal-forwarding goroutine in the helper that catches
// SIGTERM (etc.) and forwards it to every firecracker child.
// We disable that explicitly via ForwardSignals: []os.Signal{}
// in firecracker.buildConfig — without that override, systemd
// signaling the helper main would propagate to every running
// VM regardless of what these directives do.
//
// `banger system uninstall` and the daemon's vm-stop path
// explicitly stop firecracker processes when actually needed,
// so we don't lose the systemd-driven kill as a real safety
// net — banger drives those kills itself.
"KillMode=process",
"SendSIGKILL=no",
"Environment=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"Environment=TMPDIR=" + installmeta.DefaultRootHelperRuntimeDir,
"UMask=0077",