Survive banger update with running VMs

Two coupled fixes that together make the daemon-restart path of
`banger update` non-destructive for running guests:

1. Unit templates set `KillMode=process` on bangerd.service and
   bangerd-root.service. The default control-group behaviour sent
   SIGKILL to every process in the cgroup on stop/restart — including
   jailer-spawned firecracker children, since fork/exec doesn't
   escape a systemd cgroup. With process mode only the unit's main
   PID is signalled; FC children stay alive in the (unowned)
   cgroup until the new helper instance starts up and re-claims them.

2. `fcproc.FindPID` falls back to the jailer-written pidfile at
   `<chroot>/firecracker.pid` (sibling of the api-sock target) when
   `pgrep -n -f <api-sock>` doesn't find a match. pgrep can't see
   jailer'd FCs because their cmdline only carries the chroot-relative
   `--api-sock /firecracker.socket`, not the host-side path. The
   pidfile is jailer's actual record of the post-exec FC PID, so
   reconcile can verify the surviving process is the right one
   (comm == "firecracker") and re-seed handles.json without tearing
   down the VM's dm-snapshot.

Verified live on the dev host: started a VM, restarted the helper
unit, restarted the daemon unit, and confirmed the FC PID was
unchanged, vm list still showed the guest as running, and
`banger vm ssh` returned the same boot_id pre and post restart.
The systemd journal now reports "firecracker remains running after
unit stopped" and "Found left-over process X (firecracker) in
control group while starting unit. Ignoring." — exactly the shape
`KillMode=process` is supposed to produce.

Tests cover both the parser (parseVersionOutput from the v0.1.2
fix) and the new pidfile lookup: happy path, missing pidfile,
stale pid, wrong comm, garbage content, non-symlink api-sock,
whitespace tolerance.

CHANGELOG corrects v0.1.0's misleading "daemon restarts do not
interrupt running guests" line and documents the unit-refresh
caveat: existing v0.1.0–v0.1.3 installs need a one-time
`sudo banger system install` after updating to v0.1.4 to pick up
the new KillMode directive (`banger update` swaps binaries, not
unit files).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Thales Maciel 2026-04-29 17:09:15 -03:00
parent 9c2e6a4647
commit cec7291184
No known key found for this signature in database
GPG key ID: 33112E6833C34679
5 changed files with 310 additions and 3 deletions

View file

@ -10,6 +10,45 @@ changed between versions.
## [Unreleased]
## [v0.1.4] - 2026-04-29
### Fixed
- Daemon restarts no longer kill running VMs. Two changes together:
- The `bangerd-root.service` and `bangerd.service` unit templates
now set `KillMode=process`. The default (`control-group`) sent
SIGKILL to every process in the unit's cgroup on stop/restart,
including the jailer-spawned firecracker children — fork/exec
doesn't escape a systemd cgroup. With `KillMode=process` only
the unit's main PID is signalled; firecracker children survive.
- `fcproc.FindPID` now also looks up jailer'd firecracker
processes via the pidfile jailer writes at
`<chroot>/firecracker.pid` (sibling of the api-sock target).
Previously the only lookup path was `pgrep -n -f <api-sock>`,
which can't see jailer'd processes because their cmdline only
carries the chroot-relative `--api-sock /firecracker.socket`.
Reconcile after a daemon restart now correctly re-attaches to
surviving guests instead of mistaking them for stale and tearing
down their dm-snapshot.
### Notes
- v0.1.0's CHANGELOG line "daemon restarts do not interrupt running
guests" was wrong: it was true at the systemd cgroup layer in
theory but the default `KillMode` defeated it, and even with
`KillMode=process` the daemon's reconcile would mistake
surviving FCs for stale and tear them down. v0.1.4 is the version
where this actually works end-to-end.
- Updating from v0.1.0v0.1.3 to v0.1.4 still kills running VMs
because the *driver* of the update is the buggy older binary.
Updates from v0.1.4 onward preserve running VMs across the
helper+daemon restart that `banger update` performs.
- Existing v0.1.0v0.1.3 installs that update to v0.1.4 do NOT
automatically pick up the new unit files — `banger update` swaps
binaries, not systemd units. Run `sudo banger system install` once
on those hosts after updating to refresh the units. New v0.1.4+
installs get the correct units from the start.
## [v0.1.3] - 2026-04-29
No functional changes. Verification release: v0.1.2 fixed
@ -145,7 +184,8 @@ root filesystem and network, and exits on demand.
the swap rather than starting up against an incompatible store.
- Linux only. amd64 only. KVM required.
[Unreleased]: https://git.thaloco.com/thaloco/banger/compare/v0.1.3...HEAD
[Unreleased]: https://git.thaloco.com/thaloco/banger/compare/v0.1.4...HEAD
[v0.1.4]: https://git.thaloco.com/thaloco/banger/releases/tag/v0.1.4
[v0.1.3]: https://git.thaloco.com/thaloco/banger/releases/tag/v0.1.3
[v0.1.2]: https://git.thaloco.com/thaloco/banger/releases/tag/v0.1.2
[v0.1.1]: https://git.thaloco.com/thaloco/banger/releases/tag/v0.1.1