Survive banger update with running VMs
Two coupled fixes that together make the daemon-restart path of `banger update` non-destructive for running guests: 1. Unit templates set `KillMode=process` on bangerd.service and bangerd-root.service. The default control-group behaviour sent SIGKILL to every process in the cgroup on stop/restart — including jailer-spawned firecracker children, since fork/exec doesn't escape a systemd cgroup. With process mode only the unit's main PID is signalled; FC children stay alive in the (unowned) cgroup until the new helper instance starts up and re-claims them. 2. `fcproc.FindPID` falls back to the jailer-written pidfile at `<chroot>/firecracker.pid` (sibling of the api-sock target) when `pgrep -n -f <api-sock>` doesn't find a match. pgrep can't see jailer'd FCs because their cmdline only carries the chroot-relative `--api-sock /firecracker.socket`, not the host-side path. The pidfile is jailer's actual record of the post-exec FC PID, so reconcile can verify the surviving process is the right one (comm == "firecracker") and re-seed handles.json without tearing down the VM's dm-snapshot. Verified live on the dev host: started a VM, restarted the helper unit, restarted the daemon unit, and confirmed the FC PID was unchanged, vm list still showed the guest as running, and `banger vm ssh` returned the same boot_id pre and post restart. The systemd journal now reports "firecracker remains running after unit stopped" and "Found left-over process X (firecracker) in control group while starting unit. Ignoring." — exactly the shape `KillMode=process` is supposed to produce. Tests cover both the parser (parseVersionOutput from the v0.1.2 fix) and the new pidfile lookup: happy path, missing pidfile, stale pid, wrong comm, garbage content, non-symlink api-sock, whitespace tolerance. CHANGELOG corrects v0.1.0's misleading "daemon restarts do not interrupt running guests" line and documents the unit-refresh caveat: existing v0.1.0–v0.1.3 installs need a one-time `sudo banger system install` after updating to v0.1.4 to pick up the new KillMode directive (`banger update` swaps binaries, not unit files). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
9c2e6a4647
commit
cec7291184
5 changed files with 310 additions and 3 deletions
42
CHANGELOG.md
42
CHANGELOG.md
|
|
@ -10,6 +10,45 @@ changed between versions.
|
|||
|
||||
## [Unreleased]
|
||||
|
||||
## [v0.1.4] - 2026-04-29
|
||||
|
||||
### Fixed
|
||||
|
||||
- Daemon restarts no longer kill running VMs. Two changes together:
|
||||
- The `bangerd-root.service` and `bangerd.service` unit templates
|
||||
now set `KillMode=process`. The default (`control-group`) sent
|
||||
SIGKILL to every process in the unit's cgroup on stop/restart,
|
||||
including the jailer-spawned firecracker children — fork/exec
|
||||
doesn't escape a systemd cgroup. With `KillMode=process` only
|
||||
the unit's main PID is signalled; firecracker children survive.
|
||||
- `fcproc.FindPID` now also looks up jailer'd firecracker
|
||||
processes via the pidfile jailer writes at
|
||||
`<chroot>/firecracker.pid` (sibling of the api-sock target).
|
||||
Previously the only lookup path was `pgrep -n -f <api-sock>`,
|
||||
which can't see jailer'd processes because their cmdline only
|
||||
carries the chroot-relative `--api-sock /firecracker.socket`.
|
||||
Reconcile after a daemon restart now correctly re-attaches to
|
||||
surviving guests instead of mistaking them for stale and tearing
|
||||
down their dm-snapshot.
|
||||
|
||||
### Notes
|
||||
|
||||
- v0.1.0's CHANGELOG line "daemon restarts do not interrupt running
|
||||
guests" was wrong: it was true at the systemd cgroup layer in
|
||||
theory but the default `KillMode` defeated it, and even with
|
||||
`KillMode=process` the daemon's reconcile would mistake
|
||||
surviving FCs for stale and tear them down. v0.1.4 is the version
|
||||
where this actually works end-to-end.
|
||||
- Updating from v0.1.0–v0.1.3 to v0.1.4 still kills running VMs
|
||||
because the *driver* of the update is the buggy older binary.
|
||||
Updates from v0.1.4 onward preserve running VMs across the
|
||||
helper+daemon restart that `banger update` performs.
|
||||
- Existing v0.1.0–v0.1.3 installs that update to v0.1.4 do NOT
|
||||
automatically pick up the new unit files — `banger update` swaps
|
||||
binaries, not systemd units. Run `sudo banger system install` once
|
||||
on those hosts after updating to refresh the units. New v0.1.4+
|
||||
installs get the correct units from the start.
|
||||
|
||||
## [v0.1.3] - 2026-04-29
|
||||
|
||||
No functional changes. Verification release: v0.1.2 fixed
|
||||
|
|
@ -145,7 +184,8 @@ root filesystem and network, and exits on demand.
|
|||
the swap rather than starting up against an incompatible store.
|
||||
- Linux only. amd64 only. KVM required.
|
||||
|
||||
[Unreleased]: https://git.thaloco.com/thaloco/banger/compare/v0.1.3...HEAD
|
||||
[Unreleased]: https://git.thaloco.com/thaloco/banger/compare/v0.1.4...HEAD
|
||||
[v0.1.4]: https://git.thaloco.com/thaloco/banger/releases/tag/v0.1.4
|
||||
[v0.1.3]: https://git.thaloco.com/thaloco/banger/releases/tag/v0.1.3
|
||||
[v0.1.2]: https://git.thaloco.com/thaloco/banger/releases/tag/v0.1.2
|
||||
[v0.1.1]: https://git.thaloco.com/thaloco/banger/releases/tag/v0.1.1
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue