banger/internal
Thales Maciel 5eceebe49f
daemon: persist tap device on VM.Runtime so NAT teardown survives handle-cache loss
Cleanup identity for kernel objects was split across two sources of
truth: vm.Runtime (DB-backed, durable) held paths and the guest IP,
but the TAP name lived only in the in-process handle cache + the
best-effort handles.json scratch file next to the VM dir. Every
other cleanup-identifying datum has a fallback — firecracker PID
can be rediscovered via `pgrep -f <apiSock>`, loops via losetup, dm
name from the deterministic ShortID(vm.ID). The tap is the one
truly cache-only datum (allocated from a pool, not derivable).

That made NAT teardown fragile:

  - daemon crash between `acquireTap` and the handles.json write
  - handles.json corrupt on the next daemon start
  - partial cleanup that already zeroed the cache

In any of those cases natCapability.Cleanup short-circuited
("skipping nat cleanup without runtime network handles") and the
per-VM POSTROUTING MASQUERADE + the two FORWARD rules keyed off
the tap would leak. The VM row in the DB still existed, so a retry
couldn't close the loop — the tap name was simply gone.

Fix: mirror TapDevice onto model.VMRuntime (serialised via the
existing runtime_json column, omitempty so existing rows upgrade
cleanly). Set it in startVMLocked right next to the
s.setVMHandles call that seeds the in-memory cache; clear it at
every post-cleanup reset site (stop normal path + stop stale
branch, kill normal path + kill stale branch, cleanupOnErr in
start, reconcile's stale-vm branch, the stats poller's auto-stop
path).

Fallbacks now cascade:

  - natCapability.Cleanup: handles cache → Runtime.TapDevice
  - cleanupRuntime (releaseTap): handles cache → Runtime.TapDevice

Both surfaces refuse gracefully (old behaviour) only when neither
source has a value, which really does mean "no tap was ever
allocated for this VM" rather than "we lost track of it."

Test: TestNATCapabilityCleanup_FallsBackToRuntimeTapDevice clears
the handle cache, sets vm.Runtime.TapDevice, and asserts Cleanup
reaches the runner — the exact scenario the review flagged as a
plausible leak and the exact code path that now guarantees it
doesn't.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 14:21:13 -03:00
..
api workspace: drop --readonly flag — advisory only against root guests 2026-04-23 13:04:33 -03:00
buildinfo Stamp shared build metadata into banger binaries 2026-03-22 17:14:06 -03:00
cli model: validate VM names as DNS labels at CLI + daemon 2026-04-23 14:06:40 -03:00
config cleanup: drop pre-v0.1 migration scaffolding + legacy-behavior refs 2026-04-23 13:56:32 -03:00
daemon daemon: persist tap device on VM.Runtime so NAT teardown survives handle-cache loss 2026-04-23 14:21:13 -03:00
firecracker daemon: fix vm start (on a stopped VM) + regression coverage 2026-04-23 12:01:46 -03:00
guest ssh: trust-on-first-use host key pinning everywhere 2026-04-19 16:46:03 -03:00
guestconfig Refactor VM lifecycle around capabilities 2026-03-18 19:28:26 -03:00
guestnet Stop using kernel IP autoconfig for runtime VMs 2026-03-21 21:54:18 -03:00
hostnat coverage: medium batch — hostnat runner, store guest-sessions, daemon helpers 2026-04-18 18:03:37 -03:00
imagecat publish-golden-image: content-addressed tarball names 2026-04-18 15:26:57 -03:00
imagepull imagepull/BuildExt4: omit positional fs-size; rely on file truncation 2026-04-18 14:58:42 -03:00
kernelcat Prune legacy void/alpine + customize.sh flows 2026-04-18 15:39:53 -03:00
model daemon: persist tap device on VM.Runtime so NAT teardown survives handle-cache loss 2026-04-23 14:21:13 -03:00
namegen coverage: make targets + close zero-cov gaps (namegen, sessionstream) 2026-04-18 17:44:37 -03:00
paths runtime sockets: close the local-user race window around control-plane creation 2026-04-20 12:53:47 -03:00
policy Add vsock-backed VM port inspection 2026-03-19 15:52:11 -03:00
rpc Propagate RPC cancellation to daemon requests 2026-03-16 18:28:33 -03:00
store cleanup: drop pre-v0.1 migration scaffolding + legacy-behavior refs 2026-04-23 13:56:32 -03:00
system daemon: fix vm start (on a stopped VM) + regression coverage 2026-04-23 12:01:46 -03:00
toolingplan coverage: easy-wins batch across cli, system, paths, vmdns, toolingplan 2026-04-18 17:57:05 -03:00
vmdns coverage: easy-wins batch across cli, system, paths, vmdns, toolingplan 2026-04-18 17:57:05 -03:00
vsockagent Add vsock-backed VM port inspection 2026-03-19 15:52:11 -03:00