banger/internal/daemon
Thales Maciel 5eceebe49f
daemon: persist tap device on VM.Runtime so NAT teardown survives handle-cache loss
Cleanup identity for kernel objects was split across two sources of
truth: vm.Runtime (DB-backed, durable) held paths and the guest IP,
but the TAP name lived only in the in-process handle cache + the
best-effort handles.json scratch file next to the VM dir. Every
other cleanup-identifying datum has a fallback — firecracker PID
can be rediscovered via `pgrep -f <apiSock>`, loops via losetup, dm
name from the deterministic ShortID(vm.ID). The tap is the one
truly cache-only datum (allocated from a pool, not derivable).

That made NAT teardown fragile:

  - daemon crash between `acquireTap` and the handles.json write
  - handles.json corrupt on the next daemon start
  - partial cleanup that already zeroed the cache

In any of those cases natCapability.Cleanup short-circuited
("skipping nat cleanup without runtime network handles") and the
per-VM POSTROUTING MASQUERADE + the two FORWARD rules keyed off
the tap would leak. The VM row in the DB still existed, so a retry
couldn't close the loop — the tap name was simply gone.

Fix: mirror TapDevice onto model.VMRuntime (serialised via the
existing runtime_json column, omitempty so existing rows upgrade
cleanly). Set it in startVMLocked right next to the
s.setVMHandles call that seeds the in-memory cache; clear it at
every post-cleanup reset site (stop normal path + stop stale
branch, kill normal path + kill stale branch, cleanupOnErr in
start, reconcile's stale-vm branch, the stats poller's auto-stop
path).

Fallbacks now cascade:

  - natCapability.Cleanup: handles cache → Runtime.TapDevice
  - cleanupRuntime (releaseTap): handles cache → Runtime.TapDevice

Both surfaces refuse gracefully (old behaviour) only when neither
source has a value, which really does mean "no tap was ever
allocated for this VM" rather than "we lost track of it."

Test: TestNATCapabilityCleanup_FallsBackToRuntimeTapDevice clears
the handle cache, sets vm.Runtime.TapDevice, and asserts Cleanup
reaches the runner — the exact scenario the review flagged as a
plausible leak and the exact code path that now guarantees it
doesn't.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 14:21:13 -03:00
..
dmsnap Extract opstate and dmsnap into subpackages 2026-04-15 16:02:43 -03:00
fcproc fcproc: targeted tests for waitForPath + EnsureSocketAccess error paths 2026-04-22 17:49:42 -03:00
imagemgr Remove image build --from-image; doctor treats catalog images as OK 2026-04-18 15:54:29 -03:00
opstate coverage: medium batch — hostnat runner, store guest-sessions, daemon helpers 2026-04-18 18:03:37 -03:00
workspace seams: move the last four package globals onto instance fields 2026-04-22 12:07:14 -03:00
ARCHITECTURE.md docs: resync package docs, AGENTS, and kernel-catalog with current code 2026-04-22 13:01:11 -03:00
autopull_test.go daemon split (6/n): extract wireServices + drop lazy service getters 2026-04-21 15:55:28 -03:00
capabilities.go daemon: persist tap device on VM.Runtime so NAT teardown survives handle-cache loss 2026-04-23 14:21:13 -03:00
capabilities_test.go daemon split (7/n): narrow capability interfaces, wire deps at construction 2026-04-21 15:59:09 -03:00
concurrency_test.go daemon split (6/n): extract wireServices + drop lazy service getters 2026-04-21 15:55:28 -03:00
daemon.go daemon: persist tap device on VM.Runtime so NAT teardown survives handle-cache loss 2026-04-23 14:21:13 -03:00
daemon_test.go daemon split (6/n): extract wireServices + drop lazy service getters 2026-04-21 15:55:28 -03:00
daemon_testing_test.go test: add newTestDaemon harness + options 2026-04-22 17:45:43 -03:00
dns_routing.go seams: move the last four package globals onto instance fields 2026-04-22 12:07:14 -03:00
dns_routing_test.go seams: move the last four package globals onto instance fields 2026-04-22 12:07:14 -03:00
doc.go docs: resync package docs, AGENTS, and kernel-catalog with current code 2026-04-22 13:01:11 -03:00
doctor.go make smoke: end-to-end boot suite with coverage from real VM runs 2026-04-22 18:59:57 -03:00
doctor_test.go cleanup: drop pre-v0.1 migration scaffolding + legacy-behavior refs 2026-04-23 13:56:32 -03:00
fake_firecracker_test.go remove vm session feature 2026-04-20 12:47:58 -03:00
fastpath_test.go daemon split (6/n): extract wireServices + drop lazy service getters 2026-04-21 15:55:28 -03:00
guest_ssh.go remove vm session feature 2026-04-20 12:47:58 -03:00
host_network.go seams: move the last four package globals onto instance fields 2026-04-22 12:07:14 -03:00
image_seed.go daemon split (2/5): extract *ImageService service 2026-04-20 20:30:32 -03:00
image_service.go daemon split (6/n): extract wireServices + drop lazy service getters 2026-04-21 15:55:28 -03:00
images.go daemon split (2/5): extract *ImageService service 2026-04-20 20:30:32 -03:00
images_helpers_test.go coverage: medium batch — hostnat runner, store guest-sessions, daemon helpers 2026-04-18 18:03:37 -03:00
images_pull.go daemon split (2/5): extract *ImageService service 2026-04-20 20:30:32 -03:00
images_pull_bundle_test.go daemon split (6/n): extract wireServices + drop lazy service getters 2026-04-21 15:55:28 -03:00
images_pull_test.go daemon split (6/n): extract wireServices + drop lazy service getters 2026-04-21 15:55:28 -03:00
kernels.go daemon split (2/5): extract *ImageService service 2026-04-20 20:30:32 -03:00
kernels_test.go daemon split (6/n): extract wireServices + drop lazy service getters 2026-04-21 15:55:28 -03:00
lifecycle_flow_test.go test: end-to-end VMService lifecycle flow harness 2026-04-22 17:55:04 -03:00
logger.go vm state: split transient kernel/process handles off the durable schema 2026-04-19 14:18:13 -03:00
logger_test.go seams: move the last four package globals onto instance fields 2026-04-22 12:07:14 -03:00
nat.go daemon split (1/5): extract *HostNetwork service 2026-04-20 20:11:46 -03:00
nat_capability_test.go daemon: persist tap device on VM.Runtime so NAT teardown survives handle-cache loss 2026-04-23 14:21:13 -03:00
nat_test.go vm state: split transient kernel/process handles off the durable schema 2026-04-19 14:18:13 -03:00
open_close_test.go daemon split (6/n): extract wireServices + drop lazy service getters 2026-04-21 15:55:28 -03:00
ports.go daemon split (4/5): extract *VMService service 2026-04-20 20:57:05 -03:00
preflight.go seams: move the last four package globals onto instance fields 2026-04-22 12:07:14 -03:00
runtime_assets.go daemon split (4/5): extract *VMService service 2026-04-20 20:57:05 -03:00
snapshot.go daemon split (1/5): extract *HostNetwork service 2026-04-20 20:11:46 -03:00
snapshot_test.go daemon split (6/n): extract wireServices + drop lazy service getters 2026-04-21 15:55:28 -03:00
ssh_client_config.go cleanup: drop pre-v0.1 migration scaffolding + legacy-behavior refs 2026-04-23 13:56:32 -03:00
ssh_client_config_test.go cleanup: drop pre-v0.1 migration scaffolding + legacy-behavior refs 2026-04-23 13:56:32 -03:00
sshd_config_test.go guest sshd: drop DEBUG3 + StrictModes no; normalise /root perms 2026-04-19 13:40:40 -03:00
tap_pool.go daemon split (1/5): extract *HostNetwork service 2026-04-20 20:11:46 -03:00
vm.go daemon: persist tap device on VM.Runtime so NAT teardown survives handle-cache loss 2026-04-23 14:21:13 -03:00
vm_authsync.go file_sync: skip nested symlinks during recursive copy 2026-04-23 14:11:58 -03:00
vm_create.go model: validate VM names as DNS labels at CLI + daemon 2026-04-23 14:06:40 -03:00
vm_create_ops.go daemon split (4/5): extract *VMService service 2026-04-20 20:57:05 -03:00
vm_create_test.go model: validate VM names as DNS labels at CLI + daemon 2026-04-23 14:06:40 -03:00
vm_disk.go cleanup: drop pre-v0.1 migration scaffolding + legacy-behavior refs 2026-04-23 13:56:32 -03:00
vm_handles.go cleanup: drop pre-v0.1 migration scaffolding + legacy-behavior refs 2026-04-23 13:56:32 -03:00
vm_handles_test.go daemon split (6/n): extract wireServices + drop lazy service getters 2026-04-21 15:55:28 -03:00
vm_lifecycle.go daemon: persist tap device on VM.Runtime so NAT teardown survives handle-cache loss 2026-04-23 14:21:13 -03:00
vm_locks.go Move subsystem state/locks off Daemon into owning types 2026-04-15 15:58:33 -03:00
vm_service.go vmservice: delete dead guestWaitForSSH + guestDial seams 2026-04-22 12:45:27 -03:00
vm_set.go daemon split (4/5): extract *VMService service 2026-04-20 20:57:05 -03:00
vm_stats.go daemon: persist tap device on VM.Runtime so NAT teardown survives handle-cache loss 2026-04-23 14:21:13 -03:00
vm_test.go file_sync: skip nested symlinks during recursive copy 2026-04-23 14:11:58 -03:00
workspace.go workspace: drop --readonly flag — advisory only against root guests 2026-04-23 13:04:33 -03:00
workspace_rejection_test.go tests: targeted coverage for doctor, workspace rejections, and nat capability 2026-04-22 12:58:12 -03:00
workspace_service.go seams: move the last four package globals onto instance fields 2026-04-22 12:07:14 -03:00
workspace_test.go cleanup: drop pre-v0.1 migration scaffolding + legacy-behavior refs 2026-04-23 13:56:32 -03:00