daemon: persist tap device on VM.Runtime so NAT teardown survives handle-cache loss

Cleanup identity for kernel objects was split across two sources of truth: vm.Runtime (DB-backed, durable) held paths and the guest IP, but the TAP name lived only in the in-process handle cache + the best-effort handles.json scratch file next to the VM dir. Every other cleanup-identifying datum has a fallback — firecracker PID can be rediscovered via `pgrep -f <apiSock>`, loops via losetup, dm name from the deterministic ShortID(vm.ID). The tap is the one truly cache-only datum (allocated from a pool, not derivable). That made NAT teardown fragile: - daemon crash between `acquireTap` and the handles.json write - handles.json corrupt on the next daemon start - partial cleanup that already zeroed the cache In any of those cases natCapability.Cleanup short-circuited ("skipping nat cleanup without runtime network handles") and the per-VM POSTROUTING MASQUERADE + the two FORWARD rules keyed off the tap would leak. The VM row in the DB still existed, so a retry couldn't close the loop — the tap name was simply gone. Fix: mirror TapDevice onto model.VMRuntime (serialised via the existing runtime_json column, omitempty so existing rows upgrade cleanly). Set it in startVMLocked right next to the s.setVMHandles call that seeds the in-memory cache; clear it at every post-cleanup reset site (stop normal path + stop stale branch, kill normal path + kill stale branch, cleanupOnErr in start, reconcile's stale-vm branch, the stats poller's auto-stop path). Fallbacks now cascade: - natCapability.Cleanup: handles cache → Runtime.TapDevice - cleanupRuntime (releaseTap): handles cache → Runtime.TapDevice Both surfaces refuse gracefully (old behaviour) only when neither source has a value, which really does mean "no tap was ever allocated for this VM" rather than "we lost track of it." Test: TestNATCapabilityCleanup_FallsBackToRuntimeTapDevice clears the handle cache, sets vm.Runtime.TapDevice, and asserts Cleanup reaches the runner — the exact scenario the review flagged as a plausible leak and the exact code path that now guarantees it doesn't. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 14:21:13 -03:00 · 2026-04-23 14:21:13 -03:00 · 5eceebe49f
commit 5eceebe49f
parent 1850904d9c
7 changed files with 72 additions and 16 deletions
--- a/internal/daemon/nat_capability_test.go
+++ b/internal/daemon/nat_capability_test.go
@ -174,3 +174,26 @@ func TestNATCapabilityCleanup_ReversesNATWhenRuntimePresent(t *testing.T) {
 		t.Fatal("runner calls = 0, want ensureNAT(false) to execute when runtime wiring exists")
 	}
 }
+
+// TestNATCapabilityCleanup_FallsBackToRuntimeTapDevice simulates the
+// post-crash / corrupt-handles.json scenario: the in-memory handle
+// cache is empty, but the DB-backed VM.Runtime still carries the
+// tap name (startVMLocked persists it alongside the handle cache).
+// Cleanup must use that fallback so the iptables FORWARD rules
+// keyed on the tap are actually removed — if Cleanup short-circuits
+// the way it did before this fix, those rules leak forever.
+func TestNATCapabilityCleanup_FallsBackToRuntimeTapDevice(t *testing.T) {
+	f := newNATCapabilityFixture(t, true)
+	// Wipe the handle cache, as if the daemon had just restarted
+	// against a corrupt (or missing) handles.json.
+	f.d.vm.clearVMHandles(f.vm)
+	// But the VM row in the DB still has the tap recorded.
+	f.vm.Runtime.TapDevice = "tap-nat-42"
+
+	if err := f.cap.Cleanup(context.Background(), f.vm); err != nil {
+		t.Fatalf("Cleanup: %v", err)
+	}
+	if n := f.runner.total(); n == 0 {
+		t.Fatal("runner calls = 0, want ensureNAT(false) to execute via the Runtime.TapDevice fallback; NAT rules would leak across daemon restarts")
+	}
+}