banger

thaloco/banger

Fork 0

Commit graph

Author	SHA1	Message	Date
Thales Maciel	5eceebe49f	daemon: persist tap device on VM.Runtime so NAT teardown survives handle-cache loss Cleanup identity for kernel objects was split across two sources of truth: vm.Runtime (DB-backed, durable) held paths and the guest IP, but the TAP name lived only in the in-process handle cache + the best-effort handles.json scratch file next to the VM dir. Every other cleanup-identifying datum has a fallback — firecracker PID can be rediscovered via `pgrep -f <apiSock>`, loops via losetup, dm name from the deterministic ShortID(vm.ID). The tap is the one truly cache-only datum (allocated from a pool, not derivable). That made NAT teardown fragile: - daemon crash between `acquireTap` and the handles.json write - handles.json corrupt on the next daemon start - partial cleanup that already zeroed the cache In any of those cases natCapability.Cleanup short-circuited ("skipping nat cleanup without runtime network handles") and the per-VM POSTROUTING MASQUERADE + the two FORWARD rules keyed off the tap would leak. The VM row in the DB still existed, so a retry couldn't close the loop — the tap name was simply gone. Fix: mirror TapDevice onto model.VMRuntime (serialised via the existing runtime_json column, omitempty so existing rows upgrade cleanly). Set it in startVMLocked right next to the s.setVMHandles call that seeds the in-memory cache; clear it at every post-cleanup reset site (stop normal path + stop stale branch, kill normal path + kill stale branch, cleanupOnErr in start, reconcile's stale-vm branch, the stats poller's auto-stop path). Fallbacks now cascade: - natCapability.Cleanup: handles cache → Runtime.TapDevice - cleanupRuntime (releaseTap): handles cache → Runtime.TapDevice Both surfaces refuse gracefully (old behaviour) only when neither source has a value, which really does mean "no tap was ever allocated for this VM" rather than "we lost track of it." Test: TestNATCapabilityCleanup_FallsBackToRuntimeTapDevice clears the handle cache, sets vm.Runtime.TapDevice, and asserts Cleanup reaches the runner — the exact scenario the review flagged as a plausible leak and the exact code path that now guarantees it doesn't. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 14:21:13 -03:00
Thales Maciel	88bc466d58	tests: targeted coverage for doctor, workspace rejections, and nat capability Three thematic test files pinning behavior surfaces that had none before, following the review's recommendation to plug concrete error/cleanup branches rather than chase a coverage percentage. doctor_test.go Covers Daemon.doctorReport end-to-end with a permissive runner + fake executables on PATH. Pins: store error surfaces as fail, store success as pass, missing firecracker kills the host-runtime check, the three default capability feature checks (work disk, vm dns, nat) are emitted, vm-defaults is always-pass with provenance. Previously 0% — now the Doctor() command's contract with the CLI is under guard. workspace_rejection_test.go Covers the four early-exit branches of PrepareVMWorkspace that the existing happy-path + lock-release tests never hit: malformed mode, --from without --branch, VM not running, VM not found. Each one returns before any SSH I/O, so the fake-firecracker infra the happy-path test needs is unnecessary — a bare wired daemon with a stored VMRecord suffices. nat_capability_test.go Covers natCapability.ApplyConfigChange (unchanged flag → no-op, VM not alive → no-op, toggle on live VM → runner reached) and natCapability.Cleanup (NAT disabled → no-op, runtime handles missing → defensive no-op, full wiring → ensureNAT(false)). A countingRunner + startFakeFirecracker fixture stands in for the real host plumbing, with waitForVMAlive polling past the exec -a race window that startFakeFirecracker exposes on loaded CI boxes. make coverage-total 37.8% → 38.6%. The number isn't the point — these tests exist so the next refactor in this area has to break an explicit assertion to drift. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 12:58:12 -03:00

Author

SHA1

Message

Date

Thales Maciel

5eceebe49f

daemon: persist tap device on VM.Runtime so NAT teardown survives handle-cache loss

Cleanup identity for kernel objects was split across two sources of
truth: vm.Runtime (DB-backed, durable) held paths and the guest IP,
but the TAP name lived only in the in-process handle cache + the
best-effort handles.json scratch file next to the VM dir. Every
other cleanup-identifying datum has a fallback — firecracker PID
can be rediscovered via `pgrep -f <apiSock>`, loops via losetup, dm
name from the deterministic ShortID(vm.ID). The tap is the one
truly cache-only datum (allocated from a pool, not derivable).

That made NAT teardown fragile:

  - daemon crash between `acquireTap` and the handles.json write
  - handles.json corrupt on the next daemon start
  - partial cleanup that already zeroed the cache

In any of those cases natCapability.Cleanup short-circuited
("skipping nat cleanup without runtime network handles") and the
per-VM POSTROUTING MASQUERADE + the two FORWARD rules keyed off
the tap would leak. The VM row in the DB still existed, so a retry
couldn't close the loop — the tap name was simply gone.

Fix: mirror TapDevice onto model.VMRuntime (serialised via the
existing runtime_json column, omitempty so existing rows upgrade
cleanly). Set it in startVMLocked right next to the
s.setVMHandles call that seeds the in-memory cache; clear it at
every post-cleanup reset site (stop normal path + stop stale
branch, kill normal path + kill stale branch, cleanupOnErr in
start, reconcile's stale-vm branch, the stats poller's auto-stop
path).

Fallbacks now cascade:

  - natCapability.Cleanup: handles cache → Runtime.TapDevice
  - cleanupRuntime (releaseTap): handles cache → Runtime.TapDevice

Both surfaces refuse gracefully (old behaviour) only when neither
source has a value, which really does mean "no tap was ever
allocated for this VM" rather than "we lost track of it."

Test: TestNATCapabilityCleanup_FallsBackToRuntimeTapDevice clears
the handle cache, sets vm.Runtime.TapDevice, and asserts Cleanup
reaches the runner — the exact scenario the review flagged as a
plausible leak and the exact code path that now guarantees it
doesn't.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-23 14:21:13 -03:00

Thales Maciel

88bc466d58

tests: targeted coverage for doctor, workspace rejections, and nat capability

Three thematic test files pinning behavior surfaces that had none
before, following the review's recommendation to plug concrete
error/cleanup branches rather than chase a coverage percentage.

doctor_test.go
  Covers Daemon.doctorReport end-to-end with a permissive runner +
  fake executables on PATH. Pins: store error surfaces as fail,
  store success as pass, missing firecracker kills the host-runtime
  check, the three default capability feature checks (work disk,
  vm dns, nat) are emitted, vm-defaults is always-pass with
  provenance. Previously 0% — now the Doctor() command's contract
  with the CLI is under guard.

workspace_rejection_test.go
  Covers the four early-exit branches of PrepareVMWorkspace that
  the existing happy-path + lock-release tests never hit: malformed
  mode, --from without --branch, VM not running, VM not found.
  Each one returns before any SSH I/O, so the fake-firecracker
  infra the happy-path test needs is unnecessary — a bare wired
  daemon with a stored VMRecord suffices.

nat_capability_test.go
  Covers natCapability.ApplyConfigChange (unchanged flag → no-op,
  VM not alive → no-op, toggle on live VM → runner reached) and
  natCapability.Cleanup (NAT disabled → no-op, runtime handles
  missing → defensive no-op, full wiring → ensureNAT(false)). A
  countingRunner + startFakeFirecracker fixture stands in for the
  real host plumbing, with waitForVMAlive polling past the
  exec -a race window that startFakeFirecracker exposes on
  loaded CI boxes.

make coverage-total 37.8% → 38.6%. The number isn't the point —
these tests exist so the next refactor in this area has to
break an explicit assertion to drift.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-22 12:58:12 -03:00

2 commits