Commit graph

5 commits

Author SHA1 Message Date
362009d747
daemon split (1/5): extract *HostNetwork service
First phase of splitting the daemon god-struct into focused services
with explicit ownership.

HostNetwork now owns everything host-networking: the TAP interface
pool (initializeTapPool / ensureTapPool / acquireTap / releaseTap /
createTap), bridge + socket dir setup, firecracker process primitives
(find/resolve/kill/wait/ensureSocketAccess/sendCtrlAltDel), DM
snapshot lifecycle, NAT rule enforcement, guest DNS server lifecycle
+ routing setup, and the vsock-agent readiness probe. That's 7 files
whose receivers flipped from *Daemon to *HostNetwork, plus a new
host_network.go that declares the struct, its hostNetworkDeps, and
the factored firecracker + DNS helpers that used to live in vm.go.

Daemon gives up the tapPool and vmDNS fields entirely; they're now
HostNetwork's business. Construction goes through newHostNetwork in
Daemon.Open with an explicit dependency bag (runner, logger, config,
layout, closing). A lazy-init hostNet() helper on Daemon supports
test literals that don't wire net explicitly — production always
populates it eagerly.

Signature tightenings where the old receiver reached into VM-service
state:
 - ensureNAT(ctx, vm, enable) → ensureNAT(ctx, guestIP, tap, enable).
   Callers resolve tap from the handle cache themselves.
 - initializeTapPool(ctx) → initializeTapPool(usedTaps []string).
   Daemon.Open enumerates VMs, collects taps from handles, hands the
   slice in.

rebuildDNS stays on *Daemon as the orchestrator — it filters by
vm-alive (a VMService concern handles will move to in phase 4) then
calls HostNetwork.replaceDNS with the already-filtered map.

Capability hooks continue to take *Daemon; they now use it as a
facade to reach services (d.net.ensureNAT, d.hostNet().*). Planned
CapabilityHost interface extraction is orthogonal, left for later.

Tests: dns_routing_test.go + fastpath_test.go + nat_test.go +
snapshot_test.go + open_close_test.go were touched to construct
HostNetwork literals where they exercise its methods directly, or
route through d.hostNet() where they exercise the Daemon entry
points.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:11:46 -03:00
fdab4a7e68
Extract opstate and dmsnap into subpackages
Two leaves of the daemon package that carry no back-references to Daemon
move out:

- internal/daemon/opstate: generic Registry[T AsyncOp]. The AsyncOp
  interface methods are capitalised (ID, IsDone, UpdatedAt, Cancel);
  vmCreateOperationState and imageBuildOperationState implement it.
- internal/daemon/dmsnap: Create, Cleanup, Remove plus the Handles type
  for device-mapper snapshot lifecycle. Takes an explicit Runner
  interface. The daemon-package snapshot.go keeps thin forwarders and a
  type alias so existing call sites and tests are untouched.

Skipped on purpose: tap_pool has too many Daemon-scoped dependencies
(config, store, closing, createTap) for a clean extraction at this
stage; nat.go is already a thin facade over internal/hostnat;
dns_routing.go tests tightly couple to package internals, so extraction
would be more churn than payoff. Each can be revisited when a
subsystem-level refactor forces the boundary.

All tests green.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-15 16:02:43 -03:00
7b7f7e676c
Harden VM stop cleanup for stale snapshots
Stop and delete could fail with device-mapper busy errors when the persisted Firecracker PID was stale or the kernel needed longer to release the root snapshot.

Rediscover a live Firecracker process by API socket during cleanup, kill and wait on that PID instead of trusting only the stored runtime PID, and extend dm snapshot removal retries for transient busy handles.

Add daemon regressions for stale-runtime reconcile, rediscovered process cleanup, and repeated busy dm removal. Validate with go test ./..., make build, and a live ./banger vm stop debug-ssh run that now exits cleanly.
2026-03-18 12:28:15 -03:00
60294e8c90
Fix VM lifecycle issues behind verify.sh
Make the Firecracker and bangerd processes outlive short-lived CLI request contexts so vm create no longer kills the VMM or daemon as soon as the RPC returns.

Fix fresh-VM SSH by flattening the seeded /root work disk when the copied home tree lands under a nested root/ directory, and write a guest sshd override to keep root pubkey auth explicit while debugging.

Harden teardown and smoke diagnostics: verify.sh now reports early Firecracker exit and delete failures directly, while dm snapshot cleanup tolerates already-gone handles and retries busy mapper removal long enough for Firecracker to release the device.

Validation: go test ./..., make build, bash -n verify.sh, direct SSH against a fresh VM, and a live ./verify.sh run that now completes with [verify] ok.
2026-03-17 14:43:09 -03:00
375900cf65
Rollback partial dm snapshot startup
Prevent partial VM startup failures from leaking loop devices and dm state on the host.

Move root snapshot setup into a rollback-safe helper that records loop and mapper handles incrementally, tears them down in reverse order on failure, and reuses the same dm/loop cleanup path during normal runtime teardown. Also switch the daemon runner field to a small command-runner interface so the snapshot path can be tested with injected failures.

Add failure-injection coverage for losetup, blockdev, dmsetup, partial teardown, and joined rollback errors. Validated with go test ./... and make build.
2026-03-16 14:06:17 -03:00