banger/internal/daemon/ARCHITECTURE.md
Thales Maciel e69810610a
daemon: correct ARCHITECTURE doc to match actual package shape + lock scope
Two promises the doc was making that the code doesn't keep:

1. "Helpers moved out so the package stays focused on orchestration."
   The package still has ~29 files and ~130 func (d *Daemon) methods
   wiring VM lifecycle, image management, host networking, background
   reconciliation, and JSON-RPC dispatch. Calling it "just orchestration"
   sets readers up for surprise. Rewrite the subpackages preamble to
   say so, and flag the service split as a post-v0.1.0 project.

2. "vmLocks[id] is held only across short synchronous state validation
   and DB mutations." That's what workspace.prepare does; regular
   lifecycle ops (start/stop/delete/set) go through withVMLockByRef
   and hold the lock across the whole callback body, which for `start`
   means preflight + bridge + firecracker spawn + post-boot wiring.
   Rewrite the vmLocks bullet and the lock-ordering section to say
   that explicitly, so readers don't build "surely my long flow under
   the lock can't be what the doc means" reasoning on top of a false
   premise.

Doc-only change. Code behaviour is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:02:36 -03:00

5.5 KiB

internal/daemon architecture

This document describes the current daemon package layout: the Daemon composition root, the subpackages that own stateless helpers and shared primitives, and the lock ordering every caller must respect.

Composition

Daemon is the composition root. Subsystem state and locks live on their owning types:

  • Layout, config, store, runner, logger, pid — infrastructure handles.
  • vmLocks vmLockSet — per-VM *sync.Mutex, one per VM ID. Held for the entire lifecycle op on that VM: a start holds it across preflight, bridge setup, firecracker spawn, and post-boot wiring (seconds to tens of seconds). Two start/stop/delete/set calls against the same VM therefore serialise; calls against different VMs run independently. If you need a slow guest-side operation to NOT block lifecycle ops on the same VM, scope it out of the lock explicitly the way workspace.prepare does (see below).
  • workspaceLocks vmLockSet — per-VM mutex scoped to workspace.prepare / workspace.export. These ops acquire vmLocks[id] only long enough to validate VM state + snapshot the fields they need, release it, then acquire workspaceLocks[id] for the slow guest I/O phase. That keeps vm stop / delete / restart from queueing behind a running tar import.
  • handles *handleCache — in-memory map of per-VM transient kernel/ process handles (PID, tap device, loop devices, DM target). The cache is rebuildable: each VM directory holds a small handles.json scratch file that the daemon reads at startup to reconstruct the cache and verify processes against /proc via pgrep. Nothing in the durable vms SQLite row describes transient kernel state. See internal/daemon/vm_handles.go.
  • createVMMu sync.Mutex — serialises CreateVM (guards name uniqueness
    • guest IP allocation window).
  • imageOpsMu sync.Mutex — serialises image-registry mutations (PullImage, RegisterImage, PromoteImage, DeleteImage).
  • createOps opstate.Registry[*vmCreateOperationState] — in-flight VM create operations; owns its own lock.
  • tapPool tapPool — TAP interface pool; owns its own lock.
  • listener, vmDNS — networking.
  • vmCaps — registered VM capability hooks.
  • pullAndFlatten, finalizePulledRootfs, bundleFetch, requestHandler, guestWaitForSSH, guestDial, workspaceInspectRepo, workspaceImport — injectable seams used by tests.

Subpackages

Stateless helpers that don't need the Daemon composition root have been lifted into subpackages. Lifecycle orchestration, image-registry orchestration, host networking bootstrap, background reconciliation, and the JSON-RPC dispatch all still live in this package — it is not "just orchestration." ~29 files and ~130 func (d *Daemon) methods share the root struct today. A future project would be to split VM lifecycle, image management, and the background reconciler into services with explicit interfaces; that's out of scope for v0.1.0.

Each subpackage takes explicit dependencies (typically a system.Runner-compatible interface) and holds no global state beyond small test seams.

Subpackage Purpose
internal/daemon/opstate Generic Registry[T AsyncOp] for async-operation bookkeeping.
internal/daemon/dmsnap Device-mapper COW snapshot create/cleanup/remove.
internal/daemon/fcproc Firecracker process primitives (bridge, tap, binary, PID, kill, wait).
internal/daemon/imagemgr Image subsystem pure helpers: validators, staging, build script gen.
internal/daemon/workspace Workspace helpers: git inspection, copy prep, guest import script.

All subpackages are leaves — no intra-daemon subpackage imports another.

Lock ordering

Acquire in this order, release in reverse. Never acquire in the opposite direction.

vmLocks[id]  →  workspaceLocks[id]  →  {createVMMu, imageOpsMu}  →  subsystem-local locks

vmLocks[id] and workspaceLocks[id] are NEVER held at the same time. workspace.prepare acquires vmLocks[id] just long enough to validate VM state, releases it, then acquires workspaceLocks[id] for the guest I/O phase. Regular lifecycle ops (start, stop, delete, set) do NOT do this split — they hold vmLocks[id] across the whole flow.

Subsystem-local locks (tapPool.mu, opstate.Registry mu) are leaves. They do not contend with each other.

Notes:

  • vmLocks[id] is the outer lock for any operation scoped to a single VM. Acquired via withVMLockByID / withVMLockByRef. The callback runs under the lock — treat the whole function body as critical section.
  • createVMMu and imageOpsMu are narrow: each guards one family of mutations and is released before any blocking guest I/O.
  • Holding a subsystem-local lock while calling into guest SSH is discouraged; copy needed state out under the lock and release before blocking I/O.

External API

Only internal/cli imports this package. The surface is:

  • daemon.Open(ctx) (*Daemon, error)
  • (*Daemon).Serve(ctx) error
  • (*Daemon).Close() error
  • daemon.Doctor(...) — host diagnostics (no receiver).

All other *Daemon methods are reached only through the RPC dispatch switch in daemon.go and are free to move/rename during refactoring.