banger/internal/daemon/ARCHITECTURE.md
Thales Maciel 99d0811097
daemon: shrink createVMMu + imageOpsMu to reservation/publication windows
Before: createVMMu was held across the whole of CreateVM — including
image resolution (which could fire a full auto-pull) and startVMLocked
(boot of multiple seconds). imageOpsMu was held across the whole of
PullImage/RegisterImage/PromoteImage/DeleteImage, so any slow OCI pull,
bundle download, or file copy blocked every other image mutation and
every other VM create that needed to auto-pull. The async create API
bought nothing if all creates serialised on the same mutex.

CreateVM is now three phases:

 1. Validate + resolve image (possibly auto-pulling). No global lock.
 2. reserveVM: take createVMMu only long enough to re-check the name
    is free, allocate the next guest IP, and UpsertVM the "created"
    row. Milliseconds.
 3. startVMLocked: run the full boot flow under the per-VM lock only.

Parallel creates of different VMs now overlap on image resolution +
boot; they contend only across the reservation claim.

For the image surface a new publishImage helper isolates the commit
atom (recheck name free, atomic rename stagingDir→finalDir, UpsertImage)
under imageOpsMu. pullFromBundle + pullFromOCI do their network fetch
+ ext4 build + ownership fixup + agent injection outside the lock;
Register moves validation + kernel resolution outside; Promote moves
file copy + SSH-key seeding outside; Delete keeps a brief lock over
the lookup + reference check + store delete and does file cleanup
unlocked.

Two concurrency tests assert the new behaviour:
 - TestPullImageDoesNotSerialiseOnDifferentNames fails the old code
   (second pull blocks on imageOpsMu and never reaches the body).
 - TestPullImageRejectsNameClashAtPublish confirms the publish-window
   recheck is what enforces name uniqueness now that the body runs
   unlocked — exactly one winner.

ARCHITECTURE.md updated to describe the new scope explicitly instead
of calling the locks "narrow".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:44:22 -03:00

6.6 KiB

internal/daemon architecture

This document describes the current daemon package layout: the Daemon composition root, the subpackages that own stateless helpers and shared primitives, and the lock ordering every caller must respect.

Composition

Daemon is the composition root. Subsystem state and locks live on their owning types:

  • Layout, config, store, runner, logger, pid — infrastructure handles.
  • vmLocks vmLockSet — per-VM *sync.Mutex, one per VM ID. Held for the entire lifecycle op on that VM: a start holds it across preflight, bridge setup, firecracker spawn, and post-boot wiring (seconds to tens of seconds). Two start/stop/delete/set calls against the same VM therefore serialise; calls against different VMs run independently. If you need a slow guest-side operation to NOT block lifecycle ops on the same VM, scope it out of the lock explicitly the way workspace.prepare does (see below).
  • workspaceLocks vmLockSet — per-VM mutex scoped to workspace.prepare / workspace.export. These ops acquire vmLocks[id] only long enough to validate VM state + snapshot the fields they need, release it, then acquire workspaceLocks[id] for the slow guest I/O phase. That keeps vm stop / delete / restart from queueing behind a running tar import.
  • handles *handleCache — in-memory map of per-VM transient kernel/ process handles (PID, tap device, loop devices, DM target). The cache is rebuildable: each VM directory holds a small handles.json scratch file that the daemon reads at startup to reconstruct the cache and verify processes against /proc via pgrep. Nothing in the durable vms SQLite row describes transient kernel state. See internal/daemon/vm_handles.go.
  • createVMMu sync.Mutex — narrow reservation mutex. CreateVM resolves the image (possibly auto-pulling, which self-locks on imageOpsMu) and parses sizing flags outside this lock, then holds createVMMu only to re-check that the requested VM name is still free, allocate the next guest IP, and insert the initial "created" row. The subsequent boot flow runs under the per-VM lock only. Parallel vm create calls therefore overlap on image resolution and boot; they contend only across the millisecond-scale name+IP claim.
  • imageOpsMu sync.Mutex — narrow publication mutex. PullImage (both bundle and OCI paths), RegisterImage, PromoteImage, and DeleteImage do their slow work (network fetch, ext4 build, ownership fixup, file copy, SSH-key seeding) without this lock and acquire it only for the commit atom: recheck name free, atomic rename of the staging dir to its final home, upsert the store row. Two pulls for different images run fully in parallel; two pulls that race to the same name are resolved at the recheck — the loser fails fast and its staging dir is cleaned up.
  • createOps opstate.Registry[*vmCreateOperationState] — in-flight VM create operations; owns its own lock.
  • tapPool tapPool — TAP interface pool; owns its own lock.
  • listener, vmDNS — networking.
  • vmCaps — registered VM capability hooks.
  • pullAndFlatten, finalizePulledRootfs, bundleFetch, requestHandler, guestWaitForSSH, guestDial, workspaceInspectRepo, workspaceImport — injectable seams used by tests.

Subpackages

Stateless helpers that don't need the Daemon composition root have been lifted into subpackages. Lifecycle orchestration, image-registry orchestration, host networking bootstrap, background reconciliation, and the JSON-RPC dispatch all still live in this package — it is not "just orchestration." ~29 files and ~130 func (d *Daemon) methods share the root struct today. A future project would be to split VM lifecycle, image management, and the background reconciler into services with explicit interfaces; that's out of scope for v0.1.0.

Each subpackage takes explicit dependencies (typically a system.Runner-compatible interface) and holds no global state beyond small test seams.

Subpackage Purpose
internal/daemon/opstate Generic Registry[T AsyncOp] for async-operation bookkeeping.
internal/daemon/dmsnap Device-mapper COW snapshot create/cleanup/remove.
internal/daemon/fcproc Firecracker process primitives (bridge, tap, binary, PID, kill, wait).
internal/daemon/imagemgr Image subsystem pure helpers: validators, staging, build script gen.
internal/daemon/workspace Workspace helpers: git inspection, copy prep, guest import script.

All subpackages are leaves — no intra-daemon subpackage imports another.

Lock ordering

Acquire in this order, release in reverse. Never acquire in the opposite direction.

vmLocks[id]  →  workspaceLocks[id]  →  {createVMMu, imageOpsMu}  →  subsystem-local locks

vmLocks[id] and workspaceLocks[id] are NEVER held at the same time. workspace.prepare acquires vmLocks[id] just long enough to validate VM state, releases it, then acquires workspaceLocks[id] for the guest I/O phase. Regular lifecycle ops (start, stop, delete, set) do NOT do this split — they hold vmLocks[id] across the whole flow.

Subsystem-local locks (tapPool.mu, opstate.Registry mu) are leaves. They do not contend with each other.

Notes:

  • vmLocks[id] is the outer lock for any operation scoped to a single VM. Acquired via withVMLockByID / withVMLockByRef. The callback runs under the lock — treat the whole function body as critical section.
  • createVMMu is held only across the VM-name reservation + IP allocation + initial UpsertVM. Image resolution and the full boot flow happen outside it.
  • imageOpsMu is held only across the publication atom (recheck name
    • atomic rename + UpsertImage, or the equivalent for Register / Promote / Delete). Network fetch, ext4 build, and file copies run unlocked.
  • Holding a subsystem-local lock while calling into guest SSH is discouraged; copy needed state out under the lock and release before blocking I/O.

External API

Only internal/cli imports this package. The surface is:

  • daemon.Open(ctx) (*Daemon, error)
  • (*Daemon).Serve(ctx) error
  • (*Daemon).Close() error
  • daemon.Doctor(...) — host diagnostics (no receiver).

All other *Daemon methods are reached only through the RPC dispatch switch in daemon.go and are free to move/rename during refactoring.