Before: createVMMu was held across the whole of CreateVM — including
image resolution (which could fire a full auto-pull) and startVMLocked
(boot of multiple seconds). imageOpsMu was held across the whole of
PullImage/RegisterImage/PromoteImage/DeleteImage, so any slow OCI pull,
bundle download, or file copy blocked every other image mutation and
every other VM create that needed to auto-pull. The async create API
bought nothing if all creates serialised on the same mutex.
CreateVM is now three phases:
1. Validate + resolve image (possibly auto-pulling). No global lock.
2. reserveVM: take createVMMu only long enough to re-check the name
is free, allocate the next guest IP, and UpsertVM the "created"
row. Milliseconds.
3. startVMLocked: run the full boot flow under the per-VM lock only.
Parallel creates of different VMs now overlap on image resolution +
boot; they contend only across the reservation claim.
For the image surface a new publishImage helper isolates the commit
atom (recheck name free, atomic rename stagingDir→finalDir, UpsertImage)
under imageOpsMu. pullFromBundle + pullFromOCI do their network fetch
+ ext4 build + ownership fixup + agent injection outside the lock;
Register moves validation + kernel resolution outside; Promote moves
file copy + SSH-key seeding outside; Delete keeps a brief lock over
the lookup + reference check + store delete and does file cleanup
unlocked.
Two concurrency tests assert the new behaviour:
- TestPullImageDoesNotSerialiseOnDifferentNames fails the old code
(second pull blocks on imageOpsMu and never reaches the body).
- TestPullImageRejectsNameClashAtPublish confirms the publish-window
recheck is what enforces name uniqueness now that the body runs
unlocked — exactly one winner.
ARCHITECTURE.md updated to describe the new scope explicitly instead
of calling the locks "narrow".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6.6 KiB
internal/daemon architecture
This document describes the current daemon package layout: the Daemon
composition root, the subpackages that own stateless helpers and shared
primitives, and the lock ordering every caller must respect.
Composition
Daemon is the composition root. Subsystem state and locks live on their
owning types:
- Layout, config, store, runner, logger, pid — infrastructure handles.
vmLocks vmLockSet— per-VM*sync.Mutex, one per VM ID. Held for the entire lifecycle op on that VM: astartholds it across preflight, bridge setup, firecracker spawn, and post-boot wiring (seconds to tens of seconds). Twostart/stop/delete/setcalls against the same VM therefore serialise; calls against different VMs run independently. If you need a slow guest-side operation to NOT block lifecycle ops on the same VM, scope it out of the lock explicitly the wayworkspace.preparedoes (see below).workspaceLocks vmLockSet— per-VM mutex scoped toworkspace.prepare/workspace.export. These ops acquirevmLocks[id]only long enough to validate VM state + snapshot the fields they need, release it, then acquireworkspaceLocks[id]for the slow guest I/O phase. That keepsvm stop/delete/restartfrom queueing behind a running tar import.handles *handleCache— in-memory map of per-VM transient kernel/ process handles (PID, tap device, loop devices, DM target). The cache is rebuildable: each VM directory holds a smallhandles.jsonscratch file that the daemon reads at startup to reconstruct the cache and verify processes against/procvia pgrep. Nothing in the durablevmsSQLite row describes transient kernel state. Seeinternal/daemon/vm_handles.go.createVMMu sync.Mutex— narrow reservation mutex.CreateVMresolves the image (possibly auto-pulling, which self-locks onimageOpsMu) and parses sizing flags outside this lock, then holdscreateVMMuonly to re-check that the requested VM name is still free, allocate the next guest IP, and insert the initial "created" row. The subsequent boot flow runs under the per-VM lock only. Parallelvm createcalls therefore overlap on image resolution and boot; they contend only across the millisecond-scale name+IP claim.imageOpsMu sync.Mutex— narrow publication mutex.PullImage(both bundle and OCI paths),RegisterImage,PromoteImage, andDeleteImagedo their slow work (network fetch, ext4 build, ownership fixup, file copy, SSH-key seeding) without this lock and acquire it only for the commit atom: recheck name free, atomic rename of the staging dir to its final home, upsert the store row. Two pulls for different images run fully in parallel; two pulls that race to the same name are resolved at the recheck — the loser fails fast and its staging dir is cleaned up.createOps opstate.Registry[*vmCreateOperationState]— in-flight VM create operations; owns its own lock.tapPool tapPool— TAP interface pool; owns its own lock.listener,vmDNS— networking.vmCaps— registered VM capability hooks.pullAndFlatten,finalizePulledRootfs,bundleFetch,requestHandler,guestWaitForSSH,guestDial,workspaceInspectRepo,workspaceImport— injectable seams used by tests.
Subpackages
Stateless helpers that don't need the Daemon composition root have
been lifted into subpackages. Lifecycle orchestration, image-registry
orchestration, host networking bootstrap, background reconciliation,
and the JSON-RPC dispatch all still live in this package — it is not
"just orchestration." ~29 files and ~130 func (d *Daemon) methods
share the root struct today. A future project would be to split VM
lifecycle, image management, and the background reconciler into
services with explicit interfaces; that's out of scope for v0.1.0.
Each subpackage takes explicit dependencies (typically a
system.Runner-compatible interface) and holds no global state beyond
small test seams.
| Subpackage | Purpose |
|---|---|
internal/daemon/opstate |
Generic Registry[T AsyncOp] for async-operation bookkeeping. |
internal/daemon/dmsnap |
Device-mapper COW snapshot create/cleanup/remove. |
internal/daemon/fcproc |
Firecracker process primitives (bridge, tap, binary, PID, kill, wait). |
internal/daemon/imagemgr |
Image subsystem pure helpers: validators, staging, build script gen. |
internal/daemon/workspace |
Workspace helpers: git inspection, copy prep, guest import script. |
All subpackages are leaves — no intra-daemon subpackage imports another.
Lock ordering
Acquire in this order, release in reverse. Never acquire in the opposite direction.
vmLocks[id] → workspaceLocks[id] → {createVMMu, imageOpsMu} → subsystem-local locks
vmLocks[id] and workspaceLocks[id] are NEVER held at the same
time. workspace.prepare acquires vmLocks[id] just long enough to
validate VM state, releases it, then acquires workspaceLocks[id]
for the guest I/O phase. Regular lifecycle ops (start, stop,
delete, set) do NOT do this split — they hold vmLocks[id]
across the whole flow.
Subsystem-local locks (tapPool.mu, opstate.Registry mu) are leaves.
They do not contend with each other.
Notes:
vmLocks[id]is the outer lock for any operation scoped to a single VM. Acquired viawithVMLockByID/withVMLockByRef. The callback runs under the lock — treat the whole function body as critical section.createVMMuis held only across the VM-name reservation + IP allocation + initial UpsertVM. Image resolution and the full boot flow happen outside it.imageOpsMuis held only across the publication atom (recheck name- atomic rename + UpsertImage, or the equivalent for Register / Promote / Delete). Network fetch, ext4 build, and file copies run unlocked.
- Holding a subsystem-local lock while calling into guest SSH is discouraged; copy needed state out under the lock and release before blocking I/O.
External API
Only internal/cli imports this package. The surface is:
daemon.Open(ctx) (*Daemon, error)(*Daemon).Serve(ctx) error(*Daemon).Close() errordaemon.Doctor(...)— host diagnostics (no receiver).
All other *Daemon methods are reached only through the RPC dispatch
switch in daemon.go and are free to move/rename during refactoring.