Before: createVMMu was held across the whole of CreateVM — including
image resolution (which could fire a full auto-pull) and startVMLocked
(boot of multiple seconds). imageOpsMu was held across the whole of
PullImage/RegisterImage/PromoteImage/DeleteImage, so any slow OCI pull,
bundle download, or file copy blocked every other image mutation and
every other VM create that needed to auto-pull. The async create API
bought nothing if all creates serialised on the same mutex.
CreateVM is now three phases:
1. Validate + resolve image (possibly auto-pulling). No global lock.
2. reserveVM: take createVMMu only long enough to re-check the name
is free, allocate the next guest IP, and UpsertVM the "created"
row. Milliseconds.
3. startVMLocked: run the full boot flow under the per-VM lock only.
Parallel creates of different VMs now overlap on image resolution +
boot; they contend only across the reservation claim.
For the image surface a new publishImage helper isolates the commit
atom (recheck name free, atomic rename stagingDir→finalDir, UpsertImage)
under imageOpsMu. pullFromBundle + pullFromOCI do their network fetch
+ ext4 build + ownership fixup + agent injection outside the lock;
Register moves validation + kernel resolution outside; Promote moves
file copy + SSH-key seeding outside; Delete keeps a brief lock over
the lookup + reference check + store delete and does file cleanup
unlocked.
Two concurrency tests assert the new behaviour:
- TestPullImageDoesNotSerialiseOnDifferentNames fails the old code
(second pull blocks on imageOpsMu and never reaches the body).
- TestPullImageRejectsNameClashAtPublish confirms the publish-window
recheck is what enforces name uniqueness now that the body runs
unlocked — exactly one winner.
ARCHITECTURE.md updated to describe the new scope explicitly instead
of calling the locks "narrow".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
130 lines
6.6 KiB
Markdown
130 lines
6.6 KiB
Markdown
# `internal/daemon` architecture
|
|
|
|
This document describes the current daemon package layout: the `Daemon`
|
|
composition root, the subpackages that own stateless helpers and shared
|
|
primitives, and the lock ordering every caller must respect.
|
|
|
|
## Composition
|
|
|
|
`Daemon` is the composition root. Subsystem state and locks live on their
|
|
owning types:
|
|
|
|
- Layout, config, store, runner, logger, pid — infrastructure handles.
|
|
- `vmLocks vmLockSet` — per-VM `*sync.Mutex`, one per VM ID. Held for
|
|
the **entire lifecycle op** on that VM: a `start` holds it across
|
|
preflight, bridge setup, firecracker spawn, and post-boot wiring
|
|
(seconds to tens of seconds). Two `start`/`stop`/`delete`/`set` calls
|
|
against the same VM therefore serialise; calls against different VMs
|
|
run independently. If you need a slow guest-side operation to NOT
|
|
block lifecycle ops on the same VM, scope it out of the lock
|
|
explicitly the way `workspace.prepare` does (see below).
|
|
- `workspaceLocks vmLockSet` — per-VM mutex scoped to
|
|
`workspace.prepare` / `workspace.export`. These ops acquire
|
|
`vmLocks[id]` only long enough to validate VM state + snapshot the
|
|
fields they need, release it, then acquire `workspaceLocks[id]` for
|
|
the slow guest I/O phase. That keeps `vm stop` / `delete` / `restart`
|
|
from queueing behind a running tar import.
|
|
- `handles *handleCache` — in-memory map of per-VM transient kernel/
|
|
process handles (PID, tap device, loop devices, DM target). The
|
|
cache is rebuildable: each VM directory holds a small
|
|
`handles.json` scratch file that the daemon reads at startup to
|
|
reconstruct the cache and verify processes against `/proc` via
|
|
pgrep. Nothing in the durable `vms` SQLite row describes transient
|
|
kernel state. See `internal/daemon/vm_handles.go`.
|
|
- `createVMMu sync.Mutex` — narrow **reservation** mutex. `CreateVM`
|
|
resolves the image (possibly auto-pulling, which self-locks on
|
|
`imageOpsMu`) and parses sizing flags outside this lock, then holds
|
|
`createVMMu` only to re-check that the requested VM name is still
|
|
free, allocate the next guest IP, and insert the initial "created"
|
|
row. The subsequent boot flow runs under the per-VM lock only.
|
|
Parallel `vm create` calls therefore overlap on image resolution and
|
|
boot; they contend only across the millisecond-scale name+IP claim.
|
|
- `imageOpsMu sync.Mutex` — narrow **publication** mutex. `PullImage`
|
|
(both bundle and OCI paths), `RegisterImage`, `PromoteImage`, and
|
|
`DeleteImage` do their slow work (network fetch, ext4 build,
|
|
ownership fixup, file copy, SSH-key seeding) without this lock and
|
|
acquire it only for the commit atom: recheck name free, atomic
|
|
rename of the staging dir to its final home, upsert the store row.
|
|
Two pulls for different images run fully in parallel; two pulls that
|
|
race to the same name are resolved at the recheck — the loser fails
|
|
fast and its staging dir is cleaned up.
|
|
- `createOps opstate.Registry[*vmCreateOperationState]` — in-flight VM
|
|
create operations; owns its own lock.
|
|
- `tapPool tapPool` — TAP interface pool; owns its own lock.
|
|
- `listener`, `vmDNS` — networking.
|
|
- `vmCaps` — registered VM capability hooks.
|
|
- `pullAndFlatten`, `finalizePulledRootfs`, `bundleFetch`,
|
|
`requestHandler`, `guestWaitForSSH`, `guestDial`,
|
|
`workspaceInspectRepo`, `workspaceImport` — injectable seams used by tests.
|
|
|
|
## Subpackages
|
|
|
|
Stateless helpers that don't need the `Daemon` composition root have
|
|
been lifted into subpackages. Lifecycle orchestration, image-registry
|
|
orchestration, host networking bootstrap, background reconciliation,
|
|
and the JSON-RPC dispatch all still live in this package — it is not
|
|
"just orchestration." ~29 files and ~130 `func (d *Daemon)` methods
|
|
share the root struct today. A future project would be to split VM
|
|
lifecycle, image management, and the background reconciler into
|
|
services with explicit interfaces; that's out of scope for v0.1.0.
|
|
|
|
Each subpackage takes explicit dependencies (typically a
|
|
`system.Runner`-compatible interface) and holds no global state beyond
|
|
small test seams.
|
|
|
|
| Subpackage | Purpose |
|
|
| --------------------------------- | ---------------------------------------------------------------------- |
|
|
| `internal/daemon/opstate` | Generic `Registry[T AsyncOp]` for async-operation bookkeeping. |
|
|
| `internal/daemon/dmsnap` | Device-mapper COW snapshot create/cleanup/remove. |
|
|
| `internal/daemon/fcproc` | Firecracker process primitives (bridge, tap, binary, PID, kill, wait). |
|
|
| `internal/daemon/imagemgr` | Image subsystem pure helpers: validators, staging, build script gen. |
|
|
| `internal/daemon/workspace` | Workspace helpers: git inspection, copy prep, guest import script. |
|
|
|
|
All subpackages are leaves — no intra-daemon subpackage imports another.
|
|
|
|
## Lock ordering
|
|
|
|
Acquire in this order, release in reverse. Never acquire in the opposite
|
|
direction.
|
|
|
|
```
|
|
vmLocks[id] → workspaceLocks[id] → {createVMMu, imageOpsMu} → subsystem-local locks
|
|
```
|
|
|
|
`vmLocks[id]` and `workspaceLocks[id]` are NEVER held at the same
|
|
time. `workspace.prepare` acquires `vmLocks[id]` just long enough to
|
|
validate VM state, releases it, then acquires `workspaceLocks[id]`
|
|
for the guest I/O phase. Regular lifecycle ops (`start`, `stop`,
|
|
`delete`, `set`) do NOT do this split — they hold `vmLocks[id]`
|
|
across the whole flow.
|
|
|
|
Subsystem-local locks (`tapPool.mu`, `opstate.Registry` mu) are leaves.
|
|
They do not contend with each other.
|
|
|
|
Notes:
|
|
|
|
- `vmLocks[id]` is the outer lock for any operation scoped to a single VM.
|
|
Acquired via `withVMLockByID` / `withVMLockByRef`. The callback runs
|
|
under the lock — treat the whole function body as critical section.
|
|
- `createVMMu` is held only across the VM-name reservation + IP
|
|
allocation + initial UpsertVM. Image resolution and the full boot
|
|
flow happen outside it.
|
|
- `imageOpsMu` is held only across the publication atom (recheck name
|
|
+ atomic rename + UpsertImage, or the equivalent for Register /
|
|
Promote / Delete). Network fetch, ext4 build, and file copies run
|
|
unlocked.
|
|
- Holding a subsystem-local lock while calling into guest SSH is
|
|
discouraged; copy needed state out under the lock and release before
|
|
blocking I/O.
|
|
|
|
## External API
|
|
|
|
Only `internal/cli` imports this package. The surface is:
|
|
|
|
- `daemon.Open(ctx) (*Daemon, error)`
|
|
- `(*Daemon).Serve(ctx) error`
|
|
- `(*Daemon).Close() error`
|
|
- `daemon.Doctor(...)` — host diagnostics (no receiver).
|
|
|
|
All other `*Daemon` methods are reached only through the RPC `dispatch`
|
|
switch in `daemon.go` and are free to move/rename during refactoring.
|