banger/internal/daemon/ARCHITECTURE.md
Thales Maciel 59e48e830b
daemon: split owner daemon from root helper
Move the supported systemd path to two services: an owner-user bangerd for
orchestration and a narrow root helper for bridge/tap, NAT/resolver, dm/loop,
and Firecracker ownership. This removes repeated sudo from daily vm and image
flows without leaving the general daemon running as root.

Add install metadata, system install/status/restart/uninstall commands, and a
system-owned runtime layout. Keep user SSH/config material in the owner home,
lock file_sync to the owner home, and move daemon known_hosts handling out of
the old root-owned control path.

Route privileged lifecycle steps through typed privilegedOps calls, harden the
two systemd units, and rewrite smoke plus docs around the supported service
model.

Verified with make build, make test, make lint, and make smoke on the
supported systemd host path.
2026-04-26 12:43:17 -03:00

217 lines
10 KiB
Markdown

# `internal/daemon` architecture
This document describes the current daemon package layout: the `Daemon`
composition root, the four services it wires together, the subpackages
that own stateless helpers, the privileged-ops seam used by the
supported system install, and the lock ordering every caller must
respect.
## Supported service topology
On the supported host path (`banger system install` on a `systemd`
host), banger runs as two cooperating services:
- `bangerd.service` runs as the configured owner user. It owns the
public RPC socket, store, image state, workspace prep, and the
lifecycle state machine.
- `bangerd-root.service` runs as root. It owns only the privileged
host-kernel operations: bridge/tap, NAT/resolver routing, dm/loop
snapshot plumbing, privileged ext4 mutation on dm devices, and
firecracker process/socket ownership.
The owner daemon talks to the root helper through the `privilegedOps`
seam. Non-system/dev paths still use the same seam, but it is backed
by an in-process adapter instead of the helper RPC client.
## Composition
`Daemon` is a thin composition root. It holds shared infrastructure
(store, runner, logger, layout, config, listener, privileged-ops
adapter) plus pointers to four focused services. RPC dispatch is a
pure forwarder into those services; no lifecycle / image / workspace /
networking behaviour lives on `*Daemon` itself.
```
Daemon
├── *HostNetwork — bridge, tap pool, NAT, DNS, firecracker process,
│ DM snapshots, vsock readiness
├── *ImageService — register, promote, delete, pull (bundle + OCI),
│ kernel catalog, managed-seed refresh
├── *WorkspaceService — workspace.prepare / workspace.export, auth-key
│ + git-identity sync onto the work disk
└── *VMService — VM lifecycle (create/start/stop/restart/kill/
delete/set), stats polling, ports query,
handle cache, per-VM lock set, create-op
registry, preflight validation
```
Each service owns its own state. Cross-service calls go through narrow
consumer-defined seams:
- `WorkspaceService` does not hold a `*VMService` pointer. It takes
function-typed deps (`vmResolver`, `aliveChecker`, `withVMLockByRef`,
`imageResolver`, `imageWorkSeed`) so it sees exactly the operations
it needs and nothing more. Those deps are captured as closures so
construction-order cycles don't recur.
- `VMService` holds direct pointers to `*HostNetwork`, `*ImageService`,
and `*WorkspaceService`. Orchestrating a VM start really does compose
all three (bridge + tap + image resolution + work-disk sync), and
declaring a function-typed interface for every call would balloon
the surface for no win — services are unexported, so package-external
code can never reach them.
- Capability hooks do not take `*Daemon`. Each capability is a struct
with explicit service-pointer fields (`workDiskCapability{vm, ws,
store, defaultImageName}`, `dnsCapability{net}`, `natCapability{vm,
net, logger}`) populated at wiring time. `VMService` invokes them
through a `capabilityHooks` struct (function-typed bag) populated at
construction; neither the service nor any capability has a `*Daemon`
pointer.
Services + capabilities are built eagerly by `wireServices(d)`, called
once from `Daemon.Open` after the composition root's infrastructure is
populated, and once per test that constructs a `&Daemon{...}` literal.
Tests that want to stub a particular service or the capability list
assign the field before calling `wireServices` — the helper is
idempotent and skips anything already set.
## Service state
### `HostNetwork` (`host_network.go`, `nat.go`, `dns_routing.go`, `tap_pool.go`, `snapshot.go`)
- `tapPool` — TAP interface pool, owns its own lock.
- `vmDNS *vmdns.Server` — in-process DNS server for `.vm` names.
- `privilegedOps` — the host-kernel seam used for bridge/tap/NAT,
resolver routing, dm snapshots, privileged ext4 mutation, and
firecracker ownership/kill flows.
- No direct VM-state access. Where an operation needs a VM's tap name
(e.g. `ensureNAT`), the signature takes `guestIP` + `tap` string so
the caller (VMService) resolves them first.
### `ImageService` (`image_service.go`, `images.go`, `images_pull.go`, `image_seed.go`, `kernels.go`)
- `imageOpsMu sync.Mutex` — the publication-window lock. Held only
across the recheck-name + atomic-rename + UpsertImage commit atom.
Slow work (network fetch, ext4 build, SSH-key seeding) runs unlocked.
- Test seams `pullAndFlatten`, `finalizePulledRootfs`, `bundleFetch`
are struct fields (not package globals), so tests inject per-instance
fakes.
### `WorkspaceService` (`workspace_service.go`, `workspace.go`, `vm_authsync.go`)
- `workspaceLocks vmLockSet` — per-VM mutex scoped to
`workspace.prepare` / `workspace.export`. These ops acquire
`vmLocks[id]` (on VMService) only long enough to validate VM state
and snapshot the fields they need, then release it and acquire
`workspaceLocks[id]` for the slow guest I/O phase. That keeps
`vm stop` / `delete` / `restart` from queueing behind a running tar
import.
- Test seams `workspaceInspectRepo`, `workspaceImport` are per-instance
fields.
### `VMService` (`vm_service.go`, `vm_lifecycle.go`, `vm_create.go`, `vm_create_ops.go`, `vm_stats.go`, `vm_set.go`, `vm_disk.go`, `vm_handles.go`, `vm_authsync.go` (via WorkspaceService), `preflight.go`, `ports.go`, `vm.go`)
- `vmLocks vmLockSet` — per-VM `*sync.Mutex`, one per VM ID. Held for
the **entire lifecycle op** on that VM: `start` holds it across
preflight, bridge setup, firecracker spawn, and post-boot wiring
(seconds to tens of seconds). Two `start`/`stop`/`delete`/`set`
calls against the same VM therefore serialise; calls against
different VMs run independently.
- `createVMMu sync.Mutex` — narrow **reservation** mutex. `CreateVM`
resolves the image (possibly auto-pulling, which self-locks on
`imageOpsMu`) and parses sizing flags outside this lock, then holds
`createVMMu` only to re-check that the requested VM name is still
free, allocate the next guest IP, and insert the initial "created"
row. The subsequent boot flow runs under the per-VM lock only.
- `createOps opstate.Registry[*vmCreateOperationState]` — in-flight
async create operations; owns its own lock.
- `handles *handleCache` — in-memory map of per-VM transient kernel/
process handles (PID, tap device, loop devices, DM target). Each
VM directory holds a small `handles.json` scratch file so the
cache can be rebuilt at daemon startup.
- `vsockHostDevice` — path to `/dev/vhost-vsock` the preflight and
doctor checks RequireFile against. Defaulted in wireServices;
tests point at a tempfile to make the check pass without the
kernel module loaded. Guest-SSH test seams live on `*Daemon`
(`d.guestWaitForSSH`, `d.guestDial`), not VMService — workspace
prepare is the only path that reaches guest SSH, and it gets
there through closures WorkspaceService captured at wiring time.
## Subpackages
Stateless helpers with no need for a service pointer live in
subpackages. Each takes explicit dependencies (typically a
`system.Runner`-compatible interface) and holds no global state beyond
small test seams.
| Subpackage | Purpose |
| ---------------------------- | ---------------------------------------------------------------------- |
| `internal/daemon/opstate` | Generic `Registry[T AsyncOp]` for async-operation bookkeeping. |
| `internal/daemon/dmsnap` | Device-mapper COW snapshot create/cleanup/remove. |
| `internal/daemon/fcproc` | Firecracker process primitives (bridge, tap, binary, PID, kill, wait). |
| `internal/daemon/imagemgr` | Image subsystem pure helpers: validators, staging, build script gen. |
| `internal/daemon/workspace` | Workspace helpers: git inspection, copy prep, guest import script. |
All subpackages are leaves — no intra-daemon subpackage imports another.
## Lock ordering
Acquire in this order, release in reverse. Never acquire in the
opposite direction.
```
VMService.vmLocks[id] → WorkspaceService.workspaceLocks[id]
→ {VMService.createVMMu, ImageService.imageOpsMu}
→ subsystem-local locks
```
`vmLocks[id]` and `workspaceLocks[id]` are NEVER held at the same
time. `workspace.prepare` acquires `vmLocks[id]` just long enough to
validate VM state, releases it, then acquires `workspaceLocks[id]`
for the guest I/O phase. Regular lifecycle ops (`start`, `stop`,
`delete`, `set`) do NOT do this split — they hold `vmLocks[id]`
across the whole flow.
Subsystem-local locks (`tapPool.mu`, `opstate.Registry` mu,
`handleCache.mu`) are leaves. They do not contend with each other.
Notes:
- `vmLocks[id]` is the outer lock for any operation scoped to a single
VM. Acquired via `VMService.withVMLockByID` / `withVMLockByRef`. The
callback runs under the lock — treat the whole function body as
critical section.
- `createVMMu` is held only across the VM-name reservation + IP
allocation + initial UpsertVM. Image resolution and the full boot
flow happen outside it.
- `imageOpsMu` is held only across the publication atom (recheck name
+ atomic rename + UpsertImage, or the equivalent for Register /
Promote / Delete). Network fetch, ext4 build, and file copies run
unlocked.
- Holding a subsystem-local lock while calling into guest SSH is
discouraged; copy needed state out under the lock and release before
blocking I/O.
## Reconcile and background work
`Daemon.reconcile(ctx)` is the orchestrator run at startup. It
rehydrates the handle cache, reaps stale VMs, and republishes DNS
records. `Daemon.backgroundLoop()` is the ticker fan-out —
`VMService.pollStats`, `VMService.stopStaleVMs`, and
`VMService.pruneVMCreateOperations` run on independent tickers. On the
supported system path, any reconcile-time host cleanup that needs
privilege goes through `privilegedOps`, not directly through the owner
daemon process.
## External API
Only `internal/cli` imports this package. The surface is:
- `daemon.Open(ctx) (*Daemon, error)`
- `daemon.OpenSystem(ctx) (*Daemon, error)`
- `(*Daemon).Serve(ctx) error`
- `(*Daemon).Close() error`
- `daemon.Doctor(...)` — host diagnostics (no receiver).
All other methods live on the four services and are reached only
through the RPC `dispatch` switch in `daemon.go`. They are free to
move/rename during refactoring.