Move the supported systemd path to two services: an owner-user bangerd for
orchestration and a narrow root helper for bridge/tap, NAT/resolver, dm/loop,
and Firecracker ownership. This removes repeated sudo from daily vm and image
flows without leaving the general daemon running as root.
Add install metadata, system install/status/restart/uninstall commands, and a
system-owned runtime layout. Keep user SSH/config material in the owner home,
lock file_sync to the owner home, and move daemon known_hosts handling out of
the old root-owned control path.
Route privileged lifecycle steps through typed privilegedOps calls, harden the
two systemd units, and rewrite smoke plus docs around the supported service
model.
Verified with make build, make test, make lint, and make smoke on the
supported systemd host path.
Four drift fixes from a doc sweep.
internal/daemon/doc.go
Replace the capability-hook description that still said "Hook
methods take *Daemon; VMService reaches them through a
capabilityHooks seam." Current reality: every capability is a
plain struct carrying its own service pointers
(workDiskCapability{vm,ws,store}, dnsCapability{net},
natCapability{vm,net,logger}); wireServices builds the default
list; no hook reaches *Daemon.
internal/daemon/ARCHITECTURE.md
The VMService field list still claimed guestWaitForSSH and
guestDial were "per-instance fields." Those were deleted as
refactor residue. Update the note to say the seams live on
*Daemon (reached by WorkspaceService via closures wired at
construction) and document the vsockHostDevice field that
replaced the old package-global vsockHostDevicePath.
AGENTS.md
Drop the "experimental web UI" mention (removed) and the
`session` subpackage (removed). Mention banger-vsock-agent as
the third cmd/ binary while we're here — AGENTS hadn't listed
it.
docs/kernel-catalog.md
The trust-model section still read as if upstream kernel sources
were fetched by HTTPS alone. Add a paragraph covering the PGP
verification make-generic-kernel.sh now does against the
detached .tar.sign and the three kernel.org release signing keys.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Update ARCHITECTURE.md's Composition section to reflect the finished
split: capabilities carry explicit service-pointer fields, nothing
reaches *Daemon at dispatch time, and wireServices(d) is the single
entry point that builds services + capabilities eagerly (from Open
in production, from tests after constructing &Daemon{...} literals).
Removes the paragraph admitting capability→*Daemon coupling and the
lazy-init getters justification, neither of which applies anymore.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 5 of the daemon god-struct refactor. Code motion landed in
phases 1-4; this commit retells the architecture so the docs match
the structure.
ARCHITECTURE.md loses the "deferred v0.2 project" hedge about
splitting services. The Composition section now describes the four
services (HostNetwork, ImageService, WorkspaceService, VMService)
that own behaviour, the consumer-defined seam pattern for
cross-service calls, and the lazy-init getter pattern that keeps
existing test literals compiling.
doc.go inventories which methods live on which service, and the
lock-ordering section gains the service prefixes (e.g.
VMService.vmLocks instead of bare vmLocks) so readers don't have to
guess which type owns which mutex.
No code changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Before: createVMMu was held across the whole of CreateVM — including
image resolution (which could fire a full auto-pull) and startVMLocked
(boot of multiple seconds). imageOpsMu was held across the whole of
PullImage/RegisterImage/PromoteImage/DeleteImage, so any slow OCI pull,
bundle download, or file copy blocked every other image mutation and
every other VM create that needed to auto-pull. The async create API
bought nothing if all creates serialised on the same mutex.
CreateVM is now three phases:
1. Validate + resolve image (possibly auto-pulling). No global lock.
2. reserveVM: take createVMMu only long enough to re-check the name
is free, allocate the next guest IP, and UpsertVM the "created"
row. Milliseconds.
3. startVMLocked: run the full boot flow under the per-VM lock only.
Parallel creates of different VMs now overlap on image resolution +
boot; they contend only across the reservation claim.
For the image surface a new publishImage helper isolates the commit
atom (recheck name free, atomic rename stagingDir→finalDir, UpsertImage)
under imageOpsMu. pullFromBundle + pullFromOCI do their network fetch
+ ext4 build + ownership fixup + agent injection outside the lock;
Register moves validation + kernel resolution outside; Promote moves
file copy + SSH-key seeding outside; Delete keeps a brief lock over
the lookup + reference check + store delete and does file cleanup
unlocked.
Two concurrency tests assert the new behaviour:
- TestPullImageDoesNotSerialiseOnDifferentNames fails the old code
(second pull blocks on imageOpsMu and never reaches the body).
- TestPullImageRejectsNameClashAtPublish confirms the publish-window
recheck is what enforces name uniqueness now that the body runs
unlocked — exactly one winner.
ARCHITECTURE.md updated to describe the new scope explicitly instead
of calling the locks "narrow".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two promises the doc was making that the code doesn't keep:
1. "Helpers moved out so the package stays focused on orchestration."
The package still has ~29 files and ~130 func (d *Daemon) methods
wiring VM lifecycle, image management, host networking, background
reconciliation, and JSON-RPC dispatch. Calling it "just orchestration"
sets readers up for surprise. Rewrite the subpackages preamble to
say so, and flag the service split as a post-v0.1.0 project.
2. "vmLocks[id] is held only across short synchronous state validation
and DB mutations." That's what workspace.prepare does; regular
lifecycle ops (start/stop/delete/set) go through withVMLockByRef
and hold the lock across the whole callback body, which for `start`
means preflight + bridge + firecracker spawn + post-boot wiring.
Rewrite the vmLocks bullet and the lock-ordering section to say
that explicitly, so readers don't build "surely my long flow under
the lock can't be what the doc means" reasoning on top of a false
premise.
Doc-only change. Code behaviour is unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cuts the daemon-managed guest-session machinery (start/list/show/
logs/stop/kill/attach/send). The feature shipped aimed at agent-
orchestration workflows (programmatic stdin piping into a long-lived
guest process) that aren't driving any concrete user today, and the
~2.3K LOC of daemon surface area — attach bridge, FIFO keepalive,
controller registry, sessionstream framing, SQLite persistence — was
locking in an API we'd have to keep through v0.1.0.
Anything session-flavoured that people actually need today can be
done with `vm ssh + tmux` or `vm run -- cmd`.
Deleted:
- internal/cli/commands_vm_session.go
- internal/daemon/{guest_sessions,session_lifecycle,session_attach,session_stream,session_controller}.go
- internal/daemon/session/ (guest-session helpers package)
- internal/sessionstream/ (framing package)
- internal/daemon/guest_sessions_test.go
- internal/store/guest_session_test.go
- GuestSession* types from internal/{api,model}
- Store UpsertGuestSession/GetGuestSession/ListGuestSessionsByVM/DeleteGuestSession + scanner helpers
- guest.session.* RPC dispatch entries
- 5 CLI session tests, 2 completion tests, 2 printer tests
Extracted:
- ShellQuote + FormatStepError lifted to internal/daemon/workspace/util.go
(only non-session consumer); workspace package now self-contained
- internal/daemon/guest_ssh.go keeps guestSSHClient + dialGuest +
waitForGuestSSH — still used by workspace prepare/export
- internal/daemon/fake_firecracker_test.go preserves the test helper
that used to live in guest_sessions_test.go
Store schema: CREATE TABLE guest_sessions and its column migrations
removed. Existing dev DBs keep an orphan table (harmless, pre-v0.1.0).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The web UI shipped as "experimental" and was never finished — no nav
off the dashboard, no live updates, no settled design, never a
supported surface. It was opt-in by default already; leaving the code
in the tree for v0.1.0 only invited "does this work?" questions and
kept HostSummary/BangerSummary/SudoStatus types on the public RPC
surface that nothing else uses.
Removed:
internal/webui/ (all Go + templates + assets)
internal/daemon/web.go (server start / Layout / Config / ListVMs / ListImages)
internal/daemon/dashboard.go (DashboardSummary aggregator)
Simplified:
internal/api/types.go drop WebURL on PingResult, drop
HostSummary / SudoStatus / BangerSummary /
DashboardSummary / DashboardSummaryResult
internal/model/types.go drop DaemonConfig.WebListenAddr
internal/config/config.go drop web_listen_addr from fileConfig + Load
internal/daemon/daemon.go drop webListener / webServer / webURL fields +
startWebServer() call + ping WebURL population
internal/cli/banger.go `daemon status` output no longer branches on web
internal/daemon/{doc.go,ARCHITECTURE.md} drop web UI sections
README.md drop web_listen_addr config bullet + security paragraph
Tests updated to reflect the new shape. Coverage 57.3 -> 58.9% (the
webui package was largely untested; its removal lifts the ratio
without moving the numerator). `banger daemon status` output and
--help are web-free. Lint + full suite green.
Separates what a VM IS (durable intent + identity + deterministic
derived paths — `VMRuntime`) from what is CURRENTLY TRUE about it
(firecracker PID, tap device, loop devices, dm-snapshot target — new
`VMHandles`). The durable state lives in the SQLite `vms` row; the
transient state lives in an in-memory cache on the daemon plus a
per-VM `handles.json` scratch file inside VMDir, rebuilt at startup
from OS inspection. Nothing kernel-level rides the SQLite schema
anymore.
Why:
Persisting ephemeral process handles to SQLite forced reconcile to
treat "running with a stale PID" as a first-class case and mix it
with real state transitions. The schema described what we last
observed, not what the VM is. Every time the observation model
shifted (tap pool, DM naming, pgrep fallback) the reconcile logic
grew a new branch. Splitting lets each layer own what it's good at:
durable records describe intent, in-memory cache + scratch file
describe momentary reality.
Shape:
- `model.VMHandles` = PID, TapDevice, BaseLoop, COWLoop, DMName,
DMDev. Never in SQLite.
- `VMRuntime` keeps: State, GuestIP, APISockPath, VSockPath,
VSockCID, LogPath, MetricsPath, DNSName, VMDir, SystemOverlay,
WorkDiskPath, LastError. All durable or deterministic.
- `handleCache` on `*Daemon` — mutex-guarded map + scratch-file
plumbing (`writeHandlesFile` / `readHandlesFile` /
`rediscoverHandles`). See `internal/daemon/vm_handles.go`.
- `d.vmAlive(vm)` replaces the 20+ inline
`vm.State==Running && ProcessRunning(vm.Runtime.PID, apiSock)`
spreads. Single source of truth for liveness.
- Startup reconcile: per running VM, load the scratch file, pgrep
the api sock, either keep (cache seeded from scratch) or demote
to stopped (scratch handles passed to cleanupRuntime first so DM
/ loops / tap actually get torn down).
Verification:
- `go test ./...` green.
- Live: `banger vm run --name handles-test -- cat /etc/hostname`
starts; `handles.json` appears in VMDir with the expected PID,
tap, loops, DM.
- `kill -9 $(pgrep bangerd)` while the VM is running, re-invoke the
CLI, daemon auto-starts, reconcile recognises the VM as alive,
`banger vm ssh` still connects, `banger vm delete` cleans up.
Tests added:
- vm_handles_test.go: scratch-file roundtrip, missing/corrupt file
behaviour, cache concurrency, rediscoverHandles prefers pgrep
over scratch, returns scratch contents even when process is
dead (so cleanup can tear down kernel state).
- vm_test.go: reconcile test rewritten to exercise the new flow
(write scratch → reconcile reads it → verifies process is gone →
issues dmsetup/losetup teardown).
ARCHITECTURE.md updated; `handles` added to Daemon field docs.
Previously withVMLockByRef held the per-VM mutex across InspectRepo,
waitForGuestSSH, dialGuest, ImportRepoToGuest (the tar stream!), and
the readonly chmod. A large repo could block `vm stop` / `vm delete`
/ `vm restart` on the same VM for however long the import took.
Split into two phases:
1. VM mutex held briefly to validate state (running + PID alive)
and snapshot the fields needed for SSH (guest IP, api sock).
2. VM mutex released. Acquire workspaceLocks[id] — a separate
per-VM mutex scoped to workspace.prepare / workspace.export —
for the guest I/O phase.
Lifecycle ops (stop/delete/restart/set) only take vmLocks, so they
no longer queue behind a slow import. Two concurrent prepares on the
same VM still serialise via workspaceLocks so tar streams don't
interleave. ExportVMWorkspace also acquires workspaceLocks to avoid
snapshotting a half-streamed import.
Two regression tests (sequential — they swap package-level seams):
ReleasesVMLockDuringGuestIO: stall the import fake, assert the VM
mutex is acquirable from another goroutine during the stall.
SerialisesConcurrentPreparesOnSameVM: 3 concurrent prepares, assert
Import is only ever invoked 1-at-a-time per VM.
ARCHITECTURE.md documents the split + updated lock ordering.
The `image build` flow spun up a transient Firecracker VM, SSHed in,
and ran a large bash provisioning script to derive a new managed
image from an existing one. It overlapped heavily with the golden-
image Dockerfile flow (same mise/docker/tmux/opencode install logic
duplicated in Go as `imagemgr.BuildProvisionScript`) and had far more
machinery: async op state, RPC begin/status/cancel, webui form +
operation page, preflight checks, API types, tests. For custom
images, writing a Dockerfile is simpler and more reproducible.
Removed end-to-end:
- CLI `image build` subcommand + `absolutizeImageBuildPaths`.
- Daemon: BuildImage method, imagebuild.go (transient-VM orchestration),
image_build_ops.go (async begin/status/cancel), imagemgr/build.go
(the 247-line provisioning script generator and all its append*
helpers), validateImageBuildPrereqs + addImageBuildPrereqs.
- RPC dispatches for image.build / .begin / .status / .cancel.
- opstate registry `imageBuildOps`, daemon seam `imageBuild`,
background pruner call.
- API types: ImageBuildParams, ImageBuildOperation, ImageBuildBeginResult,
ImageBuildStatusParams, ImageBuildStatusResult; model type
ImageBuildRequest.
- Web UI: Backend interface methods, handlers, form, routes, template
branches (images.html build form, operation.html build branch,
dashboard.html Build button).
- Tests that directly exercised BuildImage.
Doctor polish (task C):
- Drop the "image build" preflight section entirely (its raison d'être
is gone).
- Default-image check now accepts "not local but in imagecat" as OK:
vm create auto-pulls on first use. Only flag when the image is
neither locally registered nor in the catalog.
Net: 24 files touched, 1,373 lines deleted, 25 added.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
internal/daemon/doc.go and ARCHITECTURE.md were written before the
subpackage extractions and still referenced old structure (in-progress
phrasing, missing opstate/dmsnap/fcproc/imagemgr/session/workspace,
mentions of opRegistry by its old name). Both now describe the current
shape: composition root + six leaf subpackages, lock ordering rooted
at vmLocks[id], and the one intra-package dependency (workspace →
session for ShellQuote + FormatStepError).
README.md and AGENTS.md mark the local web UI as experimental. It is
still enabled by default at 127.0.0.1:7777, but the docs now state
plainly that its surface is not stable or hardened and not intended for
anything beyond single-user localhost use. AGENTS.md also points at
ARCHITECTURE.md for the subpackage layout.
No code changes; tests still green.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Daemon no longer owns a coarse mu shared across unrelated concerns.
Each subsystem now carries its own state and lock:
- tapPool: entries, next, and mu move onto a new tapPool struct.
- sessionRegistry: sessionControllers + its mutex move off Daemon.
- opRegistry[T asyncOp]: generic registry collapses the two ad-hoc
vm-create and image-build operation maps (and their mutexes) into one
shared type; the Begin/Status/Cancel/Prune methods simplify.
- vmLockSet: the sync.Map of per-VM mutexes moves into its own type;
lockVMID forwards.
- Daemon.mu splits into imageOpsMu (image-registry mutations) and
createVMMu (CreateVM serialisation) so image ops and VM creates no
longer block each other.
Lock ordering collapses to vmLocks[id] -> {createVMMu, imageOpsMu} ->
subsystem-local leaves. doc.go and ARCHITECTURE.md updated.
No behavior change; tests green.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
vm.go (1529 LOC) splits into vm_create, vm_lifecycle, vm_set, vm_stats,
vm_disk, vm_authsync; firecracker/DNS/helpers stay in vm.go.
guest_sessions.go (1266 LOC) splits into session_controller,
session_lifecycle, session_attach, session_stream; scripts and helpers
stay in guest_sessions.go.
Mechanical move only. No behavior change. Adds doc.go and
ARCHITECTURE.md capturing subsystem map and current lock ordering as
the baseline for the upcoming subsystem extraction.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>