vm run ./repo (and the explicit vm workspace prepare) imports the
host user's own checkout. Any .mise.toml that lands in the guest
would otherwise prompt on the first guest command — 'mise trust:
hash mismatch, run "mise trust"' — and stall what should be a
zero-friction sandbox launch. The repo just came from the host,
the guest is single-tenant root@<vm>.vm, the user already trusts
this checkout: auto-trust is the right default here.
After workspaceImportHook succeeds, run
if command -v mise >/dev/null 2>&1; then
mise trust --quiet --all <guest_path> || true
fi
inside the guest. Best effort: a missing mise binary, a non-zero
exit, or a no-op trust all log at debug only and never fail
prepare. The path is shell-quoted via ws.ShellQuote so guest
paths with spaces or quotes don't break the argument.
Tests pin the script shape (command -v guard + --quiet --all flag
+ trailing `|| true`) and assert the script actually fires after
a successful import. A path with an apostrophe round-trips via
ws.ShellQuote without truncation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three recovery-path errors were silently dropped:
- vm_lifecycle.go startVMLocked persisted the VMStateError record
with `_ = s.store.UpsertVM(...)`. If the persist failed the user
saw the original start error but operators had no way to find
out the store had also drifted out of sync.
- vm_lifecycle.go deleteVMLocked killed the firecracker process
with `_ = s.net.killVMProcess(...)`. cleanupRuntime tears it
down regardless, so the explicit kill is best-effort, but a
permission-denied / EPERM was still worth logging.
- capabilities.go cleanupPreparedCapabilities collected per-cap
errors with errors.Join. Callers get the aggregated value but
couldn't tell which capability failed when more than one did.
All three now log Warn before the original behaviour continues.
The aggregate return value, control flow, and user-visible error
strings are unchanged — this is purely a "less silence in the
journal" pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Today there's no way to correlate a CLI failure with a daemon log
line. operationLog records relative timing but no id, two concurrent
vm.start calls log indistinguishably, and the async
vmCreateOperationState.ID is user-facing yet never reaches the
journal. The root helper logs plain text to stderr while bangerd
logs JSON, so a merged journalctl is hard to grep across the
trust-boundary split.
Mint a per-RPC op id at dispatch entry, store it on context, and
include it as an "op_id" attr on every operationLog record. The
id is stamped onto every error response (including the early
short-circuit paths bad_version and unknown_method). rpc.Call
forwards the context op id on requests so a daemon RPC and the
helper RPCs it triggers all share one id. The helper now logs
JSON to match bangerd, adopts the inbound id, and emits a single
"helper rpc completed" / "helper rpc failed" line per call so
operators can see at a glance how long each privileged op took.
vmCreateOperationState.ID is now the same id dispatch generated
for vm.create.begin — one identifier between client status polls,
daemon logs, and helper logs.
The wire format gains two optional fields: rpc.Request.OpID and
rpc.ErrorResponse.OpID, both omitempty so older peers (and the
opposite direction) ignore them. ErrorResponse.Error() now appends
"(op-XXXXXX)" to its string form when set; existing callers that
just print err.Error() get the id for free.
Tests cover: dispatch stamps op_id on unknown_method, bad_version,
and handler-returned errors; rpc.Call exposes the typed
*ErrorResponse via errors.As so the CLI can read code/op_id; ctx
op_id is forwarded to the server in the request envelope.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The fsck_snapshot lifecycle step exists to repair stale bitmaps in
a COW file reused from a prior aborted start — without it, the
later e2cp/e2rm calls in patch_root_overlay refuse to touch the
snapshot. On a freshly-created COW there are no stale bitmaps to
repair, so e2fsck -fy is pure overhead.
system_overlay already tracks whether it created the file this run
(sc.systemOverlayCreated, used to drive the rollback path). Reuse
that flag to skip e2fsck entirely on the create-fresh path. The
reused-COW path keeps the fsck for safety. Saves a few hundred ms
per VM create — small absolute win on top of the lazy-mkfs change,
but free.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mkfs.ext4 zeroes the entire inode table and journal at format time
unless told otherwise. On an 8 GiB work disk that's roughly 500-700ms
of host CPU/IO per 'banger vm create', for a one-time small per-write
penalty inside the guest the first time it touches an unwritten
inode that nobody can perceive.
Centralise the canonical mkfs -E option list as
system.MkfsExtraOptions and use it everywhere banger calls mkfs.ext4
on a VM-internal image: the no-seed work disk, MaterializeWorkDisk,
BuildWorkSeedImage, and the imagepull rootfs builder. The work-disk
paths feed vm create directly; the others are one-off but still
benefit from the faster format.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Old flow on every 'banger vm run' that hit the seeded path:
CopyFilePreferClone the seed file (FICLONE attempt + io.Copy + fsync
fallback), then e2fsck -fp + resize2fs to grow the FS to the spec
size. On filesystems without reflink support that meant pushing
512+ MiB through the kernel followed by a full filesystem check
and resize, even though the seed only carries a few KB of dotfiles
— minWorkSeedBytes is 512 MiB but the actual payload is tiny.
That is the minute-long stall on the 'cloning work seed' stage
users see today.
Replace the copy with a sized fresh ext4: truncate to
WorkDiskSizeBytes, mkfs.ext4 -F -E root_owner=0:0, debugfs rdump
to extract the seed's contents, then ingest each file via the
sudoless ext4 toolkit (MkdirExt4 / WriteExt4FileOwned, root:root,
mode preserved). Sub-second regardless of seed size or requested
work-disk size; no fsck or resize needed because the FS is created
at its final size from the start.
Also drop the now-implementation-pinned
TestEnsureWorkDiskClonesSeedImageAndResizes — its premise (a
scripted e2fsck/resize2fs sequence) no longer reflects the code,
and smoke covers the new flow end to end. Stage label changed
from 'cloning work seed' to 'applying work seed' to match what
actually happens.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 'docker' bit on model.Image was unused at runtime — every code
path that branched on it had been removed earlier, leaving only the
field, the SQL column, the --docker flag, and the
#feature:docker sentinel that BuildMetadataPackages emitted into a
hash file. None of those have callers anymore.
Strip the field from the model, the API params, the SQLite column,
the CLI flag, and BuildMetadataPackages's signature. Add migration
2 (drop_images_docker) so existing installs lose the column on next
daemon start. ALTER TABLE ... DROP COLUMN is fine: SQLite has
supported it since 3.35 (2021).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous check tried to bind 127.0.0.1:42069 and warned on
'address already in use' — which is exactly the state when the
banger daemon is running, the case the user ran 'doctor' to
confirm. The warning was actively misleading.
Now, on 'address already in use', probe the listener with a
*.vm DNS query that only banger's vmdns server answers
authoritatively (NXDOMAIN with Authoritative=true). If the shape
matches we pass; if the port is held by something else we still
warn. Tests cover both branches: a real vmdns server is accepted,
and a silent UDP listener on the same port is rejected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the supported systemd path to two services: an owner-user bangerd for
orchestration and a narrow root helper for bridge/tap, NAT/resolver, dm/loop,
and Firecracker ownership. This removes repeated sudo from daily vm and image
flows without leaving the general daemon running as root.
Add install metadata, system install/status/restart/uninstall commands, and a
system-owned runtime layout. Keep user SSH/config material in the owner home,
lock file_sync to the owner home, and move daemon known_hosts handling out of
the old root-owned control path.
Route privileged lifecycle steps through typed privilegedOps calls, harden the
two systemd units, and rewrite smoke plus docs around the supported service
model.
Verified with make build, make test, make lint, and make smoke on the
supported systemd host path.
Before this change `banger image pull` (both OCI-direct and bundle
paths) shipped images with an empty WorkSeedPath — the BuildWorkSeedImage
helper existed only behind the hidden `banger internal work-seed` CLI.
Every pulled image hit ensureWorkDisk's no-seed branch, and the guest
booted with a bare /root (no .bashrc, no .profile, none of the distro
defaults).
Pull now calls BuildWorkSeedImage after the rootfs is finalised (OCI)
or fetched (bundle). The builder is behind a new `workSeedBuilder` test
seam so existing pull tests don't accidentally demand sudo mount. The
build failure is non-fatal: any error logs a warning and leaves
WorkSeedPath empty — images stay publishable even if the pulled rootfs
has no /root to extract.
Verified end-to-end by wiping the cached smoke image and re-pulling:
work-seed.ext4 lands in the artifact dir next to rootfs.ext4, and all
21 smoke scenarios pass.
Also refreshes the "feature /root work disk" fallback tooling check —
the no-seed path no longer touches mount/umount/cp after commit
0e28504, so the doctor check now only requires truncate + mkfs.ext4.
The warn copy updates from "new VM creates will be slower" to "guest
/root will be empty", which matches the actual tradeoff post-refactor.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both helpers are stranded: commit f068536 dropped their last callers
from ensureAuthorizedKeyOnWorkDisk and seedAuthorizedKeyOnExt4Image,
and commit 6ab1a2b dropped the ensureGitIdentity / runFileSync calls
that still held them up. Every on-disk-patch code path now drives the
ext4 image directly via MkdirExt4 / WriteExt4FileOwned /
EnsureExt4RootPerms.
Also drops TestFlattenNestedWorkHomeCopiesEntriesIndividually —
premise gone with the function. The sshd_config_test comment
referencing normaliseHomeDirPerms now points at EnsureExt4RootPerms.
Net sudo reduction across the five-commit series: work-disk creation,
authsync, image seeding, git identity sync, and file_sync all drop
sudo entirely against user-owned ext4 files. Remaining sudo in
internal/daemon is confined to firecracker process launch, tap/dm
device setup, iptables/NAT, and dmsnap/fcproc — things that
legitimately need CAP_SYS_ADMIN or CAP_NET_ADMIN. MountTempDir stays
on exclusively as an image-build helper.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ensureGitIdentityOnWorkDisk, writeGitIdentity, runFileSync, and
copyHostDir all dropped their mount + sudo install/mkdir/chmod/chown
scaffolding. Every write now goes through MkdirExt4,
WriteExt4FileOwned, ReadExt4File, and the new MkdirAllExt4 helper —
all sudoless against user-owned ext4 images.
Net effect with the prior two commits: ensureWorkDisk, authsync, image
seeding, git identity sync, and file_sync no longer mount the work
disk or spawn sudo mkdir/chmod/chown/cat/install. Only the
image-build path (which legitimately produces root-owned artifacts)
still touches MountTempDir.
The filesystemRunner test harness grew a small debugfs/e2cp/e2rm
emulator so the WorkspaceService tests keep exercising their real
code paths without a live ext4 image. The mock is deliberately
dumb — it only implements the subset runFileSync and writeGitIdentity
drive.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ensureAuthorizedKeyOnWorkDisk and seedAuthorizedKeyOnExt4Image both
drove mount + sudo mkdir/chmod/chown/cat/install to patch
/.ssh/authorized_keys into a work disk or work-seed. Both now delegate
to a shared provisionAuthorizedKey helper that uses the ext4 toolkit
introduced in 7704396 — EnsureExt4RootPerms + MkdirExt4 +
Ext4PathExists/ReadExt4File + WriteExt4FileOwned. No mount, no sudo,
no host-path staging.
Drops ~10 sudo call sites from the VM create and image pull flows
and deletes the TestEnsureAuthorizedKeyOnWorkDiskRepairsNestedRootLayout
premise (flattenNestedWorkHome will disappear entirely in the next
commit — the no-seed path no longer copies /root, and the work-seed
path produces flat seeds).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The no-seed branch used to mount the base rootfs read-only, mount
the freshly mkfs'd work disk read-write, sudo-cp /root from one to
the other, then flatten any accidental /root/root/ nesting. Five
sudo call sites packed into a fallback that the common image path
doesn't even exercise.
Replace with: `mkfs.ext4 -F -E root_owner=0:0` and nothing else.
mkfs already stamps inode 2 as root:root:0755 — sshd's StrictModes
walks that dir's ownership when the work disk mounts at /root in
the guest, so getting it right from mkfs means authsync can just
write authorized_keys without any repair pass.
Tradeoff: no-seed VMs lose the base rootfs's default /root dotfiles
(.bashrc, .profile). The no-seed path is explicitly the degraded
fallback — `banger doctor` already warns about it — and users who
want those back have two documented knobs: rebuild the image with
a work-seed, or land them via [[file_sync]].
Sudo call sites removed: 5 (MountTempDir × 2, sudo cp -a,
flattenNestedWorkHome's chmod/cp/rm). flattenNestedWorkHome itself
stays alive for now — authsync + image_seed still call it — and
gets deleted in commit 5 once its last caller goes away.
While here: fix the freshly-added EnsureExt4RootPerms helper.
`set_inode_field <2> mode N` overwrites the full i_mode word
instead of preserving the type nibble, so the initial
implementation that passed just the permission bits (0755) would
reset the fs root to regular-file shape and break the next kernel
mount with "Structure needs cleaning." The corrected call OR's in
S_IFDIR (0o040000) explicitly. Test updated to match.
Smoke: 21/21 scenarios green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Preserve cleanup after daemon restarts and harden OCI and tar imports
against filenames that debugfs cannot encode safely.
Mirror tap, loop, and dm teardown identity onto VM.Runtime, teach
cleanup and reconcile to fall back to those persisted fields when
handles.json is missing or corrupt, and clear the recovery state on
stop, error, and delete paths.
Reject debugfs-hostile entry names during flattening and in
ApplyOwnership itself, then add regression coverage for corrupt
handles.json recovery and unsafe import paths.
Verified with targeted go tests, make lint-go, make lint-shell, and
make build.
Closes commit 3 of the god-service decomposition. VMService still
owned 45+ methods after the startVMLocked extraction and RPC table
landed in commits 1 and 2. Stats / ports / health / vsock-ping sit
in a corner of that surface that doesn't share any state with
lifecycle orchestration — nothing about "what's this VM's CPU
doing" belongs in the same service as Create/Start/Stop/Delete/Set.
New StatsService owns:
- GetVMStats / getVMStatsLocked / collectStats (stats collection)
- HealthVM / PingVM (vsock-agent health probe)
- PortsVM + buildVMPorts + probeWebListener + probeHTTPScheme +
dedupeVMPorts (listening-port enumeration)
- pollStats (background ticker refresh)
- stopStaleVMs (auto-stop sweep past config.AutoStopStaleAfter)
The three VMService touch-points stats genuinely needs — vmAlive,
vmHandles, the per-VM lock helpers, plus cleanupRuntime for the
stale-sweep tear-down — come in as function-typed closures, not a
*VMService pointer. StatsService has no back-reference to its
sibling. Mirrors the dependency-struct pattern WorkspaceService
already uses.
Wiring: d.stats is populated in wireServices AFTER d.vm (closures
must see a non-nil d.vm at call time). Dispatch table's four
entries (vm.stats / vm.health / vm.ping / vm.ports) now resolve
through d.stats. Background loop's pollStats / stopStaleVMs
tickers do the same. Dispatch surface from the RPC client's
perspective is byte-identical.
After this commit:
- vm_stats.go and ports.go are deleted; their content (plus the
stats-specific fields) lives in stats_service.go.
- VMService loses 12 methods. It's still the biggest service
(~30 methods, all lifecycle-supporting: handle cache, disk
provisioning, preflight, create-ops registry, lock helpers,
the lifecycle verbs themselves) but it's finally one coherent
concern instead of five.
Tests:
- TestWireServicesInstantiatesStatsService — pins that the
wiring order puts d.stats non-nil + its five closures all
populated. Prevents a silent background-loop regression.
- All existing tests that called d.vm.HealthVM / d.vm.PingVM /
d.vm.PortsVM / d.vm.collectStats were re-pointed at d.stats.
Smoke: all 21 scenarios green, including vm ports (exercises the
new PortsVM entry end-to-end) and the long-running workspace
scenarios (exercise the background stats poller implicitly).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The dispatch method was a single ~240-line switch of 34 cases, each
following the same pattern: decode params into some type P, call a
service method returning (R, error), wrap R in a result struct and
either marshalResultOrError-encode or return a raw rpc.NewError.
Adding a method was a 4-line ceremony per site, and grepping for
"methods banger speaks" meant reading the full switch.
New shape, in internal/daemon/dispatch.go:
- handler is the uniform `func(ctx, d, req) rpc.Response` type
every method dispatches through.
- paramHandler[P, R] is the generic wrapper that absorbs 28 of
the 34 cases (decode, call, marshal). No reflection — P and R
are deduced from the service-call literal, so each map entry
is a one-liner referencing a small adapter func.
- noParamHandler[R] is the decode-free variant for 6 methods
that don't carry params.
- rpcHandlers is the single source of truth for which methods
exist and which adapter they dispatch to.
- Four specials (ping, shutdown, vm.logs, vm.ssh) stay as named
`handler`-typed functions: ping/shutdown encode with raw
rpc.NewResult, vm.logs/vm.ssh need pre-service validation to
emit distinct error codes (not_found, not_running) that the
generic wrapper maps uniformly to operation_failed.
Daemon.dispatch shrinks from a 240-line switch to 11 lines:
version check, test-only handler short-circuit, table lookup,
invoke-or-unknown.
Tests:
- TestRPCHandlersMatchDocumentedMethods — keyset guard. Adding
or removing a method without updating the expected slice is a
red flag the test surfaces.
- TestRPCHandlersAllNonNil — catches nil-function registrations.
All pre-existing dispatch tests (param decode, error codes, etc.)
keep passing unchanged — the handler contract for any given
method is byte-identical from the RPC client's perspective. Smoke
(all 21 scenarios) exercises every code path end-to-end.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
startVMLocked was a ~260-line method running 18 sequential phases
with one lumped error path: on any failure, cleanupOnErr called
cleanupRuntime — a catch-all teardown that didn't distinguish
"this phase acquired resources we should undo" from "this phase is
idempotent." The blast radius was the entire VM lifecycle. Every
tweak to boot, NAT, disk, or auth-sync orchestration had to reason
about a closure that could fire at any of 18 points.
This commit extracts the phases into a data-driven pipeline:
- startContext threads the mutable state (vm, live, apiSock,
dmName, tapName, etc.) through every step by pointer so step
bodies mutate in place without returning copies.
- startStep carries the op.stage name, optional vmCreateStage
progress ping, optional log attrs, a run closure, and an
optional undo closure.
- runStartSteps walks steps in order, appends the failing step
to the rollback set (so partial-acquire failures like
machine.Start's post-spawn HTTP config get their undo fired),
then iterates the rollback set in reverse and joins errors
via errors.Join.
Each phase that acquires a resource now owns its own undo:
system_overlay removes a file it created, dm_snapshot cleans up
the loop + DM handles it set, prepare_host_features delegates to
capHooks.cleanupState, tap releases via releaseTap, metrics_file
removes the file, firecracker_launch kills the spawned PID and
drops the sockets, post_start_features calls capHooks.cleanupState
again (capability Cleanup hooks are idempotent — safe to call
whether PostStart reached every cap or not). The 11 phases with
no teardown obligation leave `undo` nil and the driver silently
skips them on rollback.
cleanupRuntime is retired from the start-failure path. It stays
intact for reconcile, stopVMLocked, killVMLocked, deleteVMLocked,
stopStaleVMs — the crash-recovery / lifecycle-teardown contract
those paths rely on is unchanged.
startVMLocked shrinks from ~225 lines of sequential-phase code
plus a cleanupOnErr closure to ~45 lines: compute derived paths,
build the step list, drive it, persist ERROR state on failure.
Stage names preserved 1:1 so existing log grep + the async-create
progress stream stay compatible.
Tests:
- TestRunStartSteps_RollsBackInReverseOnFailure — the contract
is pinned: succeeded-before-failing run, all their undos in
reverse, failing step's undo also fires, original err still
visible via errors.Is.
- TestRunStartSteps_SkipsNilUndos — optional-undo contract.
- TestRunStartSteps_JoinsRollbackErrors — undo failures don't
hide the root cause.
- TestRunStartSteps_HappyPathNoRollback — success path never
fires any undo.
Smoke: all 21 scenarios pass, including the start-path ones
(bare vm run, workspace vm run, vm restart, vm lifecycle, vm set
reconfig) that exercise real firecracker boots end-to-end.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cleanup identity for kernel objects was split across two sources of
truth: vm.Runtime (DB-backed, durable) held paths and the guest IP,
but the TAP name lived only in the in-process handle cache + the
best-effort handles.json scratch file next to the VM dir. Every
other cleanup-identifying datum has a fallback — firecracker PID
can be rediscovered via `pgrep -f <apiSock>`, loops via losetup, dm
name from the deterministic ShortID(vm.ID). The tap is the one
truly cache-only datum (allocated from a pool, not derivable).
That made NAT teardown fragile:
- daemon crash between `acquireTap` and the handles.json write
- handles.json corrupt on the next daemon start
- partial cleanup that already zeroed the cache
In any of those cases natCapability.Cleanup short-circuited
("skipping nat cleanup without runtime network handles") and the
per-VM POSTROUTING MASQUERADE + the two FORWARD rules keyed off
the tap would leak. The VM row in the DB still existed, so a retry
couldn't close the loop — the tap name was simply gone.
Fix: mirror TapDevice onto model.VMRuntime (serialised via the
existing runtime_json column, omitempty so existing rows upgrade
cleanly). Set it in startVMLocked right next to the
s.setVMHandles call that seeds the in-memory cache; clear it at
every post-cleanup reset site (stop normal path + stop stale
branch, kill normal path + kill stale branch, cleanupOnErr in
start, reconcile's stale-vm branch, the stats poller's auto-stop
path).
Fallbacks now cascade:
- natCapability.Cleanup: handles cache → Runtime.TapDevice
- cleanupRuntime (releaseTap): handles cache → Runtime.TapDevice
Both surfaces refuse gracefully (old behaviour) only when neither
source has a value, which really does mean "no tap was ever
allocated for this VM" rather than "we lost track of it."
Test: TestNATCapabilityCleanup_FallsBackToRuntimeTapDevice clears
the handle cache, sets vm.Runtime.TapDevice, and asserts Cleanup
reaches the runner — the exact scenario the review flagged as a
plausible leak and the exact code path that now guarantees it
doesn't.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A user who sets `[[file_sync]] host = "~/.aws"` (per the README's
own example) can unintentionally copy files from outside that
directory if .aws contains symlinks. copyHostDir used os.Stat
during recursion, which transparently follows: a symlink to a
credential dir elsewhere would be recursed into, materialising
unrelated secrets inside the guest. For credential trees that's
an avoidable sprawl vector.
Switched copyHostDir's per-entry probe from os.Stat to os.Lstat
and added a default skip-with-warning branch for ModeSymlink.
Files and dirs at the SAME level copy as before; symlinks (both
file and directory flavours) surface a "file_sync skipped
symlink (would escape the requested tree)" warn log and are
otherwise omitted.
Top-level entry paths still follow — the Stat in runFileSync is
unchanged. The user explicitly named that path, so resolving
"~/.aws" through a symlink out of $HOME is on them.
Tests:
- TestRunFileSyncSkipsNestedSymlinks — builds a synced dir with
both a file symlink and a directory symlink pointing outside
the tree; asserts real files copy, symlinks do not materialise
anywhere in the guest mount, and each skipped symlink surfaces
a warn log entry.
README updated with a one-line note about the skip behaviour so
users know to expect it rather than chasing "why didn't my file
show up."
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A VM name flows into five places that all have narrower grammars
than "arbitrary string":
- the guest's /etc/hostname (vm_disk.patchRootOverlay)
- the guest's /etc/hosts (same)
- the <name>.vm DNS record (vmdns.RecordName)
- the kernel command line (system.BuildBootArgs*)
- VM-dir file-path fragments (layout.VMsDir/<id>, etc.)
Nothing in the chain was validating the input. A name with
whitespace, newline, dot, slash, colon, or = would produce broken
hostnames, weird DNS labels, smuggled kernel cmdline tokens, or
(in the worst case) surprising traversal through the on-disk
layout. Not host shell injection — we already avoid shelling out
with the raw name — but a real correctness and supportability bug.
New: model.ValidateVMName. Rules:
- 1..63 chars (DNS label max per RFC 1123; also a comfortable
/etc/hostname cap)
- lowercase ASCII letters, digits, '-' only
- no leading or trailing '-'
- no normalization — the name is the user-visible identifier
(store key, `ssh <name>.vm`, `vm show`); silently rewriting
"MyVM" → "myvm" would hand the user back something different
than they typed
Called from two places:
- internal/cli/commands_vm.go vmCreateParamsFromFlags — rejects
bad `--name` values before any RPC. Empty name still passes
through so the daemon can generate one.
- internal/daemon/vm_create.go reserveVM — defense in depth for
any non-CLI RPC caller (SDK, direct JSON over the socket).
Tests:
- internal/model/vm_name_test.go — exhaustive character-class
matrix (space, newline, tab, dot, slash, colon, equals, quote,
control chars, unicode letters, uppercase, leading/trailing
hyphen, over-length, max-length-exact, digits-only).
- internal/cli TestVMCreateParamsFromFlagsRejectsInvalidName —
CLI wire-through + empty-name passthrough.
- internal/daemon TestReserveVMRejectsInvalidName — daemon
defense-in-depth (including `box/../evil` path-traversal).
- scripts/smoke.sh — end-to-end rejection + no-leaked-row
assertion.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
banger hasn't shipped a public release — every "legacy", "pre-opt-in",
"previously", "migration note", "no longer" reference in the tree is
pinning against a state no real user's install has ever been in.
That scaffolding has weight: it's a coordinate system future readers
have to decode, and it keeps dead code alive.
Removed (code):
- internal/daemon/ssh_client_config.go
- vmSSHConfigIncludeBegin / vmSSHConfigIncludeEnd constants and
every `removeManagedBlock(existing, vm...)` call they enabled
(legacy inline `Host *.vm` block scrub)
- cleanupLegacySSHConfigDir (+ its caller in syncVMSSHClientConfig)
— wiped a pre-opt-in sibling file under $ConfigDir/ssh
- sameDirOrParent + resolvePathForComparison — only ever used
by cleanupLegacySSHConfigDir
- the "also check legacy marker" fallback in
UserSSHIncludeInstalled / UninstallUserSSHInclude
- internal/store/migrations.go
- migrateDropDeadImageColumns (migration 2) + its slice entry
- dropColumnIfExists (orphaned after the above)
- addColumnIfMissing + the whole "columns added across the pre-
versioning lifetime" block at the end of migrateBaseline —
subsumed into the baseline CREATE TABLE
- `packages_path TEXT` column on the images table (the
throwaway migration 2 dropped it, but there was never any
reader)
- internal/daemon/vm.go
- vmDNSRecordName local wrapper — was justified as "avoid
pulling vmdns into every file"; three of four callers already
imported vmdns directly, so inline the one stray call
- internal/cli/cli_test.go
- TestLegacyRemovedCommandIsRejected (`tui` subcommand never
shipped)
Removed / simplified (tests):
- ssh_client_config_test.go: dropped TestSameDirOrParentHandlesSymlinks,
TestSyncVMSSHClientConfigPreservesUserKeyInLegacyDir,
TestSyncVMSSHClientConfigNarrowsCleanupToLegacyFile,
TestSyncVMSSHClientConfigLeavesUnexpectedLegacyContents,
TestInstallUserSSHIncludeMigratesLegacyInlineBlock, plus the
"legacy posture" regression strings in the remaining happy-path
test; TestUninstallUserSSHIncludeRemovesBothMarkerBlocks collapsed
to a single-block test
- migrations_test.go: dropped TestMigrateDropDeadImageColumns_AcrossInstallPaths,
TestDropColumnIfExistsIsIdempotent; TestOpenReadOnlyDoesNotRunMigrations
simplified to test against the baseline marker
Removed (docs):
- README.md "**Migration note.**" blockquote about the SSH-key path move
- docs/advanced.md parenthetical "(the old behaviour)"
Reworded (comments):
- Dropped "Previously this file also contained LogLevel DEBUG3..."
history from vm_disk.go's sshdGuestConfig doc
- Dropped "Call sites that previously read vm.Runtime.{PID,...}"
from vm_handles.go; now documents the current contract
- Dropped "Pre-v0.1 the defaults are" scaffolding in doctor_test.go
- Dropped "no longer does its own git inspection" phrasing in vm_run.go
- Dropped the "(also cleans up legacy inline block from pre-opt-in
builds)" aside on the `ssh-config` CLI docstring
- Renamed test var `legacyKey` → `existingKey` in vm_test.go; its
purpose was "pre-existing authorized_keys line," not banger-legacy
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
--readonly ran `chmod -R a-w` over the workspace after copying, but
every banger guest boots as root, and root bypasses DAC mode checks.
So a user running `vm workspace prepare ... --readonly` got the
mode bits set to 0444 but `echo x >> file` in the guest still
succeeded. The flag promised enforcement it couldn't deliver.
The feature also doesn't match the product model: workspaces are
prepared precisely so the guest CAN edit them, and `workspace
export` exists to pull those edits back as a patch. A
"read-only workspace" contradicts that loop.
Removed:
- CLI flag `--readonly` on `vm workspace prepare`
- api.VMWorkspacePrepareParams.ReadOnly field
- model.WorkspacePrepareResult.ReadOnly field
- daemon chmod dispatch in prepareVMWorkspaceGuestIO
- smoke scenario pinning the (advisory) mode-bit behavior
- misleading "exportbox-readonly" VM name in an unrelated export
test (the test is about not mutating the real git index;
renamed to exportbox-noindex-mutation)
If real enforcement becomes a user need later, the right primitive
is `chattr +i` (immutable bit — root CAN'T write) or a ro bind-mount.
Reintroducing a new flag is cheaper than debugging what the current
one actually guarantees.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two defects compounded to make `vm create X` → `vm stop X` → `vm start X`
→ `vm ssh X` fail with `not_running: vm X is not running` even though
`vm show` reports `state=running`.
1. firecracker-go-sdk's startVMM spawns a goroutine that SIGTERMs
firecracker when the ctx passed to Machine.Start cancels — and
retains that ctx for the lifetime of the VMM, not just the boot
phase. Our Machine.Start wrapper was plumbing the caller's ctx
through, which on `vm.start` is the RPC request ctx. daemon.go's
handleConn cancels reqCtx via `defer cancel()` right after
writing the response. Net effect: firecracker is killed ~150ms
after the `vm start` RPC "completes", invisibly, and the next
`vm ssh` sees a dead PID. `vm.create` side-stepped the bug
because BeginVMCreate detaches to context.Background() before
calling startVMLocked; `vm.start` used the RPC ctx directly.
Fix: Machine.Start now passes context.Background() to the SDK.
We own firecracker lifecycle explicitly (StopVM / KillVM /
cleanupRuntime), so ctx-driven cancellation here was never
actually wired into anything useful.
2. With (1) fixed, the same scenario exposed a second defect:
patchRootOverlay's e2cp/e2rm refuses to touch the dm-snapshot
with "Inode bitmap checksum does not match bitmap" on a restart,
because the COW holds stale free-block/free-inode counters from
the previous guest boot. Kernel ext4 is fine with this; e2fsprogs
is not. Fix: run `e2fsck -fy` on the snapshot between the
dm_snapshot and patch_root_overlay stages. Idempotent on a fresh
snapshot, reconciles the bitmaps on a reused COW.
Regression coverage:
- scripts/repro-restart-bug.sh — minimal create→stop→start→ssh
reproducer with rich on-failure diagnostics (daemon log trace,
firecracker.log tail, handles.json, pgrep-by-apiSock, apiSock
stat). Exits non-zero if the bug returns.
- scripts/smoke.sh — lifecycle scenario (create/ssh/stop/start/
ssh/delete) and vm-set scenario (--vcpu 2 → stop → set --vcpu 4
→ start → assert nproc=4). Both were pulled when the bug was
first found; now restored.
Supporting:
- internal/system/system.ExitCode — extracts exec.ExitError's
code without forcing callers to import os/exec. Needed by the
e2fsck caller (policy test pins os/exec to the shell-out
packages).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The export round-trip (`vm create` → `workspace prepare` → guest edit →
`workspace export`) exposed a reproducible failure on Debian bookworm
guests: `git read-tree HEAD --index-output=/tmp/...` returns exit 128
"unable to write new index file" when the target lives on tmpfs while
`.git` is on the workspace overlay. Move the temp index into
`$(git rev-parse --git-dir)` so it shares a filesystem with `.git/index`
and the lockfile + rename + hardlink dance git does internally works.
Alongside:
- new workspace-export smoke scenario that would have caught this at
the boundary between daemon and guest git
- `make smoke-fresh` = `smoke-clean && smoke` for release-time runs
that want first-install paths (migrations, image pull) stamped into
the coverage report
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The unit + integration tests can't cross machine.Start — the SDK
boundary would need a fake firecracker that reimplements the
control-plane HTTP API, and the ongoing maintenance cost of keeping
that fake honest with upstream kills the value. Instead, add a
pre-release smoke target that drives REAL Firecracker + real KVM,
captures coverage from the -cover-instrumented binaries, and
surfaces per-package deltas so regressions in the boot path don't
ship silently.
scripts/smoke.sh:
- Isolated XDG_{CONFIG,STATE,CACHE,RUNTIME} so the smoke run can't
touch real user state (state/cache persist under build/smoke/xdg
for fast reruns; runtime is mktemp'd fresh per-run because
sockets can't be reused)
- Preflight: `banger doctor` must pass; UDP :42069 must be free
(otherwise the user's real daemon is up and the smoke daemon
can't bind its DNS listener — fail with an actionable message)
- Scenario 1 — bare: `banger vm run --rm -- echo smoke-bare-ok`
exercises create → start → socket ownership chown → machine.Start
→ SDK waitForSocket race → vsock agent readiness → guest SSH
wait → exec → cleanup → delete
- Scenario 2 — workspace: creates a throwaway git repo, runs
`banger vm run --rm <repo> -- cat /root/repo/smoke-file.txt`,
verifies the tracked file reached the guest (exercises
workDisk capability PrepareHost + workspace.prepare)
- `banger daemon stop` at the end so instrumented binaries flush
GOCOVERDIR pods before the script exits
Makefile additions:
- smoke-build: builds banger/bangerd under build/smoke/bin/ with
`go build -cover`
- smoke: runs the script with GOCOVERDIR set, reports per-package
coverage via `go tool covdata percent`
- smoke-coverage-html: textfmt + go tool cover for a browsable
report
- smoke-clean: nukes build/smoke/ including the persisted XDG
state
Bonus fix uncovered during the first smoke run: doctor treated a
missing state.db as a FAIL ("out of memory" from SQLite
SQLITE_CANTOPEN), which red-flagged every fresh install. Split
the store check: DB file absent → PASS with "will be created on
first daemon start" detail; DB present but unreadable → FAIL as
before. New TestDoctorReport_StoreMissingSurfacesAsPassForFreshInstall
pins the behaviour.
Concrete coverage delta from the first successful smoke run
(compared to `make coverage-total`'s unit-test-only 37.8%):
internal/firecracker 43.6% → 75.0%
internal/daemon/workspace 33.8% → 60.8%
internal/store 40.1% → 56.3%
internal/guest 63.7% → 57.4% (different mix: smoke
exercises real SSH;
unit tests cover more
error branches)
The packages the review flagged are the ones that moved most —
which is the point.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Uses the newTestDaemon harness added earlier in this series to
drive VMService.CreateVM → FindVM → (duplicate CreateVM reject) →
DeleteVM through the real code path. Pins contracts the
single-responsibility tests can't:
- image resolution finds a locally-upserted row
- reserveVM allocates an IP, mkdirs VMDir, persists the Stopped
row
- FindVM round-trips the record
- duplicate-name CreateVM fails without persisting a second row
- DeleteVM runs cleanupRuntime with zero handles (no sudo calls
needed), removes the store row, removes the VMDir
Plus TestVMCreateWithUnknownImageFails for the error branch: if
CreateVM can't resolve an image, it must error before mutating
any state.
Scope: everything except firecracker boot. NoStart: true skips
machine.Start, which is the upstream SDK boundary we can't cross
without a real firecracker binary. The integration we GET
exercises name/IP reservation, per-VM lock lifecycle, store
round-trip, VMDir lifecycle, and the never-started delete path —
all of which were only indirectly covered by unit tests before.
-race clean across the two tests; nothing touches goroutines but
the harness does initialize the tap pool background goroutine on
wireServices, which the race detector validated.
Coverage is flat at the global level (37.8% → 37.8%) because
this slice tests integration of already-covered units, not new
branches. That's expected — the slice's value is as a regression
bedrock for future refactors, not a line-count bump.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every non-happy branch in fcproc was zero-covered before this. Given
that EnsureSocketAccess gates the firecracker control plane on the
daemon's ability to chown the API + vsock sockets off root, those
failure paths are exactly the ones we need pinned.
New file internal/daemon/fcproc/fcproc_test.go adds a local scripted
Runner (fcproc is a leaf package — can't pull the daemon's
scriptedRunner in) and six tests:
waitForPath:
- TestWaitForPathReturnsDeadlineExceededWhenSocketNeverAppears —
timeout branch wraps context.DeadlineExceeded with the label,
and waits at least one poll tick before giving up
- TestWaitForPathReturnsOnceSocketAppears — happy path with a
mid-wait file creation via goroutine
- TestWaitForPathRespectsContextCancellation — ctx.Done() beats
the poll interval so a cancelled request doesn't stall
EnsureSocketAccess:
- TestEnsureSocketAccessChownFailureBubbles — chown error surfaces
untouched; chmod not attempted when chown fails
- TestEnsureSocketAccessChmodFailureBubbles — chmod error surfaces
after chown succeeds
- TestEnsureSocketAccessTimesOutBeforeTouchingRunner — ordering
contract: no sudo calls when the socket never materialises
Package function coverage moved 55.2% → 62.1%.
Integration-level chown-race test was considered (run a real shell
that exercises buildProcessRunner's script with a fake firecracker
binary) but skipped — requires `sudo -n` in the test env and makes
CI fragile. The socket-ownership regression this slice is meant to
guard against is covered at the unit level here; the
manual-smoke in the plan's verification section remains the
end-to-end check.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The symlink test in this commit catches a real bug: sameDirOrParent
used filepath.Abs for both sides of the "is the key inside the
legacy dir?" check, but filepath.Abs doesn't resolve symlinks. A
user whose ssh_key_path pointed into ConfigDir/ssh via a symlinked
spelling (e.g. ConfigDir itself is a symlink, or the user
maintains an alias tree) would have their key silently deleted by
the legacy-dir scrub — the gate thought the key lived elsewhere
because the two spellings didn't match lexically.
Fix: resolvePathForComparison tries filepath.EvalSymlinks first,
falls back to filepath.Abs when the path doesn't exist yet (new
install, pre-first-Open). Both sides of the sameDirOrParent
comparison now use this helper, so a symlinked key + canonical
dir (or the reverse) lands in the same physical path before the
Rel check.
Tests added in this commit:
internal/daemon/ssh_client_config_test.go
TestSameDirOrParentHandlesSymlinks — symlinked-key + canonical-dir
and the reverse are both reported "inside"; unrelated paths stay
out. Skips if the filesystem doesn't support symlinks.
internal/config/config_test.go
TestLoadNormalizesAbsoluteSSHKeyPath — trailing slash, duplicate
slashes, dot segments all collapse via filepath.Clean, so two
spellings of the same path compare equal downstream.
TestEnsureDefaultSSHKeyRejectsCorruptExistingFile — regression
guard against a future "regenerate if invalid" patch that would
silently nuke a real user key.
TestResolveSSHKeyPathRejectsEmptySSHDirAndStateDir — pins the
absolute-path guard that stops a bad layout from scribbling
into cwd (this was the test that caught the stray
internal/config/ssh/ a few commits back).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
No Daemon test in this package has a shared constructor. Every file
re-derives the same pattern — &Daemon{...}, wireServices(d), maybe
override a field — which means new lifecycle / integration tests
spend half their length standing up infrastructure instead of
exercising behaviour.
Consolidate into internal/daemon/daemon_testing_test.go:
func newTestDaemon(t *testing.T, opts ...testDaemonOption) *Daemon
Defaults: tempdir layout (distinct StateDir/ConfigDir/SSHDir/...),
fresh store.Store with migrations auto-run, permissiveRunner,
io.Discard logger, empty vmCaps (so default workDisk/dns/nat
capabilities don't fire real side effects in tests that just want
to exercise VMService plumbing).
Options so far:
- withRunner(system.CommandRunner)
- withConfig(model.DaemonConfig)
- withStore(*store.Store)
- withLogger(*slog.Logger)
- withLayout(paths.Layout)
- withVMCaps(caps ...vmCapability)
- withVsockHostDevice(string)
withVMCaps tracks a vmCapsSet flag so tests that explicitly pass no
caps (i.e. the default) still get the empty-slice behaviour — the
reset after wireServices only fires when the caller didn't opt in.
That keeps wireServices's production semantics unchanged: if you
construct a real Daemon without pre-populating vmCaps, you still
get the default three.
Two smoke tests pin:
- zero-option call wires every service, gives an empty-vmCaps
daemon with the default vsock device, store non-nil
- each option actually lands on the resulting Daemon (guards
against silent rename)
Existing tests unchanged — this is purely additive. Later slices
(Firecracker error-path tests, store migration edges, lifecycle
flow harness) will adopt the helper.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bug: syncVMSSHClientConfig did os.RemoveAll on $ConfigDir/ssh every
daemon Open. The intent was to migrate off the pre-opt-in layout,
where banger used to write $ConfigDir/ssh/ssh_config. But a user who
sets ssh_key_path = "~/.config/banger/ssh/id_ed25519" in config.toml
has their key live exactly in that dir — and the scrub deletes it
along with every other file in the tree.
This is the same class of bug that cost the default key until
ebe6517 moved it to StateDir, but that fix was scoped to the default
path. A configured ssh_key_path pointed under the legacy dir still
dies.
Fix: replace os.RemoveAll with a narrow two-step cleanup:
1. Skip the cleanup entirely when the configured ssh_key_path
resolves under the legacy dir. A user who pointed banger at a
key there must keep the enclosing directory.
2. Otherwise, os.Remove the specific legacy file ($ConfigDir/ssh/
ssh_config) and then os.Remove the directory. The second
os.Remove fails with ENOTEMPTY if the dir still holds anything
(e.g. a user-managed sibling file we don't own). Both errors
are swallowed — this is best-effort migration, not a hard
failure.
Tests pin all three paths: user key under legacy dir survives,
legacy dir empties and is removed when the user moved on, and a
user-managed sibling file in the legacy dir is preserved.
Also fix stale doc claims in README.md and AGENTS.md — both still
pointed at the old ~/.config/banger/ssh/id_ed25519 default, which
moved to ~/.local/state/banger/ssh/id_ed25519 in ebe6517.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four drift fixes from a doc sweep.
internal/daemon/doc.go
Replace the capability-hook description that still said "Hook
methods take *Daemon; VMService reaches them through a
capabilityHooks seam." Current reality: every capability is a
plain struct carrying its own service pointers
(workDiskCapability{vm,ws,store}, dnsCapability{net},
natCapability{vm,net,logger}); wireServices builds the default
list; no hook reaches *Daemon.
internal/daemon/ARCHITECTURE.md
The VMService field list still claimed guestWaitForSSH and
guestDial were "per-instance fields." Those were deleted as
refactor residue. Update the note to say the seams live on
*Daemon (reached by WorkspaceService via closures wired at
construction) and document the vsockHostDevice field that
replaced the old package-global vsockHostDevicePath.
AGENTS.md
Drop the "experimental web UI" mention (removed) and the
`session` subpackage (removed). Mention banger-vsock-agent as
the third cmd/ binary while we're here — AGENTS hadn't listed
it.
docs/kernel-catalog.md
The trust-model section still read as if upstream kernel sources
were fetched by HTTPS alone. Add a paragraph covering the PGP
verification make-generic-kernel.sh now does against the
detached .tar.sign and the three kernel.org release signing keys.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three thematic test files pinning behavior surfaces that had none
before, following the review's recommendation to plug concrete
error/cleanup branches rather than chase a coverage percentage.
doctor_test.go
Covers Daemon.doctorReport end-to-end with a permissive runner +
fake executables on PATH. Pins: store error surfaces as fail,
store success as pass, missing firecracker kills the host-runtime
check, the three default capability feature checks (work disk,
vm dns, nat) are emitted, vm-defaults is always-pass with
provenance. Previously 0% — now the Doctor() command's contract
with the CLI is under guard.
workspace_rejection_test.go
Covers the four early-exit branches of PrepareVMWorkspace that
the existing happy-path + lock-release tests never hit: malformed
mode, --from without --branch, VM not running, VM not found.
Each one returns before any SSH I/O, so the fake-firecracker
infra the happy-path test needs is unnecessary — a bare wired
daemon with a stored VMRecord suffices.
nat_capability_test.go
Covers natCapability.ApplyConfigChange (unchanged flag → no-op,
VM not alive → no-op, toggle on live VM → runner reached) and
natCapability.Cleanup (NAT disabled → no-op, runtime handles
missing → defensive no-op, full wiring → ensureNAT(false)). A
countingRunner + startFakeFirecracker fixture stands in for the
real host plumbing, with waitForVMAlive polling past the
exec -a race window that startFakeFirecracker exposes on
loaded CI boxes.
make coverage-total 37.8% → 38.6%. The number isn't the point —
these tests exist so the next refactor in this area has to
break an explicit assertion to drift.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
VMService carried guestWaitForSSH and guestDial fields + matching
constructor args ever since the daemon split, but nothing on
VMService ever read them. The live guest-SSH path runs on *Daemon
(d.waitForGuestSSH / d.dialGuest in guest_ssh.go); WorkspaceService
reaches those through closures wired in wireServices.
So the VMService copies were refactor residue: they made the
service look more decoupled than it actually is, and any future
test that stubbed VMService.guestDial would be stubbing nothing.
Delete the fields, the deps entries, the newVMService assignments,
and the wireServices passes.
Real seams on *Daemon are unchanged — those are the ones tests
(e.g. workspace_test.go) already set directly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three test seams were still package-level mutable vars, which tests
had to swap before use. That's the classic path to flaky parallel
tests — two goroutines fighting over the same global fake. Push each
down to the struct that owns the behaviour.
internal/daemon/dns_routing.go
lookupExecutableFunc + vmDNSAddrFunc → fields on *HostNetwork,
defaulted at newHostNetwork time. dns_routing_test builds
HostNetwork{..., lookupExecutable: stub, vmDNSAddr: stub} inline,
no more t.Cleanup dance around package-level vars.
internal/daemon/preflight.go + doctor.go
vsockHostDevicePath (mutable string) → vsockHostDevice field on
*VMService, defaulted via defaultVsockHostDevice constant in
newVMService. Preflight reads s.vsockHostDevice; doctor reads
d.vm.vsockHostDevice. Logger test sets d.vm.vsockHostDevice = tmp
after wireServices.
internal/daemon/workspace/workspace.go
HostCommandOutputFunc → *Inspector struct with a Runner field.
Every git-using helper (GitOutput, GitTrimmedOutput,
GitResolvedConfigValue, RunHostCommand, ListSubmodules,
ListOverlayPaths, CountUntrackedPaths, InspectRepo,
ImportRepoToGuest, PrepareRepoCopy) is now a method on *Inspector.
NewInspector() wraps the real host runner for production;
WorkspaceService holds one via repoInspector, CLI deps holds one
too. cli_test.go's submodule-rejection test builds its own
Inspector with a scripted Runner instead of patching a global.
Pure helpers (FinalizeScript, ResolveSourcePath, ParsePrepareMode,
ShellQuote, FormatStepError, GitFileURL, ParseNullSeparatedOutput)
stay free functions since they don't touch the host.
Sentinel: grep for HostCommandOutputFunc, lookupExecutableFunc,
vmDNSAddrFunc, vsockHostDevicePath is now empty across internal/.
make lint test green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`banger doctor` used to call store.Open, which unconditionally runs
migrations on the way up. Diagnostics mutating persistent state is a
surprise — particularly now that migration 2 drops a column, so a
plain `doctor` invocation against an old DB would silently schema-
evolve it.
Add store.OpenReadOnly: separate DSN builder with mode=ro and a
minimal pragma set (foreign_keys, busy_timeout — no journal_mode=WAL,
no wal_autocheckpoint), skips runMigrations, and pings on open so a
missing DB fails up front rather than at first query. doctor.go now
uses OpenReadOnly; the existing storeErr fallback path surfaces any
failure as a failing check, unchanged.
Tests pin two invariants:
- OpenReadOnly against a DB whose migration 2 marker was removed and
packages_path re-added must leave both alone (i.e. no drift is
applied behind the user's back).
- Any write attempted through the read-only handle is rejected at
the driver layer (belt-and-braces for future refactors).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three drift items surfaced in review, each dead on arrival and each
worth trusting a little more at v0.1.0.
config: drop MetricsPollInterval. The field was parsed from TOML
(metrics_poll_interval), stored on DaemonConfig, and ignored by every
consumer — only StatsPollInterval drives the background poll loop.
Users setting it in config.toml saw zero effect. Removed from the TOML
surface, the model constant, and the config test.
daemon: delete ensureDefaultImage. No callers, body was `_ = ctx;
return nil`. Dead since whatever flow used to call it got removed.
store: drop packages_path from the images table. The column was
carried by the baseline migration but never referenced by UpsertImage
(no INSERT / UPDATE mention) or any Go model field — a ghost from a
build pipeline that no longer exists. Added migration id=2
(drop_dead_image_columns) with an idempotent dropColumnIfExists
helper: fresh installs run baseline (creates the column) + 2 (drops
it); legacy DBs where the column was never added get a no-op. Updated
the direct-INSERT SQL in TestGetImageRejectsMalformedTimestamp to
drop the column reference, and added a migration test covering both
install paths (fresh + legacy).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Workspace-mode vm run and vm workspace prepare used to copy both
tracked AND untracked non-ignored files into the guest. That silently
catches local .env files, scratch notes, credentials, and any other
working-tree state a developer hasn't explicitly gitignored — a real
data-exposure footgun given the golden image ships Docker and the
usual dev tooling.
Flip the default to tracked-only. Users who actually want the fuller
set opt in with --include-untracked (documented in both commands'
help). Gitignored files are still always excluded regardless of the
flag.
Add --dry-run to both vm run and vm workspace prepare. Dry-run
inspects the repo CLI-side (no VM created, no daemon RPC needed since
the daemon is always local and the inspection is a pure git read),
prints the exact file list + mode, and exits. A byte-level preview of
what would land in the guest.
When running real (non-dry) and untracked files exist in the repo but
are being skipped under the new default, print a one-line notice
pointing to --include-untracked so users aren't surprised when the
guest is missing something they expected.
Signature changes:
- ListOverlayPaths takes an includeUntracked bool (tracked always;
untracked gated by flag).
- InspectRepo takes the same flag and passes it through.
- VMWorkspacePrepareParams gains IncludeUntracked.
- WorkspaceService.workspaceInspectRepo seam signature widened to
match (4 callers in tests updated).
New workspace package tests cover both modes and verify that
gitignored files never leak regardless of the flag.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Update ARCHITECTURE.md's Composition section to reflect the finished
split: capabilities carry explicit service-pointer fields, nothing
reaches *Daemon at dispatch time, and wireServices(d) is the single
entry point that builds services + capabilities eagerly (from Open
in production, from tests after constructing &Daemon{...} literals).
Removes the paragraph admitting capability→*Daemon coupling and the
lazy-init getters justification, neither of which applies anymore.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stop passing *Daemon into capability hooks. Each capability
implementation is now a struct with explicit service-pointer fields
populated at wireServices time; the six dynamic-dispatch interfaces
(AddStartPreflight, PrepareHost, PostStart, Cleanup, ApplyConfigChange,
AddDoctorChecks) no longer have a *Daemon parameter. Capability
methods reach their dependencies through struct fields, not through
d.vm / d.ws / d.net.
- workDiskCapability carries {vm, ws, store, defaultImageName}
- dnsCapability carries {net}
- natCapability carries {vm, net, logger}
Daemon.defaultCapabilities() builds the production list from the
already-constructed services and is called from wireServices so
d.vmCaps is populated eagerly. Tests that preinstall d.vmCaps with
stubs still work — wireServices only overwrites an empty slice.
registeredCapabilities() is gone (every dispatch loop now reads
d.vmCaps directly). capabilities_test.go's testCapability fake drops
*Daemon from its method set to match the new interfaces.
This finishes the daemon service split: capability implementations
no longer reach through the composition root, there's no path back
to *Daemon from any service or capability, and test construction
goes through one explicit wireServices call instead of lazy getters.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Factor the service + capability wiring out of Daemon.Open() into
wireServices(d), an idempotent helper that constructs HostNetwork,
ImageService, WorkspaceService, and VMService from whatever
infrastructure (runner, store, config, layout, logger, closing) is
already set on d. Open() calls it once after filling the composition
root; tests that build &Daemon{...} literals call it to get a working
service graph, preinstalling stubs on the fields they want to fake.
Drops the four lazy-init getters on *Daemon — d.hostNet(),
d.imageSvc(), d.workspaceSvc(), d.vmSvc() — whose sole purpose was
keeping test literals working. Every production call site now reads
d.net / d.img / d.ws / d.vm directly; the services are guaranteed
non-nil once Open returns. No behavior change.
Mechanical: all existing `d.xxxSvc()` calls (production + tests)
rewritten to field access; each `d := &Daemon{...}` in tests gets a
trailing wireServices(d) so the literal + wiring are side-by-side.
Tests that override a pre-built service (e.g. d.img = &ImageService{
bundleFetch: stub}) now set the override before wireServices so the
replacement propagates into VMService's peer pointer.
Also nil-guards HostNetwork.stopVMDNS and d.store in Close() so
partially-initialised daemons (pre-reconcile open failure) still
tear down cleanly — same contract the old lazy getters provided.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 5 of the daemon god-struct refactor. Code motion landed in
phases 1-4; this commit retells the architecture so the docs match
the structure.
ARCHITECTURE.md loses the "deferred v0.2 project" hedge about
splitting services. The Composition section now describes the four
services (HostNetwork, ImageService, WorkspaceService, VMService)
that own behaviour, the consumer-defined seam pattern for
cross-service calls, and the lazy-init getter pattern that keeps
existing test literals compiling.
doc.go inventories which methods live on which service, and the
lock-ordering section gains the service prefixes (e.g.
VMService.vmLocks instead of bare vmLocks) so readers don't have to
guess which type owns which mutex.
No code changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 4 of the daemon god-struct refactor. VM lifecycle, create-op
registry, handle cache, disk provisioning, stats polling, ports
query, and the per-VM lock set all move off *Daemon onto *VMService.
Daemon keeps thin forwarders only for FindVM / TouchVM (dispatch
surface) and is otherwise out of VM lifecycle. Lazy-init via
d.vmSvc() mirrors the earlier services so test literals like
\`&Daemon{store: db, runner: r}\` still get a functional service
without spelling one out.
Three small cleanups along the way:
* preflight helpers (validateStartPrereqs / addBaseStartPrereqs
/ addBaseStartCommandPrereqs / validateWorkDiskResizePrereqs)
move with the VM methods that call them.
* cleanupRuntime / rebuildDNS move to *VMService, with
HostNetwork primitives (findFirecrackerPID, cleanupDMSnapshot,
killVMProcess, releaseTap, waitForExit, sendCtrlAltDel)
reached through s.net instead of the hostNet() facade.
* vsockAgentBinary becomes a package-level function so both
*Daemon (doctor) and *VMService (preflight) call one entry
point instead of each owning a forwarder method.
WorkspaceService's peer deps switch from eager method values to
closures — vmSvc() constructs VMService with WorkspaceService as a
peer, so resolving d.vmSvc().FindVM at construction time recursed
through workspaceSvc() → vmSvc(). Closures defer the lookup to call
time.
Pure code motion: build + unit tests green, lint clean. No RPC
surface or lock-ordering changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Third phase of splitting the daemon god-struct. WorkspaceService now
owns workspace.prepare / workspace.export plus the ssh-key +
git-identity + arbitrary-file sync that runs as part of VM start's
prepare_work_disk capability hook. workspaceLocks (the per-VM tar
serialisation set) lives on the service.
workspace.go and vm_authsync.go flipped receivers from *Daemon to
*WorkspaceService. The workspaceInspectRepo / workspaceImport test
seams moved onto the service as fields.
Peer-service dependencies go through narrow function-typed fields:
vmResolver, aliveChecker, waitGuestSSH, dialGuest, imageResolver,
imageWorkSeed, withVMLockByRef, beginOperation. WorkspaceService
never touches VMService / HostNetwork / ImageService directly —
only the exact operations the Daemon hands it at construction.
Daemon lazy-init helper workspaceSvc() mirrors the Phase 1/2
pattern. Test literals still write `&Daemon{store: db, runner: r}`
and get a wired workspace service for free. Tests that override the
inspect/import seams (workspace_test.go, ~4 sites) assign them on
d.workspaceSvc() instead of on the daemon literal.
Dispatch in daemon.go: vm.workspace.prepare and vm.workspace.export
now forward one-liners to d.workspaceSvc().
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Second phase of splitting the daemon god-struct. ImageService now owns
all image + kernel registry operations: register/promote/delete/pull
for images (bundle + OCI paths), the six kernel commands, and the
shared SSH-key/work-seed injection helpers. imageOpsMu (the
publication-window lock) lives on the service; so do the three OCI
pull test seams pullAndFlatten / finalizePulledRootfs / bundleFetch.
The four files images.go, images_pull.go, image_seed.go, kernels.go
flipped their receivers from *Daemon to *ImageService.
FindImage moved with the service. Daemon keeps a thin FindImage
forwarder so callers reading the dispatch code see the obvious
facade and tests that pre-date the split still compile.
flattenNestedWorkHome — called from image_seed.go, vm_authsync.go,
and vm_disk.go across future service boundaries — became a
package-level helper taking a CommandRunner explicitly. Daemon keeps
a deprecated forwarder for now; the other services will use the
package form.
Lazy-init helper imageSvc() on Daemon mirrors hostNet() from
Phase 1, so test literals like &Daemon{store: db, runner: r, ...}
that don't spell out an ImageService still get a working one.
Tests that override the image test seams (autopull_test,
concurrency_test, images_pull_test, images_pull_bundle_test) now
assign d.img = &ImageService{...seams...}; the two-statement pattern
matches what Phase 1 established for HostNetwork.
Dispatch in daemon.go is cleaner now: every image/kernel RPC handler
is a single-liner forwarding to d.imageSvc().*. Phase 5 will do the
same for VM lifecycle.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First phase of splitting the daemon god-struct into focused services
with explicit ownership.
HostNetwork now owns everything host-networking: the TAP interface
pool (initializeTapPool / ensureTapPool / acquireTap / releaseTap /
createTap), bridge + socket dir setup, firecracker process primitives
(find/resolve/kill/wait/ensureSocketAccess/sendCtrlAltDel), DM
snapshot lifecycle, NAT rule enforcement, guest DNS server lifecycle
+ routing setup, and the vsock-agent readiness probe. That's 7 files
whose receivers flipped from *Daemon to *HostNetwork, plus a new
host_network.go that declares the struct, its hostNetworkDeps, and
the factored firecracker + DNS helpers that used to live in vm.go.
Daemon gives up the tapPool and vmDNS fields entirely; they're now
HostNetwork's business. Construction goes through newHostNetwork in
Daemon.Open with an explicit dependency bag (runner, logger, config,
layout, closing). A lazy-init hostNet() helper on Daemon supports
test literals that don't wire net explicitly — production always
populates it eagerly.
Signature tightenings where the old receiver reached into VM-service
state:
- ensureNAT(ctx, vm, enable) → ensureNAT(ctx, guestIP, tap, enable).
Callers resolve tap from the handle cache themselves.
- initializeTapPool(ctx) → initializeTapPool(usedTaps []string).
Daemon.Open enumerates VMs, collects taps from handles, hands the
slice in.
rebuildDNS stays on *Daemon as the orchestrator — it filters by
vm-alive (a VMService concern handles will move to in phase 4) then
calls HostNetwork.replaceDNS with the already-filtered map.
Capability hooks continue to take *Daemon; they now use it as a
facade to reach services (d.net.ensureNAT, d.hostNet().*). Planned
CapabilityHost interface extraction is orthogonal, left for later.
Tests: dns_routing_test.go + fastpath_test.go + nat_test.go +
snapshot_test.go + open_close_test.go were touched to construct
HostNetwork literals where they exercise its methods directly, or
route through d.hostNet() where they exercise the Daemon entry
points.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
reserveVM's duplicate-name guard routed through Daemon.FindVM, which
falls back to prefix-matching on both ids and names when no exact
match is found. That turns the uniqueness check into a correctness
bug: a brand-new VM name can be rejected because it happens to
prefix an existing VM's id, or an existing VM's name. So `vm create
--name beta` fails when `beta-sandbox` already exists.
Swap in a dedicated store.GetVMByName that does a literal `WHERE
name = ?` lookup, and use it from reserveVM. FindVM keeps its
prefix-matching behaviour for user-facing lookup paths (`vm ssh
<partial>`, `vm stop <partial>`) where "did you mean" semantics
are the feature.
Tests:
- TestReserveVMAllowsNameThatPrefixesExistingVM — seeds a VM whose
id + name both start with "longname", then reserves two new VMs
named "longname" and "longname-sandbox". Both must succeed.
Under the old FindVM-based check, both would fail.
- TestReserveVMRejectsExactDuplicateName — actual collisions are
still rejected after the swap.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Before this change, every daemon.Open() wrote a Host *.vm stanza into
~/.ssh/config in a marker-fenced block. That's a real footgun for users
who manage their SSH config declaratively (chezmoi, dotfiles, NixOS):
banger was mutating host state outside its own directory on every
daemon start, easy to miss and hard to audit.
New contract: the daemon only ever writes its own ssh_config file at
~/.config/banger/ssh_config. ~/.ssh/config is untouched unless the user
opts in. `banger vm ssh <name>` still works out of the box — the
shortcut only matters for plain `ssh sandbox.vm` from any terminal.
The opt-in surface is `banger ssh-config`:
banger ssh-config # prints path + include-line +
# install/uninstall hints
banger ssh-config --install # adds `Include <bangerConfig>` to
# ~/.ssh/config inside a marker-fenced
# block; idempotent; migrates any
# legacy inline Host *.vm block from
# pre-opt-in builds
banger ssh-config --uninstall # removes the new Include block AND
# any legacy inline block
Doctor gains a gentle warn-level note when banger's ssh_config exists
but the user hasn't wired it in — not a fail, since the shortcut is
convenience and `banger vm ssh` covers the essential case.
Tests cover: daemon writes banger file and does NOT touch ~/.ssh/config,
Install adds the block, Install is idempotent, Install migrates the
legacy inline block cleanly (removing it, preserving unrelated
entries, adding the new Include block), Uninstall removes both marker
variants, Uninstall is a no-op when ~/.ssh/config is absent, and
UserSSHIncludeInstalled detects both marker shapes.
README reframes the feature as optional convenience.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Before: createVMMu was held across the whole of CreateVM — including
image resolution (which could fire a full auto-pull) and startVMLocked
(boot of multiple seconds). imageOpsMu was held across the whole of
PullImage/RegisterImage/PromoteImage/DeleteImage, so any slow OCI pull,
bundle download, or file copy blocked every other image mutation and
every other VM create that needed to auto-pull. The async create API
bought nothing if all creates serialised on the same mutex.
CreateVM is now three phases:
1. Validate + resolve image (possibly auto-pulling). No global lock.
2. reserveVM: take createVMMu only long enough to re-check the name
is free, allocate the next guest IP, and UpsertVM the "created"
row. Milliseconds.
3. startVMLocked: run the full boot flow under the per-VM lock only.
Parallel creates of different VMs now overlap on image resolution +
boot; they contend only across the reservation claim.
For the image surface a new publishImage helper isolates the commit
atom (recheck name free, atomic rename stagingDir→finalDir, UpsertImage)
under imageOpsMu. pullFromBundle + pullFromOCI do their network fetch
+ ext4 build + ownership fixup + agent injection outside the lock;
Register moves validation + kernel resolution outside; Promote moves
file copy + SSH-key seeding outside; Delete keeps a brief lock over
the lookup + reference check + store delete and does file cleanup
unlocked.
Two concurrency tests assert the new behaviour:
- TestPullImageDoesNotSerialiseOnDifferentNames fails the old code
(second pull blocks on imageOpsMu and never reaches the body).
- TestPullImageRejectsNameClashAtPublish confirms the publish-window
recheck is what enforces name uniqueness now that the body runs
unlocked — exactly one winner.
ARCHITECTURE.md updated to describe the new scope explicitly instead
of calling the locks "narrow".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The README sold the product as "Linux with /dev/kvm"; the deeper docs
admit that the Makefile pins companion builds to GOARCH=amd64, the
kernel catalog ships only x86_64 entries, and OCI import pulls
linux/amd64 layers. arm64 users who show up through the README only
discover that after install fails in non-obvious ways.
Two surface-level fixes:
- README requirements list leads with "x86_64 / amd64 Linux — arm64 is
not supported today", with a short note on the three places that
assumption lives so users understand it's not a last-mile gap.
- `banger doctor` now runs an architecture check that passes on amd64
and FAILS (not warns) on anything else, referencing the three
downstream assumptions. Hard-fail rather than warn so a user on an
arm64 machine doesn't waste time chasing unrelated preflight items.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>