Commit graph

243 commits

Author SHA1 Message Date
a3a51e06c4
daemon: build the work disk fresh instead of cloning the seed file
Old flow on every 'banger vm run' that hit the seeded path:
CopyFilePreferClone the seed file (FICLONE attempt + io.Copy + fsync
fallback), then e2fsck -fp + resize2fs to grow the FS to the spec
size. On filesystems without reflink support that meant pushing
512+ MiB through the kernel followed by a full filesystem check
and resize, even though the seed only carries a few KB of dotfiles
— minWorkSeedBytes is 512 MiB but the actual payload is tiny.
That is the minute-long stall on the 'cloning work seed' stage
users see today.

Replace the copy with a sized fresh ext4: truncate to
WorkDiskSizeBytes, mkfs.ext4 -F -E root_owner=0:0, debugfs rdump
to extract the seed's contents, then ingest each file via the
sudoless ext4 toolkit (MkdirExt4 / WriteExt4FileOwned, root:root,
mode preserved). Sub-second regardless of seed size or requested
work-disk size; no fsck or resize needed because the FS is created
at its final size from the start.

Also drop the now-implementation-pinned
TestEnsureWorkDiskClonesSeedImageAndResizes — its premise (a
scripted e2fsck/resize2fs sequence) no longer reflects the code,
and smoke covers the new flow end to end. Stage label changed
from 'cloning work seed' to 'applying work seed' to match what
actually happens.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 20:42:10 -03:00
6c37fec17b
images: remove the docker field
The 'docker' bit on model.Image was unused at runtime — every code
path that branched on it had been removed earlier, leaving only the
field, the SQL column, the --docker flag, and the
#feature:docker sentinel that BuildMetadataPackages emitted into a
hash file. None of those have callers anymore.

Strip the field from the model, the API params, the SQLite column,
the CLI flag, and BuildMetadataPackages's signature. Add migration
2 (drop_images_docker) so existing installs lose the column on next
daemon start. ALTER TABLE ... DROP COLUMN is fine: SQLite has
supported it since 3.35 (2021).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 20:28:40 -03:00
408ad6756c
system: build work-seed without sudo
BuildWorkSeedImage used to mount the source rootfs and the new seed
image — both via sudo. After the privilege split (59e48e8) the owner
daemon runs without sudo and those mounts fail silently inside the
image-pull pipeline (runBuildWorkSeed swallows errors), so every
freshly pulled image landed in the store with an empty WorkSeedPath
and 'banger doctor' kept warning that /root would be empty.

Rewrite the builder around the existing sudoless toolkit:

  1. RdumpExt4Dir extracts /root from the source rootfs into a host
     tempdir (debugfs, no mount).
  2. truncate + mkfs.ext4 -F -E root_owner=0:0 produces an empty
     user-owned ext4 file.
  3. A Go walk over the staged tree calls MkdirExt4 /
     WriteExt4FileOwned for every dir + regular file, forcing
     root:root and preserving mode bits.

Symlinks and special files in /root are skipped — extremely rare on
a stock distro and not part of what makes a useful seed.

Fix won't retroactively populate already-pulled images: re-pull the
default image (e.g. 'banger image delete debian-bookworm && banger
image pull debian-bookworm') to get a seeded work-seed.ext4.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 20:18:23 -03:00
3ec357090a
daemon: doctor passes vm dns when banger itself owns the port
The previous check tried to bind 127.0.0.1:42069 and warned on
'address already in use' — which is exactly the state when the
banger daemon is running, the case the user ran 'doctor' to
confirm. The warning was actively misleading.

Now, on 'address already in use', probe the listener with a
*.vm DNS query that only banger's vmdns server answers
authoritatively (NXDOMAIN with Authoritative=true). If the shape
matches we pass; if the port is held by something else we still
warn. Tests cover both branches: a real vmdns server is accepted,
and a silent UDP listener on the same port is rejected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 18:57:27 -03:00
35bfac3f13
cli: rewrite help text for AI-driven discovery
Frontier models tend to discover a CLI by running --help, scanning
the Long description, and inferring the dominant workflow from the
examples. Today's banger help reads like a man page index — every
verb has a one-line Short and nothing else. This rewrites the
groups (banger, vm, vm workspace, image, kernel, system,
ssh-config) so each landing page answers "what is this for, what's
the 80% command, what comes next" in three to ten lines, with
runnable examples.

Also disambiguates the near-twin lifecycle commands so a model
reading the subcommand index can tell stop/kill/delete apart at a
glance:

  start    Start a stopped VM
  stop     Stop a running VM gracefully
  restart  Stop then start a VM
  kill     Force-kill a VM (use when 'vm stop' hangs)
  delete   Stop a VM and remove its disks (irreversible)

vm create / vm ssh / vm logs / vm show pick up Long descriptions
and examples for the same reason. No behaviour changes; help text
only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 15:02:08 -03:00
41ced66a54
mise: pin go and shellcheck
go 1.25.0 matches go.mod's toolchain. shellcheck is the only non-go
tool make lint hard-requires.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 13:11:51 -03:00
b0b1300314
docs: add the privilege model document
Explain what runs as the owner user vs root, every helper RPC method
and its validation gate, the on-disk paths banger writes, network
mutations, and how install/uninstall work end to end. The aim is to
give a reader enough information to grant or refuse the privileges
banger asks for during system install with their eyes open.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 12:55:18 -03:00
47d83ce4d7
gitignore: exclude the entire build directory
Replace the per-subdir entries with a single /build/ to cover any
new outputs Make or scripts add later (build/manual exists today;
future docs/coverage variants would otherwise need new lines).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 12:55:11 -03:00
59e48e830b
daemon: split owner daemon from root helper
Move the supported systemd path to two services: an owner-user bangerd for
orchestration and a narrow root helper for bridge/tap, NAT/resolver, dm/loop,
and Firecracker ownership. This removes repeated sudo from daily vm and image
flows without leaving the general daemon running as root.

Add install metadata, system install/status/restart/uninstall commands, and a
system-owned runtime layout. Keep user SSH/config material in the owner home,
lock file_sync to the owner home, and move daemon known_hosts handling out of
the old root-owned control path.

Route privileged lifecycle steps through typed privilegedOps calls, harden the
two systemd units, and rewrite smoke plus docs around the supported service
model.

Verified with make build, make test, make lint, and make smoke on the
supported systemd host path.
2026-04-26 12:43:17 -03:00
3edd7c6de7
daemon: build a work-seed during image pull, refresh doctor check
Before this change `banger image pull` (both OCI-direct and bundle
paths) shipped images with an empty WorkSeedPath — the BuildWorkSeedImage
helper existed only behind the hidden `banger internal work-seed` CLI.
Every pulled image hit ensureWorkDisk's no-seed branch, and the guest
booted with a bare /root (no .bashrc, no .profile, none of the distro
defaults).

Pull now calls BuildWorkSeedImage after the rootfs is finalised (OCI)
or fetched (bundle). The builder is behind a new `workSeedBuilder` test
seam so existing pull tests don't accidentally demand sudo mount. The
build failure is non-fatal: any error logs a warning and leaves
WorkSeedPath empty — images stay publishable even if the pulled rootfs
has no /root to extract.

Verified end-to-end by wiping the cached smoke image and re-pulling:
work-seed.ext4 lands in the artifact dir next to rootfs.ext4, and all
21 smoke scenarios pass.

Also refreshes the "feature /root work disk" fallback tooling check —
the no-seed path no longer touches mount/umount/cp after commit
0e28504, so the doctor check now only requires truncate + mkfs.ext4.
The warn copy updates from "new VM creates will be slower" to "guest
/root will be empty", which matches the actual tradeoff post-refactor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 20:24:10 -03:00
02773c1cf5
daemon: delete flattenNestedWorkHome and normaliseHomeDirPerms
Both helpers are stranded: commit f068536 dropped their last callers
from ensureAuthorizedKeyOnWorkDisk and seedAuthorizedKeyOnExt4Image,
and commit 6ab1a2b dropped the ensureGitIdentity / runFileSync calls
that still held them up. Every on-disk-patch code path now drives the
ext4 image directly via MkdirExt4 / WriteExt4FileOwned /
EnsureExt4RootPerms.

Also drops TestFlattenNestedWorkHomeCopiesEntriesIndividually —
premise gone with the function. The sshd_config_test comment
referencing normaliseHomeDirPerms now points at EnsureExt4RootPerms.

Net sudo reduction across the five-commit series: work-disk creation,
authsync, image seeding, git identity sync, and file_sync all drop
sudo entirely against user-owned ext4 files. Remaining sudo in
internal/daemon is confined to firecracker process launch, tap/dm
device setup, iptables/NAT, and dmsnap/fcproc — things that
legitimately need CAP_SYS_ADMIN or CAP_NET_ADMIN. MountTempDir stays
on exclusively as an image-build helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 18:33:06 -03:00
6ab1a2b844
daemon: rewrite git identity sync + file_sync on ext4 toolkit
ensureGitIdentityOnWorkDisk, writeGitIdentity, runFileSync, and
copyHostDir all dropped their mount + sudo install/mkdir/chmod/chown
scaffolding. Every write now goes through MkdirExt4,
WriteExt4FileOwned, ReadExt4File, and the new MkdirAllExt4 helper —
all sudoless against user-owned ext4 images.

Net effect with the prior two commits: ensureWorkDisk, authsync, image
seeding, git identity sync, and file_sync no longer mount the work
disk or spawn sudo mkdir/chmod/chown/cat/install. Only the
image-build path (which legitimately produces root-owned artifacts)
still touches MountTempDir.

The filesystemRunner test harness grew a small debugfs/e2cp/e2rm
emulator so the WorkspaceService tests keep exercising their real
code paths without a live ext4 image. The mock is deliberately
dumb — it only implements the subset runFileSync and writeGitIdentity
drive.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 18:29:30 -03:00
f0685366ec
daemon: rewrite authsync + image seeding on ext4 toolkit
ensureAuthorizedKeyOnWorkDisk and seedAuthorizedKeyOnExt4Image both
drove mount + sudo mkdir/chmod/chown/cat/install to patch
/.ssh/authorized_keys into a work disk or work-seed. Both now delegate
to a shared provisionAuthorizedKey helper that uses the ext4 toolkit
introduced in 7704396 — EnsureExt4RootPerms + MkdirExt4 +
Ext4PathExists/ReadExt4File + WriteExt4FileOwned. No mount, no sudo,
no host-path staging.

Drops ~10 sudo call sites from the VM create and image pull flows
and deletes the TestEnsureAuthorizedKeyOnWorkDiskRepairsNestedRootLayout
premise (flattenNestedWorkHome will disappear entirely in the next
commit — the no-seed path no longer copies /root, and the work-seed
path produces flat seeds).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 18:21:50 -03:00
0e28504892
daemon: rewrite ensureWorkDisk no-seed path to skip the mount + cp
The no-seed branch used to mount the base rootfs read-only, mount
the freshly mkfs'd work disk read-write, sudo-cp /root from one to
the other, then flatten any accidental /root/root/ nesting. Five
sudo call sites packed into a fallback that the common image path
doesn't even exercise.

Replace with: `mkfs.ext4 -F -E root_owner=0:0` and nothing else.
mkfs already stamps inode 2 as root:root:0755 — sshd's StrictModes
walks that dir's ownership when the work disk mounts at /root in
the guest, so getting it right from mkfs means authsync can just
write authorized_keys without any repair pass.

Tradeoff: no-seed VMs lose the base rootfs's default /root dotfiles
(.bashrc, .profile). The no-seed path is explicitly the degraded
fallback — `banger doctor` already warns about it — and users who
want those back have two documented knobs: rebuild the image with
a work-seed, or land them via [[file_sync]].

Sudo call sites removed: 5 (MountTempDir × 2, sudo cp -a,
flattenNestedWorkHome's chmod/cp/rm). flattenNestedWorkHome itself
stays alive for now — authsync + image_seed still call it — and
gets deleted in commit 5 once its last caller goes away.

While here: fix the freshly-added EnsureExt4RootPerms helper.
`set_inode_field <2> mode N` overwrites the full i_mode word
instead of preserving the type nibble, so the initial
implementation that passed just the permission bits (0755) would
reset the fs root to regular-file shape and break the next kernel
mount with "Structure needs cleaning." The corrected call OR's in
S_IFDIR (0o040000) explicitly. Test updated to match.

Smoke: 21/21 scenarios green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 18:09:32 -03:00
77043966d4
system: add ext4 toolkit for non-sudo work-disk writes
The daemon mounts every VM's work disk on the host via sudo, copies
files in as root, chmods+chowns them, and unmounts. That's ~18 of
banger's runtime RunSudo calls. The ext4 image is a regular file the
daemon user owns; e2cp / debugfs can write to it directly and bake
uid/gid/mode into the filesystem metadata without the caller being
root. `imagepull.ApplyOwnership` already proves this works in
production (OCI layer flattening writes 0/0/root-owned inodes from
an unprivileged daemon).

This commit adds the toolkit layer. Callers land in the next four
commits:

  - MkdirExt4 — idempotent directory create + metadata reset, single
    debugfs batch
  - WriteExt4FileOwned — e2cp + debugfs-driven uid/gid/mode, auto-
    cleans the host tempfile
  - SetExt4Ownership — sif + set_inode_field batch for existing
    inodes (no mkdir implied)
  - EnsureExt4RootPerms — fixes inode <2> (the fs root, which is
    `/root` once the work disk is mounted inside the guest), the
    thing sshd's StrictModes walks
  - Ext4PathExists — yes/no probe via `debugfs -R "stat ..."` with
    "File not found" detection
  - ReadExt4File — bytes-returning wrapper around the existing
    ReadDebugFSText with the same path rejection

Design notes:

  - extfsRun auto-switches Run ↔ RunSudo on imagePath's type: regular
    files get the unprivileged path, block devices (dm-snapshot,
    loops) get sudo. The same helper works for both patchRootOverlay
    (dm device) and work-disk writes (user-owned file). No caller
    flag needed — os.Stat tells us.
  - debugfsScript batches set_inode_field + sif + mkdir lines into
    one `debugfs -w -f -` stdin invocation on any Runner that
    implements StdinRunner (production's system.Runner does). Matches
    imagepull.ApplyOwnership's existing pattern; dramatically cheaper
    than per-call subprocesses.
  - Paths are escaped for debugfs on the way in: spaces get double-
    quoted, double-quote/backslash/newline are rejected outright
    (debugfs's hand-rolled parser doesn't reliably escape those and
    we'd rather fail fast than silently scribble over the wrong
    inode).

Tests: seven behaviour assertions via scripted + stdin-scripted
runners — existence probe (found + missing + rejection), read
passthrough, mkdir batch contents (new vs. pre-existing path), write
tempfile cleanup + mode line shape, root-inode addressing, and the
full rejectDebugfsUnsafePath matrix.

No production wiring change in this commit — the helpers land
unused. `make smoke` stays green (21/21) because nothing else
shifted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 16:31:50 -03:00
d743a8ba4b
daemon: persist teardown fallbacks and reject unsafe import paths
Preserve cleanup after daemon restarts and harden OCI and tar imports
against filenames that debugfs cannot encode safely.

Mirror tap, loop, and dm teardown identity onto VM.Runtime, teach
cleanup and reconcile to fall back to those persisted fields when
handles.json is missing or corrupt, and clear the recovery state on
stop, error, and delete paths.

Reject debugfs-hostile entry names during flattening and in
ApplyOwnership itself, then add regression coverage for corrupt
handles.json recovery and unsafe import paths.

Verified with targeted go tests, make lint-go, make lint-shell, and
make build.
2026-04-23 16:21:59 -03:00
86a56fedb3
daemon: extract StatsService sibling; shrink VMService's surface
Closes commit 3 of the god-service decomposition. VMService still
owned 45+ methods after the startVMLocked extraction and RPC table
landed in commits 1 and 2. Stats / ports / health / vsock-ping sit
in a corner of that surface that doesn't share any state with
lifecycle orchestration — nothing about "what's this VM's CPU
doing" belongs in the same service as Create/Start/Stop/Delete/Set.

New StatsService owns:

  - GetVMStats / getVMStatsLocked / collectStats (stats collection)
  - HealthVM / PingVM (vsock-agent health probe)
  - PortsVM + buildVMPorts + probeWebListener + probeHTTPScheme +
    dedupeVMPorts (listening-port enumeration)
  - pollStats (background ticker refresh)
  - stopStaleVMs (auto-stop sweep past config.AutoStopStaleAfter)

The three VMService touch-points stats genuinely needs — vmAlive,
vmHandles, the per-VM lock helpers, plus cleanupRuntime for the
stale-sweep tear-down — come in as function-typed closures, not a
*VMService pointer. StatsService has no back-reference to its
sibling. Mirrors the dependency-struct pattern WorkspaceService
already uses.

Wiring: d.stats is populated in wireServices AFTER d.vm (closures
must see a non-nil d.vm at call time). Dispatch table's four
entries (vm.stats / vm.health / vm.ping / vm.ports) now resolve
through d.stats. Background loop's pollStats / stopStaleVMs
tickers do the same. Dispatch surface from the RPC client's
perspective is byte-identical.

After this commit:

  - vm_stats.go and ports.go are deleted; their content (plus the
    stats-specific fields) lives in stats_service.go.
  - VMService loses 12 methods. It's still the biggest service
    (~30 methods, all lifecycle-supporting: handle cache, disk
    provisioning, preflight, create-ops registry, lock helpers,
    the lifecycle verbs themselves) but it's finally one coherent
    concern instead of five.

Tests:
  - TestWireServicesInstantiatesStatsService — pins that the
    wiring order puts d.stats non-nil + its five closures all
    populated. Prevents a silent background-loop regression.
  - All existing tests that called d.vm.HealthVM / d.vm.PingVM /
    d.vm.PortsVM / d.vm.collectStats were re-pointed at d.stats.

Smoke: all 21 scenarios green, including vm ports (exercises the
new PortsVM entry end-to-end) and the long-running workspace
scenarios (exercise the background stats poller implicitly).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 15:46:59 -03:00
366e1560c9
daemon: replace RPC switch with generic method-to-handler table
The dispatch method was a single ~240-line switch of 34 cases, each
following the same pattern: decode params into some type P, call a
service method returning (R, error), wrap R in a result struct and
either marshalResultOrError-encode or return a raw rpc.NewError.
Adding a method was a 4-line ceremony per site, and grepping for
"methods banger speaks" meant reading the full switch.

New shape, in internal/daemon/dispatch.go:

  - handler is the uniform `func(ctx, d, req) rpc.Response` type
    every method dispatches through.
  - paramHandler[P, R] is the generic wrapper that absorbs 28 of
    the 34 cases (decode, call, marshal). No reflection — P and R
    are deduced from the service-call literal, so each map entry
    is a one-liner referencing a small adapter func.
  - noParamHandler[R] is the decode-free variant for 6 methods
    that don't carry params.
  - rpcHandlers is the single source of truth for which methods
    exist and which adapter they dispatch to.
  - Four specials (ping, shutdown, vm.logs, vm.ssh) stay as named
    `handler`-typed functions: ping/shutdown encode with raw
    rpc.NewResult, vm.logs/vm.ssh need pre-service validation to
    emit distinct error codes (not_found, not_running) that the
    generic wrapper maps uniformly to operation_failed.

Daemon.dispatch shrinks from a 240-line switch to 11 lines:
version check, test-only handler short-circuit, table lookup,
invoke-or-unknown.

Tests:

  - TestRPCHandlersMatchDocumentedMethods — keyset guard. Adding
    or removing a method without updating the expected slice is a
    red flag the test surfaces.
  - TestRPCHandlersAllNonNil — catches nil-function registrations.

All pre-existing dispatch tests (param decode, error codes, etc.)
keep passing unchanged — the handler contract for any given
method is byte-identical from the RPC client's perspective. Smoke
(all 21 scenarios) exercises every code path end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 15:40:08 -03:00
11a33604c0
daemon: extract startVMLocked into step runner with per-step rollback
startVMLocked was a ~260-line method running 18 sequential phases
with one lumped error path: on any failure, cleanupOnErr called
cleanupRuntime — a catch-all teardown that didn't distinguish
"this phase acquired resources we should undo" from "this phase is
idempotent." The blast radius was the entire VM lifecycle. Every
tweak to boot, NAT, disk, or auth-sync orchestration had to reason
about a closure that could fire at any of 18 points.

This commit extracts the phases into a data-driven pipeline:

  - startContext threads the mutable state (vm, live, apiSock,
    dmName, tapName, etc.) through every step by pointer so step
    bodies mutate in place without returning copies.
  - startStep carries the op.stage name, optional vmCreateStage
    progress ping, optional log attrs, a run closure, and an
    optional undo closure.
  - runStartSteps walks steps in order, appends the failing step
    to the rollback set (so partial-acquire failures like
    machine.Start's post-spawn HTTP config get their undo fired),
    then iterates the rollback set in reverse and joins errors
    via errors.Join.

Each phase that acquires a resource now owns its own undo:
system_overlay removes a file it created, dm_snapshot cleans up
the loop + DM handles it set, prepare_host_features delegates to
capHooks.cleanupState, tap releases via releaseTap, metrics_file
removes the file, firecracker_launch kills the spawned PID and
drops the sockets, post_start_features calls capHooks.cleanupState
again (capability Cleanup hooks are idempotent — safe to call
whether PostStart reached every cap or not). The 11 phases with
no teardown obligation leave `undo` nil and the driver silently
skips them on rollback.

cleanupRuntime is retired from the start-failure path. It stays
intact for reconcile, stopVMLocked, killVMLocked, deleteVMLocked,
stopStaleVMs — the crash-recovery / lifecycle-teardown contract
those paths rely on is unchanged.

startVMLocked shrinks from ~225 lines of sequential-phase code
plus a cleanupOnErr closure to ~45 lines: compute derived paths,
build the step list, drive it, persist ERROR state on failure.
Stage names preserved 1:1 so existing log grep + the async-create
progress stream stay compatible.

Tests:

  - TestRunStartSteps_RollsBackInReverseOnFailure — the contract
    is pinned: succeeded-before-failing run, all their undos in
    reverse, failing step's undo also fires, original err still
    visible via errors.Is.
  - TestRunStartSteps_SkipsNilUndos — optional-undo contract.
  - TestRunStartSteps_JoinsRollbackErrors — undo failures don't
    hide the root cause.
  - TestRunStartSteps_HappyPathNoRollback — success path never
    fires any undo.

Smoke: all 21 scenarios pass, including the start-path ones
(bare vm run, workspace vm run, vm restart, vm lifecycle, vm set
reconfig) that exercise real firecracker boots end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 15:34:34 -03:00
2ebd2b64bb
imagepull: update stale package + BuildExt4 docs
The package doc in internal/imagepull/imagepull.go still described
a two-step Pull + Flatten + BuildExt4 pipeline and warned that the
resulting image was "suitable as input to `image build` but not
directly bootable" because ownership preservation was deferred.
That's been wrong for a while: ApplyOwnership
(internal/imagepull/ownership.go) restores tar-header uid/gid/mode
via a debugfs set_inode_field batch, and InjectGuestAgents
(internal/imagepull/inject.go) writes banger's guest-side assets
into the image. `image pull` now produces a directly bootable
rootfs end-to-end.

Updated:
  - imagepull.go package doc — describes the full
    Pull → Flatten → BuildExt4 → ApplyOwnership → InjectGuestAgents
    pipeline and drops the "Phase A limitations" list that spoke
    of deferred ownership.
  - ext4.go BuildExt4 doc — notes that the filesystem is root-owned
    via `-E root_owner=0:0` and points at ApplyOwnership as the
    step that handles per-file ownership, instead of the previous
    "see the package doc for the implications" handwave.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 14:34:25 -03:00
5eceebe49f
daemon: persist tap device on VM.Runtime so NAT teardown survives handle-cache loss
Cleanup identity for kernel objects was split across two sources of
truth: vm.Runtime (DB-backed, durable) held paths and the guest IP,
but the TAP name lived only in the in-process handle cache + the
best-effort handles.json scratch file next to the VM dir. Every
other cleanup-identifying datum has a fallback — firecracker PID
can be rediscovered via `pgrep -f <apiSock>`, loops via losetup, dm
name from the deterministic ShortID(vm.ID). The tap is the one
truly cache-only datum (allocated from a pool, not derivable).

That made NAT teardown fragile:

  - daemon crash between `acquireTap` and the handles.json write
  - handles.json corrupt on the next daemon start
  - partial cleanup that already zeroed the cache

In any of those cases natCapability.Cleanup short-circuited
("skipping nat cleanup without runtime network handles") and the
per-VM POSTROUTING MASQUERADE + the two FORWARD rules keyed off
the tap would leak. The VM row in the DB still existed, so a retry
couldn't close the loop — the tap name was simply gone.

Fix: mirror TapDevice onto model.VMRuntime (serialised via the
existing runtime_json column, omitempty so existing rows upgrade
cleanly). Set it in startVMLocked right next to the
s.setVMHandles call that seeds the in-memory cache; clear it at
every post-cleanup reset site (stop normal path + stop stale
branch, kill normal path + kill stale branch, cleanupOnErr in
start, reconcile's stale-vm branch, the stats poller's auto-stop
path).

Fallbacks now cascade:

  - natCapability.Cleanup: handles cache → Runtime.TapDevice
  - cleanupRuntime (releaseTap): handles cache → Runtime.TapDevice

Both surfaces refuse gracefully (old behaviour) only when neither
source has a value, which really does mean "no tap was ever
allocated for this VM" rather than "we lost track of it."

Test: TestNATCapabilityCleanup_FallsBackToRuntimeTapDevice clears
the handle cache, sets vm.Runtime.TapDevice, and asserts Cleanup
reaches the runner — the exact scenario the review flagged as a
plausible leak and the exact code path that now guarantees it
doesn't.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 14:21:13 -03:00
1850904d9c
file_sync: skip nested symlinks during recursive copy
A user who sets `[[file_sync]] host = "~/.aws"` (per the README's
own example) can unintentionally copy files from outside that
directory if .aws contains symlinks. copyHostDir used os.Stat
during recursion, which transparently follows: a symlink to a
credential dir elsewhere would be recursed into, materialising
unrelated secrets inside the guest. For credential trees that's
an avoidable sprawl vector.

Switched copyHostDir's per-entry probe from os.Stat to os.Lstat
and added a default skip-with-warning branch for ModeSymlink.
Files and dirs at the SAME level copy as before; symlinks (both
file and directory flavours) surface a "file_sync skipped
symlink (would escape the requested tree)" warn log and are
otherwise omitted.

Top-level entry paths still follow — the Stat in runFileSync is
unchanged. The user explicitly named that path, so resolving
"~/.aws" through a symlink out of $HOME is on them.

Tests:
- TestRunFileSyncSkipsNestedSymlinks — builds a synced dir with
  both a file symlink and a directory symlink pointing outside
  the tree; asserts real files copy, symlinks do not materialise
  anywhere in the guest mount, and each skipped symlink surfaces
  a warn log entry.

README updated with a one-line note about the skip behaviour so
users know to expect it rather than chasing "why didn't my file
show up."

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 14:11:58 -03:00
caa6a2b996
model: validate VM names as DNS labels at CLI + daemon
A VM name flows into five places that all have narrower grammars
than "arbitrary string":

  - the guest's /etc/hostname  (vm_disk.patchRootOverlay)
  - the guest's /etc/hosts      (same)
  - the <name>.vm DNS record    (vmdns.RecordName)
  - the kernel command line     (system.BuildBootArgs*)
  - VM-dir file-path fragments  (layout.VMsDir/<id>, etc.)

Nothing in the chain was validating the input. A name with
whitespace, newline, dot, slash, colon, or = would produce broken
hostnames, weird DNS labels, smuggled kernel cmdline tokens, or
(in the worst case) surprising traversal through the on-disk
layout. Not host shell injection — we already avoid shelling out
with the raw name — but a real correctness and supportability bug.

New: model.ValidateVMName. Rules:

  - 1..63 chars (DNS label max per RFC 1123; also a comfortable
    /etc/hostname cap)
  - lowercase ASCII letters, digits, '-' only
  - no leading or trailing '-'
  - no normalization — the name is the user-visible identifier
    (store key, `ssh <name>.vm`, `vm show`); silently rewriting
    "MyVM" → "myvm" would hand the user back something different
    than they typed

Called from two places:

  - internal/cli/commands_vm.go vmCreateParamsFromFlags — rejects
    bad `--name` values before any RPC. Empty name still passes
    through so the daemon can generate one.
  - internal/daemon/vm_create.go reserveVM — defense in depth for
    any non-CLI RPC caller (SDK, direct JSON over the socket).

Tests:

  - internal/model/vm_name_test.go — exhaustive character-class
    matrix (space, newline, tab, dot, slash, colon, equals, quote,
    control chars, unicode letters, uppercase, leading/trailing
    hyphen, over-length, max-length-exact, digits-only).
  - internal/cli TestVMCreateParamsFromFlagsRejectsInvalidName —
    CLI wire-through + empty-name passthrough.
  - internal/daemon TestReserveVMRejectsInvalidName — daemon
    defense-in-depth (including `box/../evil` path-traversal).
  - scripts/smoke.sh — end-to-end rejection + no-leaked-row
    assertion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 14:06:40 -03:00
700a1e6e60
cleanup: drop pre-v0.1 migration scaffolding + legacy-behavior refs
banger hasn't shipped a public release — every "legacy", "pre-opt-in",
"previously", "migration note", "no longer" reference in the tree is
pinning against a state no real user's install has ever been in.
That scaffolding has weight: it's a coordinate system future readers
have to decode, and it keeps dead code alive.

Removed (code):
- internal/daemon/ssh_client_config.go
    - vmSSHConfigIncludeBegin / vmSSHConfigIncludeEnd constants and
      every `removeManagedBlock(existing, vm...)` call they enabled
      (legacy inline `Host *.vm` block scrub)
    - cleanupLegacySSHConfigDir (+ its caller in syncVMSSHClientConfig)
      — wiped a pre-opt-in sibling file under $ConfigDir/ssh
    - sameDirOrParent + resolvePathForComparison — only ever used
      by cleanupLegacySSHConfigDir
    - the "also check legacy marker" fallback in
      UserSSHIncludeInstalled / UninstallUserSSHInclude
- internal/store/migrations.go
    - migrateDropDeadImageColumns (migration 2) + its slice entry
    - dropColumnIfExists (orphaned after the above)
    - addColumnIfMissing + the whole "columns added across the pre-
      versioning lifetime" block at the end of migrateBaseline —
      subsumed into the baseline CREATE TABLE
    - `packages_path TEXT` column on the images table (the
      throwaway migration 2 dropped it, but there was never any
      reader)
- internal/daemon/vm.go
    - vmDNSRecordName local wrapper — was justified as "avoid
      pulling vmdns into every file"; three of four callers already
      imported vmdns directly, so inline the one stray call
- internal/cli/cli_test.go
    - TestLegacyRemovedCommandIsRejected (`tui` subcommand never
      shipped)

Removed / simplified (tests):
- ssh_client_config_test.go: dropped TestSameDirOrParentHandlesSymlinks,
  TestSyncVMSSHClientConfigPreservesUserKeyInLegacyDir,
  TestSyncVMSSHClientConfigNarrowsCleanupToLegacyFile,
  TestSyncVMSSHClientConfigLeavesUnexpectedLegacyContents,
  TestInstallUserSSHIncludeMigratesLegacyInlineBlock, plus the
  "legacy posture" regression strings in the remaining happy-path
  test; TestUninstallUserSSHIncludeRemovesBothMarkerBlocks collapsed
  to a single-block test
- migrations_test.go: dropped TestMigrateDropDeadImageColumns_AcrossInstallPaths,
  TestDropColumnIfExistsIsIdempotent; TestOpenReadOnlyDoesNotRunMigrations
  simplified to test against the baseline marker

Removed (docs):
- README.md "**Migration note.**" blockquote about the SSH-key path move
- docs/advanced.md parenthetical "(the old behaviour)"

Reworded (comments):
- Dropped "Previously this file also contained LogLevel DEBUG3..."
  history from vm_disk.go's sshdGuestConfig doc
- Dropped "Call sites that previously read vm.Runtime.{PID,...}"
  from vm_handles.go; now documents the current contract
- Dropped "Pre-v0.1 the defaults are" scaffolding in doctor_test.go
- Dropped "no longer does its own git inspection" phrasing in vm_run.go
- Dropped the "(also cleans up legacy inline block from pre-opt-in
  builds)" aside on the `ssh-config` CLI docstring
- Renamed test var `legacyKey` → `existingKey` in vm_test.go; its
  purpose was "pre-existing authorized_keys line," not banger-legacy

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 13:56:32 -03:00
5791466498
make: coverage-combined — merge unit-test and smoke covdata
Unit tests and the smoke suite cover different halves of the
codebase: unit for pure-Go branching (error paths, parsers, handler
wiring); smoke for the sudo / firecracker / dm-snap / real-KVM paths
unit tests physically can't reach. Separate reports each tell half
the story.

`make coverage-combined` runs the unit suite with
`-test.gocoverdir` pointed at a fresh binary-format dir, then
merges it with the existing smoke covdata via `go tool covdata
merge`. Modes must match; smoke uses the default 'set', so the
unit run aligns by NOT passing -covermode=atomic.

Output matches the existing `make coverage` layout (per-package
list + total) so the two targets read the same in CI.

`make coverage-combined-html` also emits an HTML report at
build/combined.cover.html for clicking through the uncovered
lines that neither suite touches.

Combined total right now: 72.7% (vs 37.7% unit-only / 49% daemon
via smoke).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 13:17:17 -03:00
235758e5b2
workspace: drop --readonly flag — advisory only against root guests
--readonly ran `chmod -R a-w` over the workspace after copying, but
every banger guest boots as root, and root bypasses DAC mode checks.
So a user running `vm workspace prepare ... --readonly` got the
mode bits set to 0444 but `echo x >> file` in the guest still
succeeded. The flag promised enforcement it couldn't deliver.

The feature also doesn't match the product model: workspaces are
prepared precisely so the guest CAN edit them, and `workspace
export` exists to pull those edits back as a patch. A
"read-only workspace" contradicts that loop.

Removed:
  - CLI flag `--readonly` on `vm workspace prepare`
  - api.VMWorkspacePrepareParams.ReadOnly field
  - model.WorkspacePrepareResult.ReadOnly field
  - daemon chmod dispatch in prepareVMWorkspaceGuestIO
  - smoke scenario pinning the (advisory) mode-bit behavior
  - misleading "exportbox-readonly" VM name in an unrelated export
    test (the test is about not mutating the real git index;
    renamed to exportbox-noindex-mutation)

If real enforcement becomes a user need later, the right primitive
is `chattr +i` (immutable bit — root CAN'T write) or a ro bind-mount.
Reintroducing a new flag is cheaper than debugging what the current
one actually guarantees.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 13:04:33 -03:00
bafe816fc7
smoke: cover the gaps — NAT, vm ports/restart/kill/prune, workspace variants, ssh-config
Audit of banger's advertised CLI surface vs. what smoke was exercising
turned up several gaps where a regression would have shipped silently.
New scenarios:

- NAT: asserts the per-VM POSTROUTING MASQUERADE rule is installed
  with --nat (scoped to the guest /32), idempotent across stop/start,
  and torn down on delete. End-to-end curl tests don't work here
  because the bridge IP and uplink IP both belong to the host — a
  guest reaching the uplink lands on host-local loopback whether
  MASQUERADE is set up or not — so the test pins the iptables rule
  itself. Skipped if passwordless `sudo iptables` isn't available.

- vm ports: sshd :22 must be visible with the <name>.vm endpoint
  (not localhost, not the raw guest IP — the daemon prefers the
  DNS record when one exists).

- vm restart: dedicated verb, not a stop+start alias. Asserts a
  fresh boot_id to prove the kernel actually recycled.

- vm kill --signal KILL: forceful termination path (distinct from
  `vm stop`'s graceful Ctrl-Alt-Del). Post-kill state must be
  'stopped' (not 'error') and the dm-snapshot must be cleaned up.

- vm prune -f: batch delete of non-running VMs while preserving any
  that are still running. Regression guard for the case where prune
  could wipe a live session.

- workspace prepare --readonly: mode bits on /root/repo/<file>
  must drop all write bits. Enforcement is advisory against a root
  guest, so the test asserts the bits, not EACCES.

- workspace prepare --mode full_copy: alternate transfer path
  (tarred into rootfs, no overlay) still lands the repo contents
  at /root/repo.

- workspace export --base-commit: guest-side commits captured in
  the patch when the pre-commit SHA is pinned. The feature's whole
  reason for existing; it had zero coverage. Includes a control
  assertion that the plain (no --base-commit) export does NOT see
  the committed file.

- ssh-config --install / --uninstall: HOME-isolated to a smoke
  tempdir so we don't touch the invoking user's ~/.ssh/config.
  Seeds a pre-existing config to catch any regression where
  install clobbers instead of appending. Asserts idempotency
  (second install doesn't duplicate the Include line) and clean
  round-trip (uninstall leaves the user's own content intact).

Coverage deltas from smoke (vs the last run):
  internal/hostnat          14.1% → 64.1%  (+50pp — NAT rule dance)
  internal/daemon/opstate   56.2% → 87.5%  (+31pp)
  internal/daemon           43.4% → 49.4%  (+6pp)
  internal/cli              36.1% → 40.4%  (+4pp)
  internal/daemon/workspace 64.1% → 67.5%  (+3pp)

Scenario count: 12 → 21.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 12:50:39 -03:00
b4afe13b2a
daemon: fix vm start (on a stopped VM) + regression coverage
Two defects compounded to make `vm create X` → `vm stop X` → `vm start X`
→ `vm ssh X` fail with `not_running: vm X is not running` even though
`vm show` reports `state=running`.

1. firecracker-go-sdk's startVMM spawns a goroutine that SIGTERMs
   firecracker when the ctx passed to Machine.Start cancels — and
   retains that ctx for the lifetime of the VMM, not just the boot
   phase. Our Machine.Start wrapper was plumbing the caller's ctx
   through, which on `vm.start` is the RPC request ctx. daemon.go's
   handleConn cancels reqCtx via `defer cancel()` right after
   writing the response. Net effect: firecracker is killed ~150ms
   after the `vm start` RPC "completes", invisibly, and the next
   `vm ssh` sees a dead PID. `vm.create` side-stepped the bug
   because BeginVMCreate detaches to context.Background() before
   calling startVMLocked; `vm.start` used the RPC ctx directly.
   Fix: Machine.Start now passes context.Background() to the SDK.
   We own firecracker lifecycle explicitly (StopVM / KillVM /
   cleanupRuntime), so ctx-driven cancellation here was never
   actually wired into anything useful.

2. With (1) fixed, the same scenario exposed a second defect:
   patchRootOverlay's e2cp/e2rm refuses to touch the dm-snapshot
   with "Inode bitmap checksum does not match bitmap" on a restart,
   because the COW holds stale free-block/free-inode counters from
   the previous guest boot. Kernel ext4 is fine with this; e2fsprogs
   is not. Fix: run `e2fsck -fy` on the snapshot between the
   dm_snapshot and patch_root_overlay stages. Idempotent on a fresh
   snapshot, reconciles the bitmaps on a reused COW.

Regression coverage:
  - scripts/repro-restart-bug.sh — minimal create→stop→start→ssh
    reproducer with rich on-failure diagnostics (daemon log trace,
    firecracker.log tail, handles.json, pgrep-by-apiSock, apiSock
    stat). Exits non-zero if the bug returns.
  - scripts/smoke.sh — lifecycle scenario (create/ssh/stop/start/
    ssh/delete) and vm-set scenario (--vcpu 2 → stop → set --vcpu 4
    → start → assert nproc=4). Both were pulled when the bug was
    first found; now restored.

Supporting:
  - internal/system/system.ExitCode — extracts exec.ExitError's
    code without forcing callers to import os/exec. Needed by the
    e2fsck caller (policy test pins os/exec to the shell-out
    packages).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 12:01:46 -03:00
e94e7c4dcc
smoke: workspace export scenario + smoke-fresh target + fix the export bug it caught
The export round-trip (`vm create` → `workspace prepare` → guest edit →
`workspace export`) exposed a reproducible failure on Debian bookworm
guests: `git read-tree HEAD --index-output=/tmp/...` returns exit 128
"unable to write new index file" when the target lives on tmpfs while
`.git` is on the workspace overlay. Move the temp index into
`$(git rev-parse --git-dir)` so it shares a filesystem with `.git/index`
and the lockfile + rename + hardlink dance git does internally works.

Alongside:
- new workspace-export smoke scenario that would have caught this at
  the boundary between daemon and guest git
- `make smoke-fresh` = `smoke-clean && smoke` for release-time runs
  that want first-install paths (migrations, image pull) stamped into
  the coverage report

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 11:34:55 -03:00
672d7151e9
smoke: five more scenarios + fix exit-code propagation bug the new ones caught
Five new smoke scenarios layered on top of the existing bare + workspace
vm-run tests:

  - exit-code propagation: `sh -c 'exit 42'` must rc=42
  - workspace dry-run: --dry-run lists tracked files without a VM
  - workspace --include-untracked: opt-in ships files outside the git
    index (regression guard on the security-default flip from review 4)
  - concurrent vm runs: two --rm invocations in parallel both succeed
    (stresses per-VM locks, createVMMu reservation window, tap pool)
  - invalid spec rejection: --vcpu 0 must fail with no VM row left
    behind (the "cleanup on partial failure" path the review flagged)

The exit-code scenario caught a real bug on first run:

  `banger vm run --rm -- sh -c 'exit 42'` returned rc=0, not 42.

Root cause in internal/cli/ssh.go's sshCommandArgs: extra args were
appended to the ssh argv verbatim, relying on ssh(1)'s implicit
space-join to deliver the remote command. That works for single
tokens (echo hello) but re-tokenises multi-word commands on the
remote side: `ssh host sh -c 'exit 42'` becomes remote
`sh -c exit 42`, where `42` is $0 for the already-completed `exit`,
and the exit code the user asked for is lost.

Fix: shell-quote every element of extra (`'sh'` `'-c'` `'exit 42'`)
and join them into a single trailing argv entry. ssh's space-join
then produces exactly the command the user typed on the remote
shell. TestSSHCommandArgs was updated to pin the quoting; the
existing TestRunVMRunCommandModePropagatesExitCode test needed a
one-word quote tweak (`false` → `'false'`).

Smoke run after fix passes all seven scenarios in ~2 min on warm
state. cmd/banger coverage jumped to 100% (the invalid-spec
scenario hits the error-reporting path that wasn't covered
before).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 19:37:07 -03:00
5f81332b0a
make smoke: end-to-end boot suite with coverage from real VM runs
The unit + integration tests can't cross machine.Start — the SDK
boundary would need a fake firecracker that reimplements the
control-plane HTTP API, and the ongoing maintenance cost of keeping
that fake honest with upstream kills the value. Instead, add a
pre-release smoke target that drives REAL Firecracker + real KVM,
captures coverage from the -cover-instrumented binaries, and
surfaces per-package deltas so regressions in the boot path don't
ship silently.

scripts/smoke.sh:
  - Isolated XDG_{CONFIG,STATE,CACHE,RUNTIME} so the smoke run can't
    touch real user state (state/cache persist under build/smoke/xdg
    for fast reruns; runtime is mktemp'd fresh per-run because
    sockets can't be reused)
  - Preflight: `banger doctor` must pass; UDP :42069 must be free
    (otherwise the user's real daemon is up and the smoke daemon
    can't bind its DNS listener — fail with an actionable message)
  - Scenario 1 — bare: `banger vm run --rm -- echo smoke-bare-ok`
    exercises create → start → socket ownership chown → machine.Start
    → SDK waitForSocket race → vsock agent readiness → guest SSH
    wait → exec → cleanup → delete
  - Scenario 2 — workspace: creates a throwaway git repo, runs
    `banger vm run --rm <repo> -- cat /root/repo/smoke-file.txt`,
    verifies the tracked file reached the guest (exercises
    workDisk capability PrepareHost + workspace.prepare)
  - `banger daemon stop` at the end so instrumented binaries flush
    GOCOVERDIR pods before the script exits

Makefile additions:
  - smoke-build: builds banger/bangerd under build/smoke/bin/ with
    `go build -cover`
  - smoke: runs the script with GOCOVERDIR set, reports per-package
    coverage via `go tool covdata percent`
  - smoke-coverage-html: textfmt + go tool cover for a browsable
    report
  - smoke-clean: nukes build/smoke/ including the persisted XDG
    state

Bonus fix uncovered during the first smoke run: doctor treated a
missing state.db as a FAIL ("out of memory" from SQLite
SQLITE_CANTOPEN), which red-flagged every fresh install. Split
the store check: DB file absent → PASS with "will be created on
first daemon start" detail; DB present but unreadable → FAIL as
before. New TestDoctorReport_StoreMissingSurfacesAsPassForFreshInstall
pins the behaviour.

Concrete coverage delta from the first successful smoke run
(compared to `make coverage-total`'s unit-test-only 37.8%):

  internal/firecracker        43.6% → 75.0%
  internal/daemon/workspace   33.8% → 60.8%
  internal/store              40.1% → 56.3%
  internal/guest              63.7% → 57.4%  (different mix: smoke
                                              exercises real SSH;
                                              unit tests cover more
                                              error branches)

The packages the review flagged are the ones that moved most —
which is the point.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 18:59:57 -03:00
52612b516b
README: describe the SSH-key migration as a VM restart, not workspace sync
The migration note said existing VMs needed a "fresh workspace
sync" to pick up a new host SSH key fingerprint. That's wrong:
workspace.prepare (vm workspace prepare) touches the git checkout,
not authorized_keys. The authorized_keys rewrite happens on the
vm start path — specifically in workDiskCapability.PrepareHost
calling WorkspaceService.ensureAuthorizedKeyOnWorkDisk, which runs
during start, not during an explicit workspace sync.

Rewrite the note to name the actual recovery action: stop-and-start
(or vm restart). Leave the --rm caveat — those flows always boot
fresh and don't carry the problem.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 18:06:52 -03:00
e2ac70631b
test: end-to-end VMService lifecycle flow harness
Uses the newTestDaemon harness added earlier in this series to
drive VMService.CreateVM → FindVM → (duplicate CreateVM reject) →
DeleteVM through the real code path. Pins contracts the
single-responsibility tests can't:

  - image resolution finds a locally-upserted row
  - reserveVM allocates an IP, mkdirs VMDir, persists the Stopped
    row
  - FindVM round-trips the record
  - duplicate-name CreateVM fails without persisting a second row
  - DeleteVM runs cleanupRuntime with zero handles (no sudo calls
    needed), removes the store row, removes the VMDir

Plus TestVMCreateWithUnknownImageFails for the error branch: if
CreateVM can't resolve an image, it must error before mutating
any state.

Scope: everything except firecracker boot. NoStart: true skips
machine.Start, which is the upstream SDK boundary we can't cross
without a real firecracker binary. The integration we GET
exercises name/IP reservation, per-VM lock lifecycle, store
round-trip, VMDir lifecycle, and the never-started delete path —
all of which were only indirectly covered by unit tests before.

-race clean across the two tests; nothing touches goroutines but
the harness does initialize the tap pool background goroutine on
wireServices, which the race detector validated.

Coverage is flat at the global level (37.8% → 37.8%) because
this slice tests integration of already-covered units, not new
branches. That's expected — the slice's value is as a regression
bedrock for future refactors, not a line-count bump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 17:55:04 -03:00
7820960706
store: edge-path tests for migrations and Open
Three gaps from the coverage plan, none of which were covered before.

internal/store/migrations_test.go:

  TestRunMigrationsIgnoresUnknownAppliedIDs — simulates a DB
  written by a newer banger opened by an older one: schema_migrations
  carries an id (9001) the current binary doesn't know about. The
  runner must leave the alien row alone AND still apply its own
  known migrations. Without this, forward-then-backward upgrades or
  running two daemon versions against the same state dir would
  either fail or start destructively reinterpreting rows.

  TestDropColumnIfExistsIsIdempotent — pins the "run twice, no harm"
  property. A daemon restart after migration 2 succeeded on a fresh
  install must not fail because the column is already gone.
  dropColumnIfExists is what makes that idempotent.

internal/store/store_test.go:

  TestOpenRejectsCorruptDB — writes garbage to state.db, Open must
  error cleanly (not panic, not silently overwrite). Also verifies
  the garbage bytes are untouched so the operator can hand the
  file to a recovery tool.

  TestOpenReadOnlyRejectsMissingDB — the doctor path must not
  silently create an empty DB when none exists; that would make
  "no VMs yet" and "your state is missing" indistinguishable.

Package function coverage nudged 39.1% → 40.1%.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 17:51:52 -03:00
2f3db9b104
fcproc: targeted tests for waitForPath + EnsureSocketAccess error paths
Every non-happy branch in fcproc was zero-covered before this. Given
that EnsureSocketAccess gates the firecracker control plane on the
daemon's ability to chown the API + vsock sockets off root, those
failure paths are exactly the ones we need pinned.

New file internal/daemon/fcproc/fcproc_test.go adds a local scripted
Runner (fcproc is a leaf package — can't pull the daemon's
scriptedRunner in) and six tests:

waitForPath:
  - TestWaitForPathReturnsDeadlineExceededWhenSocketNeverAppears —
    timeout branch wraps context.DeadlineExceeded with the label,
    and waits at least one poll tick before giving up
  - TestWaitForPathReturnsOnceSocketAppears — happy path with a
    mid-wait file creation via goroutine
  - TestWaitForPathRespectsContextCancellation — ctx.Done() beats
    the poll interval so a cancelled request doesn't stall

EnsureSocketAccess:
  - TestEnsureSocketAccessChownFailureBubbles — chown error surfaces
    untouched; chmod not attempted when chown fails
  - TestEnsureSocketAccessChmodFailureBubbles — chmod error surfaces
    after chown succeeds
  - TestEnsureSocketAccessTimesOutBeforeTouchingRunner — ordering
    contract: no sudo calls when the socket never materialises

Package function coverage moved 55.2% → 62.1%.

Integration-level chown-race test was considered (run a real shell
that exercises buildProcessRunner's script with a fake firecracker
binary) but skipped — requires `sudo -n` in the test env and makes
CI fragile. The socket-ownership regression this slice is meant to
guard against is covered at the unit level here; the
manual-smoke in the plan's verification section remains the
end-to-end check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 17:49:42 -03:00
cef9bf92a5
ssh-config: harden sameDirOrParent against symlinks + add edge tests
The symlink test in this commit catches a real bug: sameDirOrParent
used filepath.Abs for both sides of the "is the key inside the
legacy dir?" check, but filepath.Abs doesn't resolve symlinks. A
user whose ssh_key_path pointed into ConfigDir/ssh via a symlinked
spelling (e.g. ConfigDir itself is a symlink, or the user
maintains an alias tree) would have their key silently deleted by
the legacy-dir scrub — the gate thought the key lived elsewhere
because the two spellings didn't match lexically.

Fix: resolvePathForComparison tries filepath.EvalSymlinks first,
falls back to filepath.Abs when the path doesn't exist yet (new
install, pre-first-Open). Both sides of the sameDirOrParent
comparison now use this helper, so a symlinked key + canonical
dir (or the reverse) lands in the same physical path before the
Rel check.

Tests added in this commit:

internal/daemon/ssh_client_config_test.go
  TestSameDirOrParentHandlesSymlinks — symlinked-key + canonical-dir
  and the reverse are both reported "inside"; unrelated paths stay
  out. Skips if the filesystem doesn't support symlinks.

internal/config/config_test.go
  TestLoadNormalizesAbsoluteSSHKeyPath — trailing slash, duplicate
  slashes, dot segments all collapse via filepath.Clean, so two
  spellings of the same path compare equal downstream.
  TestEnsureDefaultSSHKeyRejectsCorruptExistingFile — regression
  guard against a future "regenerate if invalid" patch that would
  silently nuke a real user key.
  TestResolveSSHKeyPathRejectsEmptySSHDirAndStateDir — pins the
  absolute-path guard that stops a bad layout from scribbling
  into cwd (this was the test that caught the stray
  internal/config/ssh/ a few commits back).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 17:48:06 -03:00
b2756f5e7e
test: add newTestDaemon harness + options
No Daemon test in this package has a shared constructor. Every file
re-derives the same pattern — &Daemon{...}, wireServices(d), maybe
override a field — which means new lifecycle / integration tests
spend half their length standing up infrastructure instead of
exercising behaviour.

Consolidate into internal/daemon/daemon_testing_test.go:

  func newTestDaemon(t *testing.T, opts ...testDaemonOption) *Daemon

Defaults: tempdir layout (distinct StateDir/ConfigDir/SSHDir/...),
fresh store.Store with migrations auto-run, permissiveRunner,
io.Discard logger, empty vmCaps (so default workDisk/dns/nat
capabilities don't fire real side effects in tests that just want
to exercise VMService plumbing).

Options so far:
  - withRunner(system.CommandRunner)
  - withConfig(model.DaemonConfig)
  - withStore(*store.Store)
  - withLogger(*slog.Logger)
  - withLayout(paths.Layout)
  - withVMCaps(caps ...vmCapability)
  - withVsockHostDevice(string)

withVMCaps tracks a vmCapsSet flag so tests that explicitly pass no
caps (i.e. the default) still get the empty-slice behaviour — the
reset after wireServices only fires when the caller didn't opt in.
That keeps wireServices's production semantics unchanged: if you
construct a real Daemon without pre-populating vmCaps, you still
get the default three.

Two smoke tests pin:
  - zero-option call wires every service, gives an empty-vmCaps
    daemon with the default vsock device, store non-nil
  - each option actually lands on the resulting Daemon (guards
    against silent rename)

Existing tests unchanged — this is purely additive. Later slices
(Firecracker error-path tests, store migration edges, lifecycle
flow harness) will adopt the helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 17:45:43 -03:00
e2885060dc
README: add ssh_key_path migration note + document normalization rules
Docs pointing at the new state-dir default were updated in b1fbf69;
what was still missing is the migration guidance the review asked for.

Add a short note under the ssh_key_path bullet covering:

- what moved (~/.config/banger/ssh/id_ed25519 →
  ~/.local/state/banger/ssh/id_ed25519)
- that users with the old path hardcoded in config.toml are safe
  (the narrowed legacy-dir cleanup preserves the enclosing dir when
  ssh_key_path points inside it)
- that unsetting the key and letting banger manage the new default
  is also fine — the only caveat is existing VMs need a
  stop-and-start to re-sync authorized_keys

Also document the new normalization rules (~/ expansion, absolute
required) on the ssh_key_path bullet itself so users know what's
accepted before they hit a load error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 17:14:00 -03:00
617008e8f1
config: normalize ssh_key_path — expand ~/, reject non-absolute
Bug: resolveSSHKeyPath returned a configured ssh_key_path verbatim.
That meant:

- ssh_key_path = "~/.ssh/id_ed25519" kept the literal "~" — downstream
  readers (internal/guest/ssh.go, internal/daemon/image_seed.go,
  internal/daemon/vm_authsync.go, internal/cli/ssh.go) do raw
  os.ReadFile on the path and fail at runtime with a path that looks
  fine but isn't.
- ssh_key_path = "id_ed25519" (relative) silently worked or didn't
  depending on the daemon's cwd — the daemon process's cwd is not
  the user's shell cwd, so behavior was non-obvious.

Fix: add normalizeSSHKeyPath() run over configured values. It:

  - expands "~/..." against $HOME
  - rejects bare "~" (ambiguous)
  - rejects "~user/..." (we don't do user-tilde)
  - rejects relative paths outright
  - returns filepath.Clean'd absolute paths

Tests cover the accepting case (home-anchored expansion) and every
rejection branch via a table-driven subtests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 17:00:34 -03:00
b1fbf695ca
ssh-config: narrow the legacy-dir cleanup so it can't delete a user key
Bug: syncVMSSHClientConfig did os.RemoveAll on $ConfigDir/ssh every
daemon Open. The intent was to migrate off the pre-opt-in layout,
where banger used to write $ConfigDir/ssh/ssh_config. But a user who
sets ssh_key_path = "~/.config/banger/ssh/id_ed25519" in config.toml
has their key live exactly in that dir — and the scrub deletes it
along with every other file in the tree.

This is the same class of bug that cost the default key until
ebe6517 moved it to StateDir, but that fix was scoped to the default
path. A configured ssh_key_path pointed under the legacy dir still
dies.

Fix: replace os.RemoveAll with a narrow two-step cleanup:

 1. Skip the cleanup entirely when the configured ssh_key_path
    resolves under the legacy dir. A user who pointed banger at a
    key there must keep the enclosing directory.
 2. Otherwise, os.Remove the specific legacy file ($ConfigDir/ssh/
    ssh_config) and then os.Remove the directory. The second
    os.Remove fails with ENOTEMPTY if the dir still holds anything
    (e.g. a user-managed sibling file we don't own). Both errors
    are swallowed — this is best-effort migration, not a hard
    failure.

Tests pin all three paths: user key under legacy dir survives,
legacy dir empties and is removed when the user moved on, and a
user-managed sibling file in the legacy dir is preserved.

Also fix stale doc claims in README.md and AGENTS.md — both still
pointed at the old ~/.config/banger/ssh/id_ed25519 default, which
moved to ~/.local/state/banger/ssh/id_ed25519 in ebe6517.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 16:31:07 -03:00
fba30f26d4
firecracker: chown API + vsock sockets inside the sudo shell
Bug: Firecracker creates its API and vsock sockets as root:root 0700
(enforced by the intentional umask 077 in buildProcessRunner). The
daemon, running as the invoking user, then can't connect(2) to
either — AF_UNIX connect needs write permission on the socket file
and 0700 root-owned leaves thales without any.

firecracker-go-sdk's Machine.Start() blocks on waitForSocket, which
probes the socket with both os.Stat (succeeds — parent dir is the
user's XDG_RUNTIME_DIR) and an HTTP GET over the socket (fails —
EACCES on connect). The SDK loops for 3 seconds then fails with
"Firecracker did not create API socket ... context deadline exceeded".

The daemon's EnsureSocketAccess chown was meant to fix permissions,
but it runs *after* Machine.Start returns — and Start never returns
because it's still looping on the SDK's probe. Chicken-and-egg.

Fix: inside the sudo'd shell that launches firecracker, spawn a
background subshell that polls for each expected socket (API + vsock,
when configured) and chowns it to $SUDO_UID:$SUDO_GID as soon as it
appears. The background polling is bounded at 1s (20 × 50ms) so a
broken firecracker invocation doesn't leak a waiting shell.

Post-fix: socket appears root-owned 0600 briefly, is chowned to the
invoking user within ~50ms, SDK's HTTP probe succeeds, Machine.Start
returns normally. EnsureSocketAccess's later chmod 600 remains the
belt-and-braces guarantee on final mode.

Verified: manual repro of the shell script produces a socket owned
by thales:thales that a non-root python socket.connect() accepts.
Without the fix the same setup gives "PermissionError: [Errno 13]
Permission denied".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 16:09:02 -03:00
60f90eb8be
config: harden resolveSSHKeyPath against relative paths + drop stray test keys
Prior commit's test (TestLoadDefaultsResolveFirecrackerAndGenerateSSHKey)
was contained, but every OTHER config test called
Load(paths.Layout{ConfigDir: ...}) without SSHDir or StateDir set.
Under the new code path that meant resolveSSHKeyPath produced a
relative target ("ssh/id_ed25519") which go test happily wrote
against the package's own source directory — and I caught that in
the commit after the fact, in the form of internal/config/ssh/
showing up as tracked files.

Two changes:
- resolveSSHKeyPath now errors if the resolved path is not absolute.
  paths.Resolve always produces an absolute SSHDir in production;
  this just stops a fumbled layout from silently scribbling into
  cwd.
- Every existing config test that was relying on the old
  ConfigDir/ssh path gets an explicit SSHDir: t.TempDir() added,
  restoring the key-generation surface under tempdir isolation.

Delete internal/config/ssh/ — those files were test-generated by the
prior commit and have no business in the repo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 14:31:11 -03:00
ebe651762f
config: put the default SSH key under the state dir, not ConfigDir/ssh
Bug: every daemon Open deleted the freshly-generated default SSH key
before returning, so the next VM create failed reading it.

Sequence:
  1. Open → config.Load → resolveSSHKeyPath generates
     ~/.config/banger/ssh/id_ed25519
  2. Open → ensureVMSSHClientConfig → syncVMSSHClientConfig scrubs
     ~/.config/banger/ssh entirely as a migration step for the
     pre-opt-in layout (commit 108f7a0)

The scrub was added for a file that used to live at
ConfigDir/ssh/ssh_config, but it os.RemoveAll'd the whole
ConfigDir/ssh dir — including the id_ed25519 the key generator had
just put there.

Fix: point the default key at layout.SSHDir (a StateDir-rooted path
that paths.Ensure already creates). The scrub can keep cleaning up
ConfigDir/ssh because nothing banger writes under it anymore.

Users whose ssh_key_path is explicitly set in config.toml are
unaffected — configured wins. Users on the default path will get a
fresh key at StateDir/ssh/id_ed25519 on their next daemon Open;
existing VMs' authorized_keys re-sync on next start/create through
ensureAuthorizedKeyOnWorkDisk, so no manual intervention is needed
beyond restarting the daemon.

Regression test pins the new placement and asserts the legacy path
stays empty.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 14:29:34 -03:00
80ae4d6667
docs: resync package docs, AGENTS, and kernel-catalog with current code
Four drift fixes from a doc sweep.

internal/daemon/doc.go
  Replace the capability-hook description that still said "Hook
  methods take *Daemon; VMService reaches them through a
  capabilityHooks seam." Current reality: every capability is a
  plain struct carrying its own service pointers
  (workDiskCapability{vm,ws,store}, dnsCapability{net},
  natCapability{vm,net,logger}); wireServices builds the default
  list; no hook reaches *Daemon.

internal/daemon/ARCHITECTURE.md
  The VMService field list still claimed guestWaitForSSH and
  guestDial were "per-instance fields." Those were deleted as
  refactor residue. Update the note to say the seams live on
  *Daemon (reached by WorkspaceService via closures wired at
  construction) and document the vsockHostDevice field that
  replaced the old package-global vsockHostDevicePath.

AGENTS.md
  Drop the "experimental web UI" mention (removed) and the
  `session` subpackage (removed). Mention banger-vsock-agent as
  the third cmd/ binary while we're here — AGENTS hadn't listed
  it.

docs/kernel-catalog.md
  The trust-model section still read as if upstream kernel sources
  were fetched by HTTPS alone. Add a paragraph covering the PGP
  verification make-generic-kernel.sh now does against the
  detached .tar.sign and the three kernel.org release signing keys.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 13:01:11 -03:00
88bc466d58
tests: targeted coverage for doctor, workspace rejections, and nat capability
Three thematic test files pinning behavior surfaces that had none
before, following the review's recommendation to plug concrete
error/cleanup branches rather than chase a coverage percentage.

doctor_test.go
  Covers Daemon.doctorReport end-to-end with a permissive runner +
  fake executables on PATH. Pins: store error surfaces as fail,
  store success as pass, missing firecracker kills the host-runtime
  check, the three default capability feature checks (work disk,
  vm dns, nat) are emitted, vm-defaults is always-pass with
  provenance. Previously 0% — now the Doctor() command's contract
  with the CLI is under guard.

workspace_rejection_test.go
  Covers the four early-exit branches of PrepareVMWorkspace that
  the existing happy-path + lock-release tests never hit: malformed
  mode, --from without --branch, VM not running, VM not found.
  Each one returns before any SSH I/O, so the fake-firecracker
  infra the happy-path test needs is unnecessary — a bare wired
  daemon with a stored VMRecord suffices.

nat_capability_test.go
  Covers natCapability.ApplyConfigChange (unchanged flag → no-op,
  VM not alive → no-op, toggle on live VM → runner reached) and
  natCapability.Cleanup (NAT disabled → no-op, runtime handles
  missing → defensive no-op, full wiring → ensureNAT(false)). A
  countingRunner + startFakeFirecracker fixture stands in for the
  real host plumbing, with waitForVMAlive polling past the
  exec -a race window that startFakeFirecracker exposes on
  loaded CI boxes.

make coverage-total 37.8% → 38.6%. The number isn't the point —
these tests exist so the next refactor in this area has to
break an explicit assertion to drift.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 12:58:12 -03:00
3aec590deb
vmservice: delete dead guestWaitForSSH + guestDial seams
VMService carried guestWaitForSSH and guestDial fields + matching
constructor args ever since the daemon split, but nothing on
VMService ever read them. The live guest-SSH path runs on *Daemon
(d.waitForGuestSSH / d.dialGuest in guest_ssh.go); WorkspaceService
reaches those through closures wired in wireServices.

So the VMService copies were refactor residue: they made the
service look more decoupled than it actually is, and any future
test that stubbed VMService.guestDial would be stubbing nothing.
Delete the fields, the deps entries, the newVMService assignments,
and the wireServices passes.

Real seams on *Daemon are unchanged — those are the ones tests
(e.g. workspace_test.go) already set directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 12:45:27 -03:00
bbd187391e
noteUntrackedSkipped: fix subdir underreport + be best-effort everywhere
Two bugs in the untracked-skipped warning, both surfaced by review.

1. Wrong scope for subdir inputs. The helper accepted any path the
   caller had (sourcePath, which may be a user-supplied subdirectory)
   and ran `git -C <path> ls-files --others --exclude-standard`. Git
   scopes that output to the cwd, so pointing `vm run ./repo/sub` at
   a subdir silently underreported untracked files living elsewhere
   in the repo — exactly the files the warning exists to surface.
   Fix: resolve sourcePath to the repo root inside the helper via
   `rev-parse --show-toplevel` before counting.

2. Inconsistent failure handling. The comment said the helper should
   be silent when the count can't be determined; the body returned
   the error. vm_run.go treated the error as non-fatal (logged a
   warning, continued); workspace prepare and --dry-run aborted the
   whole operation on the same helper failure. A courtesy notice
   shouldn't kill the operation.
   Fix: make the helper best-effort in signature and body — no error
   return, swallows rev-parse + count failures, emits nothing when
   there's nothing to say. All three callers lose their error
   branches.

Regression tests:
- subdir input reports the root-level untracked file (the bug case)
- non-repo path produces silence, not a fatal error
- inspector whose runner errors on every call produces silence

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 12:42:33 -03:00
ecb18ce6ca
seams: move the last four package globals onto instance fields
Three test seams were still package-level mutable vars, which tests
had to swap before use. That's the classic path to flaky parallel
tests — two goroutines fighting over the same global fake. Push each
down to the struct that owns the behaviour.

internal/daemon/dns_routing.go
  lookupExecutableFunc + vmDNSAddrFunc → fields on *HostNetwork,
  defaulted at newHostNetwork time. dns_routing_test builds
  HostNetwork{..., lookupExecutable: stub, vmDNSAddr: stub} inline,
  no more t.Cleanup dance around package-level vars.

internal/daemon/preflight.go + doctor.go
  vsockHostDevicePath (mutable string) → vsockHostDevice field on
  *VMService, defaulted via defaultVsockHostDevice constant in
  newVMService. Preflight reads s.vsockHostDevice; doctor reads
  d.vm.vsockHostDevice. Logger test sets d.vm.vsockHostDevice = tmp
  after wireServices.

internal/daemon/workspace/workspace.go
  HostCommandOutputFunc → *Inspector struct with a Runner field.
  Every git-using helper (GitOutput, GitTrimmedOutput,
  GitResolvedConfigValue, RunHostCommand, ListSubmodules,
  ListOverlayPaths, CountUntrackedPaths, InspectRepo,
  ImportRepoToGuest, PrepareRepoCopy) is now a method on *Inspector.
  NewInspector() wraps the real host runner for production;
  WorkspaceService holds one via repoInspector, CLI deps holds one
  too. cli_test.go's submodule-rejection test builds its own
  Inspector with a scripted Runner instead of patching a global.
  Pure helpers (FinalizeScript, ResolveSourcePath, ParsePrepareMode,
  ShellQuote, FormatStepError, GitFileURL, ParseNullSeparatedOutput)
  stay free functions since they don't touch the host.

Sentinel: grep for HostCommandOutputFunc, lookupExecutableFunc,
vmDNSAddrFunc, vsockHostDevicePath is now empty across internal/.
make lint test green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 12:07:14 -03:00
2685bc73f8
doctor: open the state DB read-only so inspection never mutates it
`banger doctor` used to call store.Open, which unconditionally runs
migrations on the way up. Diagnostics mutating persistent state is a
surprise — particularly now that migration 2 drops a column, so a
plain `doctor` invocation against an old DB would silently schema-
evolve it.

Add store.OpenReadOnly: separate DSN builder with mode=ro and a
minimal pragma set (foreign_keys, busy_timeout — no journal_mode=WAL,
no wal_autocheckpoint), skips runMigrations, and pings on open so a
missing DB fails up front rather than at first query. doctor.go now
uses OpenReadOnly; the existing storeErr fallback path surfaces any
failure as a failing check, unchanged.

Tests pin two invariants:
- OpenReadOnly against a DB whose migration 2 marker was removed and
  packages_path re-added must leave both alone (i.e. no drift is
  applied behind the user's back).
- Any write attempted through the read-only handle is rejected at
  the driver layer (belt-and-braces for future refactors).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 11:05:23 -03:00
129475be20
config + store: remove dead knobs and stale schema
Three drift items surfaced in review, each dead on arrival and each
worth trusting a little more at v0.1.0.

config: drop MetricsPollInterval. The field was parsed from TOML
(metrics_poll_interval), stored on DaemonConfig, and ignored by every
consumer — only StatsPollInterval drives the background poll loop.
Users setting it in config.toml saw zero effect. Removed from the TOML
surface, the model constant, and the config test.

daemon: delete ensureDefaultImage. No callers, body was `_ = ctx;
return nil`. Dead since whatever flow used to call it got removed.

store: drop packages_path from the images table. The column was
carried by the baseline migration but never referenced by UpsertImage
(no INSERT / UPDATE mention) or any Go model field — a ghost from a
build pipeline that no longer exists. Added migration id=2
(drop_dead_image_columns) with an idempotent dropColumnIfExists
helper: fresh installs run baseline (creates the column) + 2 (drops
it); legacy DBs where the column was never added get a no-op. Updated
the direct-INSERT SQL in TestGetImageRejectsMalformedTimestamp to
drop the column reference, and added a migration test covering both
install paths (fresh + legacy).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 10:54:01 -03:00