Commit graph

247 commits

Author SHA1 Message Date
bbd187391e
noteUntrackedSkipped: fix subdir underreport + be best-effort everywhere
Two bugs in the untracked-skipped warning, both surfaced by review.

1. Wrong scope for subdir inputs. The helper accepted any path the
   caller had (sourcePath, which may be a user-supplied subdirectory)
   and ran `git -C <path> ls-files --others --exclude-standard`. Git
   scopes that output to the cwd, so pointing `vm run ./repo/sub` at
   a subdir silently underreported untracked files living elsewhere
   in the repo — exactly the files the warning exists to surface.
   Fix: resolve sourcePath to the repo root inside the helper via
   `rev-parse --show-toplevel` before counting.

2. Inconsistent failure handling. The comment said the helper should
   be silent when the count can't be determined; the body returned
   the error. vm_run.go treated the error as non-fatal (logged a
   warning, continued); workspace prepare and --dry-run aborted the
   whole operation on the same helper failure. A courtesy notice
   shouldn't kill the operation.
   Fix: make the helper best-effort in signature and body — no error
   return, swallows rev-parse + count failures, emits nothing when
   there's nothing to say. All three callers lose their error
   branches.

Regression tests:
- subdir input reports the root-level untracked file (the bug case)
- non-repo path produces silence, not a fatal error
- inspector whose runner errors on every call produces silence

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 12:42:33 -03:00
ecb18ce6ca
seams: move the last four package globals onto instance fields
Three test seams were still package-level mutable vars, which tests
had to swap before use. That's the classic path to flaky parallel
tests — two goroutines fighting over the same global fake. Push each
down to the struct that owns the behaviour.

internal/daemon/dns_routing.go
  lookupExecutableFunc + vmDNSAddrFunc → fields on *HostNetwork,
  defaulted at newHostNetwork time. dns_routing_test builds
  HostNetwork{..., lookupExecutable: stub, vmDNSAddr: stub} inline,
  no more t.Cleanup dance around package-level vars.

internal/daemon/preflight.go + doctor.go
  vsockHostDevicePath (mutable string) → vsockHostDevice field on
  *VMService, defaulted via defaultVsockHostDevice constant in
  newVMService. Preflight reads s.vsockHostDevice; doctor reads
  d.vm.vsockHostDevice. Logger test sets d.vm.vsockHostDevice = tmp
  after wireServices.

internal/daemon/workspace/workspace.go
  HostCommandOutputFunc → *Inspector struct with a Runner field.
  Every git-using helper (GitOutput, GitTrimmedOutput,
  GitResolvedConfigValue, RunHostCommand, ListSubmodules,
  ListOverlayPaths, CountUntrackedPaths, InspectRepo,
  ImportRepoToGuest, PrepareRepoCopy) is now a method on *Inspector.
  NewInspector() wraps the real host runner for production;
  WorkspaceService holds one via repoInspector, CLI deps holds one
  too. cli_test.go's submodule-rejection test builds its own
  Inspector with a scripted Runner instead of patching a global.
  Pure helpers (FinalizeScript, ResolveSourcePath, ParsePrepareMode,
  ShellQuote, FormatStepError, GitFileURL, ParseNullSeparatedOutput)
  stay free functions since they don't touch the host.

Sentinel: grep for HostCommandOutputFunc, lookupExecutableFunc,
vmDNSAddrFunc, vsockHostDevicePath is now empty across internal/.
make lint test green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 12:07:14 -03:00
2685bc73f8
doctor: open the state DB read-only so inspection never mutates it
`banger doctor` used to call store.Open, which unconditionally runs
migrations on the way up. Diagnostics mutating persistent state is a
surprise — particularly now that migration 2 drops a column, so a
plain `doctor` invocation against an old DB would silently schema-
evolve it.

Add store.OpenReadOnly: separate DSN builder with mode=ro and a
minimal pragma set (foreign_keys, busy_timeout — no journal_mode=WAL,
no wal_autocheckpoint), skips runMigrations, and pings on open so a
missing DB fails up front rather than at first query. doctor.go now
uses OpenReadOnly; the existing storeErr fallback path surfaces any
failure as a failing check, unchanged.

Tests pin two invariants:
- OpenReadOnly against a DB whose migration 2 marker was removed and
  packages_path re-added must leave both alone (i.e. no drift is
  applied behind the user's back).
- Any write attempted through the read-only handle is rejected at
  the driver layer (belt-and-braces for future refactors).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 11:05:23 -03:00
129475be20
config + store: remove dead knobs and stale schema
Three drift items surfaced in review, each dead on arrival and each
worth trusting a little more at v0.1.0.

config: drop MetricsPollInterval. The field was parsed from TOML
(metrics_poll_interval), stored on DaemonConfig, and ignored by every
consumer — only StatsPollInterval drives the background poll loop.
Users setting it in config.toml saw zero effect. Removed from the TOML
surface, the model constant, and the config test.

daemon: delete ensureDefaultImage. No callers, body was `_ = ctx;
return nil`. Dead since whatever flow used to call it got removed.

store: drop packages_path from the images table. The column was
carried by the baseline migration but never referenced by UpsertImage
(no INSERT / UPDATE mention) or any Go model field — a ghost from a
build pipeline that no longer exists. Added migration id=2
(drop_dead_image_columns) with an idempotent dropColumnIfExists
helper: fresh installs run baseline (creates the column) + 2 (drops
it); legacy DBs where the column was never added get a no-op. Updated
the direct-INSERT SQL in TestGetImageRejectsMalformedTimestamp to
drop the column reference, and added a migration test covering both
install paths (fresh + legacy).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 10:54:01 -03:00
2a7f55f028
vm run: ship tracked files only by default; add --include-untracked + --dry-run
Workspace-mode vm run and vm workspace prepare used to copy both
tracked AND untracked non-ignored files into the guest. That silently
catches local .env files, scratch notes, credentials, and any other
working-tree state a developer hasn't explicitly gitignored — a real
data-exposure footgun given the golden image ships Docker and the
usual dev tooling.

Flip the default to tracked-only. Users who actually want the fuller
set opt in with --include-untracked (documented in both commands'
help). Gitignored files are still always excluded regardless of the
flag.

Add --dry-run to both vm run and vm workspace prepare. Dry-run
inspects the repo CLI-side (no VM created, no daemon RPC needed since
the daemon is always local and the inspection is a pure git read),
prints the exact file list + mode, and exits. A byte-level preview of
what would land in the guest.

When running real (non-dry) and untracked files exist in the repo but
are being skipped under the new default, print a one-line notice
pointing to --include-untracked so users aren't surprised when the
guest is missing something they expected.

Signature changes:
- ListOverlayPaths takes an includeUntracked bool (tracked always;
  untracked gated by flag).
- InspectRepo takes the same flag and passes it through.
- VMWorkspacePrepareParams gains IncludeUntracked.
- WorkspaceService.workspaceInspectRepo seam signature widened to
  match (4 callers in tests updated).

New workspace package tests cover both modes and verify that
gitignored files never leak regardless of the flag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 19:53:17 -03:00
25a1466947
supply chain: verify signatures and pins across image + kernel builds
Three independent hardenings, addressing a review finding that the
kernel and image build pipelines were relying on HTTPS alone for
artifact integrity.

scripts/make-generic-kernel.sh
- Fetch the detached PGP signature (linux-<ver>.tar.sign) alongside
  the tarball and verify it with gpg before extraction. An isolated
  $GNUPGHOME under the tempdir keeps the kernel signers out of the
  invoking user's keyring.
- Import the three kernel.org release signing keys (Greg KH / Linus /
  Sasha Levin) from keyserver.ubuntu.com, falling back to
  keys.openpgp.org. Ubuntu comes first because keys.openpgp.org strips
  unverified UIDs on upload, leaving gpg with UID-less keys it
  refuses to trust.
- Require VALIDSIG (cryptographic proof) rather than GOODSIG
  (printed even for expired keys) before proceeding. Verified
  end-to-end against a clean tarball (accepts) and a byte-flipped
  tampered copy (rejects with BADSIG).
- gpg + gpgv + xz added to the required-tools check.

images/golden/Dockerfile
- Pin Docker's apt signing key by fingerprint. After downloading
  /etc/apt/keyrings/docker.asc we gpg --show-keys --with-colons it,
  extract the fpr, and compare against the expected
  9DC858229FC7DD38854AE2D88D81803C0EBFCD88. A tampered or swapped key
  aborts the build before any apt repo metadata is fetched.
- Replace `curl https://mise.run | sh` with a pinned GitHub release
  binary (mise v2026.4.18, linux-x64) verified against its published
  sha256. Refuses to build on unknown architectures rather than
  silently installing a binary we have no hash for.
- Add gnupg to the ESSENTIAL apt-get install so the fingerprint check
  has gpg available.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 19:38:13 -03:00
011b59a72f
daemon split (8/8): document capability decoupling + wireServices
Update ARCHITECTURE.md's Composition section to reflect the finished
split: capabilities carry explicit service-pointer fields, nothing
reaches *Daemon at dispatch time, and wireServices(d) is the single
entry point that builds services + capabilities eagerly (from Open
in production, from tests after constructing &Daemon{...} literals).

Removes the paragraph admitting capability→*Daemon coupling and the
lazy-init getters justification, neither of which applies anymore.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 15:59:39 -03:00
9c73155e17
daemon split (7/n): narrow capability interfaces, wire deps at construction
Stop passing *Daemon into capability hooks. Each capability
implementation is now a struct with explicit service-pointer fields
populated at wireServices time; the six dynamic-dispatch interfaces
(AddStartPreflight, PrepareHost, PostStart, Cleanup, ApplyConfigChange,
AddDoctorChecks) no longer have a *Daemon parameter. Capability
methods reach their dependencies through struct fields, not through
d.vm / d.ws / d.net.

- workDiskCapability carries {vm, ws, store, defaultImageName}
- dnsCapability carries {net}
- natCapability carries {vm, net, logger}

Daemon.defaultCapabilities() builds the production list from the
already-constructed services and is called from wireServices so
d.vmCaps is populated eagerly. Tests that preinstall d.vmCaps with
stubs still work — wireServices only overwrites an empty slice.

registeredCapabilities() is gone (every dispatch loop now reads
d.vmCaps directly). capabilities_test.go's testCapability fake drops
*Daemon from its method set to match the new interfaces.

This finishes the daemon service split: capability implementations
no longer reach through the composition root, there's no path back
to *Daemon from any service or capability, and test construction
goes through one explicit wireServices call instead of lazy getters.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 15:59:09 -03:00
16702bd5e1
daemon split (6/n): extract wireServices + drop lazy service getters
Factor the service + capability wiring out of Daemon.Open() into
wireServices(d), an idempotent helper that constructs HostNetwork,
ImageService, WorkspaceService, and VMService from whatever
infrastructure (runner, store, config, layout, logger, closing) is
already set on d. Open() calls it once after filling the composition
root; tests that build &Daemon{...} literals call it to get a working
service graph, preinstalling stubs on the fields they want to fake.

Drops the four lazy-init getters on *Daemon — d.hostNet(),
d.imageSvc(), d.workspaceSvc(), d.vmSvc() — whose sole purpose was
keeping test literals working. Every production call site now reads
d.net / d.img / d.ws / d.vm directly; the services are guaranteed
non-nil once Open returns. No behavior change.

Mechanical: all existing `d.xxxSvc()` calls (production + tests)
rewritten to field access; each `d := &Daemon{...}` in tests gets a
trailing wireServices(d) so the literal + wiring are side-by-side.
Tests that override a pre-built service (e.g. d.img = &ImageService{
bundleFetch: stub}) now set the override before wireServices so the
replacement propagates into VMService's peer pointer.

Also nil-guards HostNetwork.stopVMDNS and d.store in Close() so
partially-initialised daemons (pre-reconcile open failure) still
tear down cleanly — same contract the old lazy getters provided.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 15:55:28 -03:00
0cfd8a5451
daemon split (5/5): document the service composition
Phase 5 of the daemon god-struct refactor. Code motion landed in
phases 1-4; this commit retells the architecture so the docs match
the structure.

ARCHITECTURE.md loses the "deferred v0.2 project" hedge about
splitting services. The Composition section now describes the four
services (HostNetwork, ImageService, WorkspaceService, VMService)
that own behaviour, the consumer-defined seam pattern for
cross-service calls, and the lazy-init getter pattern that keeps
existing test literals compiling.

doc.go inventories which methods live on which service, and the
lock-ordering section gains the service prefixes (e.g.
VMService.vmLocks instead of bare vmLocks) so readers don't have to
guess which type owns which mutex.

No code changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:58:53 -03:00
466a7c30c4
daemon split (4/5): extract *VMService service
Phase 4 of the daemon god-struct refactor. VM lifecycle, create-op
registry, handle cache, disk provisioning, stats polling, ports
query, and the per-VM lock set all move off *Daemon onto *VMService.

Daemon keeps thin forwarders only for FindVM / TouchVM (dispatch
surface) and is otherwise out of VM lifecycle. Lazy-init via
d.vmSvc() mirrors the earlier services so test literals like
\`&Daemon{store: db, runner: r}\` still get a functional service
without spelling one out.

Three small cleanups along the way:

  * preflight helpers (validateStartPrereqs / addBaseStartPrereqs
    / addBaseStartCommandPrereqs / validateWorkDiskResizePrereqs)
    move with the VM methods that call them.
  * cleanupRuntime / rebuildDNS move to *VMService, with
    HostNetwork primitives (findFirecrackerPID, cleanupDMSnapshot,
    killVMProcess, releaseTap, waitForExit, sendCtrlAltDel)
    reached through s.net instead of the hostNet() facade.
  * vsockAgentBinary becomes a package-level function so both
    *Daemon (doctor) and *VMService (preflight) call one entry
    point instead of each owning a forwarder method.

WorkspaceService's peer deps switch from eager method values to
closures — vmSvc() constructs VMService with WorkspaceService as a
peer, so resolving d.vmSvc().FindVM at construction time recursed
through workspaceSvc() → vmSvc(). Closures defer the lookup to call
time.

Pure code motion: build + unit tests green, lint clean. No RPC
surface or lock-ordering changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:57:05 -03:00
c0d456e734
daemon split (3/5): extract *WorkspaceService service
Third phase of splitting the daemon god-struct. WorkspaceService now
owns workspace.prepare / workspace.export plus the ssh-key +
git-identity + arbitrary-file sync that runs as part of VM start's
prepare_work_disk capability hook. workspaceLocks (the per-VM tar
serialisation set) lives on the service.

workspace.go and vm_authsync.go flipped receivers from *Daemon to
*WorkspaceService. The workspaceInspectRepo / workspaceImport test
seams moved onto the service as fields.

Peer-service dependencies go through narrow function-typed fields:
vmResolver, aliveChecker, waitGuestSSH, dialGuest, imageResolver,
imageWorkSeed, withVMLockByRef, beginOperation. WorkspaceService
never touches VMService / HostNetwork / ImageService directly —
only the exact operations the Daemon hands it at construction.

Daemon lazy-init helper workspaceSvc() mirrors the Phase 1/2
pattern. Test literals still write `&Daemon{store: db, runner: r}`
and get a wired workspace service for free. Tests that override the
inspect/import seams (workspace_test.go, ~4 sites) assign them on
d.workspaceSvc() instead of on the daemon literal.

Dispatch in daemon.go: vm.workspace.prepare and vm.workspace.export
now forward one-liners to d.workspaceSvc().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:42:31 -03:00
d7614a3b2b
daemon split (2/5): extract *ImageService service
Second phase of splitting the daemon god-struct. ImageService now owns
all image + kernel registry operations: register/promote/delete/pull
for images (bundle + OCI paths), the six kernel commands, and the
shared SSH-key/work-seed injection helpers. imageOpsMu (the
publication-window lock) lives on the service; so do the three OCI
pull test seams pullAndFlatten / finalizePulledRootfs / bundleFetch.
The four files images.go, images_pull.go, image_seed.go, kernels.go
flipped their receivers from *Daemon to *ImageService.

FindImage moved with the service. Daemon keeps a thin FindImage
forwarder so callers reading the dispatch code see the obvious
facade and tests that pre-date the split still compile.

flattenNestedWorkHome — called from image_seed.go, vm_authsync.go,
and vm_disk.go across future service boundaries — became a
package-level helper taking a CommandRunner explicitly. Daemon keeps
a deprecated forwarder for now; the other services will use the
package form.

Lazy-init helper imageSvc() on Daemon mirrors hostNet() from
Phase 1, so test literals like &Daemon{store: db, runner: r, ...}
that don't spell out an ImageService still get a working one.
Tests that override the image test seams (autopull_test,
concurrency_test, images_pull_test, images_pull_bundle_test) now
assign d.img = &ImageService{...seams...}; the two-statement pattern
matches what Phase 1 established for HostNetwork.

Dispatch in daemon.go is cleaner now: every image/kernel RPC handler
is a single-liner forwarding to d.imageSvc().*. Phase 5 will do the
same for VM lifecycle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:30:32 -03:00
362009d747
daemon split (1/5): extract *HostNetwork service
First phase of splitting the daemon god-struct into focused services
with explicit ownership.

HostNetwork now owns everything host-networking: the TAP interface
pool (initializeTapPool / ensureTapPool / acquireTap / releaseTap /
createTap), bridge + socket dir setup, firecracker process primitives
(find/resolve/kill/wait/ensureSocketAccess/sendCtrlAltDel), DM
snapshot lifecycle, NAT rule enforcement, guest DNS server lifecycle
+ routing setup, and the vsock-agent readiness probe. That's 7 files
whose receivers flipped from *Daemon to *HostNetwork, plus a new
host_network.go that declares the struct, its hostNetworkDeps, and
the factored firecracker + DNS helpers that used to live in vm.go.

Daemon gives up the tapPool and vmDNS fields entirely; they're now
HostNetwork's business. Construction goes through newHostNetwork in
Daemon.Open with an explicit dependency bag (runner, logger, config,
layout, closing). A lazy-init hostNet() helper on Daemon supports
test literals that don't wire net explicitly — production always
populates it eagerly.

Signature tightenings where the old receiver reached into VM-service
state:
 - ensureNAT(ctx, vm, enable) → ensureNAT(ctx, guestIP, tap, enable).
   Callers resolve tap from the handle cache themselves.
 - initializeTapPool(ctx) → initializeTapPool(usedTaps []string).
   Daemon.Open enumerates VMs, collects taps from handles, hands the
   slice in.

rebuildDNS stays on *Daemon as the orchestrator — it filters by
vm-alive (a VMService concern handles will move to in phase 4) then
calls HostNetwork.replaceDNS with the already-filtered map.

Capability hooks continue to take *Daemon; they now use it as a
facade to reach services (d.net.ensureNAT, d.hostNet().*). Planned
CapabilityHost interface extraction is orthogonal, left for later.

Tests: dns_routing_test.go + fastpath_test.go + nat_test.go +
snapshot_test.go + open_close_test.go were touched to construct
HostNetwork literals where they exercise its methods directly, or
route through d.hostNet() where they exercise the Daemon entry
points.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:11:46 -03:00
eba9a553bf
daemon: use exact-name lookup for VM-create uniqueness
reserveVM's duplicate-name guard routed through Daemon.FindVM, which
falls back to prefix-matching on both ids and names when no exact
match is found. That turns the uniqueness check into a correctness
bug: a brand-new VM name can be rejected because it happens to
prefix an existing VM's id, or an existing VM's name. So `vm create
--name beta` fails when `beta-sandbox` already exists.

Swap in a dedicated store.GetVMByName that does a literal `WHERE
name = ?` lookup, and use it from reserveVM. FindVM keeps its
prefix-matching behaviour for user-facing lookup paths (`vm ssh
<partial>`, `vm stop <partial>`) where "did you mean" semantics
are the feature.

Tests:
 - TestReserveVMAllowsNameThatPrefixesExistingVM — seeds a VM whose
   id + name both start with "longname", then reserves two new VMs
   named "longname" and "longname-sandbox". Both must succeed.
   Under the old FindVM-based check, both would fail.
 - TestReserveVMRejectsExactDuplicateName — actual collisions are
   still rejected after the swap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 14:00:33 -03:00
108f7a0600
ssh-config: make the ssh <name>.vm shortcut opt-in
Before this change, every daemon.Open() wrote a Host *.vm stanza into
~/.ssh/config in a marker-fenced block. That's a real footgun for users
who manage their SSH config declaratively (chezmoi, dotfiles, NixOS):
banger was mutating host state outside its own directory on every
daemon start, easy to miss and hard to audit.

New contract: the daemon only ever writes its own ssh_config file at
~/.config/banger/ssh_config. ~/.ssh/config is untouched unless the user
opts in. `banger vm ssh <name>` still works out of the box — the
shortcut only matters for plain `ssh sandbox.vm` from any terminal.

The opt-in surface is `banger ssh-config`:

  banger ssh-config              # prints path + include-line +
                                 # install/uninstall hints
  banger ssh-config --install    # adds `Include <bangerConfig>` to
                                 # ~/.ssh/config inside a marker-fenced
                                 # block; idempotent; migrates any
                                 # legacy inline Host *.vm block from
                                 # pre-opt-in builds
  banger ssh-config --uninstall  # removes the new Include block AND
                                 # any legacy inline block

Doctor gains a gentle warn-level note when banger's ssh_config exists
but the user hasn't wired it in — not a fail, since the shortcut is
convenience and `banger vm ssh` covers the essential case.

Tests cover: daemon writes banger file and does NOT touch ~/.ssh/config,
Install adds the block, Install is idempotent, Install migrates the
legacy inline block cleanly (removing it, preserving unrelated
entries, adding the new Include block), Uninstall removes both marker
variants, Uninstall is a no-op when ~/.ssh/config is absent, and
UserSSHIncludeInstalled detects both marker shapes.

README reframes the feature as optional convenience.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:57:26 -03:00
99d0811097
daemon: shrink createVMMu + imageOpsMu to reservation/publication windows
Before: createVMMu was held across the whole of CreateVM — including
image resolution (which could fire a full auto-pull) and startVMLocked
(boot of multiple seconds). imageOpsMu was held across the whole of
PullImage/RegisterImage/PromoteImage/DeleteImage, so any slow OCI pull,
bundle download, or file copy blocked every other image mutation and
every other VM create that needed to auto-pull. The async create API
bought nothing if all creates serialised on the same mutex.

CreateVM is now three phases:

 1. Validate + resolve image (possibly auto-pulling). No global lock.
 2. reserveVM: take createVMMu only long enough to re-check the name
    is free, allocate the next guest IP, and UpsertVM the "created"
    row. Milliseconds.
 3. startVMLocked: run the full boot flow under the per-VM lock only.

Parallel creates of different VMs now overlap on image resolution +
boot; they contend only across the reservation claim.

For the image surface a new publishImage helper isolates the commit
atom (recheck name free, atomic rename stagingDir→finalDir, UpsertImage)
under imageOpsMu. pullFromBundle + pullFromOCI do their network fetch
+ ext4 build + ownership fixup + agent injection outside the lock;
Register moves validation + kernel resolution outside; Promote moves
file copy + SSH-key seeding outside; Delete keeps a brief lock over
the lookup + reference check + store delete and does file cleanup
unlocked.

Two concurrency tests assert the new behaviour:
 - TestPullImageDoesNotSerialiseOnDifferentNames fails the old code
   (second pull blocks on imageOpsMu and never reaches the body).
 - TestPullImageRejectsNameClashAtPublish confirms the publish-window
   recheck is what enforces name uniqueness now that the body runs
   unlocked — exactly one winner.

ARCHITECTURE.md updated to describe the new scope explicitly instead
of calling the locks "narrow".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:44:22 -03:00
afe91e805a
drop unused bench-create script + Makefile target
The script carried a python3 dep for one json.dumps on a VM name
that's always alphanumeric-plus-dashes anyway, it was never wired
into CI or docs, and `time banger vm create` covers the same need
ad hoc when anyone wants to measure create latency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:33:09 -03:00
58464ac28c
docs + doctor: be honest about amd64-only support
The README sold the product as "Linux with /dev/kvm"; the deeper docs
admit that the Makefile pins companion builds to GOARCH=amd64, the
kernel catalog ships only x86_64 entries, and OCI import pulls
linux/amd64 layers. arm64 users who show up through the README only
discover that after install fails in non-obvious ways.

Two surface-level fixes:

- README requirements list leads with "x86_64 / amd64 Linux — arm64 is
  not supported today", with a short note on the three places that
  assumption lives so users understand it's not a last-mile gap.
- `banger doctor` now runs an architecture check that passes on amd64
  and FAILS (not warns) on anything else, referencing the three
  downstream assumptions. Hard-fail rather than warn so a user on an
  arm64 machine doesn't waste time chasing unrelated preflight items.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:03:50 -03:00
e69810610a
daemon: correct ARCHITECTURE doc to match actual package shape + lock scope
Two promises the doc was making that the code doesn't keep:

1. "Helpers moved out so the package stays focused on orchestration."
   The package still has ~29 files and ~130 func (d *Daemon) methods
   wiring VM lifecycle, image management, host networking, background
   reconciliation, and JSON-RPC dispatch. Calling it "just orchestration"
   sets readers up for surprise. Rewrite the subpackages preamble to
   say so, and flag the service split as a post-v0.1.0 project.

2. "vmLocks[id] is held only across short synchronous state validation
   and DB mutations." That's what workspace.prepare does; regular
   lifecycle ops (start/stop/delete/set) go through withVMLockByRef
   and hold the lock across the whole callback body, which for `start`
   means preflight + bridge + firecracker spawn + post-boot wiring.
   Rewrite the vmLocks bullet and the lock-ordering section to say
   that explicitly, so readers don't build "surely my long flow under
   the lock can't be what the doc means" reasoning on top of a false
   premise.

Doc-only change. Code behaviour is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:02:36 -03:00
34dd7644d8
store: introduce versioned migrations with ordered runner + atomic apply
The old migrate() helper only knew how to re-run a fixed slab of CREATE
TABLE IF NOT EXISTS plus per-column ensureColumnExists calls. That worked
while every schema change was a benign additive column; it falls apart
as soon as we need a data backfill, an index, a rename, or anything that
has to happen exactly once in a known order.

Replaces it with a schema_migrations table + ordered []migration slice.
Each migration has a unique id, a human-readable name, and a func(*Tx)
body; the runner opens a transaction per migration so DDL and any data
changes either both land and get recorded or both roll back together,
leaving the DB in a state where retrying on next Open() reapplies from
the same point.

Migration 1 ("baseline") collapses the current schema into one entry:
fresh databases apply it in one shot; existing dev databases see
idempotent `CREATE TABLE IF NOT EXISTS` + `ALTER TABLE … ADD COLUMN`
statements that succeed as no-ops, and the only net effect is the
schema_migrations row that brings them into the versioned system.

Tests cover fresh apply, idempotent re-open, skipping already-applied
ids, rollback on body error (the transient table the migration created
must not survive), and duplicate-id rejection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 12:59:42 -03:00
b930c51990
runtime sockets: close the local-user race window around control-plane creation
Previously the daemon socket, per-VM firecracker API socket, and vsock
socket were transiently world-exposed on hosts without XDG_RUNTIME_DIR:
the runtime directory landed in /tmp at 0755, Firecracker ran with
umask 000 (mode 0666 sockets), and only a follow-up chown/chmod in
EnsureSocketAccess tightened them. A local attacker could race into
bangerd.sock or the firecracker API socket during that window.

Three changes:

- internal/paths/paths.go: RuntimeDir is now created (and re-chmod'd if
  stale) at 0700 unconditionally. When XDG_RUNTIME_DIR is unset and we
  fall back to /tmp/banger-runtime-<uid>, Ensure() now verifies the
  parent dir is owned by the current uid and 0700 mode — refusing to
  place sockets inside a directory someone else created. Symlink swaps
  rejected via Lstat.

- internal/firecracker/client.go: launch firecracker with umask 077
  instead of umask 000 so the API socket is mode 0600 from birth. The
  chown in fcproc.EnsureSocketAccess still transfers ownership from
  root to the invoking user afterwards.

- internal/daemon/fcproc/fcproc.go: EnsureSocketDir now creates (and
  re-chmod's) the runtime socket directory at 0700.

Tests cover the tightening path — an existing 0755 RuntimeDir is
re-chmod'd on Ensure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 12:53:47 -03:00
2b6437d1b4
remove vm session feature
Cuts the daemon-managed guest-session machinery (start/list/show/
logs/stop/kill/attach/send). The feature shipped aimed at agent-
orchestration workflows (programmatic stdin piping into a long-lived
guest process) that aren't driving any concrete user today, and the
~2.3K LOC of daemon surface area — attach bridge, FIFO keepalive,
controller registry, sessionstream framing, SQLite persistence — was
locking in an API we'd have to keep through v0.1.0.

Anything session-flavoured that people actually need today can be
done with `vm ssh + tmux` or `vm run -- cmd`.

Deleted:
- internal/cli/commands_vm_session.go
- internal/daemon/{guest_sessions,session_lifecycle,session_attach,session_stream,session_controller}.go
- internal/daemon/session/ (guest-session helpers package)
- internal/sessionstream/ (framing package)
- internal/daemon/guest_sessions_test.go
- internal/store/guest_session_test.go
- GuestSession* types from internal/{api,model}
- Store UpsertGuestSession/GetGuestSession/ListGuestSessionsByVM/DeleteGuestSession + scanner helpers
- guest.session.* RPC dispatch entries
- 5 CLI session tests, 2 completion tests, 2 printer tests

Extracted:
- ShellQuote + FormatStepError lifted to internal/daemon/workspace/util.go
  (only non-session consumer); workspace package now self-contained
- internal/daemon/guest_ssh.go keeps guestSSHClient + dialGuest +
  waitForGuestSSH — still used by workspace prepare/export
- internal/daemon/fake_firecracker_test.go preserves the test helper
  that used to live in guest_sessions_test.go

Store schema: CREATE TABLE guest_sessions and its column migrations
removed. Existing dev DBs keep an orphan table (harmless, pre-v0.1.0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 12:47:58 -03:00
c42fcbe012
cli + daemon: move test seams off package globals onto injected structs
CLI: introduce internal/cli.deps which owns every RPC/SSH/host-command
seam the tree used to reach through mutable package vars. Command
builders, orchestrators, and the completion helpers become methods on
*deps. Tests construct their own deps per case, so fakes no longer leak
across cases and tests are free to run in parallel.

Daemon: move workspaceInspectRepoFunc + workspaceImportFunc onto the
Daemon struct (workspaceInspectRepo / workspaceImport), mirroring the
existing guestWaitForSSH / guestDial pattern. Workspace-prepare tests
drop t.Parallel() guards now that they no longer mutate process-wide
state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 19:03:55 -03:00
d38f580e00
doctor: surface state store open failure as failing check
Previously store.Open errors were silently swallowed, so `banger
doctor` could report green while the default-image check (and any
other store-dependent diagnostic) was silently skipped because
d.store was nil.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:35:27 -03:00
3f6ecb4376
cli: split banger.go god file into focused files
Pure code motion — banger.go 3508→240 LOC, same-package
decomposition keeps all identifiers visible without export changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:34:32 -03:00
3a5f4cd40d
cli: delete vm run's dead import path + duplicated git inspection
The CLI carried a full second copy of the workspace import
implementation that `vm run` never actually used:

  - importVMRunRepoToGuest (no callers — the live flow calls the
    daemon's PrepareVMWorkspace RPC instead)
  - prepareVMRunRepoCopy, vmRunCheckoutCommit, vmRunCheckoutScript,
    gitFileURL, runHostCommand (all reachable only from the dead
    importVMRunRepoToGuest)

Plus a duplicated repo-inspection surface that shadowed the
daemon's:

  - inspectVMRunRepo ran every git query the daemon re-ran during
    workspace.prepare (HEAD, branch, identity, origin, overlay list)
  - gitOutput / gitTrimmedOutput / gitResolvedConfigValue /
    parseNullSeparatedOutput / listSubmodules / listOverlayPaths /
    resolveVMRunSourcePath — all identical to the exported
    workspace.* versions
  - vmRunRepoSpec — same fields as workspace.RepoSpec

Replaced with a single minimal preflight:

  func vmRunPreflightRepo(ctx, rawPath) (absPath, err error)

The preflight only checks what the user can fix locally before
banger creates a VM (path exists, sits in a non-bare git repo, no
submodules). The daemon's workspace.prepare RPC does the full
inspection — and returns RepoRoot + RepoName in the response, which
the CLI now threads into the tooling harness instead of computing
them a second time.

Signature changes:

  runVMRun(ctx, ..., *vmRunRepo, ...)   // was: *vmRunRepoSpec
  startVMRunToolingHarness(ctx, client, repoRoot, repoName, progress)
                                        // was: (ctx, client, spec, progress)
  vmRunToolingHarnessScript(plan)       // was: (spec, plan)
  vmRunToolingHarnessLaunchScript(repoName)  // was: (spec)

Tests: the CLI-side git-inspection tests are replaced by a single
TestVMRunPreflightRejectsSubmodules that exercises the preflight.
Everything else (tooling harness script, progress renderer, SSH args,
runVMRun flows) keeps working. The shallow-copy / checkout-script
tests are gone — that code now lives only in
internal/daemon/workspace and is tested there.

Also fixed a latent bug the refactor exposed: vm run's --from flag
defaults to "HEAD", which the daemon reads as "from without branch"
and rejects. CLI now scrubs fromRef when branchName is empty.

Live verified: `banger vm run --name X . -- cmd` boots, workspace
materialises at /root/repo with matching HEAD, exit code propagates.
2026-04-19 17:01:26 -03:00
ae14b9499d
ssh: trust-on-first-use host key pinning everywhere
Guest host-key verification was off in all three SSH paths:

  * Go SSH (internal/guest/ssh.go) used ssh.InsecureIgnoreHostKey
  * `banger vm ssh` passed StrictHostKeyChecking=no
    + UserKnownHostsFile=/dev/null
  * `~/.ssh/config` Host *.vm shipped the same posture into the
    user's global config

Now each path verifies against a banger-owned known_hosts file at
`~/.local/state/banger/ssh/known_hosts` with TOFU semantics:

  * First dial to a VM pins the key.
  * Subsequent dials require an exact match. A mismatch fails with
    an explicit "possible MITM" error.
  * `vm delete` removes the entries so a future VM reusing the IP
    or name re-pins cleanly.
  * The user's `~/.ssh/known_hosts` is untouched.

Changes:

  internal/guest/known_hosts.go (new) — OpenSSH-compatible parser,
    TOFUHostKeyCallback, RemoveKnownHosts. Process-wide mutex
    around the file.
  internal/guest/ssh.go — Dial and WaitForSSH grew a knownHostsPath
    parameter threaded through the callback. Empty path keeps the
    insecure callback (tests + throwaway tools only; documented).
  internal/daemon/{guest_sessions,session_attach,session_lifecycle,
    session_stream}.go — call sites pass d.layout.KnownHostsPath.
  internal/daemon/ssh_client_config.go — the ~/.ssh/config Host *.vm
    block now points at banger's known_hosts and uses
    StrictHostKeyChecking=accept-new. Missing path → fail closed.
  internal/daemon/vm_lifecycle.go — deleteVMLocked drops known_hosts
    entries for the VM's IP and DNS name via removeVMKnownHosts.
  internal/cli/banger.go — sshCommandArgs swaps StrictHostKeyChecking
    no + /dev/null for banger's file + accept-new. Path resolution
    failure falls through to StrictHostKeyChecking=yes.
  internal/paths/paths.go — Layout gains SSHDir + KnownHostsPath;
    Ensure creates SSHDir at 0700.

Tests (internal/guest/known_hosts_test.go): pin on first use, accept
matching key on second dial, reject mismatch, empty path skips
checking, RemoveKnownHosts drops the entry, re-pin works after
remove. Existing daemon + cli tests updated to assert the new
posture and regression-guard against the old flags.

Live verified: vm run writes the pin to banger's known_hosts at 0600
inside a 0700 dir; banger vm ssh + ssh root@<vm>.vm both succeed
using the pin; vm delete clears it.
2026-04-19 16:46:03 -03:00
a59958d4f5
daemon: roll back host state on any Open() failure
Open() touched several pieces of host state before hitting the step
that returned the error:

  * SQLite handle (store.Open)
  * managed SSH client config block (ensureVMSSHClientConfig)
  * vm-DNS UDP listener goroutine (startVMDNS)
  * systemd-resolved per-interface routing (ensureVMDNSResolverRouting)

The only deferred cleanup guarded stopVMDNS. A reconcile() or
initializeTapPool() failure therefore left the listener running, the
resolver wiring in place, and the SQLite handle open. A subsequent
startup attempt ran into "port 42069 already in use" or silently
published stale state.

Fix: once `d` exists, defer `d.Close()` on `err != nil`. Close is
idempotent (sync.Once) and every teardown step (listener close, DNS
listener close, resolver revert, session registry close, store close)
is nil-guarded, so calling it on a daemon that never got past the
first startup step is safe.

Tests (internal/daemon/open_close_test.go):

  - TestCloseOnPartiallyInitialisedDaemon: Close survives a daemon
    with only store + closing channel, and with a vmDNS listener but
    nothing else. Catches regressions where a teardown step forgets
    to nil-check.
  - TestCloseIdempotentUnderConcurrency: 5 goroutines racing on
    Close() never panic (sync.Once + close(d.closing) survive).
  - TestOpenFailureRunsCloseCleanup: structural check that the
    `defer cleanup() if err != nil` pattern actually fires.

Live: `banger daemon stop` cleanly, `banger vm ls` restarts daemon
without a residual listener on port 42069.
2026-04-19 16:36:29 -03:00
d1b9a8c102
remove experimental web UI
The web UI shipped as "experimental" and was never finished — no nav
off the dashboard, no live updates, no settled design, never a
supported surface. It was opt-in by default already; leaving the code
in the tree for v0.1.0 only invited "does this work?" questions and
kept HostSummary/BangerSummary/SudoStatus types on the public RPC
surface that nothing else uses.

Removed:

  internal/webui/                         (all Go + templates + assets)
  internal/daemon/web.go                  (server start / Layout / Config / ListVMs / ListImages)
  internal/daemon/dashboard.go            (DashboardSummary aggregator)

Simplified:

  internal/api/types.go                   drop WebURL on PingResult, drop
                                          HostSummary / SudoStatus / BangerSummary /
                                          DashboardSummary / DashboardSummaryResult
  internal/model/types.go                 drop DaemonConfig.WebListenAddr
  internal/config/config.go               drop web_listen_addr from fileConfig + Load
  internal/daemon/daemon.go               drop webListener / webServer / webURL fields +
                                          startWebServer() call + ping WebURL population
  internal/cli/banger.go                  `daemon status` output no longer branches on web
  internal/daemon/{doc.go,ARCHITECTURE.md} drop web UI sections
  README.md                               drop web_listen_addr config bullet + security paragraph

Tests updated to reflect the new shape. Coverage 57.3 -> 58.9% (the
webui package was largely untested; its removal lifts the ratio
without moving the numerator). `banger daemon status` output and
--help are web-free. Lint + full suite green.
2026-04-19 14:28:08 -03:00
687fcf0b59
vm state: split transient kernel/process handles off the durable schema
Separates what a VM IS (durable intent + identity + deterministic
derived paths — `VMRuntime`) from what is CURRENTLY TRUE about it
(firecracker PID, tap device, loop devices, dm-snapshot target — new
`VMHandles`). The durable state lives in the SQLite `vms` row; the
transient state lives in an in-memory cache on the daemon plus a
per-VM `handles.json` scratch file inside VMDir, rebuilt at startup
from OS inspection. Nothing kernel-level rides the SQLite schema
anymore.

Why:

  Persisting ephemeral process handles to SQLite forced reconcile to
  treat "running with a stale PID" as a first-class case and mix it
  with real state transitions. The schema described what we last
  observed, not what the VM is. Every time the observation model
  shifted (tap pool, DM naming, pgrep fallback) the reconcile logic
  grew a new branch. Splitting lets each layer own what it's good at:
  durable records describe intent, in-memory cache + scratch file
  describe momentary reality.

Shape:

  - `model.VMHandles` = PID, TapDevice, BaseLoop, COWLoop, DMName,
    DMDev. Never in SQLite.
  - `VMRuntime` keeps: State, GuestIP, APISockPath, VSockPath,
    VSockCID, LogPath, MetricsPath, DNSName, VMDir, SystemOverlay,
    WorkDiskPath, LastError. All durable or deterministic.
  - `handleCache` on `*Daemon` — mutex-guarded map + scratch-file
    plumbing (`writeHandlesFile` / `readHandlesFile` /
    `rediscoverHandles`). See `internal/daemon/vm_handles.go`.
  - `d.vmAlive(vm)` replaces the 20+ inline
    `vm.State==Running && ProcessRunning(vm.Runtime.PID, apiSock)`
    spreads. Single source of truth for liveness.
  - Startup reconcile: per running VM, load the scratch file, pgrep
    the api sock, either keep (cache seeded from scratch) or demote
    to stopped (scratch handles passed to cleanupRuntime first so DM
    / loops / tap actually get torn down).

Verification:

  - `go test ./...` green.
  - Live: `banger vm run --name handles-test -- cat /etc/hostname`
    starts; `handles.json` appears in VMDir with the expected PID,
    tap, loops, DM.
  - `kill -9 $(pgrep bangerd)` while the VM is running, re-invoke the
    CLI, daemon auto-starts, reconcile recognises the VM as alive,
    `banger vm ssh` still connects, `banger vm delete` cleans up.

Tests added:

  - vm_handles_test.go: scratch-file roundtrip, missing/corrupt file
    behaviour, cache concurrency, rediscoverHandles prefers pgrep
    over scratch, returns scratch contents even when process is
    dead (so cleanup can tear down kernel state).
  - vm_test.go: reconcile test rewritten to exercise the new flow
    (write scratch → reconcile reads it → verifies process is gone →
    issues dmsetup/losetup teardown).

ARCHITECTURE.md updated; `handles` added to Daemon field docs.
2026-04-19 14:18:13 -03:00
2e6e64bc04
guest sshd: drop DEBUG3 + StrictModes no; normalise /root perms
Previously /etc/ssh/sshd_config.d/99-banger.conf landed with:

  LogLevel DEBUG3
  PermitRootLogin yes
  PubkeyAuthentication yes
  AuthorizedKeysFile /root/.ssh/authorized_keys
  StrictModes no

DEBUG3 was debug leftover that floods journald in normal use.
StrictModes no was a workaround for /root perm drift on the work
disk — the real fix is to make those perms correct at provisioning
time.

New drop-in:

  PermitRootLogin prohibit-password
  PubkeyAuthentication yes
  PasswordAuthentication no
  KbdInteractiveAuthentication no
  AuthorizedKeysFile /root/.ssh/authorized_keys

prohibit-password blocks password root login even if PasswordAuth
gets flipped on elsewhere; KbdInteractiveAuth no closes the last
interactive fallback; StrictModes is now on (sshd's default).

normaliseHomeDirPerms chown/chmods /root to 0755 root:root at every
work-disk mount (ensureAuthorizedKeyOnWorkDisk,
seedAuthorizedKeyOnExt4Image); the .ssh dir also explicitly
chown'd root:root. Verified end-to-end against a real VM:
`sshd -T` reports strictmodes yes and all five directives match.

Regression test (sshd_config_test.go) pins the allow-list and the
deny-list (DEBUG3, StrictModes no, bare `PermitRootLogin yes`) so
the next accidental reintroduction fails fast.

README's Security section updated to reflect the new posture.
2026-04-19 13:40:40 -03:00
6cd52d12f4
workspace prepare: release VM mutex before guest I/O
Previously withVMLockByRef held the per-VM mutex across InspectRepo,
waitForGuestSSH, dialGuest, ImportRepoToGuest (the tar stream!), and
the readonly chmod. A large repo could block `vm stop` / `vm delete`
/ `vm restart` on the same VM for however long the import took.

Split into two phases:

  1. VM mutex held briefly to validate state (running + PID alive)
     and snapshot the fields needed for SSH (guest IP, api sock).
  2. VM mutex released. Acquire workspaceLocks[id] — a separate
     per-VM mutex scoped to workspace.prepare / workspace.export —
     for the guest I/O phase.

Lifecycle ops (stop/delete/restart/set) only take vmLocks, so they
no longer queue behind a slow import. Two concurrent prepares on the
same VM still serialise via workspaceLocks so tar streams don't
interleave. ExportVMWorkspace also acquires workspaceLocks to avoid
snapshotting a half-streamed import.

Two regression tests (sequential — they swap package-level seams):

  ReleasesVMLockDuringGuestIO: stall the import fake, assert the VM
  mutex is acquirable from another goroutine during the stall.

  SerialisesConcurrentPreparesOnSameVM: 3 concurrent prepares, assert
  Import is only ever invoked 1-at-a-time per VM.

ARCHITECTURE.md documents the split + updated lock ordering.
2026-04-19 13:32:42 -03:00
99de42385f
workspace export: stop mutating the guest repo index
Previously `banger vm workspace export` ran `git add -A` against the
guest's real `.git/index`, so the observation step left staged
changes behind that users never asked for. Reconnecting later (ssh,
another export) surfaced them and looked like phantom work.

Route `git add -A` through a throwaway index file instead:

  tmp_idx=$(mktemp ...)
  trap 'rm -f "$tmp_idx"' EXIT
  git read-tree <ref> --index-output="$tmp_idx"
  GIT_INDEX_FILE="$tmp_idx" git add -A
  GIT_INDEX_FILE="$tmp_idx" git diff --cached <ref> --binary|--name-only

The real .git/index, working tree, and refs stay exactly as the user
left them. Same diff content — commits past <ref>, uncommitted edits,
and untracked files (minus .gitignore) all captured.

Regression test locks the invariant: every export script must route
add -A through GIT_INDEX_FILE and clean the temp index on exit. CLI
help text updated to say "non-mutating".
2026-04-19 13:20:56 -03:00
21b74639d8
vm defaults: host-aware sizing + spec line on spawn + doctor check
Replaces the static model.Default* constants that drove --vcpu / --memory
/ --disk-size with a three-layer resolver:

  1. [vm_defaults] in ~/.config/banger/config.toml (if set)
  2. host-derived heuristics (cpus/4 capped at 4; ram/8 capped at 8 GiB)
  3. baked-in constants (floor)

Visibility:

- Every `vm run` / `vm create` prints a `spec:` line before progress
  begins: `spec: 4 vcpu · 8192 MiB · 8G disk`. Matches the VM that
  actually gets created because the CLI is now the single source of
  truth — it resolves, populates the flag defaults, and forwards the
  explicit values to the daemon.
- `banger doctor` adds a "vm defaults" check showing per-field
  provenance (config|auto|builtin) and the config file path for
  overrides.
- `--help` shows the resolved defaults (e.g. `--vcpu int (default 4)`
  on an 8-core host).

No `banger config init` command, no first-run side effects, no writes
to the user's filesystem behind their back. Users who want explicit
control set the keys; everyone else gets sensible numbers that track
their hardware.
2026-04-19 13:06:51 -03:00
78ff482bfa
release prep: opt-in web UI, make uninstall, fix stale kernel-catalog docs
- WebListenAddr default is now "" (empty). The experimental web UI was
  running on 127.0.0.1:7777 by default, which surprises users who never
  opted in. Users who want it set `web_listen_addr = "127.0.0.1:7777"`
  in config.toml.
- `make uninstall` stops the daemon (if any) and removes the installed
  binaries. Preserves user data on disk but prints the paths so `rm -rf`
  can follow for a full purge. Documented in README next to install.
- docs/kernel-catalog.md: replace the `void-6.12` and `alpine-3.23`
  examples (never published) with `generic-6.12` (the only cataloged
  kernel today). Updates the versioning-convention example too.
2026-04-19 12:43:58 -03:00
221fb03d68
cli QoL: vm prune, list→ls aliases, delete→rm aliases
- `banger vm prune` sweeps every non-running VM (stopped, created,
  error) with an interactive confirmation; -f/--force skips the prompt.
  Partial failures report which VM failed and exit non-zero.
- list commands gain `ls` alias: vm list already had it; added to image
  list, kernel list, and vm session list.
- delete commands gain `rm` alias: vm delete and image delete. kernel
  rm already aliased delete/remove.

Uses new test seams (vmListFunc) plus the existing vmDeleteFunc so
prune unit-tests without touching the daemon socket.
2026-04-19 12:17:46 -03:00
e3eaa0c797
cli: shell completion via cobra + dynamic resource name lookups
Re-enable cobra's default `completion` subcommand (`banger completion
bash|zsh|fish|powershell`). Plus live resource-name suggestions that
hit the running daemon via the same RPC the real commands use:

  vm start/stop/restart/delete/kill/set         → completeVMNames (variadic)
  vm ssh/show/logs/stats/ports/...              → completeVMNameOnlyAtPos0
  vm session list/start                         → completeVMNameOnlyAtPos0
  vm session show/logs/stop/kill/attach/send    → completeSessionNames (vm + session)
  image show/delete/promote                     → completeImageNameOnlyAtPos0
  kernel show/rm                                → completeKernelNameOnlyAtPos0
  vm run/create --image, image pull/register --kernel-ref → flag-value completion

Design notes in internal/cli/completion.go: completers never auto-start
the daemon (ping-check, bail with NoFileComp on miss), so tab-completion
stays a zero-cost probe. Variadic completers exclude already-entered
args to avoid duplicate suggestions.

README: install recipes for bash / zsh / fish.
2026-04-19 12:12:40 -03:00
346eaba673
coverage: medium batch — hostnat runner, store guest-sessions, daemon helpers
Reuses existing fixtures (CommandRunner fakes, SQLite tempfile store,
pure-Go seams). No new infra needed.

  hostnat                  50% -> 98%   (iptables orchestration via fake runner)
  store                    78% -> 91%   (guest_sessions CRUD roundtrip)
  daemon/session           57% -> 95%   (script gen, state parse, snapshot apply)
  daemon/opstate           67% -> 100%  (Registry Insert/Get/Prune)
  daemon (firstNonEmpty)   slight bump

Total 54.0% -> 56.5%.
2026-04-18 18:03:37 -03:00
f8979de58a
coverage: easy-wins batch across cli, system, paths, vmdns, toolingplan
Pure-Go tests for formatters, layout resolution, and validators — no
fixtures, no external processes. Targets previously-zero functions the
triage scan flagged as low-hanging fruit.

  cli          55% -> 65%
  paths        64% -> 91%
  system       65% -> 75%
  vmdns        72% -> 86%
  toolingplan  73% -> 78%

Total 52.6% -> 54.0%.
2026-04-18 17:57:05 -03:00
a3cc296523
guest: tests for fingerprint, shellQuote, tar-entries edge cases, nil receivers
Pure-Go additions (no SSH server fixture): AuthorizedPublicKeyFingerprint,
shellQuote escaping, writeTarEntriesArchive error paths (.., ., missing,
duplicates, blank entries) and symlink handling, StreamSession/Client
nil-receiver safety, WaitForSSH context cancellation.

internal/guest coverage 17.8% -> 47.6%. Total 52.1% -> 52.6%. The
remaining uncovered paths need a real in-process SSH server; skip.
2026-04-18 17:47:24 -03:00
18bf89eae9
coverage: make targets + close zero-cov gaps (namegen, sessionstream)
Adds `make coverage` (per-package + total via -coverpkg=./...),
`make coverage-html`, and `make coverage-total` (CI-friendly). Wires
coverage.out/coverage.html through `make clean` and .gitignore.

Closes the two easy zero-coverage packages: namegen (77.8%) and
sessionstream (93.5%). Total statement coverage 51.7% -> 52.1%.
2026-04-18 17:44:37 -03:00
88425fb857
docs: DNS routing guide; README aimed at common users
Adds docs/dns-routing.md covering how `<vm>.vm` resolution works:
auto-configuration on systemd-resolved hosts (what the daemon
already does), and per-resolver recipes for dnsmasq /
NetworkManager+dnsmasq / /etc/resolv.conf / macOS `/etc/resolver/`
/ WSL. Plus verification via `dig @127.0.0.1 -p 42069` and
troubleshooting for the common failure modes.

README reshape: lead with the three things a common user needs —
quick start, what `vm run` does, where to put hostnames + image +
config — and push the rest to docs. `vm create` / OCI `image pull`
/ `image register` / workspace-and-session primitives are all still
documented, just under docs/advanced.md where they're not in the
first-time reader's way. Web UI and unnecessary implementation
notes dropped; the "further reading" section at the bottom
enumerates the five docs pages so nothing becomes hard to find.

README shrinks from 208 → 158 lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 17:24:50 -03:00
2584f94828
image/kernel pull: heartbeat dots so slow pulls look alive
Bundle downloads can take 20–60s on a typical connection and the
CLI was going silent between "resolving daemon" and the final image
summary. Users wondered whether banger had wedged.

New `withHeartbeat` helper wraps an RPC call with a dot-every-2s
ticker on stderr. No-op when stderr isn't a terminal, so piped or
scripted invocations stay quiet. Wired into `image pull` and `kernel
pull`, the two commands that actually download bytes.

Example:

    $ banger image pull debian-bookworm
    [image pull] ..........
    id  name             managed  ...

Tests cover the non-TTY short-circuit and error propagation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 17:08:30 -03:00
b5c13e3938
Remove opencode package + vm acp command (dead code)
The `internal/opencode` package and the `opencodeCapability` that
consumed it were hard-wired to wait for opencode on guest port 4096
when an image shipped an initrd. After the prune commits (void /
alpine / customize.sh / image build all removed), nothing banger
produces today carries an initrd, so the capability's wait path was
unreachable: every startup short-circuited to the "direct-boot, skip
opencode" branch.

Same logic for `banger vm acp`: it SSHes to `opencode acp --cwd
<path>`, a binary the golden image no longer ships. Users who run
their own image with opencode can still invoke
`ssh vm -- opencode acp --cwd /root/repo` directly — no banger
scaffolding required.

Removed:
- internal/opencode/ (whole package, 255 LOC incl. tests)
- internal/daemon/opencode.go (opencodeCapability)
- cli `vm acp` command + its helpers (runVMACP, sshACPCommandArgs,
  vmACPRemoteCommand) + their tests
- The opencodeCapability{} entry in registeredCapabilities() plus
  the test that pinned its presence
- `wait_opencode` progress-stage label from the vm-create renderer
- Stale mentions in daemon/doc.go, README, and webui test fixtures

~480 lines gone, 12 added. `banger/internal` is now 25 packages
instead of 26.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 16:54:37 -03:00
0933deaeb1
file_sync: config-driven replacement for hardcoded auth sync
Replace the three hardcoded host→guest credential syncs (opencode,
claude, pi) with a generic `[[file_sync]]` config list. Default is
empty — users opt in to exactly what they want synced, with no
surprise about which tools banger "supports".

```toml
[[file_sync]]
host = "~/.local/share/opencode/auth.json"
guest = "~/.local/share/opencode/auth.json"

[[file_sync]]
host = "~/.aws"          # directories are copied recursively
guest = "~/.aws"

[[file_sync]]
host = "~/bin/my-script"
guest = "~/bin/my-script"
mode = "0755"            # optional; default 0600 for files
```

Semantics:
- Host `~/...` expands against the host user's $HOME. Absolute host
  paths are used as-is.
- Guest must live under `~/` or `/root/...` — banger's work disk is
  mounted at /root in the guest, so that's the syncable namespace.
  Anything outside is rejected at config load.
- Validation at config load: reject empty paths, relative paths,
  `..` traversal, `~user/...`, malformed mode strings. Errors name
  the offending entry index.
- Missing host paths are a soft skip with a warn log (existing
  behaviour). Other errors (read, mkdir, install) abort VM create.
- File entries: `install -o 0 -g 0 -m <mode>` (default 0600).
- Directory entries: walked in Go; each source file is installed
  with its own source permissions preserved. The entry's `mode` is
  ignored for directories.

Removed (all dead after this):
- `ensureOpencodeAuthOnWorkDisk`, `ensureClaudeAuthOnWorkDisk`,
  `ensurePiAuthOnWorkDisk`, the shared `ensureAuthFileOnWorkDisk`,
  their `warn*Skipped` helpers, `resolveHost{Opencode,Claude,Pi}AuthPath`,
  and the work-disk relative-path + default display-path constants.
- The capability hook registering the three syncs now calls the
  generic `runFileSync` once.

Seven tests exercising the old codepath deleted; six new tests cover
the new runFileSync (no-op on empty config, file copy, custom mode,
missing-host-skip, overwrite, recursive directory). Config-layer
test adds happy-path parsing and a case-per-shape table of invalid
entries (empty, relative host, guest outside /root, '..' traversal,
`~user`, bad mode).

README updated: replaces the "Credential sync" section with a
"File sync" section showing the new config shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 16:40:11 -03:00
843314be5e
vm_authsync: s/repairing/provisioning/ in SSH work-disk stage
The "repairing SSH access on work disk" stage detail sounded
remedial, like something had gone wrong. It's just writing banger's
SSH key to /root/.ssh/authorized_keys on the work disk for the first
time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 16:29:18 -03:00
cdd857b288
vm run --rm: suppress the still-running reminder
The deferred --rm delete fires AFTER runSSHSession returns, but
runSSHSession prints "vm X is still running (stop with ...)" before
returning. Net effect: the user sees the reminder, then the VM gets
deleted behind it — misleading.

Thread a skipReminder bool into runSSHSession. `vm run` passes the
same value as removeOnExit; other callers (`vm ssh`) pass false.
Reinforced by a new assertion in the --rm happy-path test that the
reminder string never appears in stderr.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 16:10:29 -03:00
b33f24865c
vm run --rm: ephemeral sandboxes
New `--rm` flag deletes the VM once the ssh session or `-- cmd`
exits, making `vm run` one-shot. Exit code from command mode still
propagates correctly.

Semantics:
- Create fails → no VM to delete, nothing to do.
- SSH-wait timeout → VM intentionally kept alive so `vm logs <name>`
  shows why; the timeout error already pointed users at that. Even
  with --rm, this path skips delete — a wedged sshd is exactly when
  you want post-mortem access.
- Session/command ends (any exit code, any reason) → VM is deleted
  via `vm.delete` RPC. Uses a fresh 10s context so Ctrl-C during the
  session doesn't abort the cleanup.

New vmDeleteFunc seam at the top of banger.go alongside the other
RPC seams. Two tests cover the happy path (session ends cleanly →
delete fires with correct ref) and the skip-on-timeout path (ssh
wait errors → delete does NOT fire).

README updated with an ephemeral example and a note about the
timeout-skip behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 16:06:46 -03:00
3aa64a63c1
vm run: bound the ssh wait and give a useful error on timeout
Before: `guestWaitForSSHFunc` loops forever bounded only by context
cancellation, so if sshd fails to start in the guest `vm run` hangs
indefinitely — which burned a long debugging session during the
golden-image bring-up.

After: the ssh wait gets its own 90s deadline. On guest-side timeout
the error names the VM, explains sshd is the likely suspect, points
at `banger vm logs <name>` for the console output, and notes the VM
is still alive for inspection (or `vm delete` to clean up). Parent
context cancellation (Ctrl-C, caller timeout) still surfaces as-is
without the hint.

`vmRunSSHTimeout` is a var rather than a const so tests can shrink
it; the new TestRunVMRunSSHTimeoutReturnsActionableError sets it to
50ms and asserts the error message contains the actionable bits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 15:59:27 -03:00