Commit graph

308 commits

Author SHA1 Message Date
cef9bf92a5
ssh-config: harden sameDirOrParent against symlinks + add edge tests
The symlink test in this commit catches a real bug: sameDirOrParent
used filepath.Abs for both sides of the "is the key inside the
legacy dir?" check, but filepath.Abs doesn't resolve symlinks. A
user whose ssh_key_path pointed into ConfigDir/ssh via a symlinked
spelling (e.g. ConfigDir itself is a symlink, or the user
maintains an alias tree) would have their key silently deleted by
the legacy-dir scrub — the gate thought the key lived elsewhere
because the two spellings didn't match lexically.

Fix: resolvePathForComparison tries filepath.EvalSymlinks first,
falls back to filepath.Abs when the path doesn't exist yet (new
install, pre-first-Open). Both sides of the sameDirOrParent
comparison now use this helper, so a symlinked key + canonical
dir (or the reverse) lands in the same physical path before the
Rel check.

Tests added in this commit:

internal/daemon/ssh_client_config_test.go
  TestSameDirOrParentHandlesSymlinks — symlinked-key + canonical-dir
  and the reverse are both reported "inside"; unrelated paths stay
  out. Skips if the filesystem doesn't support symlinks.

internal/config/config_test.go
  TestLoadNormalizesAbsoluteSSHKeyPath — trailing slash, duplicate
  slashes, dot segments all collapse via filepath.Clean, so two
  spellings of the same path compare equal downstream.
  TestEnsureDefaultSSHKeyRejectsCorruptExistingFile — regression
  guard against a future "regenerate if invalid" patch that would
  silently nuke a real user key.
  TestResolveSSHKeyPathRejectsEmptySSHDirAndStateDir — pins the
  absolute-path guard that stops a bad layout from scribbling
  into cwd (this was the test that caught the stray
  internal/config/ssh/ a few commits back).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 17:48:06 -03:00
b2756f5e7e
test: add newTestDaemon harness + options
No Daemon test in this package has a shared constructor. Every file
re-derives the same pattern — &Daemon{...}, wireServices(d), maybe
override a field — which means new lifecycle / integration tests
spend half their length standing up infrastructure instead of
exercising behaviour.

Consolidate into internal/daemon/daemon_testing_test.go:

  func newTestDaemon(t *testing.T, opts ...testDaemonOption) *Daemon

Defaults: tempdir layout (distinct StateDir/ConfigDir/SSHDir/...),
fresh store.Store with migrations auto-run, permissiveRunner,
io.Discard logger, empty vmCaps (so default workDisk/dns/nat
capabilities don't fire real side effects in tests that just want
to exercise VMService plumbing).

Options so far:
  - withRunner(system.CommandRunner)
  - withConfig(model.DaemonConfig)
  - withStore(*store.Store)
  - withLogger(*slog.Logger)
  - withLayout(paths.Layout)
  - withVMCaps(caps ...vmCapability)
  - withVsockHostDevice(string)

withVMCaps tracks a vmCapsSet flag so tests that explicitly pass no
caps (i.e. the default) still get the empty-slice behaviour — the
reset after wireServices only fires when the caller didn't opt in.
That keeps wireServices's production semantics unchanged: if you
construct a real Daemon without pre-populating vmCaps, you still
get the default three.

Two smoke tests pin:
  - zero-option call wires every service, gives an empty-vmCaps
    daemon with the default vsock device, store non-nil
  - each option actually lands on the resulting Daemon (guards
    against silent rename)

Existing tests unchanged — this is purely additive. Later slices
(Firecracker error-path tests, store migration edges, lifecycle
flow harness) will adopt the helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 17:45:43 -03:00
e2885060dc
README: add ssh_key_path migration note + document normalization rules
Docs pointing at the new state-dir default were updated in b1fbf69;
what was still missing is the migration guidance the review asked for.

Add a short note under the ssh_key_path bullet covering:

- what moved (~/.config/banger/ssh/id_ed25519 →
  ~/.local/state/banger/ssh/id_ed25519)
- that users with the old path hardcoded in config.toml are safe
  (the narrowed legacy-dir cleanup preserves the enclosing dir when
  ssh_key_path points inside it)
- that unsetting the key and letting banger manage the new default
  is also fine — the only caveat is existing VMs need a
  stop-and-start to re-sync authorized_keys

Also document the new normalization rules (~/ expansion, absolute
required) on the ssh_key_path bullet itself so users know what's
accepted before they hit a load error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 17:14:00 -03:00
617008e8f1
config: normalize ssh_key_path — expand ~/, reject non-absolute
Bug: resolveSSHKeyPath returned a configured ssh_key_path verbatim.
That meant:

- ssh_key_path = "~/.ssh/id_ed25519" kept the literal "~" — downstream
  readers (internal/guest/ssh.go, internal/daemon/image_seed.go,
  internal/daemon/vm_authsync.go, internal/cli/ssh.go) do raw
  os.ReadFile on the path and fail at runtime with a path that looks
  fine but isn't.
- ssh_key_path = "id_ed25519" (relative) silently worked or didn't
  depending on the daemon's cwd — the daemon process's cwd is not
  the user's shell cwd, so behavior was non-obvious.

Fix: add normalizeSSHKeyPath() run over configured values. It:

  - expands "~/..." against $HOME
  - rejects bare "~" (ambiguous)
  - rejects "~user/..." (we don't do user-tilde)
  - rejects relative paths outright
  - returns filepath.Clean'd absolute paths

Tests cover the accepting case (home-anchored expansion) and every
rejection branch via a table-driven subtests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 17:00:34 -03:00
b1fbf695ca
ssh-config: narrow the legacy-dir cleanup so it can't delete a user key
Bug: syncVMSSHClientConfig did os.RemoveAll on $ConfigDir/ssh every
daemon Open. The intent was to migrate off the pre-opt-in layout,
where banger used to write $ConfigDir/ssh/ssh_config. But a user who
sets ssh_key_path = "~/.config/banger/ssh/id_ed25519" in config.toml
has their key live exactly in that dir — and the scrub deletes it
along with every other file in the tree.

This is the same class of bug that cost the default key until
ebe6517 moved it to StateDir, but that fix was scoped to the default
path. A configured ssh_key_path pointed under the legacy dir still
dies.

Fix: replace os.RemoveAll with a narrow two-step cleanup:

 1. Skip the cleanup entirely when the configured ssh_key_path
    resolves under the legacy dir. A user who pointed banger at a
    key there must keep the enclosing directory.
 2. Otherwise, os.Remove the specific legacy file ($ConfigDir/ssh/
    ssh_config) and then os.Remove the directory. The second
    os.Remove fails with ENOTEMPTY if the dir still holds anything
    (e.g. a user-managed sibling file we don't own). Both errors
    are swallowed — this is best-effort migration, not a hard
    failure.

Tests pin all three paths: user key under legacy dir survives,
legacy dir empties and is removed when the user moved on, and a
user-managed sibling file in the legacy dir is preserved.

Also fix stale doc claims in README.md and AGENTS.md — both still
pointed at the old ~/.config/banger/ssh/id_ed25519 default, which
moved to ~/.local/state/banger/ssh/id_ed25519 in ebe6517.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 16:31:07 -03:00
fba30f26d4
firecracker: chown API + vsock sockets inside the sudo shell
Bug: Firecracker creates its API and vsock sockets as root:root 0700
(enforced by the intentional umask 077 in buildProcessRunner). The
daemon, running as the invoking user, then can't connect(2) to
either — AF_UNIX connect needs write permission on the socket file
and 0700 root-owned leaves thales without any.

firecracker-go-sdk's Machine.Start() blocks on waitForSocket, which
probes the socket with both os.Stat (succeeds — parent dir is the
user's XDG_RUNTIME_DIR) and an HTTP GET over the socket (fails —
EACCES on connect). The SDK loops for 3 seconds then fails with
"Firecracker did not create API socket ... context deadline exceeded".

The daemon's EnsureSocketAccess chown was meant to fix permissions,
but it runs *after* Machine.Start returns — and Start never returns
because it's still looping on the SDK's probe. Chicken-and-egg.

Fix: inside the sudo'd shell that launches firecracker, spawn a
background subshell that polls for each expected socket (API + vsock,
when configured) and chowns it to $SUDO_UID:$SUDO_GID as soon as it
appears. The background polling is bounded at 1s (20 × 50ms) so a
broken firecracker invocation doesn't leak a waiting shell.

Post-fix: socket appears root-owned 0600 briefly, is chowned to the
invoking user within ~50ms, SDK's HTTP probe succeeds, Machine.Start
returns normally. EnsureSocketAccess's later chmod 600 remains the
belt-and-braces guarantee on final mode.

Verified: manual repro of the shell script produces a socket owned
by thales:thales that a non-root python socket.connect() accepts.
Without the fix the same setup gives "PermissionError: [Errno 13]
Permission denied".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 16:09:02 -03:00
60f90eb8be
config: harden resolveSSHKeyPath against relative paths + drop stray test keys
Prior commit's test (TestLoadDefaultsResolveFirecrackerAndGenerateSSHKey)
was contained, but every OTHER config test called
Load(paths.Layout{ConfigDir: ...}) without SSHDir or StateDir set.
Under the new code path that meant resolveSSHKeyPath produced a
relative target ("ssh/id_ed25519") which go test happily wrote
against the package's own source directory — and I caught that in
the commit after the fact, in the form of internal/config/ssh/
showing up as tracked files.

Two changes:
- resolveSSHKeyPath now errors if the resolved path is not absolute.
  paths.Resolve always produces an absolute SSHDir in production;
  this just stops a fumbled layout from silently scribbling into
  cwd.
- Every existing config test that was relying on the old
  ConfigDir/ssh path gets an explicit SSHDir: t.TempDir() added,
  restoring the key-generation surface under tempdir isolation.

Delete internal/config/ssh/ — those files were test-generated by the
prior commit and have no business in the repo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 14:31:11 -03:00
ebe651762f
config: put the default SSH key under the state dir, not ConfigDir/ssh
Bug: every daemon Open deleted the freshly-generated default SSH key
before returning, so the next VM create failed reading it.

Sequence:
  1. Open → config.Load → resolveSSHKeyPath generates
     ~/.config/banger/ssh/id_ed25519
  2. Open → ensureVMSSHClientConfig → syncVMSSHClientConfig scrubs
     ~/.config/banger/ssh entirely as a migration step for the
     pre-opt-in layout (commit 108f7a0)

The scrub was added for a file that used to live at
ConfigDir/ssh/ssh_config, but it os.RemoveAll'd the whole
ConfigDir/ssh dir — including the id_ed25519 the key generator had
just put there.

Fix: point the default key at layout.SSHDir (a StateDir-rooted path
that paths.Ensure already creates). The scrub can keep cleaning up
ConfigDir/ssh because nothing banger writes under it anymore.

Users whose ssh_key_path is explicitly set in config.toml are
unaffected — configured wins. Users on the default path will get a
fresh key at StateDir/ssh/id_ed25519 on their next daemon Open;
existing VMs' authorized_keys re-sync on next start/create through
ensureAuthorizedKeyOnWorkDisk, so no manual intervention is needed
beyond restarting the daemon.

Regression test pins the new placement and asserts the legacy path
stays empty.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 14:29:34 -03:00
80ae4d6667
docs: resync package docs, AGENTS, and kernel-catalog with current code
Four drift fixes from a doc sweep.

internal/daemon/doc.go
  Replace the capability-hook description that still said "Hook
  methods take *Daemon; VMService reaches them through a
  capabilityHooks seam." Current reality: every capability is a
  plain struct carrying its own service pointers
  (workDiskCapability{vm,ws,store}, dnsCapability{net},
  natCapability{vm,net,logger}); wireServices builds the default
  list; no hook reaches *Daemon.

internal/daemon/ARCHITECTURE.md
  The VMService field list still claimed guestWaitForSSH and
  guestDial were "per-instance fields." Those were deleted as
  refactor residue. Update the note to say the seams live on
  *Daemon (reached by WorkspaceService via closures wired at
  construction) and document the vsockHostDevice field that
  replaced the old package-global vsockHostDevicePath.

AGENTS.md
  Drop the "experimental web UI" mention (removed) and the
  `session` subpackage (removed). Mention banger-vsock-agent as
  the third cmd/ binary while we're here — AGENTS hadn't listed
  it.

docs/kernel-catalog.md
  The trust-model section still read as if upstream kernel sources
  were fetched by HTTPS alone. Add a paragraph covering the PGP
  verification make-generic-kernel.sh now does against the
  detached .tar.sign and the three kernel.org release signing keys.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 13:01:11 -03:00
88bc466d58
tests: targeted coverage for doctor, workspace rejections, and nat capability
Three thematic test files pinning behavior surfaces that had none
before, following the review's recommendation to plug concrete
error/cleanup branches rather than chase a coverage percentage.

doctor_test.go
  Covers Daemon.doctorReport end-to-end with a permissive runner +
  fake executables on PATH. Pins: store error surfaces as fail,
  store success as pass, missing firecracker kills the host-runtime
  check, the three default capability feature checks (work disk,
  vm dns, nat) are emitted, vm-defaults is always-pass with
  provenance. Previously 0% — now the Doctor() command's contract
  with the CLI is under guard.

workspace_rejection_test.go
  Covers the four early-exit branches of PrepareVMWorkspace that
  the existing happy-path + lock-release tests never hit: malformed
  mode, --from without --branch, VM not running, VM not found.
  Each one returns before any SSH I/O, so the fake-firecracker
  infra the happy-path test needs is unnecessary — a bare wired
  daemon with a stored VMRecord suffices.

nat_capability_test.go
  Covers natCapability.ApplyConfigChange (unchanged flag → no-op,
  VM not alive → no-op, toggle on live VM → runner reached) and
  natCapability.Cleanup (NAT disabled → no-op, runtime handles
  missing → defensive no-op, full wiring → ensureNAT(false)). A
  countingRunner + startFakeFirecracker fixture stands in for the
  real host plumbing, with waitForVMAlive polling past the
  exec -a race window that startFakeFirecracker exposes on
  loaded CI boxes.

make coverage-total 37.8% → 38.6%. The number isn't the point —
these tests exist so the next refactor in this area has to
break an explicit assertion to drift.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 12:58:12 -03:00
3aec590deb
vmservice: delete dead guestWaitForSSH + guestDial seams
VMService carried guestWaitForSSH and guestDial fields + matching
constructor args ever since the daemon split, but nothing on
VMService ever read them. The live guest-SSH path runs on *Daemon
(d.waitForGuestSSH / d.dialGuest in guest_ssh.go); WorkspaceService
reaches those through closures wired in wireServices.

So the VMService copies were refactor residue: they made the
service look more decoupled than it actually is, and any future
test that stubbed VMService.guestDial would be stubbing nothing.
Delete the fields, the deps entries, the newVMService assignments,
and the wireServices passes.

Real seams on *Daemon are unchanged — those are the ones tests
(e.g. workspace_test.go) already set directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 12:45:27 -03:00
bbd187391e
noteUntrackedSkipped: fix subdir underreport + be best-effort everywhere
Two bugs in the untracked-skipped warning, both surfaced by review.

1. Wrong scope for subdir inputs. The helper accepted any path the
   caller had (sourcePath, which may be a user-supplied subdirectory)
   and ran `git -C <path> ls-files --others --exclude-standard`. Git
   scopes that output to the cwd, so pointing `vm run ./repo/sub` at
   a subdir silently underreported untracked files living elsewhere
   in the repo — exactly the files the warning exists to surface.
   Fix: resolve sourcePath to the repo root inside the helper via
   `rev-parse --show-toplevel` before counting.

2. Inconsistent failure handling. The comment said the helper should
   be silent when the count can't be determined; the body returned
   the error. vm_run.go treated the error as non-fatal (logged a
   warning, continued); workspace prepare and --dry-run aborted the
   whole operation on the same helper failure. A courtesy notice
   shouldn't kill the operation.
   Fix: make the helper best-effort in signature and body — no error
   return, swallows rev-parse + count failures, emits nothing when
   there's nothing to say. All three callers lose their error
   branches.

Regression tests:
- subdir input reports the root-level untracked file (the bug case)
- non-repo path produces silence, not a fatal error
- inspector whose runner errors on every call produces silence

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 12:42:33 -03:00
ecb18ce6ca
seams: move the last four package globals onto instance fields
Three test seams were still package-level mutable vars, which tests
had to swap before use. That's the classic path to flaky parallel
tests — two goroutines fighting over the same global fake. Push each
down to the struct that owns the behaviour.

internal/daemon/dns_routing.go
  lookupExecutableFunc + vmDNSAddrFunc → fields on *HostNetwork,
  defaulted at newHostNetwork time. dns_routing_test builds
  HostNetwork{..., lookupExecutable: stub, vmDNSAddr: stub} inline,
  no more t.Cleanup dance around package-level vars.

internal/daemon/preflight.go + doctor.go
  vsockHostDevicePath (mutable string) → vsockHostDevice field on
  *VMService, defaulted via defaultVsockHostDevice constant in
  newVMService. Preflight reads s.vsockHostDevice; doctor reads
  d.vm.vsockHostDevice. Logger test sets d.vm.vsockHostDevice = tmp
  after wireServices.

internal/daemon/workspace/workspace.go
  HostCommandOutputFunc → *Inspector struct with a Runner field.
  Every git-using helper (GitOutput, GitTrimmedOutput,
  GitResolvedConfigValue, RunHostCommand, ListSubmodules,
  ListOverlayPaths, CountUntrackedPaths, InspectRepo,
  ImportRepoToGuest, PrepareRepoCopy) is now a method on *Inspector.
  NewInspector() wraps the real host runner for production;
  WorkspaceService holds one via repoInspector, CLI deps holds one
  too. cli_test.go's submodule-rejection test builds its own
  Inspector with a scripted Runner instead of patching a global.
  Pure helpers (FinalizeScript, ResolveSourcePath, ParsePrepareMode,
  ShellQuote, FormatStepError, GitFileURL, ParseNullSeparatedOutput)
  stay free functions since they don't touch the host.

Sentinel: grep for HostCommandOutputFunc, lookupExecutableFunc,
vmDNSAddrFunc, vsockHostDevicePath is now empty across internal/.
make lint test green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 12:07:14 -03:00
2685bc73f8
doctor: open the state DB read-only so inspection never mutates it
`banger doctor` used to call store.Open, which unconditionally runs
migrations on the way up. Diagnostics mutating persistent state is a
surprise — particularly now that migration 2 drops a column, so a
plain `doctor` invocation against an old DB would silently schema-
evolve it.

Add store.OpenReadOnly: separate DSN builder with mode=ro and a
minimal pragma set (foreign_keys, busy_timeout — no journal_mode=WAL,
no wal_autocheckpoint), skips runMigrations, and pings on open so a
missing DB fails up front rather than at first query. doctor.go now
uses OpenReadOnly; the existing storeErr fallback path surfaces any
failure as a failing check, unchanged.

Tests pin two invariants:
- OpenReadOnly against a DB whose migration 2 marker was removed and
  packages_path re-added must leave both alone (i.e. no drift is
  applied behind the user's back).
- Any write attempted through the read-only handle is rejected at
  the driver layer (belt-and-braces for future refactors).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 11:05:23 -03:00
129475be20
config + store: remove dead knobs and stale schema
Three drift items surfaced in review, each dead on arrival and each
worth trusting a little more at v0.1.0.

config: drop MetricsPollInterval. The field was parsed from TOML
(metrics_poll_interval), stored on DaemonConfig, and ignored by every
consumer — only StatsPollInterval drives the background poll loop.
Users setting it in config.toml saw zero effect. Removed from the TOML
surface, the model constant, and the config test.

daemon: delete ensureDefaultImage. No callers, body was `_ = ctx;
return nil`. Dead since whatever flow used to call it got removed.

store: drop packages_path from the images table. The column was
carried by the baseline migration but never referenced by UpsertImage
(no INSERT / UPDATE mention) or any Go model field — a ghost from a
build pipeline that no longer exists. Added migration id=2
(drop_dead_image_columns) with an idempotent dropColumnIfExists
helper: fresh installs run baseline (creates the column) + 2 (drops
it); legacy DBs where the column was never added get a no-op. Updated
the direct-INSERT SQL in TestGetImageRejectsMalformedTimestamp to
drop the column reference, and added a migration test covering both
install paths (fresh + legacy).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 10:54:01 -03:00
2a7f55f028
vm run: ship tracked files only by default; add --include-untracked + --dry-run
Workspace-mode vm run and vm workspace prepare used to copy both
tracked AND untracked non-ignored files into the guest. That silently
catches local .env files, scratch notes, credentials, and any other
working-tree state a developer hasn't explicitly gitignored — a real
data-exposure footgun given the golden image ships Docker and the
usual dev tooling.

Flip the default to tracked-only. Users who actually want the fuller
set opt in with --include-untracked (documented in both commands'
help). Gitignored files are still always excluded regardless of the
flag.

Add --dry-run to both vm run and vm workspace prepare. Dry-run
inspects the repo CLI-side (no VM created, no daemon RPC needed since
the daemon is always local and the inspection is a pure git read),
prints the exact file list + mode, and exits. A byte-level preview of
what would land in the guest.

When running real (non-dry) and untracked files exist in the repo but
are being skipped under the new default, print a one-line notice
pointing to --include-untracked so users aren't surprised when the
guest is missing something they expected.

Signature changes:
- ListOverlayPaths takes an includeUntracked bool (tracked always;
  untracked gated by flag).
- InspectRepo takes the same flag and passes it through.
- VMWorkspacePrepareParams gains IncludeUntracked.
- WorkspaceService.workspaceInspectRepo seam signature widened to
  match (4 callers in tests updated).

New workspace package tests cover both modes and verify that
gitignored files never leak regardless of the flag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 19:53:17 -03:00
25a1466947
supply chain: verify signatures and pins across image + kernel builds
Three independent hardenings, addressing a review finding that the
kernel and image build pipelines were relying on HTTPS alone for
artifact integrity.

scripts/make-generic-kernel.sh
- Fetch the detached PGP signature (linux-<ver>.tar.sign) alongside
  the tarball and verify it with gpg before extraction. An isolated
  $GNUPGHOME under the tempdir keeps the kernel signers out of the
  invoking user's keyring.
- Import the three kernel.org release signing keys (Greg KH / Linus /
  Sasha Levin) from keyserver.ubuntu.com, falling back to
  keys.openpgp.org. Ubuntu comes first because keys.openpgp.org strips
  unverified UIDs on upload, leaving gpg with UID-less keys it
  refuses to trust.
- Require VALIDSIG (cryptographic proof) rather than GOODSIG
  (printed even for expired keys) before proceeding. Verified
  end-to-end against a clean tarball (accepts) and a byte-flipped
  tampered copy (rejects with BADSIG).
- gpg + gpgv + xz added to the required-tools check.

images/golden/Dockerfile
- Pin Docker's apt signing key by fingerprint. After downloading
  /etc/apt/keyrings/docker.asc we gpg --show-keys --with-colons it,
  extract the fpr, and compare against the expected
  9DC858229FC7DD38854AE2D88D81803C0EBFCD88. A tampered or swapped key
  aborts the build before any apt repo metadata is fetched.
- Replace `curl https://mise.run | sh` with a pinned GitHub release
  binary (mise v2026.4.18, linux-x64) verified against its published
  sha256. Refuses to build on unknown architectures rather than
  silently installing a binary we have no hash for.
- Add gnupg to the ESSENTIAL apt-get install so the fingerprint check
  has gpg available.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 19:38:13 -03:00
011b59a72f
daemon split (8/8): document capability decoupling + wireServices
Update ARCHITECTURE.md's Composition section to reflect the finished
split: capabilities carry explicit service-pointer fields, nothing
reaches *Daemon at dispatch time, and wireServices(d) is the single
entry point that builds services + capabilities eagerly (from Open
in production, from tests after constructing &Daemon{...} literals).

Removes the paragraph admitting capability→*Daemon coupling and the
lazy-init getters justification, neither of which applies anymore.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 15:59:39 -03:00
9c73155e17
daemon split (7/n): narrow capability interfaces, wire deps at construction
Stop passing *Daemon into capability hooks. Each capability
implementation is now a struct with explicit service-pointer fields
populated at wireServices time; the six dynamic-dispatch interfaces
(AddStartPreflight, PrepareHost, PostStart, Cleanup, ApplyConfigChange,
AddDoctorChecks) no longer have a *Daemon parameter. Capability
methods reach their dependencies through struct fields, not through
d.vm / d.ws / d.net.

- workDiskCapability carries {vm, ws, store, defaultImageName}
- dnsCapability carries {net}
- natCapability carries {vm, net, logger}

Daemon.defaultCapabilities() builds the production list from the
already-constructed services and is called from wireServices so
d.vmCaps is populated eagerly. Tests that preinstall d.vmCaps with
stubs still work — wireServices only overwrites an empty slice.

registeredCapabilities() is gone (every dispatch loop now reads
d.vmCaps directly). capabilities_test.go's testCapability fake drops
*Daemon from its method set to match the new interfaces.

This finishes the daemon service split: capability implementations
no longer reach through the composition root, there's no path back
to *Daemon from any service or capability, and test construction
goes through one explicit wireServices call instead of lazy getters.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 15:59:09 -03:00
16702bd5e1
daemon split (6/n): extract wireServices + drop lazy service getters
Factor the service + capability wiring out of Daemon.Open() into
wireServices(d), an idempotent helper that constructs HostNetwork,
ImageService, WorkspaceService, and VMService from whatever
infrastructure (runner, store, config, layout, logger, closing) is
already set on d. Open() calls it once after filling the composition
root; tests that build &Daemon{...} literals call it to get a working
service graph, preinstalling stubs on the fields they want to fake.

Drops the four lazy-init getters on *Daemon — d.hostNet(),
d.imageSvc(), d.workspaceSvc(), d.vmSvc() — whose sole purpose was
keeping test literals working. Every production call site now reads
d.net / d.img / d.ws / d.vm directly; the services are guaranteed
non-nil once Open returns. No behavior change.

Mechanical: all existing `d.xxxSvc()` calls (production + tests)
rewritten to field access; each `d := &Daemon{...}` in tests gets a
trailing wireServices(d) so the literal + wiring are side-by-side.
Tests that override a pre-built service (e.g. d.img = &ImageService{
bundleFetch: stub}) now set the override before wireServices so the
replacement propagates into VMService's peer pointer.

Also nil-guards HostNetwork.stopVMDNS and d.store in Close() so
partially-initialised daemons (pre-reconcile open failure) still
tear down cleanly — same contract the old lazy getters provided.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 15:55:28 -03:00
0cfd8a5451
daemon split (5/5): document the service composition
Phase 5 of the daemon god-struct refactor. Code motion landed in
phases 1-4; this commit retells the architecture so the docs match
the structure.

ARCHITECTURE.md loses the "deferred v0.2 project" hedge about
splitting services. The Composition section now describes the four
services (HostNetwork, ImageService, WorkspaceService, VMService)
that own behaviour, the consumer-defined seam pattern for
cross-service calls, and the lazy-init getter pattern that keeps
existing test literals compiling.

doc.go inventories which methods live on which service, and the
lock-ordering section gains the service prefixes (e.g.
VMService.vmLocks instead of bare vmLocks) so readers don't have to
guess which type owns which mutex.

No code changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:58:53 -03:00
466a7c30c4
daemon split (4/5): extract *VMService service
Phase 4 of the daemon god-struct refactor. VM lifecycle, create-op
registry, handle cache, disk provisioning, stats polling, ports
query, and the per-VM lock set all move off *Daemon onto *VMService.

Daemon keeps thin forwarders only for FindVM / TouchVM (dispatch
surface) and is otherwise out of VM lifecycle. Lazy-init via
d.vmSvc() mirrors the earlier services so test literals like
\`&Daemon{store: db, runner: r}\` still get a functional service
without spelling one out.

Three small cleanups along the way:

  * preflight helpers (validateStartPrereqs / addBaseStartPrereqs
    / addBaseStartCommandPrereqs / validateWorkDiskResizePrereqs)
    move with the VM methods that call them.
  * cleanupRuntime / rebuildDNS move to *VMService, with
    HostNetwork primitives (findFirecrackerPID, cleanupDMSnapshot,
    killVMProcess, releaseTap, waitForExit, sendCtrlAltDel)
    reached through s.net instead of the hostNet() facade.
  * vsockAgentBinary becomes a package-level function so both
    *Daemon (doctor) and *VMService (preflight) call one entry
    point instead of each owning a forwarder method.

WorkspaceService's peer deps switch from eager method values to
closures — vmSvc() constructs VMService with WorkspaceService as a
peer, so resolving d.vmSvc().FindVM at construction time recursed
through workspaceSvc() → vmSvc(). Closures defer the lookup to call
time.

Pure code motion: build + unit tests green, lint clean. No RPC
surface or lock-ordering changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:57:05 -03:00
c0d456e734
daemon split (3/5): extract *WorkspaceService service
Third phase of splitting the daemon god-struct. WorkspaceService now
owns workspace.prepare / workspace.export plus the ssh-key +
git-identity + arbitrary-file sync that runs as part of VM start's
prepare_work_disk capability hook. workspaceLocks (the per-VM tar
serialisation set) lives on the service.

workspace.go and vm_authsync.go flipped receivers from *Daemon to
*WorkspaceService. The workspaceInspectRepo / workspaceImport test
seams moved onto the service as fields.

Peer-service dependencies go through narrow function-typed fields:
vmResolver, aliveChecker, waitGuestSSH, dialGuest, imageResolver,
imageWorkSeed, withVMLockByRef, beginOperation. WorkspaceService
never touches VMService / HostNetwork / ImageService directly —
only the exact operations the Daemon hands it at construction.

Daemon lazy-init helper workspaceSvc() mirrors the Phase 1/2
pattern. Test literals still write `&Daemon{store: db, runner: r}`
and get a wired workspace service for free. Tests that override the
inspect/import seams (workspace_test.go, ~4 sites) assign them on
d.workspaceSvc() instead of on the daemon literal.

Dispatch in daemon.go: vm.workspace.prepare and vm.workspace.export
now forward one-liners to d.workspaceSvc().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:42:31 -03:00
d7614a3b2b
daemon split (2/5): extract *ImageService service
Second phase of splitting the daemon god-struct. ImageService now owns
all image + kernel registry operations: register/promote/delete/pull
for images (bundle + OCI paths), the six kernel commands, and the
shared SSH-key/work-seed injection helpers. imageOpsMu (the
publication-window lock) lives on the service; so do the three OCI
pull test seams pullAndFlatten / finalizePulledRootfs / bundleFetch.
The four files images.go, images_pull.go, image_seed.go, kernels.go
flipped their receivers from *Daemon to *ImageService.

FindImage moved with the service. Daemon keeps a thin FindImage
forwarder so callers reading the dispatch code see the obvious
facade and tests that pre-date the split still compile.

flattenNestedWorkHome — called from image_seed.go, vm_authsync.go,
and vm_disk.go across future service boundaries — became a
package-level helper taking a CommandRunner explicitly. Daemon keeps
a deprecated forwarder for now; the other services will use the
package form.

Lazy-init helper imageSvc() on Daemon mirrors hostNet() from
Phase 1, so test literals like &Daemon{store: db, runner: r, ...}
that don't spell out an ImageService still get a working one.
Tests that override the image test seams (autopull_test,
concurrency_test, images_pull_test, images_pull_bundle_test) now
assign d.img = &ImageService{...seams...}; the two-statement pattern
matches what Phase 1 established for HostNetwork.

Dispatch in daemon.go is cleaner now: every image/kernel RPC handler
is a single-liner forwarding to d.imageSvc().*. Phase 5 will do the
same for VM lifecycle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:30:32 -03:00
362009d747
daemon split (1/5): extract *HostNetwork service
First phase of splitting the daemon god-struct into focused services
with explicit ownership.

HostNetwork now owns everything host-networking: the TAP interface
pool (initializeTapPool / ensureTapPool / acquireTap / releaseTap /
createTap), bridge + socket dir setup, firecracker process primitives
(find/resolve/kill/wait/ensureSocketAccess/sendCtrlAltDel), DM
snapshot lifecycle, NAT rule enforcement, guest DNS server lifecycle
+ routing setup, and the vsock-agent readiness probe. That's 7 files
whose receivers flipped from *Daemon to *HostNetwork, plus a new
host_network.go that declares the struct, its hostNetworkDeps, and
the factored firecracker + DNS helpers that used to live in vm.go.

Daemon gives up the tapPool and vmDNS fields entirely; they're now
HostNetwork's business. Construction goes through newHostNetwork in
Daemon.Open with an explicit dependency bag (runner, logger, config,
layout, closing). A lazy-init hostNet() helper on Daemon supports
test literals that don't wire net explicitly — production always
populates it eagerly.

Signature tightenings where the old receiver reached into VM-service
state:
 - ensureNAT(ctx, vm, enable) → ensureNAT(ctx, guestIP, tap, enable).
   Callers resolve tap from the handle cache themselves.
 - initializeTapPool(ctx) → initializeTapPool(usedTaps []string).
   Daemon.Open enumerates VMs, collects taps from handles, hands the
   slice in.

rebuildDNS stays on *Daemon as the orchestrator — it filters by
vm-alive (a VMService concern handles will move to in phase 4) then
calls HostNetwork.replaceDNS with the already-filtered map.

Capability hooks continue to take *Daemon; they now use it as a
facade to reach services (d.net.ensureNAT, d.hostNet().*). Planned
CapabilityHost interface extraction is orthogonal, left for later.

Tests: dns_routing_test.go + fastpath_test.go + nat_test.go +
snapshot_test.go + open_close_test.go were touched to construct
HostNetwork literals where they exercise its methods directly, or
route through d.hostNet() where they exercise the Daemon entry
points.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:11:46 -03:00
eba9a553bf
daemon: use exact-name lookup for VM-create uniqueness
reserveVM's duplicate-name guard routed through Daemon.FindVM, which
falls back to prefix-matching on both ids and names when no exact
match is found. That turns the uniqueness check into a correctness
bug: a brand-new VM name can be rejected because it happens to
prefix an existing VM's id, or an existing VM's name. So `vm create
--name beta` fails when `beta-sandbox` already exists.

Swap in a dedicated store.GetVMByName that does a literal `WHERE
name = ?` lookup, and use it from reserveVM. FindVM keeps its
prefix-matching behaviour for user-facing lookup paths (`vm ssh
<partial>`, `vm stop <partial>`) where "did you mean" semantics
are the feature.

Tests:
 - TestReserveVMAllowsNameThatPrefixesExistingVM — seeds a VM whose
   id + name both start with "longname", then reserves two new VMs
   named "longname" and "longname-sandbox". Both must succeed.
   Under the old FindVM-based check, both would fail.
 - TestReserveVMRejectsExactDuplicateName — actual collisions are
   still rejected after the swap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 14:00:33 -03:00
108f7a0600
ssh-config: make the ssh <name>.vm shortcut opt-in
Before this change, every daemon.Open() wrote a Host *.vm stanza into
~/.ssh/config in a marker-fenced block. That's a real footgun for users
who manage their SSH config declaratively (chezmoi, dotfiles, NixOS):
banger was mutating host state outside its own directory on every
daemon start, easy to miss and hard to audit.

New contract: the daemon only ever writes its own ssh_config file at
~/.config/banger/ssh_config. ~/.ssh/config is untouched unless the user
opts in. `banger vm ssh <name>` still works out of the box — the
shortcut only matters for plain `ssh sandbox.vm` from any terminal.

The opt-in surface is `banger ssh-config`:

  banger ssh-config              # prints path + include-line +
                                 # install/uninstall hints
  banger ssh-config --install    # adds `Include <bangerConfig>` to
                                 # ~/.ssh/config inside a marker-fenced
                                 # block; idempotent; migrates any
                                 # legacy inline Host *.vm block from
                                 # pre-opt-in builds
  banger ssh-config --uninstall  # removes the new Include block AND
                                 # any legacy inline block

Doctor gains a gentle warn-level note when banger's ssh_config exists
but the user hasn't wired it in — not a fail, since the shortcut is
convenience and `banger vm ssh` covers the essential case.

Tests cover: daemon writes banger file and does NOT touch ~/.ssh/config,
Install adds the block, Install is idempotent, Install migrates the
legacy inline block cleanly (removing it, preserving unrelated
entries, adding the new Include block), Uninstall removes both marker
variants, Uninstall is a no-op when ~/.ssh/config is absent, and
UserSSHIncludeInstalled detects both marker shapes.

README reframes the feature as optional convenience.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:57:26 -03:00
99d0811097
daemon: shrink createVMMu + imageOpsMu to reservation/publication windows
Before: createVMMu was held across the whole of CreateVM — including
image resolution (which could fire a full auto-pull) and startVMLocked
(boot of multiple seconds). imageOpsMu was held across the whole of
PullImage/RegisterImage/PromoteImage/DeleteImage, so any slow OCI pull,
bundle download, or file copy blocked every other image mutation and
every other VM create that needed to auto-pull. The async create API
bought nothing if all creates serialised on the same mutex.

CreateVM is now three phases:

 1. Validate + resolve image (possibly auto-pulling). No global lock.
 2. reserveVM: take createVMMu only long enough to re-check the name
    is free, allocate the next guest IP, and UpsertVM the "created"
    row. Milliseconds.
 3. startVMLocked: run the full boot flow under the per-VM lock only.

Parallel creates of different VMs now overlap on image resolution +
boot; they contend only across the reservation claim.

For the image surface a new publishImage helper isolates the commit
atom (recheck name free, atomic rename stagingDir→finalDir, UpsertImage)
under imageOpsMu. pullFromBundle + pullFromOCI do their network fetch
+ ext4 build + ownership fixup + agent injection outside the lock;
Register moves validation + kernel resolution outside; Promote moves
file copy + SSH-key seeding outside; Delete keeps a brief lock over
the lookup + reference check + store delete and does file cleanup
unlocked.

Two concurrency tests assert the new behaviour:
 - TestPullImageDoesNotSerialiseOnDifferentNames fails the old code
   (second pull blocks on imageOpsMu and never reaches the body).
 - TestPullImageRejectsNameClashAtPublish confirms the publish-window
   recheck is what enforces name uniqueness now that the body runs
   unlocked — exactly one winner.

ARCHITECTURE.md updated to describe the new scope explicitly instead
of calling the locks "narrow".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:44:22 -03:00
afe91e805a
drop unused bench-create script + Makefile target
The script carried a python3 dep for one json.dumps on a VM name
that's always alphanumeric-plus-dashes anyway, it was never wired
into CI or docs, and `time banger vm create` covers the same need
ad hoc when anyone wants to measure create latency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:33:09 -03:00
58464ac28c
docs + doctor: be honest about amd64-only support
The README sold the product as "Linux with /dev/kvm"; the deeper docs
admit that the Makefile pins companion builds to GOARCH=amd64, the
kernel catalog ships only x86_64 entries, and OCI import pulls
linux/amd64 layers. arm64 users who show up through the README only
discover that after install fails in non-obvious ways.

Two surface-level fixes:

- README requirements list leads with "x86_64 / amd64 Linux — arm64 is
  not supported today", with a short note on the three places that
  assumption lives so users understand it's not a last-mile gap.
- `banger doctor` now runs an architecture check that passes on amd64
  and FAILS (not warns) on anything else, referencing the three
  downstream assumptions. Hard-fail rather than warn so a user on an
  arm64 machine doesn't waste time chasing unrelated preflight items.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:03:50 -03:00
e69810610a
daemon: correct ARCHITECTURE doc to match actual package shape + lock scope
Two promises the doc was making that the code doesn't keep:

1. "Helpers moved out so the package stays focused on orchestration."
   The package still has ~29 files and ~130 func (d *Daemon) methods
   wiring VM lifecycle, image management, host networking, background
   reconciliation, and JSON-RPC dispatch. Calling it "just orchestration"
   sets readers up for surprise. Rewrite the subpackages preamble to
   say so, and flag the service split as a post-v0.1.0 project.

2. "vmLocks[id] is held only across short synchronous state validation
   and DB mutations." That's what workspace.prepare does; regular
   lifecycle ops (start/stop/delete/set) go through withVMLockByRef
   and hold the lock across the whole callback body, which for `start`
   means preflight + bridge + firecracker spawn + post-boot wiring.
   Rewrite the vmLocks bullet and the lock-ordering section to say
   that explicitly, so readers don't build "surely my long flow under
   the lock can't be what the doc means" reasoning on top of a false
   premise.

Doc-only change. Code behaviour is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:02:36 -03:00
34dd7644d8
store: introduce versioned migrations with ordered runner + atomic apply
The old migrate() helper only knew how to re-run a fixed slab of CREATE
TABLE IF NOT EXISTS plus per-column ensureColumnExists calls. That worked
while every schema change was a benign additive column; it falls apart
as soon as we need a data backfill, an index, a rename, or anything that
has to happen exactly once in a known order.

Replaces it with a schema_migrations table + ordered []migration slice.
Each migration has a unique id, a human-readable name, and a func(*Tx)
body; the runner opens a transaction per migration so DDL and any data
changes either both land and get recorded or both roll back together,
leaving the DB in a state where retrying on next Open() reapplies from
the same point.

Migration 1 ("baseline") collapses the current schema into one entry:
fresh databases apply it in one shot; existing dev databases see
idempotent `CREATE TABLE IF NOT EXISTS` + `ALTER TABLE … ADD COLUMN`
statements that succeed as no-ops, and the only net effect is the
schema_migrations row that brings them into the versioned system.

Tests cover fresh apply, idempotent re-open, skipping already-applied
ids, rollback on body error (the transient table the migration created
must not survive), and duplicate-id rejection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 12:59:42 -03:00
b930c51990
runtime sockets: close the local-user race window around control-plane creation
Previously the daemon socket, per-VM firecracker API socket, and vsock
socket were transiently world-exposed on hosts without XDG_RUNTIME_DIR:
the runtime directory landed in /tmp at 0755, Firecracker ran with
umask 000 (mode 0666 sockets), and only a follow-up chown/chmod in
EnsureSocketAccess tightened them. A local attacker could race into
bangerd.sock or the firecracker API socket during that window.

Three changes:

- internal/paths/paths.go: RuntimeDir is now created (and re-chmod'd if
  stale) at 0700 unconditionally. When XDG_RUNTIME_DIR is unset and we
  fall back to /tmp/banger-runtime-<uid>, Ensure() now verifies the
  parent dir is owned by the current uid and 0700 mode — refusing to
  place sockets inside a directory someone else created. Symlink swaps
  rejected via Lstat.

- internal/firecracker/client.go: launch firecracker with umask 077
  instead of umask 000 so the API socket is mode 0600 from birth. The
  chown in fcproc.EnsureSocketAccess still transfers ownership from
  root to the invoking user afterwards.

- internal/daemon/fcproc/fcproc.go: EnsureSocketDir now creates (and
  re-chmod's) the runtime socket directory at 0700.

Tests cover the tightening path — an existing 0755 RuntimeDir is
re-chmod'd on Ensure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 12:53:47 -03:00
2b6437d1b4
remove vm session feature
Cuts the daemon-managed guest-session machinery (start/list/show/
logs/stop/kill/attach/send). The feature shipped aimed at agent-
orchestration workflows (programmatic stdin piping into a long-lived
guest process) that aren't driving any concrete user today, and the
~2.3K LOC of daemon surface area — attach bridge, FIFO keepalive,
controller registry, sessionstream framing, SQLite persistence — was
locking in an API we'd have to keep through v0.1.0.

Anything session-flavoured that people actually need today can be
done with `vm ssh + tmux` or `vm run -- cmd`.

Deleted:
- internal/cli/commands_vm_session.go
- internal/daemon/{guest_sessions,session_lifecycle,session_attach,session_stream,session_controller}.go
- internal/daemon/session/ (guest-session helpers package)
- internal/sessionstream/ (framing package)
- internal/daemon/guest_sessions_test.go
- internal/store/guest_session_test.go
- GuestSession* types from internal/{api,model}
- Store UpsertGuestSession/GetGuestSession/ListGuestSessionsByVM/DeleteGuestSession + scanner helpers
- guest.session.* RPC dispatch entries
- 5 CLI session tests, 2 completion tests, 2 printer tests

Extracted:
- ShellQuote + FormatStepError lifted to internal/daemon/workspace/util.go
  (only non-session consumer); workspace package now self-contained
- internal/daemon/guest_ssh.go keeps guestSSHClient + dialGuest +
  waitForGuestSSH — still used by workspace prepare/export
- internal/daemon/fake_firecracker_test.go preserves the test helper
  that used to live in guest_sessions_test.go

Store schema: CREATE TABLE guest_sessions and its column migrations
removed. Existing dev DBs keep an orphan table (harmless, pre-v0.1.0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 12:47:58 -03:00
c42fcbe012
cli + daemon: move test seams off package globals onto injected structs
CLI: introduce internal/cli.deps which owns every RPC/SSH/host-command
seam the tree used to reach through mutable package vars. Command
builders, orchestrators, and the completion helpers become methods on
*deps. Tests construct their own deps per case, so fakes no longer leak
across cases and tests are free to run in parallel.

Daemon: move workspaceInspectRepoFunc + workspaceImportFunc onto the
Daemon struct (workspaceInspectRepo / workspaceImport), mirroring the
existing guestWaitForSSH / guestDial pattern. Workspace-prepare tests
drop t.Parallel() guards now that they no longer mutate process-wide
state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 19:03:55 -03:00
d38f580e00
doctor: surface state store open failure as failing check
Previously store.Open errors were silently swallowed, so `banger
doctor` could report green while the default-image check (and any
other store-dependent diagnostic) was silently skipped because
d.store was nil.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:35:27 -03:00
3f6ecb4376
cli: split banger.go god file into focused files
Pure code motion — banger.go 3508→240 LOC, same-package
decomposition keeps all identifiers visible without export changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:34:32 -03:00
3a5f4cd40d
cli: delete vm run's dead import path + duplicated git inspection
The CLI carried a full second copy of the workspace import
implementation that `vm run` never actually used:

  - importVMRunRepoToGuest (no callers — the live flow calls the
    daemon's PrepareVMWorkspace RPC instead)
  - prepareVMRunRepoCopy, vmRunCheckoutCommit, vmRunCheckoutScript,
    gitFileURL, runHostCommand (all reachable only from the dead
    importVMRunRepoToGuest)

Plus a duplicated repo-inspection surface that shadowed the
daemon's:

  - inspectVMRunRepo ran every git query the daemon re-ran during
    workspace.prepare (HEAD, branch, identity, origin, overlay list)
  - gitOutput / gitTrimmedOutput / gitResolvedConfigValue /
    parseNullSeparatedOutput / listSubmodules / listOverlayPaths /
    resolveVMRunSourcePath — all identical to the exported
    workspace.* versions
  - vmRunRepoSpec — same fields as workspace.RepoSpec

Replaced with a single minimal preflight:

  func vmRunPreflightRepo(ctx, rawPath) (absPath, err error)

The preflight only checks what the user can fix locally before
banger creates a VM (path exists, sits in a non-bare git repo, no
submodules). The daemon's workspace.prepare RPC does the full
inspection — and returns RepoRoot + RepoName in the response, which
the CLI now threads into the tooling harness instead of computing
them a second time.

Signature changes:

  runVMRun(ctx, ..., *vmRunRepo, ...)   // was: *vmRunRepoSpec
  startVMRunToolingHarness(ctx, client, repoRoot, repoName, progress)
                                        // was: (ctx, client, spec, progress)
  vmRunToolingHarnessScript(plan)       // was: (spec, plan)
  vmRunToolingHarnessLaunchScript(repoName)  // was: (spec)

Tests: the CLI-side git-inspection tests are replaced by a single
TestVMRunPreflightRejectsSubmodules that exercises the preflight.
Everything else (tooling harness script, progress renderer, SSH args,
runVMRun flows) keeps working. The shallow-copy / checkout-script
tests are gone — that code now lives only in
internal/daemon/workspace and is tested there.

Also fixed a latent bug the refactor exposed: vm run's --from flag
defaults to "HEAD", which the daemon reads as "from without branch"
and rejects. CLI now scrubs fromRef when branchName is empty.

Live verified: `banger vm run --name X . -- cmd` boots, workspace
materialises at /root/repo with matching HEAD, exit code propagates.
2026-04-19 17:01:26 -03:00
ae14b9499d
ssh: trust-on-first-use host key pinning everywhere
Guest host-key verification was off in all three SSH paths:

  * Go SSH (internal/guest/ssh.go) used ssh.InsecureIgnoreHostKey
  * `banger vm ssh` passed StrictHostKeyChecking=no
    + UserKnownHostsFile=/dev/null
  * `~/.ssh/config` Host *.vm shipped the same posture into the
    user's global config

Now each path verifies against a banger-owned known_hosts file at
`~/.local/state/banger/ssh/known_hosts` with TOFU semantics:

  * First dial to a VM pins the key.
  * Subsequent dials require an exact match. A mismatch fails with
    an explicit "possible MITM" error.
  * `vm delete` removes the entries so a future VM reusing the IP
    or name re-pins cleanly.
  * The user's `~/.ssh/known_hosts` is untouched.

Changes:

  internal/guest/known_hosts.go (new) — OpenSSH-compatible parser,
    TOFUHostKeyCallback, RemoveKnownHosts. Process-wide mutex
    around the file.
  internal/guest/ssh.go — Dial and WaitForSSH grew a knownHostsPath
    parameter threaded through the callback. Empty path keeps the
    insecure callback (tests + throwaway tools only; documented).
  internal/daemon/{guest_sessions,session_attach,session_lifecycle,
    session_stream}.go — call sites pass d.layout.KnownHostsPath.
  internal/daemon/ssh_client_config.go — the ~/.ssh/config Host *.vm
    block now points at banger's known_hosts and uses
    StrictHostKeyChecking=accept-new. Missing path → fail closed.
  internal/daemon/vm_lifecycle.go — deleteVMLocked drops known_hosts
    entries for the VM's IP and DNS name via removeVMKnownHosts.
  internal/cli/banger.go — sshCommandArgs swaps StrictHostKeyChecking
    no + /dev/null for banger's file + accept-new. Path resolution
    failure falls through to StrictHostKeyChecking=yes.
  internal/paths/paths.go — Layout gains SSHDir + KnownHostsPath;
    Ensure creates SSHDir at 0700.

Tests (internal/guest/known_hosts_test.go): pin on first use, accept
matching key on second dial, reject mismatch, empty path skips
checking, RemoveKnownHosts drops the entry, re-pin works after
remove. Existing daemon + cli tests updated to assert the new
posture and regression-guard against the old flags.

Live verified: vm run writes the pin to banger's known_hosts at 0600
inside a 0700 dir; banger vm ssh + ssh root@<vm>.vm both succeed
using the pin; vm delete clears it.
2026-04-19 16:46:03 -03:00
a59958d4f5
daemon: roll back host state on any Open() failure
Open() touched several pieces of host state before hitting the step
that returned the error:

  * SQLite handle (store.Open)
  * managed SSH client config block (ensureVMSSHClientConfig)
  * vm-DNS UDP listener goroutine (startVMDNS)
  * systemd-resolved per-interface routing (ensureVMDNSResolverRouting)

The only deferred cleanup guarded stopVMDNS. A reconcile() or
initializeTapPool() failure therefore left the listener running, the
resolver wiring in place, and the SQLite handle open. A subsequent
startup attempt ran into "port 42069 already in use" or silently
published stale state.

Fix: once `d` exists, defer `d.Close()` on `err != nil`. Close is
idempotent (sync.Once) and every teardown step (listener close, DNS
listener close, resolver revert, session registry close, store close)
is nil-guarded, so calling it on a daemon that never got past the
first startup step is safe.

Tests (internal/daemon/open_close_test.go):

  - TestCloseOnPartiallyInitialisedDaemon: Close survives a daemon
    with only store + closing channel, and with a vmDNS listener but
    nothing else. Catches regressions where a teardown step forgets
    to nil-check.
  - TestCloseIdempotentUnderConcurrency: 5 goroutines racing on
    Close() never panic (sync.Once + close(d.closing) survive).
  - TestOpenFailureRunsCloseCleanup: structural check that the
    `defer cleanup() if err != nil` pattern actually fires.

Live: `banger daemon stop` cleanly, `banger vm ls` restarts daemon
without a residual listener on port 42069.
2026-04-19 16:36:29 -03:00
d1b9a8c102
remove experimental web UI
The web UI shipped as "experimental" and was never finished — no nav
off the dashboard, no live updates, no settled design, never a
supported surface. It was opt-in by default already; leaving the code
in the tree for v0.1.0 only invited "does this work?" questions and
kept HostSummary/BangerSummary/SudoStatus types on the public RPC
surface that nothing else uses.

Removed:

  internal/webui/                         (all Go + templates + assets)
  internal/daemon/web.go                  (server start / Layout / Config / ListVMs / ListImages)
  internal/daemon/dashboard.go            (DashboardSummary aggregator)

Simplified:

  internal/api/types.go                   drop WebURL on PingResult, drop
                                          HostSummary / SudoStatus / BangerSummary /
                                          DashboardSummary / DashboardSummaryResult
  internal/model/types.go                 drop DaemonConfig.WebListenAddr
  internal/config/config.go               drop web_listen_addr from fileConfig + Load
  internal/daemon/daemon.go               drop webListener / webServer / webURL fields +
                                          startWebServer() call + ping WebURL population
  internal/cli/banger.go                  `daemon status` output no longer branches on web
  internal/daemon/{doc.go,ARCHITECTURE.md} drop web UI sections
  README.md                               drop web_listen_addr config bullet + security paragraph

Tests updated to reflect the new shape. Coverage 57.3 -> 58.9% (the
webui package was largely untested; its removal lifts the ratio
without moving the numerator). `banger daemon status` output and
--help are web-free. Lint + full suite green.
2026-04-19 14:28:08 -03:00
687fcf0b59
vm state: split transient kernel/process handles off the durable schema
Separates what a VM IS (durable intent + identity + deterministic
derived paths — `VMRuntime`) from what is CURRENTLY TRUE about it
(firecracker PID, tap device, loop devices, dm-snapshot target — new
`VMHandles`). The durable state lives in the SQLite `vms` row; the
transient state lives in an in-memory cache on the daemon plus a
per-VM `handles.json` scratch file inside VMDir, rebuilt at startup
from OS inspection. Nothing kernel-level rides the SQLite schema
anymore.

Why:

  Persisting ephemeral process handles to SQLite forced reconcile to
  treat "running with a stale PID" as a first-class case and mix it
  with real state transitions. The schema described what we last
  observed, not what the VM is. Every time the observation model
  shifted (tap pool, DM naming, pgrep fallback) the reconcile logic
  grew a new branch. Splitting lets each layer own what it's good at:
  durable records describe intent, in-memory cache + scratch file
  describe momentary reality.

Shape:

  - `model.VMHandles` = PID, TapDevice, BaseLoop, COWLoop, DMName,
    DMDev. Never in SQLite.
  - `VMRuntime` keeps: State, GuestIP, APISockPath, VSockPath,
    VSockCID, LogPath, MetricsPath, DNSName, VMDir, SystemOverlay,
    WorkDiskPath, LastError. All durable or deterministic.
  - `handleCache` on `*Daemon` — mutex-guarded map + scratch-file
    plumbing (`writeHandlesFile` / `readHandlesFile` /
    `rediscoverHandles`). See `internal/daemon/vm_handles.go`.
  - `d.vmAlive(vm)` replaces the 20+ inline
    `vm.State==Running && ProcessRunning(vm.Runtime.PID, apiSock)`
    spreads. Single source of truth for liveness.
  - Startup reconcile: per running VM, load the scratch file, pgrep
    the api sock, either keep (cache seeded from scratch) or demote
    to stopped (scratch handles passed to cleanupRuntime first so DM
    / loops / tap actually get torn down).

Verification:

  - `go test ./...` green.
  - Live: `banger vm run --name handles-test -- cat /etc/hostname`
    starts; `handles.json` appears in VMDir with the expected PID,
    tap, loops, DM.
  - `kill -9 $(pgrep bangerd)` while the VM is running, re-invoke the
    CLI, daemon auto-starts, reconcile recognises the VM as alive,
    `banger vm ssh` still connects, `banger vm delete` cleans up.

Tests added:

  - vm_handles_test.go: scratch-file roundtrip, missing/corrupt file
    behaviour, cache concurrency, rediscoverHandles prefers pgrep
    over scratch, returns scratch contents even when process is
    dead (so cleanup can tear down kernel state).
  - vm_test.go: reconcile test rewritten to exercise the new flow
    (write scratch → reconcile reads it → verifies process is gone →
    issues dmsetup/losetup teardown).

ARCHITECTURE.md updated; `handles` added to Daemon field docs.
2026-04-19 14:18:13 -03:00
2e6e64bc04
guest sshd: drop DEBUG3 + StrictModes no; normalise /root perms
Previously /etc/ssh/sshd_config.d/99-banger.conf landed with:

  LogLevel DEBUG3
  PermitRootLogin yes
  PubkeyAuthentication yes
  AuthorizedKeysFile /root/.ssh/authorized_keys
  StrictModes no

DEBUG3 was debug leftover that floods journald in normal use.
StrictModes no was a workaround for /root perm drift on the work
disk — the real fix is to make those perms correct at provisioning
time.

New drop-in:

  PermitRootLogin prohibit-password
  PubkeyAuthentication yes
  PasswordAuthentication no
  KbdInteractiveAuthentication no
  AuthorizedKeysFile /root/.ssh/authorized_keys

prohibit-password blocks password root login even if PasswordAuth
gets flipped on elsewhere; KbdInteractiveAuth no closes the last
interactive fallback; StrictModes is now on (sshd's default).

normaliseHomeDirPerms chown/chmods /root to 0755 root:root at every
work-disk mount (ensureAuthorizedKeyOnWorkDisk,
seedAuthorizedKeyOnExt4Image); the .ssh dir also explicitly
chown'd root:root. Verified end-to-end against a real VM:
`sshd -T` reports strictmodes yes and all five directives match.

Regression test (sshd_config_test.go) pins the allow-list and the
deny-list (DEBUG3, StrictModes no, bare `PermitRootLogin yes`) so
the next accidental reintroduction fails fast.

README's Security section updated to reflect the new posture.
2026-04-19 13:40:40 -03:00
6cd52d12f4
workspace prepare: release VM mutex before guest I/O
Previously withVMLockByRef held the per-VM mutex across InspectRepo,
waitForGuestSSH, dialGuest, ImportRepoToGuest (the tar stream!), and
the readonly chmod. A large repo could block `vm stop` / `vm delete`
/ `vm restart` on the same VM for however long the import took.

Split into two phases:

  1. VM mutex held briefly to validate state (running + PID alive)
     and snapshot the fields needed for SSH (guest IP, api sock).
  2. VM mutex released. Acquire workspaceLocks[id] — a separate
     per-VM mutex scoped to workspace.prepare / workspace.export —
     for the guest I/O phase.

Lifecycle ops (stop/delete/restart/set) only take vmLocks, so they
no longer queue behind a slow import. Two concurrent prepares on the
same VM still serialise via workspaceLocks so tar streams don't
interleave. ExportVMWorkspace also acquires workspaceLocks to avoid
snapshotting a half-streamed import.

Two regression tests (sequential — they swap package-level seams):

  ReleasesVMLockDuringGuestIO: stall the import fake, assert the VM
  mutex is acquirable from another goroutine during the stall.

  SerialisesConcurrentPreparesOnSameVM: 3 concurrent prepares, assert
  Import is only ever invoked 1-at-a-time per VM.

ARCHITECTURE.md documents the split + updated lock ordering.
2026-04-19 13:32:42 -03:00
99de42385f
workspace export: stop mutating the guest repo index
Previously `banger vm workspace export` ran `git add -A` against the
guest's real `.git/index`, so the observation step left staged
changes behind that users never asked for. Reconnecting later (ssh,
another export) surfaced them and looked like phantom work.

Route `git add -A` through a throwaway index file instead:

  tmp_idx=$(mktemp ...)
  trap 'rm -f "$tmp_idx"' EXIT
  git read-tree <ref> --index-output="$tmp_idx"
  GIT_INDEX_FILE="$tmp_idx" git add -A
  GIT_INDEX_FILE="$tmp_idx" git diff --cached <ref> --binary|--name-only

The real .git/index, working tree, and refs stay exactly as the user
left them. Same diff content — commits past <ref>, uncommitted edits,
and untracked files (minus .gitignore) all captured.

Regression test locks the invariant: every export script must route
add -A through GIT_INDEX_FILE and clean the temp index on exit. CLI
help text updated to say "non-mutating".
2026-04-19 13:20:56 -03:00
21b74639d8
vm defaults: host-aware sizing + spec line on spawn + doctor check
Replaces the static model.Default* constants that drove --vcpu / --memory
/ --disk-size with a three-layer resolver:

  1. [vm_defaults] in ~/.config/banger/config.toml (if set)
  2. host-derived heuristics (cpus/4 capped at 4; ram/8 capped at 8 GiB)
  3. baked-in constants (floor)

Visibility:

- Every `vm run` / `vm create` prints a `spec:` line before progress
  begins: `spec: 4 vcpu · 8192 MiB · 8G disk`. Matches the VM that
  actually gets created because the CLI is now the single source of
  truth — it resolves, populates the flag defaults, and forwards the
  explicit values to the daemon.
- `banger doctor` adds a "vm defaults" check showing per-field
  provenance (config|auto|builtin) and the config file path for
  overrides.
- `--help` shows the resolved defaults (e.g. `--vcpu int (default 4)`
  on an 8-core host).

No `banger config init` command, no first-run side effects, no writes
to the user's filesystem behind their back. Users who want explicit
control set the keys; everyone else gets sensible numbers that track
their hardware.
2026-04-19 13:06:51 -03:00
78ff482bfa
release prep: opt-in web UI, make uninstall, fix stale kernel-catalog docs
- WebListenAddr default is now "" (empty). The experimental web UI was
  running on 127.0.0.1:7777 by default, which surprises users who never
  opted in. Users who want it set `web_listen_addr = "127.0.0.1:7777"`
  in config.toml.
- `make uninstall` stops the daemon (if any) and removes the installed
  binaries. Preserves user data on disk but prints the paths so `rm -rf`
  can follow for a full purge. Documented in README next to install.
- docs/kernel-catalog.md: replace the `void-6.12` and `alpine-3.23`
  examples (never published) with `generic-6.12` (the only cataloged
  kernel today). Updates the versioning-convention example too.
2026-04-19 12:43:58 -03:00
221fb03d68
cli QoL: vm prune, list→ls aliases, delete→rm aliases
- `banger vm prune` sweeps every non-running VM (stopped, created,
  error) with an interactive confirmation; -f/--force skips the prompt.
  Partial failures report which VM failed and exit non-zero.
- list commands gain `ls` alias: vm list already had it; added to image
  list, kernel list, and vm session list.
- delete commands gain `rm` alias: vm delete and image delete. kernel
  rm already aliased delete/remove.

Uses new test seams (vmListFunc) plus the existing vmDeleteFunc so
prune unit-tests without touching the daemon socket.
2026-04-19 12:17:46 -03:00
e3eaa0c797
cli: shell completion via cobra + dynamic resource name lookups
Re-enable cobra's default `completion` subcommand (`banger completion
bash|zsh|fish|powershell`). Plus live resource-name suggestions that
hit the running daemon via the same RPC the real commands use:

  vm start/stop/restart/delete/kill/set         → completeVMNames (variadic)
  vm ssh/show/logs/stats/ports/...              → completeVMNameOnlyAtPos0
  vm session list/start                         → completeVMNameOnlyAtPos0
  vm session show/logs/stop/kill/attach/send    → completeSessionNames (vm + session)
  image show/delete/promote                     → completeImageNameOnlyAtPos0
  kernel show/rm                                → completeKernelNameOnlyAtPos0
  vm run/create --image, image pull/register --kernel-ref → flag-value completion

Design notes in internal/cli/completion.go: completers never auto-start
the daemon (ping-check, bail with NoFileComp on miss), so tab-completion
stays a zero-cost probe. Variadic completers exclude already-entered
args to avoid duplicate suggestions.

README: install recipes for bash / zsh / fish.
2026-04-19 12:12:40 -03:00
346eaba673
coverage: medium batch — hostnat runner, store guest-sessions, daemon helpers
Reuses existing fixtures (CommandRunner fakes, SQLite tempfile store,
pure-Go seams). No new infra needed.

  hostnat                  50% -> 98%   (iptables orchestration via fake runner)
  store                    78% -> 91%   (guest_sessions CRUD roundtrip)
  daemon/session           57% -> 95%   (script gen, state parse, snapshot apply)
  daemon/opstate           67% -> 100%  (Registry Insert/Get/Prune)
  daemon (firstNonEmpty)   slight bump

Total 54.0% -> 56.5%.
2026-04-18 18:03:37 -03:00