Commit graph

238 commits

Author SHA1 Message Date
362009d747
daemon split (1/5): extract *HostNetwork service
First phase of splitting the daemon god-struct into focused services
with explicit ownership.

HostNetwork now owns everything host-networking: the TAP interface
pool (initializeTapPool / ensureTapPool / acquireTap / releaseTap /
createTap), bridge + socket dir setup, firecracker process primitives
(find/resolve/kill/wait/ensureSocketAccess/sendCtrlAltDel), DM
snapshot lifecycle, NAT rule enforcement, guest DNS server lifecycle
+ routing setup, and the vsock-agent readiness probe. That's 7 files
whose receivers flipped from *Daemon to *HostNetwork, plus a new
host_network.go that declares the struct, its hostNetworkDeps, and
the factored firecracker + DNS helpers that used to live in vm.go.

Daemon gives up the tapPool and vmDNS fields entirely; they're now
HostNetwork's business. Construction goes through newHostNetwork in
Daemon.Open with an explicit dependency bag (runner, logger, config,
layout, closing). A lazy-init hostNet() helper on Daemon supports
test literals that don't wire net explicitly — production always
populates it eagerly.

Signature tightenings where the old receiver reached into VM-service
state:
 - ensureNAT(ctx, vm, enable) → ensureNAT(ctx, guestIP, tap, enable).
   Callers resolve tap from the handle cache themselves.
 - initializeTapPool(ctx) → initializeTapPool(usedTaps []string).
   Daemon.Open enumerates VMs, collects taps from handles, hands the
   slice in.

rebuildDNS stays on *Daemon as the orchestrator — it filters by
vm-alive (a VMService concern handles will move to in phase 4) then
calls HostNetwork.replaceDNS with the already-filtered map.

Capability hooks continue to take *Daemon; they now use it as a
facade to reach services (d.net.ensureNAT, d.hostNet().*). Planned
CapabilityHost interface extraction is orthogonal, left for later.

Tests: dns_routing_test.go + fastpath_test.go + nat_test.go +
snapshot_test.go + open_close_test.go were touched to construct
HostNetwork literals where they exercise its methods directly, or
route through d.hostNet() where they exercise the Daemon entry
points.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:11:46 -03:00
eba9a553bf
daemon: use exact-name lookup for VM-create uniqueness
reserveVM's duplicate-name guard routed through Daemon.FindVM, which
falls back to prefix-matching on both ids and names when no exact
match is found. That turns the uniqueness check into a correctness
bug: a brand-new VM name can be rejected because it happens to
prefix an existing VM's id, or an existing VM's name. So `vm create
--name beta` fails when `beta-sandbox` already exists.

Swap in a dedicated store.GetVMByName that does a literal `WHERE
name = ?` lookup, and use it from reserveVM. FindVM keeps its
prefix-matching behaviour for user-facing lookup paths (`vm ssh
<partial>`, `vm stop <partial>`) where "did you mean" semantics
are the feature.

Tests:
 - TestReserveVMAllowsNameThatPrefixesExistingVM — seeds a VM whose
   id + name both start with "longname", then reserves two new VMs
   named "longname" and "longname-sandbox". Both must succeed.
   Under the old FindVM-based check, both would fail.
 - TestReserveVMRejectsExactDuplicateName — actual collisions are
   still rejected after the swap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 14:00:33 -03:00
108f7a0600
ssh-config: make the ssh <name>.vm shortcut opt-in
Before this change, every daemon.Open() wrote a Host *.vm stanza into
~/.ssh/config in a marker-fenced block. That's a real footgun for users
who manage their SSH config declaratively (chezmoi, dotfiles, NixOS):
banger was mutating host state outside its own directory on every
daemon start, easy to miss and hard to audit.

New contract: the daemon only ever writes its own ssh_config file at
~/.config/banger/ssh_config. ~/.ssh/config is untouched unless the user
opts in. `banger vm ssh <name>` still works out of the box — the
shortcut only matters for plain `ssh sandbox.vm` from any terminal.

The opt-in surface is `banger ssh-config`:

  banger ssh-config              # prints path + include-line +
                                 # install/uninstall hints
  banger ssh-config --install    # adds `Include <bangerConfig>` to
                                 # ~/.ssh/config inside a marker-fenced
                                 # block; idempotent; migrates any
                                 # legacy inline Host *.vm block from
                                 # pre-opt-in builds
  banger ssh-config --uninstall  # removes the new Include block AND
                                 # any legacy inline block

Doctor gains a gentle warn-level note when banger's ssh_config exists
but the user hasn't wired it in — not a fail, since the shortcut is
convenience and `banger vm ssh` covers the essential case.

Tests cover: daemon writes banger file and does NOT touch ~/.ssh/config,
Install adds the block, Install is idempotent, Install migrates the
legacy inline block cleanly (removing it, preserving unrelated
entries, adding the new Include block), Uninstall removes both marker
variants, Uninstall is a no-op when ~/.ssh/config is absent, and
UserSSHIncludeInstalled detects both marker shapes.

README reframes the feature as optional convenience.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:57:26 -03:00
99d0811097
daemon: shrink createVMMu + imageOpsMu to reservation/publication windows
Before: createVMMu was held across the whole of CreateVM — including
image resolution (which could fire a full auto-pull) and startVMLocked
(boot of multiple seconds). imageOpsMu was held across the whole of
PullImage/RegisterImage/PromoteImage/DeleteImage, so any slow OCI pull,
bundle download, or file copy blocked every other image mutation and
every other VM create that needed to auto-pull. The async create API
bought nothing if all creates serialised on the same mutex.

CreateVM is now three phases:

 1. Validate + resolve image (possibly auto-pulling). No global lock.
 2. reserveVM: take createVMMu only long enough to re-check the name
    is free, allocate the next guest IP, and UpsertVM the "created"
    row. Milliseconds.
 3. startVMLocked: run the full boot flow under the per-VM lock only.

Parallel creates of different VMs now overlap on image resolution +
boot; they contend only across the reservation claim.

For the image surface a new publishImage helper isolates the commit
atom (recheck name free, atomic rename stagingDir→finalDir, UpsertImage)
under imageOpsMu. pullFromBundle + pullFromOCI do their network fetch
+ ext4 build + ownership fixup + agent injection outside the lock;
Register moves validation + kernel resolution outside; Promote moves
file copy + SSH-key seeding outside; Delete keeps a brief lock over
the lookup + reference check + store delete and does file cleanup
unlocked.

Two concurrency tests assert the new behaviour:
 - TestPullImageDoesNotSerialiseOnDifferentNames fails the old code
   (second pull blocks on imageOpsMu and never reaches the body).
 - TestPullImageRejectsNameClashAtPublish confirms the publish-window
   recheck is what enforces name uniqueness now that the body runs
   unlocked — exactly one winner.

ARCHITECTURE.md updated to describe the new scope explicitly instead
of calling the locks "narrow".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:44:22 -03:00
58464ac28c
docs + doctor: be honest about amd64-only support
The README sold the product as "Linux with /dev/kvm"; the deeper docs
admit that the Makefile pins companion builds to GOARCH=amd64, the
kernel catalog ships only x86_64 entries, and OCI import pulls
linux/amd64 layers. arm64 users who show up through the README only
discover that after install fails in non-obvious ways.

Two surface-level fixes:

- README requirements list leads with "x86_64 / amd64 Linux — arm64 is
  not supported today", with a short note on the three places that
  assumption lives so users understand it's not a last-mile gap.
- `banger doctor` now runs an architecture check that passes on amd64
  and FAILS (not warns) on anything else, referencing the three
  downstream assumptions. Hard-fail rather than warn so a user on an
  arm64 machine doesn't waste time chasing unrelated preflight items.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:03:50 -03:00
e69810610a
daemon: correct ARCHITECTURE doc to match actual package shape + lock scope
Two promises the doc was making that the code doesn't keep:

1. "Helpers moved out so the package stays focused on orchestration."
   The package still has ~29 files and ~130 func (d *Daemon) methods
   wiring VM lifecycle, image management, host networking, background
   reconciliation, and JSON-RPC dispatch. Calling it "just orchestration"
   sets readers up for surprise. Rewrite the subpackages preamble to
   say so, and flag the service split as a post-v0.1.0 project.

2. "vmLocks[id] is held only across short synchronous state validation
   and DB mutations." That's what workspace.prepare does; regular
   lifecycle ops (start/stop/delete/set) go through withVMLockByRef
   and hold the lock across the whole callback body, which for `start`
   means preflight + bridge + firecracker spawn + post-boot wiring.
   Rewrite the vmLocks bullet and the lock-ordering section to say
   that explicitly, so readers don't build "surely my long flow under
   the lock can't be what the doc means" reasoning on top of a false
   premise.

Doc-only change. Code behaviour is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:02:36 -03:00
34dd7644d8
store: introduce versioned migrations with ordered runner + atomic apply
The old migrate() helper only knew how to re-run a fixed slab of CREATE
TABLE IF NOT EXISTS plus per-column ensureColumnExists calls. That worked
while every schema change was a benign additive column; it falls apart
as soon as we need a data backfill, an index, a rename, or anything that
has to happen exactly once in a known order.

Replaces it with a schema_migrations table + ordered []migration slice.
Each migration has a unique id, a human-readable name, and a func(*Tx)
body; the runner opens a transaction per migration so DDL and any data
changes either both land and get recorded or both roll back together,
leaving the DB in a state where retrying on next Open() reapplies from
the same point.

Migration 1 ("baseline") collapses the current schema into one entry:
fresh databases apply it in one shot; existing dev databases see
idempotent `CREATE TABLE IF NOT EXISTS` + `ALTER TABLE … ADD COLUMN`
statements that succeed as no-ops, and the only net effect is the
schema_migrations row that brings them into the versioned system.

Tests cover fresh apply, idempotent re-open, skipping already-applied
ids, rollback on body error (the transient table the migration created
must not survive), and duplicate-id rejection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 12:59:42 -03:00
b930c51990
runtime sockets: close the local-user race window around control-plane creation
Previously the daemon socket, per-VM firecracker API socket, and vsock
socket were transiently world-exposed on hosts without XDG_RUNTIME_DIR:
the runtime directory landed in /tmp at 0755, Firecracker ran with
umask 000 (mode 0666 sockets), and only a follow-up chown/chmod in
EnsureSocketAccess tightened them. A local attacker could race into
bangerd.sock or the firecracker API socket during that window.

Three changes:

- internal/paths/paths.go: RuntimeDir is now created (and re-chmod'd if
  stale) at 0700 unconditionally. When XDG_RUNTIME_DIR is unset and we
  fall back to /tmp/banger-runtime-<uid>, Ensure() now verifies the
  parent dir is owned by the current uid and 0700 mode — refusing to
  place sockets inside a directory someone else created. Symlink swaps
  rejected via Lstat.

- internal/firecracker/client.go: launch firecracker with umask 077
  instead of umask 000 so the API socket is mode 0600 from birth. The
  chown in fcproc.EnsureSocketAccess still transfers ownership from
  root to the invoking user afterwards.

- internal/daemon/fcproc/fcproc.go: EnsureSocketDir now creates (and
  re-chmod's) the runtime socket directory at 0700.

Tests cover the tightening path — an existing 0755 RuntimeDir is
re-chmod'd on Ensure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 12:53:47 -03:00
2b6437d1b4
remove vm session feature
Cuts the daemon-managed guest-session machinery (start/list/show/
logs/stop/kill/attach/send). The feature shipped aimed at agent-
orchestration workflows (programmatic stdin piping into a long-lived
guest process) that aren't driving any concrete user today, and the
~2.3K LOC of daemon surface area — attach bridge, FIFO keepalive,
controller registry, sessionstream framing, SQLite persistence — was
locking in an API we'd have to keep through v0.1.0.

Anything session-flavoured that people actually need today can be
done with `vm ssh + tmux` or `vm run -- cmd`.

Deleted:
- internal/cli/commands_vm_session.go
- internal/daemon/{guest_sessions,session_lifecycle,session_attach,session_stream,session_controller}.go
- internal/daemon/session/ (guest-session helpers package)
- internal/sessionstream/ (framing package)
- internal/daemon/guest_sessions_test.go
- internal/store/guest_session_test.go
- GuestSession* types from internal/{api,model}
- Store UpsertGuestSession/GetGuestSession/ListGuestSessionsByVM/DeleteGuestSession + scanner helpers
- guest.session.* RPC dispatch entries
- 5 CLI session tests, 2 completion tests, 2 printer tests

Extracted:
- ShellQuote + FormatStepError lifted to internal/daemon/workspace/util.go
  (only non-session consumer); workspace package now self-contained
- internal/daemon/guest_ssh.go keeps guestSSHClient + dialGuest +
  waitForGuestSSH — still used by workspace prepare/export
- internal/daemon/fake_firecracker_test.go preserves the test helper
  that used to live in guest_sessions_test.go

Store schema: CREATE TABLE guest_sessions and its column migrations
removed. Existing dev DBs keep an orphan table (harmless, pre-v0.1.0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 12:47:58 -03:00
c42fcbe012
cli + daemon: move test seams off package globals onto injected structs
CLI: introduce internal/cli.deps which owns every RPC/SSH/host-command
seam the tree used to reach through mutable package vars. Command
builders, orchestrators, and the completion helpers become methods on
*deps. Tests construct their own deps per case, so fakes no longer leak
across cases and tests are free to run in parallel.

Daemon: move workspaceInspectRepoFunc + workspaceImportFunc onto the
Daemon struct (workspaceInspectRepo / workspaceImport), mirroring the
existing guestWaitForSSH / guestDial pattern. Workspace-prepare tests
drop t.Parallel() guards now that they no longer mutate process-wide
state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 19:03:55 -03:00
d38f580e00
doctor: surface state store open failure as failing check
Previously store.Open errors were silently swallowed, so `banger
doctor` could report green while the default-image check (and any
other store-dependent diagnostic) was silently skipped because
d.store was nil.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:35:27 -03:00
3f6ecb4376
cli: split banger.go god file into focused files
Pure code motion — banger.go 3508→240 LOC, same-package
decomposition keeps all identifiers visible without export changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-19 17:34:32 -03:00
3a5f4cd40d
cli: delete vm run's dead import path + duplicated git inspection
The CLI carried a full second copy of the workspace import
implementation that `vm run` never actually used:

  - importVMRunRepoToGuest (no callers — the live flow calls the
    daemon's PrepareVMWorkspace RPC instead)
  - prepareVMRunRepoCopy, vmRunCheckoutCommit, vmRunCheckoutScript,
    gitFileURL, runHostCommand (all reachable only from the dead
    importVMRunRepoToGuest)

Plus a duplicated repo-inspection surface that shadowed the
daemon's:

  - inspectVMRunRepo ran every git query the daemon re-ran during
    workspace.prepare (HEAD, branch, identity, origin, overlay list)
  - gitOutput / gitTrimmedOutput / gitResolvedConfigValue /
    parseNullSeparatedOutput / listSubmodules / listOverlayPaths /
    resolveVMRunSourcePath — all identical to the exported
    workspace.* versions
  - vmRunRepoSpec — same fields as workspace.RepoSpec

Replaced with a single minimal preflight:

  func vmRunPreflightRepo(ctx, rawPath) (absPath, err error)

The preflight only checks what the user can fix locally before
banger creates a VM (path exists, sits in a non-bare git repo, no
submodules). The daemon's workspace.prepare RPC does the full
inspection — and returns RepoRoot + RepoName in the response, which
the CLI now threads into the tooling harness instead of computing
them a second time.

Signature changes:

  runVMRun(ctx, ..., *vmRunRepo, ...)   // was: *vmRunRepoSpec
  startVMRunToolingHarness(ctx, client, repoRoot, repoName, progress)
                                        // was: (ctx, client, spec, progress)
  vmRunToolingHarnessScript(plan)       // was: (spec, plan)
  vmRunToolingHarnessLaunchScript(repoName)  // was: (spec)

Tests: the CLI-side git-inspection tests are replaced by a single
TestVMRunPreflightRejectsSubmodules that exercises the preflight.
Everything else (tooling harness script, progress renderer, SSH args,
runVMRun flows) keeps working. The shallow-copy / checkout-script
tests are gone — that code now lives only in
internal/daemon/workspace and is tested there.

Also fixed a latent bug the refactor exposed: vm run's --from flag
defaults to "HEAD", which the daemon reads as "from without branch"
and rejects. CLI now scrubs fromRef when branchName is empty.

Live verified: `banger vm run --name X . -- cmd` boots, workspace
materialises at /root/repo with matching HEAD, exit code propagates.
2026-04-19 17:01:26 -03:00
ae14b9499d
ssh: trust-on-first-use host key pinning everywhere
Guest host-key verification was off in all three SSH paths:

  * Go SSH (internal/guest/ssh.go) used ssh.InsecureIgnoreHostKey
  * `banger vm ssh` passed StrictHostKeyChecking=no
    + UserKnownHostsFile=/dev/null
  * `~/.ssh/config` Host *.vm shipped the same posture into the
    user's global config

Now each path verifies against a banger-owned known_hosts file at
`~/.local/state/banger/ssh/known_hosts` with TOFU semantics:

  * First dial to a VM pins the key.
  * Subsequent dials require an exact match. A mismatch fails with
    an explicit "possible MITM" error.
  * `vm delete` removes the entries so a future VM reusing the IP
    or name re-pins cleanly.
  * The user's `~/.ssh/known_hosts` is untouched.

Changes:

  internal/guest/known_hosts.go (new) — OpenSSH-compatible parser,
    TOFUHostKeyCallback, RemoveKnownHosts. Process-wide mutex
    around the file.
  internal/guest/ssh.go — Dial and WaitForSSH grew a knownHostsPath
    parameter threaded through the callback. Empty path keeps the
    insecure callback (tests + throwaway tools only; documented).
  internal/daemon/{guest_sessions,session_attach,session_lifecycle,
    session_stream}.go — call sites pass d.layout.KnownHostsPath.
  internal/daemon/ssh_client_config.go — the ~/.ssh/config Host *.vm
    block now points at banger's known_hosts and uses
    StrictHostKeyChecking=accept-new. Missing path → fail closed.
  internal/daemon/vm_lifecycle.go — deleteVMLocked drops known_hosts
    entries for the VM's IP and DNS name via removeVMKnownHosts.
  internal/cli/banger.go — sshCommandArgs swaps StrictHostKeyChecking
    no + /dev/null for banger's file + accept-new. Path resolution
    failure falls through to StrictHostKeyChecking=yes.
  internal/paths/paths.go — Layout gains SSHDir + KnownHostsPath;
    Ensure creates SSHDir at 0700.

Tests (internal/guest/known_hosts_test.go): pin on first use, accept
matching key on second dial, reject mismatch, empty path skips
checking, RemoveKnownHosts drops the entry, re-pin works after
remove. Existing daemon + cli tests updated to assert the new
posture and regression-guard against the old flags.

Live verified: vm run writes the pin to banger's known_hosts at 0600
inside a 0700 dir; banger vm ssh + ssh root@<vm>.vm both succeed
using the pin; vm delete clears it.
2026-04-19 16:46:03 -03:00
a59958d4f5
daemon: roll back host state on any Open() failure
Open() touched several pieces of host state before hitting the step
that returned the error:

  * SQLite handle (store.Open)
  * managed SSH client config block (ensureVMSSHClientConfig)
  * vm-DNS UDP listener goroutine (startVMDNS)
  * systemd-resolved per-interface routing (ensureVMDNSResolverRouting)

The only deferred cleanup guarded stopVMDNS. A reconcile() or
initializeTapPool() failure therefore left the listener running, the
resolver wiring in place, and the SQLite handle open. A subsequent
startup attempt ran into "port 42069 already in use" or silently
published stale state.

Fix: once `d` exists, defer `d.Close()` on `err != nil`. Close is
idempotent (sync.Once) and every teardown step (listener close, DNS
listener close, resolver revert, session registry close, store close)
is nil-guarded, so calling it on a daemon that never got past the
first startup step is safe.

Tests (internal/daemon/open_close_test.go):

  - TestCloseOnPartiallyInitialisedDaemon: Close survives a daemon
    with only store + closing channel, and with a vmDNS listener but
    nothing else. Catches regressions where a teardown step forgets
    to nil-check.
  - TestCloseIdempotentUnderConcurrency: 5 goroutines racing on
    Close() never panic (sync.Once + close(d.closing) survive).
  - TestOpenFailureRunsCloseCleanup: structural check that the
    `defer cleanup() if err != nil` pattern actually fires.

Live: `banger daemon stop` cleanly, `banger vm ls` restarts daemon
without a residual listener on port 42069.
2026-04-19 16:36:29 -03:00
d1b9a8c102
remove experimental web UI
The web UI shipped as "experimental" and was never finished — no nav
off the dashboard, no live updates, no settled design, never a
supported surface. It was opt-in by default already; leaving the code
in the tree for v0.1.0 only invited "does this work?" questions and
kept HostSummary/BangerSummary/SudoStatus types on the public RPC
surface that nothing else uses.

Removed:

  internal/webui/                         (all Go + templates + assets)
  internal/daemon/web.go                  (server start / Layout / Config / ListVMs / ListImages)
  internal/daemon/dashboard.go            (DashboardSummary aggregator)

Simplified:

  internal/api/types.go                   drop WebURL on PingResult, drop
                                          HostSummary / SudoStatus / BangerSummary /
                                          DashboardSummary / DashboardSummaryResult
  internal/model/types.go                 drop DaemonConfig.WebListenAddr
  internal/config/config.go               drop web_listen_addr from fileConfig + Load
  internal/daemon/daemon.go               drop webListener / webServer / webURL fields +
                                          startWebServer() call + ping WebURL population
  internal/cli/banger.go                  `daemon status` output no longer branches on web
  internal/daemon/{doc.go,ARCHITECTURE.md} drop web UI sections
  README.md                               drop web_listen_addr config bullet + security paragraph

Tests updated to reflect the new shape. Coverage 57.3 -> 58.9% (the
webui package was largely untested; its removal lifts the ratio
without moving the numerator). `banger daemon status` output and
--help are web-free. Lint + full suite green.
2026-04-19 14:28:08 -03:00
687fcf0b59
vm state: split transient kernel/process handles off the durable schema
Separates what a VM IS (durable intent + identity + deterministic
derived paths — `VMRuntime`) from what is CURRENTLY TRUE about it
(firecracker PID, tap device, loop devices, dm-snapshot target — new
`VMHandles`). The durable state lives in the SQLite `vms` row; the
transient state lives in an in-memory cache on the daemon plus a
per-VM `handles.json` scratch file inside VMDir, rebuilt at startup
from OS inspection. Nothing kernel-level rides the SQLite schema
anymore.

Why:

  Persisting ephemeral process handles to SQLite forced reconcile to
  treat "running with a stale PID" as a first-class case and mix it
  with real state transitions. The schema described what we last
  observed, not what the VM is. Every time the observation model
  shifted (tap pool, DM naming, pgrep fallback) the reconcile logic
  grew a new branch. Splitting lets each layer own what it's good at:
  durable records describe intent, in-memory cache + scratch file
  describe momentary reality.

Shape:

  - `model.VMHandles` = PID, TapDevice, BaseLoop, COWLoop, DMName,
    DMDev. Never in SQLite.
  - `VMRuntime` keeps: State, GuestIP, APISockPath, VSockPath,
    VSockCID, LogPath, MetricsPath, DNSName, VMDir, SystemOverlay,
    WorkDiskPath, LastError. All durable or deterministic.
  - `handleCache` on `*Daemon` — mutex-guarded map + scratch-file
    plumbing (`writeHandlesFile` / `readHandlesFile` /
    `rediscoverHandles`). See `internal/daemon/vm_handles.go`.
  - `d.vmAlive(vm)` replaces the 20+ inline
    `vm.State==Running && ProcessRunning(vm.Runtime.PID, apiSock)`
    spreads. Single source of truth for liveness.
  - Startup reconcile: per running VM, load the scratch file, pgrep
    the api sock, either keep (cache seeded from scratch) or demote
    to stopped (scratch handles passed to cleanupRuntime first so DM
    / loops / tap actually get torn down).

Verification:

  - `go test ./...` green.
  - Live: `banger vm run --name handles-test -- cat /etc/hostname`
    starts; `handles.json` appears in VMDir with the expected PID,
    tap, loops, DM.
  - `kill -9 $(pgrep bangerd)` while the VM is running, re-invoke the
    CLI, daemon auto-starts, reconcile recognises the VM as alive,
    `banger vm ssh` still connects, `banger vm delete` cleans up.

Tests added:

  - vm_handles_test.go: scratch-file roundtrip, missing/corrupt file
    behaviour, cache concurrency, rediscoverHandles prefers pgrep
    over scratch, returns scratch contents even when process is
    dead (so cleanup can tear down kernel state).
  - vm_test.go: reconcile test rewritten to exercise the new flow
    (write scratch → reconcile reads it → verifies process is gone →
    issues dmsetup/losetup teardown).

ARCHITECTURE.md updated; `handles` added to Daemon field docs.
2026-04-19 14:18:13 -03:00
2e6e64bc04
guest sshd: drop DEBUG3 + StrictModes no; normalise /root perms
Previously /etc/ssh/sshd_config.d/99-banger.conf landed with:

  LogLevel DEBUG3
  PermitRootLogin yes
  PubkeyAuthentication yes
  AuthorizedKeysFile /root/.ssh/authorized_keys
  StrictModes no

DEBUG3 was debug leftover that floods journald in normal use.
StrictModes no was a workaround for /root perm drift on the work
disk — the real fix is to make those perms correct at provisioning
time.

New drop-in:

  PermitRootLogin prohibit-password
  PubkeyAuthentication yes
  PasswordAuthentication no
  KbdInteractiveAuthentication no
  AuthorizedKeysFile /root/.ssh/authorized_keys

prohibit-password blocks password root login even if PasswordAuth
gets flipped on elsewhere; KbdInteractiveAuth no closes the last
interactive fallback; StrictModes is now on (sshd's default).

normaliseHomeDirPerms chown/chmods /root to 0755 root:root at every
work-disk mount (ensureAuthorizedKeyOnWorkDisk,
seedAuthorizedKeyOnExt4Image); the .ssh dir also explicitly
chown'd root:root. Verified end-to-end against a real VM:
`sshd -T` reports strictmodes yes and all five directives match.

Regression test (sshd_config_test.go) pins the allow-list and the
deny-list (DEBUG3, StrictModes no, bare `PermitRootLogin yes`) so
the next accidental reintroduction fails fast.

README's Security section updated to reflect the new posture.
2026-04-19 13:40:40 -03:00
6cd52d12f4
workspace prepare: release VM mutex before guest I/O
Previously withVMLockByRef held the per-VM mutex across InspectRepo,
waitForGuestSSH, dialGuest, ImportRepoToGuest (the tar stream!), and
the readonly chmod. A large repo could block `vm stop` / `vm delete`
/ `vm restart` on the same VM for however long the import took.

Split into two phases:

  1. VM mutex held briefly to validate state (running + PID alive)
     and snapshot the fields needed for SSH (guest IP, api sock).
  2. VM mutex released. Acquire workspaceLocks[id] — a separate
     per-VM mutex scoped to workspace.prepare / workspace.export —
     for the guest I/O phase.

Lifecycle ops (stop/delete/restart/set) only take vmLocks, so they
no longer queue behind a slow import. Two concurrent prepares on the
same VM still serialise via workspaceLocks so tar streams don't
interleave. ExportVMWorkspace also acquires workspaceLocks to avoid
snapshotting a half-streamed import.

Two regression tests (sequential — they swap package-level seams):

  ReleasesVMLockDuringGuestIO: stall the import fake, assert the VM
  mutex is acquirable from another goroutine during the stall.

  SerialisesConcurrentPreparesOnSameVM: 3 concurrent prepares, assert
  Import is only ever invoked 1-at-a-time per VM.

ARCHITECTURE.md documents the split + updated lock ordering.
2026-04-19 13:32:42 -03:00
99de42385f
workspace export: stop mutating the guest repo index
Previously `banger vm workspace export` ran `git add -A` against the
guest's real `.git/index`, so the observation step left staged
changes behind that users never asked for. Reconnecting later (ssh,
another export) surfaced them and looked like phantom work.

Route `git add -A` through a throwaway index file instead:

  tmp_idx=$(mktemp ...)
  trap 'rm -f "$tmp_idx"' EXIT
  git read-tree <ref> --index-output="$tmp_idx"
  GIT_INDEX_FILE="$tmp_idx" git add -A
  GIT_INDEX_FILE="$tmp_idx" git diff --cached <ref> --binary|--name-only

The real .git/index, working tree, and refs stay exactly as the user
left them. Same diff content — commits past <ref>, uncommitted edits,
and untracked files (minus .gitignore) all captured.

Regression test locks the invariant: every export script must route
add -A through GIT_INDEX_FILE and clean the temp index on exit. CLI
help text updated to say "non-mutating".
2026-04-19 13:20:56 -03:00
21b74639d8
vm defaults: host-aware sizing + spec line on spawn + doctor check
Replaces the static model.Default* constants that drove --vcpu / --memory
/ --disk-size with a three-layer resolver:

  1. [vm_defaults] in ~/.config/banger/config.toml (if set)
  2. host-derived heuristics (cpus/4 capped at 4; ram/8 capped at 8 GiB)
  3. baked-in constants (floor)

Visibility:

- Every `vm run` / `vm create` prints a `spec:` line before progress
  begins: `spec: 4 vcpu · 8192 MiB · 8G disk`. Matches the VM that
  actually gets created because the CLI is now the single source of
  truth — it resolves, populates the flag defaults, and forwards the
  explicit values to the daemon.
- `banger doctor` adds a "vm defaults" check showing per-field
  provenance (config|auto|builtin) and the config file path for
  overrides.
- `--help` shows the resolved defaults (e.g. `--vcpu int (default 4)`
  on an 8-core host).

No `banger config init` command, no first-run side effects, no writes
to the user's filesystem behind their back. Users who want explicit
control set the keys; everyone else gets sensible numbers that track
their hardware.
2026-04-19 13:06:51 -03:00
78ff482bfa
release prep: opt-in web UI, make uninstall, fix stale kernel-catalog docs
- WebListenAddr default is now "" (empty). The experimental web UI was
  running on 127.0.0.1:7777 by default, which surprises users who never
  opted in. Users who want it set `web_listen_addr = "127.0.0.1:7777"`
  in config.toml.
- `make uninstall` stops the daemon (if any) and removes the installed
  binaries. Preserves user data on disk but prints the paths so `rm -rf`
  can follow for a full purge. Documented in README next to install.
- docs/kernel-catalog.md: replace the `void-6.12` and `alpine-3.23`
  examples (never published) with `generic-6.12` (the only cataloged
  kernel today). Updates the versioning-convention example too.
2026-04-19 12:43:58 -03:00
221fb03d68
cli QoL: vm prune, list→ls aliases, delete→rm aliases
- `banger vm prune` sweeps every non-running VM (stopped, created,
  error) with an interactive confirmation; -f/--force skips the prompt.
  Partial failures report which VM failed and exit non-zero.
- list commands gain `ls` alias: vm list already had it; added to image
  list, kernel list, and vm session list.
- delete commands gain `rm` alias: vm delete and image delete. kernel
  rm already aliased delete/remove.

Uses new test seams (vmListFunc) plus the existing vmDeleteFunc so
prune unit-tests without touching the daemon socket.
2026-04-19 12:17:46 -03:00
e3eaa0c797
cli: shell completion via cobra + dynamic resource name lookups
Re-enable cobra's default `completion` subcommand (`banger completion
bash|zsh|fish|powershell`). Plus live resource-name suggestions that
hit the running daemon via the same RPC the real commands use:

  vm start/stop/restart/delete/kill/set         → completeVMNames (variadic)
  vm ssh/show/logs/stats/ports/...              → completeVMNameOnlyAtPos0
  vm session list/start                         → completeVMNameOnlyAtPos0
  vm session show/logs/stop/kill/attach/send    → completeSessionNames (vm + session)
  image show/delete/promote                     → completeImageNameOnlyAtPos0
  kernel show/rm                                → completeKernelNameOnlyAtPos0
  vm run/create --image, image pull/register --kernel-ref → flag-value completion

Design notes in internal/cli/completion.go: completers never auto-start
the daemon (ping-check, bail with NoFileComp on miss), so tab-completion
stays a zero-cost probe. Variadic completers exclude already-entered
args to avoid duplicate suggestions.

README: install recipes for bash / zsh / fish.
2026-04-19 12:12:40 -03:00
346eaba673
coverage: medium batch — hostnat runner, store guest-sessions, daemon helpers
Reuses existing fixtures (CommandRunner fakes, SQLite tempfile store,
pure-Go seams). No new infra needed.

  hostnat                  50% -> 98%   (iptables orchestration via fake runner)
  store                    78% -> 91%   (guest_sessions CRUD roundtrip)
  daemon/session           57% -> 95%   (script gen, state parse, snapshot apply)
  daemon/opstate           67% -> 100%  (Registry Insert/Get/Prune)
  daemon (firstNonEmpty)   slight bump

Total 54.0% -> 56.5%.
2026-04-18 18:03:37 -03:00
f8979de58a
coverage: easy-wins batch across cli, system, paths, vmdns, toolingplan
Pure-Go tests for formatters, layout resolution, and validators — no
fixtures, no external processes. Targets previously-zero functions the
triage scan flagged as low-hanging fruit.

  cli          55% -> 65%
  paths        64% -> 91%
  system       65% -> 75%
  vmdns        72% -> 86%
  toolingplan  73% -> 78%

Total 52.6% -> 54.0%.
2026-04-18 17:57:05 -03:00
a3cc296523
guest: tests for fingerprint, shellQuote, tar-entries edge cases, nil receivers
Pure-Go additions (no SSH server fixture): AuthorizedPublicKeyFingerprint,
shellQuote escaping, writeTarEntriesArchive error paths (.., ., missing,
duplicates, blank entries) and symlink handling, StreamSession/Client
nil-receiver safety, WaitForSSH context cancellation.

internal/guest coverage 17.8% -> 47.6%. Total 52.1% -> 52.6%. The
remaining uncovered paths need a real in-process SSH server; skip.
2026-04-18 17:47:24 -03:00
18bf89eae9
coverage: make targets + close zero-cov gaps (namegen, sessionstream)
Adds `make coverage` (per-package + total via -coverpkg=./...),
`make coverage-html`, and `make coverage-total` (CI-friendly). Wires
coverage.out/coverage.html through `make clean` and .gitignore.

Closes the two easy zero-coverage packages: namegen (77.8%) and
sessionstream (93.5%). Total statement coverage 51.7% -> 52.1%.
2026-04-18 17:44:37 -03:00
2584f94828
image/kernel pull: heartbeat dots so slow pulls look alive
Bundle downloads can take 20–60s on a typical connection and the
CLI was going silent between "resolving daemon" and the final image
summary. Users wondered whether banger had wedged.

New `withHeartbeat` helper wraps an RPC call with a dot-every-2s
ticker on stderr. No-op when stderr isn't a terminal, so piped or
scripted invocations stay quiet. Wired into `image pull` and `kernel
pull`, the two commands that actually download bytes.

Example:

    $ banger image pull debian-bookworm
    [image pull] ..........
    id  name             managed  ...

Tests cover the non-TTY short-circuit and error propagation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 17:08:30 -03:00
b5c13e3938
Remove opencode package + vm acp command (dead code)
The `internal/opencode` package and the `opencodeCapability` that
consumed it were hard-wired to wait for opencode on guest port 4096
when an image shipped an initrd. After the prune commits (void /
alpine / customize.sh / image build all removed), nothing banger
produces today carries an initrd, so the capability's wait path was
unreachable: every startup short-circuited to the "direct-boot, skip
opencode" branch.

Same logic for `banger vm acp`: it SSHes to `opencode acp --cwd
<path>`, a binary the golden image no longer ships. Users who run
their own image with opencode can still invoke
`ssh vm -- opencode acp --cwd /root/repo` directly — no banger
scaffolding required.

Removed:
- internal/opencode/ (whole package, 255 LOC incl. tests)
- internal/daemon/opencode.go (opencodeCapability)
- cli `vm acp` command + its helpers (runVMACP, sshACPCommandArgs,
  vmACPRemoteCommand) + their tests
- The opencodeCapability{} entry in registeredCapabilities() plus
  the test that pinned its presence
- `wait_opencode` progress-stage label from the vm-create renderer
- Stale mentions in daemon/doc.go, README, and webui test fixtures

~480 lines gone, 12 added. `banger/internal` is now 25 packages
instead of 26.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 16:54:37 -03:00
0933deaeb1
file_sync: config-driven replacement for hardcoded auth sync
Replace the three hardcoded host→guest credential syncs (opencode,
claude, pi) with a generic `[[file_sync]]` config list. Default is
empty — users opt in to exactly what they want synced, with no
surprise about which tools banger "supports".

```toml
[[file_sync]]
host = "~/.local/share/opencode/auth.json"
guest = "~/.local/share/opencode/auth.json"

[[file_sync]]
host = "~/.aws"          # directories are copied recursively
guest = "~/.aws"

[[file_sync]]
host = "~/bin/my-script"
guest = "~/bin/my-script"
mode = "0755"            # optional; default 0600 for files
```

Semantics:
- Host `~/...` expands against the host user's $HOME. Absolute host
  paths are used as-is.
- Guest must live under `~/` or `/root/...` — banger's work disk is
  mounted at /root in the guest, so that's the syncable namespace.
  Anything outside is rejected at config load.
- Validation at config load: reject empty paths, relative paths,
  `..` traversal, `~user/...`, malformed mode strings. Errors name
  the offending entry index.
- Missing host paths are a soft skip with a warn log (existing
  behaviour). Other errors (read, mkdir, install) abort VM create.
- File entries: `install -o 0 -g 0 -m <mode>` (default 0600).
- Directory entries: walked in Go; each source file is installed
  with its own source permissions preserved. The entry's `mode` is
  ignored for directories.

Removed (all dead after this):
- `ensureOpencodeAuthOnWorkDisk`, `ensureClaudeAuthOnWorkDisk`,
  `ensurePiAuthOnWorkDisk`, the shared `ensureAuthFileOnWorkDisk`,
  their `warn*Skipped` helpers, `resolveHost{Opencode,Claude,Pi}AuthPath`,
  and the work-disk relative-path + default display-path constants.
- The capability hook registering the three syncs now calls the
  generic `runFileSync` once.

Seven tests exercising the old codepath deleted; six new tests cover
the new runFileSync (no-op on empty config, file copy, custom mode,
missing-host-skip, overwrite, recursive directory). Config-layer
test adds happy-path parsing and a case-per-shape table of invalid
entries (empty, relative host, guest outside /root, '..' traversal,
`~user`, bad mode).

README updated: replaces the "Credential sync" section with a
"File sync" section showing the new config shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 16:40:11 -03:00
843314be5e
vm_authsync: s/repairing/provisioning/ in SSH work-disk stage
The "repairing SSH access on work disk" stage detail sounded
remedial, like something had gone wrong. It's just writing banger's
SSH key to /root/.ssh/authorized_keys on the work disk for the first
time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 16:29:18 -03:00
cdd857b288
vm run --rm: suppress the still-running reminder
The deferred --rm delete fires AFTER runSSHSession returns, but
runSSHSession prints "vm X is still running (stop with ...)" before
returning. Net effect: the user sees the reminder, then the VM gets
deleted behind it — misleading.

Thread a skipReminder bool into runSSHSession. `vm run` passes the
same value as removeOnExit; other callers (`vm ssh`) pass false.
Reinforced by a new assertion in the --rm happy-path test that the
reminder string never appears in stderr.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 16:10:29 -03:00
b33f24865c
vm run --rm: ephemeral sandboxes
New `--rm` flag deletes the VM once the ssh session or `-- cmd`
exits, making `vm run` one-shot. Exit code from command mode still
propagates correctly.

Semantics:
- Create fails → no VM to delete, nothing to do.
- SSH-wait timeout → VM intentionally kept alive so `vm logs <name>`
  shows why; the timeout error already pointed users at that. Even
  with --rm, this path skips delete — a wedged sshd is exactly when
  you want post-mortem access.
- Session/command ends (any exit code, any reason) → VM is deleted
  via `vm.delete` RPC. Uses a fresh 10s context so Ctrl-C during the
  session doesn't abort the cleanup.

New vmDeleteFunc seam at the top of banger.go alongside the other
RPC seams. Two tests cover the happy path (session ends cleanly →
delete fires with correct ref) and the skip-on-timeout path (ssh
wait errors → delete does NOT fire).

README updated with an ephemeral example and a note about the
timeout-skip behaviour.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 16:06:46 -03:00
3aa64a63c1
vm run: bound the ssh wait and give a useful error on timeout
Before: `guestWaitForSSHFunc` loops forever bounded only by context
cancellation, so if sshd fails to start in the guest `vm run` hangs
indefinitely — which burned a long debugging session during the
golden-image bring-up.

After: the ssh wait gets its own 90s deadline. On guest-side timeout
the error names the VM, explains sshd is the likely suspect, points
at `banger vm logs <name>` for the console output, and notes the VM
is still alive for inspection (or `vm delete` to clean up). Parent
context cancellation (Ctrl-C, caller timeout) still surfaces as-is
without the hint.

`vmRunSSHTimeout` is a var rather than a const so tests can shrink
it; the new TestRunVMRunSSHTimeoutReturnsActionableError sets it to
50ms and asserts the error message contains the actionable bits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 15:59:27 -03:00
ac7974f5b9
Remove image build --from-image; doctor treats catalog images as OK
The `image build` flow spun up a transient Firecracker VM, SSHed in,
and ran a large bash provisioning script to derive a new managed
image from an existing one. It overlapped heavily with the golden-
image Dockerfile flow (same mise/docker/tmux/opencode install logic
duplicated in Go as `imagemgr.BuildProvisionScript`) and had far more
machinery: async op state, RPC begin/status/cancel, webui form +
operation page, preflight checks, API types, tests. For custom
images, writing a Dockerfile is simpler and more reproducible.

Removed end-to-end:
- CLI `image build` subcommand + `absolutizeImageBuildPaths`.
- Daemon: BuildImage method, imagebuild.go (transient-VM orchestration),
  image_build_ops.go (async begin/status/cancel), imagemgr/build.go
  (the 247-line provisioning script generator and all its append*
  helpers), validateImageBuildPrereqs + addImageBuildPrereqs.
- RPC dispatches for image.build / .begin / .status / .cancel.
- opstate registry `imageBuildOps`, daemon seam `imageBuild`,
  background pruner call.
- API types: ImageBuildParams, ImageBuildOperation, ImageBuildBeginResult,
  ImageBuildStatusParams, ImageBuildStatusResult; model type
  ImageBuildRequest.
- Web UI: Backend interface methods, handlers, form, routes, template
  branches (images.html build form, operation.html build branch,
  dashboard.html Build button).
- Tests that directly exercised BuildImage.

Doctor polish (task C):
- Drop the "image build" preflight section entirely (its raison d'être
  is gone).
- Default-image check now accepts "not local but in imagecat" as OK:
  vm create auto-pulls on first use. Only flag when the image is
  neither locally registered nor in the catalog.

Net: 24 files touched, 1,373 lines deleted, 25 added.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 15:54:29 -03:00
6083e2dde5
Prune legacy void/alpine + customize.sh flows
The golden-image Dockerfile + catalog pipeline replaces the entire
manual rootfs-build stack. With that shipped, the per-distro shell
flows are dead code.

Removed:
- scripts/customize.sh, scripts/interactive.sh, scripts/verify.sh
- scripts/make-rootfs{,-void,-alpine}.sh
- scripts/register-{void,alpine}-image.sh
- scripts/make-{void,alpine}-kernel.sh
- internal/imagepreset/ (only consumer was `banger internal packages`,
  which fed customize.sh)
- examples/{void,alpine}.config.toml
- Makefile targets: rootfs, rootfs-void, rootfs-alpine, void-kernel,
  alpine-kernel, void-register, alpine-register, void-vm, alpine-vm,
  verify-void, verify-alpine, plus the ALPINE_RELEASE / *_IMAGE_NAME
  / *_VM_NAME variables

The void-6.12 kernel catalog entry is also gone — golden image pairs
with generic-6.12 and nothing else in the catalog depended on it.

Consolidated: imagemgr now holds the small DebianBasePackages list +
package-hash helper inline, so the `image build --from-image` flow
(still supported) no longer pulls from a separate imagepreset package.

Net: 3,815 lines deleted, 59 added. No runtime functionality removed
beyond the `banger internal packages` subcommand (hidden, used only
by the deleted customize.sh).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 15:39:53 -03:00
75baf2e415
publish-golden-image: content-addressed tarball names
Embed the sha256 prefix in the uploaded filename so every rebuild
lives at a unique URL. Cloudflare's edge cache (and any similar CDN
in front of R2) can never serve stale bytes for the URL the catalog
points at. The R2 console offers no per-URL purge for this bucket
layout, so making the URL itself content-addressed is the only
durable fix.

Also republishes the debian-bookworm catalog entry with the new
filename.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 15:26:57 -03:00
e0894376ea
vm create: auto-pull image and kernel from catalogs if missing
One-command sandbox: `banger vm run` on a fresh host now Just Works.
No prior `banger image pull` or `banger kernel pull` needed.

Changes:

- Default `default_image_name` flips from "default" to "debian-bookworm"
  so the golden image is the implicit target when `--image` is omitted.
- `CreateVM` resolves the image via a new `findOrAutoPullImage`: try
  the local store first, and on miss fall back to the embedded imagecat
  catalog + auto-pull. Emits a vm-create progress stage so the user
  sees "pulling from image catalog" in the create output.
- `resolveKernelInputs` gains context + the same pattern via
  `readOrAutoPullKernel`: try the local kernelcat, and on miss look up
  the embedded kernelcat and auto-pull. Fires whenever a bundle's
  manifest references a kernel the user hasn't pulled yet, not just
  during image pull — any CreateVM with an image that needs a kernel
  not yet local will resolve it.
- `--image` help text updated on both `vm run` and `vm create`.

Six tests cover local-hit-no-pull, auto-pull-on-miss, not-in-catalog
error propagation, and a non-ENOENT kernel read error does NOT trigger
a misleading "not in catalog" claim.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 15:10:26 -03:00
81a27d6648
imagecat: publish debian-bookworm bundle with boot fixes
End-to-end verified:
  banger image pull debian-bookworm
  banger vm run --image debian-bookworm --name goldenvm
boots through multi-user.target, sshd starts, and vm run drops into
an interactive ssh session.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 14:59:01 -03:00
66838bb135
make-bundle: strip /.dockerenv so systemd doesn't misdetect virt
`docker create` drops /.dockerenv into the container's writable layer,
and `docker export` includes it in the tar. When systemd later boots
that rootfs it finds /.dockerenv and flags virtualization=docker,
which disables a bunch of udev device-unit behaviour (device units
never become active, mount units waiting on them hang forever).
Strip /.dockerenv (and /run/.containerenv for podman symmetry) from
the staging tree after FlattenTar and before BuildExt4 so systemd
correctly detects virtualization=kvm.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 14:58:42 -03:00
ed4117d926
imagepull/BuildExt4: omit positional fs-size; rely on file truncation
mkfs.ext4's positional fs-size is documented in 1 KiB units (not the
filesystem's 4 KiB block size), so passing sizeBytes/4096 made
filesystems 1/4 the intended size. A 4 GiB request became a 1 GiB
ext4 in a 4 GiB file, packed to 0 free blocks — VM create then failed
with 'Could not allocate block' when patchRootOverlay tried to write
guest config.

The file is truncated to the target size before mkfs runs; without
the positional arg, mkfs uses the whole device.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 14:58:42 -03:00
b2dcdf9757
vm_lifecycle: drop systemd.mask=dev-{ttyS0,vdb}.device
Both masks were added when the direct-boot path first landed for
container rootfses that didn't have anything mounted on /dev/vdb. The
golden image (and any pulled OCI image running under banger's
patchRootOverlay) has an /etc/fstab entry mounting /dev/vdb at /root —
masking dev-vdb.device makes systemd wait forever for a unit that can
never become active, and the work-disk mount never completes. dev-ttyS0
is a real serial console the image needs too. Drop both.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 14:58:42 -03:00
ab5627aec2
imagecat: publish debian-bookworm golden image
First entry in the image catalog. Verified end-to-end:
  - https://images.thaloco.com/debian-bookworm-x86_64.tar.zst reachable
  - sha256 071495e6... matches
  - bundle unpacks to rootfs.ext4 (4 GiB) + manifest.json with the
    expected name/distro/arch/kernel_ref.

publish-golden-image.sh tweaks:
  - default RCLONE_REMOTE from 'r2' to 'banger-images' (matches the
    rclone config actually in use here).
  - rclone copyto now passes --s3-no-check-bucket and --no-check-dest
    so scoped R2 tokens without HeadBucket/HeadObject permission
    still upload cleanly.

To use: restart bangerd so it picks up the new embedded catalog,
then `banger image pull debian-bookworm`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:25:42 -03:00
5bdc9985c2
image pull: dispatch to imagecat bundle path before OCI
PullImage now checks the embedded imagecat catalog first. If the
ref matches a catalog entry, it takes the bundle path:

  1. Fetch the .tar.zst bundle into a staging dir (rootfs.ext4 +
     manifest.json).
  2. Strip manifest.json (staging-only metadata).
  3. Stage kernel/initrd/modules alongside rootfs.ext4.
  4. Publish the staging dir and upsert the image row.

Bundle rootfs is already flattened + ownership-fixed + agent-
injected at build time, so the daemon-side work is strictly I/O —
no flatten, no mkfs, no debugfs.

Kernel resolution in the bundle path: --kernel-ref > entry.kernel_ref
> --kernel/--initrd/--modules.

If the ref doesn't match a catalog entry, PullImage falls through
to the existing OCI path unchanged (extracted into pullFromOCI).

New test seam: d.bundleFetch. Six unit tests cover happy path,
--kernel-ref override, existing-name rejection, kernel-required
error, fetch-failure cleanup, and the catalog → OCI fallthrough.

CLI help updated: image pull now documents both forms and takes
<name-or-oci-ref> instead of requiring an OCI ref.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 15:43:33 -03:00
d22d05555c
scripts: bundle-based golden image pipeline
Replaces the OCI-push flow with a bundle-based one that mirrors the
kernel catalog (publish-kernel.sh / kernelcat).

- scripts/make-golden-bundle.sh: docker build → docker create → docker
  export | banger internal make-bundle → .tar.zst. Defaults target
  debian-bookworm / generic-6.12 / x86_64; pinned --size 4G to leave
  headroom for first-boot installs and in-VM apt use.
- scripts/publish-golden-image.sh: rewritten to call make-golden-bundle,
  rclone upload to R2 (banger-images bucket, images.thaloco.com), and
  jq-patch internal/imagecat/catalog.json with URL / sha256 / size.
  --skip-upload stops after bundle build and copies to dist/.

make-bundle default ext4 sizing also bumped from +25% to +50% headroom
(mkfs.ext4 needs room for inode tables, block-group metadata, journal,
and the default 5% reserved-blocks margin). The old 25% was too tight
for the ~950 MB golden rootfs and aborted with "Could not allocate
block".

End-to-end smoke (local): golden Dockerfile → 286 MB tar.zst bundle
with correct manifest, valid ext4, and all banger units + vsock agent
present.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 15:38:04 -03:00
a7d1a49aca
cli: restrict ExitCodeError unwrap to the CLI's own type
main.go previously unwrapped *any* error implementing `ExitCode() int`
into the process exit status, which matched *exec.ExitError too. So
whenever a CLI command ran a subprocess (mkfs.ext4, debugfs, ssh to a
daemon preflight, etc.) and that subprocess failed, the CLI would
silently exit with the subprocess's code — no error message printed.
Surfaced while bringing up `banger internal make-bundle`: mkfs.ext4
was failing on an undersized ext4 and the user saw only `EXIT=1`.

Fix: export the type as `cli.ExitCodeError` and unwrap against the
concrete type in main.go. The `ExitCode()` method is gone — only the
explicit wrap at the `vm run` command-mode call site produces this
error now.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 15:37:47 -03:00
bb95a0a273
banger internal make-bundle: build image bundles from flat rootfs tars
New hidden subcommand that turns a `docker export`-style rootfs tar
into a banger bundle (`rootfs.ext4` + `manifest.json`, tar+zstd):

  1. FlattenTar (new in imagepull) extracts the stream into a staging
     dir while capturing per-file uid/gid/mode into a Metadata record.
  2. imagepull.BuildExt4 produces the ext4 via `mkfs.ext4 -d`.
  3. imagepull.ApplyOwnership re-applies the captured metadata with
     `debugfs sif` so setuid/root-owned files keep their identity.
  4. imagepull.InjectGuestAgents drops the vsock agent + network
     bootstrap + first-boot service into the ext4.
  5. manifest.json is written with name/distro/arch/kernel_ref.
  6. Both files are packaged as .tar.zst with max compression.

Flags: --rootfs-tar (file or '-' for stdin), --name, --distro, --arch,
--kernel-ref, --description, --size, --out. Stdout prints bundle path,
sha256, and size so callers can patch the catalog.

Unit tests cover flag registration, required-arg validation, the
bundle tar round-trip, sha256HexFile, and dirSize. An end-to-end test
runs the full pipeline against a synthesized tiny rootfs tar; skips
gracefully when mkfs.ext4 / debugfs / companion binaries are missing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 15:17:50 -03:00
3d9ae624b1
imagecat: catalog + fetch for banger image bundles
New package mirroring `kernelcat`: catalog + SHA256-verified HTTP
fetch of `.tar.zst` bundles that contain rootfs.ext4 + manifest.json.
Mounted empty (version:1, entries:[]) so nothing is pullable via the
bundle path yet; wiring into `banger image pull` lands in a later
phase.

- catalog.go: Catalog/CatEntry, LoadEmbedded, ParseCatalog, Lookup,
  ValidateName.
- fetch.go: Fetch(ctx, client, destDir, entry) downloads the bundle,
  verifies sha256, extracts exactly rootfs.ext4 and manifest.json
  into destDir, returns the parsed manifest. Rejects unexpected tar
  entries, unsafe paths, non-regular files, and cleans up partial
  writes on failure.
- Thirteen unit tests (happy path + every failure mode).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 15:11:52 -03:00
feb679a301
vm run redesign: one command, three modes
`vm run` now covers bare sandbox (no args), workspace sandbox (path),
and workspace+command (path -- cmd) in a single entry point. Replaces
the old print-next-steps-and-exit behaviour: bare and workspace modes
drop into interactive ssh, command mode execs via ssh and propagates
the remote exit code through banger's own exit status.

- path argument is optional; --branch / --from still require a path.
- workspace prep and mise tooling bootstrap only run when a path is
  given; command mode skips the bootstrap.
- remote command exit status is wrapped as exitCodeError so main() can
  propagate it instead of collapsing every failure to 1.
- README: promote vm run with three-mode examples; demote vm create
  to a scripting primitive.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 14:00:45 -03:00