Previously the daemon socket, per-VM firecracker API socket, and vsock
socket were transiently world-exposed on hosts without XDG_RUNTIME_DIR:
the runtime directory landed in /tmp at 0755, Firecracker ran with
umask 000 (mode 0666 sockets), and only a follow-up chown/chmod in
EnsureSocketAccess tightened them. A local attacker could race into
bangerd.sock or the firecracker API socket during that window.
Three changes:
- internal/paths/paths.go: RuntimeDir is now created (and re-chmod'd if
stale) at 0700 unconditionally. When XDG_RUNTIME_DIR is unset and we
fall back to /tmp/banger-runtime-<uid>, Ensure() now verifies the
parent dir is owned by the current uid and 0700 mode — refusing to
place sockets inside a directory someone else created. Symlink swaps
rejected via Lstat.
- internal/firecracker/client.go: launch firecracker with umask 077
instead of umask 000 so the API socket is mode 0600 from birth. The
chown in fcproc.EnsureSocketAccess still transfers ownership from
root to the invoking user afterwards.
- internal/daemon/fcproc/fcproc.go: EnsureSocketDir now creates (and
re-chmod's) the runtime socket directory at 0700.
Tests cover the tightening path — an existing 0755 RuntimeDir is
re-chmod'd on Ensure.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cuts the daemon-managed guest-session machinery (start/list/show/
logs/stop/kill/attach/send). The feature shipped aimed at agent-
orchestration workflows (programmatic stdin piping into a long-lived
guest process) that aren't driving any concrete user today, and the
~2.3K LOC of daemon surface area — attach bridge, FIFO keepalive,
controller registry, sessionstream framing, SQLite persistence — was
locking in an API we'd have to keep through v0.1.0.
Anything session-flavoured that people actually need today can be
done with `vm ssh + tmux` or `vm run -- cmd`.
Deleted:
- internal/cli/commands_vm_session.go
- internal/daemon/{guest_sessions,session_lifecycle,session_attach,session_stream,session_controller}.go
- internal/daemon/session/ (guest-session helpers package)
- internal/sessionstream/ (framing package)
- internal/daemon/guest_sessions_test.go
- internal/store/guest_session_test.go
- GuestSession* types from internal/{api,model}
- Store UpsertGuestSession/GetGuestSession/ListGuestSessionsByVM/DeleteGuestSession + scanner helpers
- guest.session.* RPC dispatch entries
- 5 CLI session tests, 2 completion tests, 2 printer tests
Extracted:
- ShellQuote + FormatStepError lifted to internal/daemon/workspace/util.go
(only non-session consumer); workspace package now self-contained
- internal/daemon/guest_ssh.go keeps guestSSHClient + dialGuest +
waitForGuestSSH — still used by workspace prepare/export
- internal/daemon/fake_firecracker_test.go preserves the test helper
that used to live in guest_sessions_test.go
Store schema: CREATE TABLE guest_sessions and its column migrations
removed. Existing dev DBs keep an orphan table (harmless, pre-v0.1.0).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CLI: introduce internal/cli.deps which owns every RPC/SSH/host-command
seam the tree used to reach through mutable package vars. Command
builders, orchestrators, and the completion helpers become methods on
*deps. Tests construct their own deps per case, so fakes no longer leak
across cases and tests are free to run in parallel.
Daemon: move workspaceInspectRepoFunc + workspaceImportFunc onto the
Daemon struct (workspaceInspectRepo / workspaceImport), mirroring the
existing guestWaitForSSH / guestDial pattern. Workspace-prepare tests
drop t.Parallel() guards now that they no longer mutate process-wide
state.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously store.Open errors were silently swallowed, so `banger
doctor` could report green while the default-image check (and any
other store-dependent diagnostic) was silently skipped because
d.store was nil.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure code motion — banger.go 3508→240 LOC, same-package
decomposition keeps all identifiers visible without export changes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The CLI carried a full second copy of the workspace import
implementation that `vm run` never actually used:
- importVMRunRepoToGuest (no callers — the live flow calls the
daemon's PrepareVMWorkspace RPC instead)
- prepareVMRunRepoCopy, vmRunCheckoutCommit, vmRunCheckoutScript,
gitFileURL, runHostCommand (all reachable only from the dead
importVMRunRepoToGuest)
Plus a duplicated repo-inspection surface that shadowed the
daemon's:
- inspectVMRunRepo ran every git query the daemon re-ran during
workspace.prepare (HEAD, branch, identity, origin, overlay list)
- gitOutput / gitTrimmedOutput / gitResolvedConfigValue /
parseNullSeparatedOutput / listSubmodules / listOverlayPaths /
resolveVMRunSourcePath — all identical to the exported
workspace.* versions
- vmRunRepoSpec — same fields as workspace.RepoSpec
Replaced with a single minimal preflight:
func vmRunPreflightRepo(ctx, rawPath) (absPath, err error)
The preflight only checks what the user can fix locally before
banger creates a VM (path exists, sits in a non-bare git repo, no
submodules). The daemon's workspace.prepare RPC does the full
inspection — and returns RepoRoot + RepoName in the response, which
the CLI now threads into the tooling harness instead of computing
them a second time.
Signature changes:
runVMRun(ctx, ..., *vmRunRepo, ...) // was: *vmRunRepoSpec
startVMRunToolingHarness(ctx, client, repoRoot, repoName, progress)
// was: (ctx, client, spec, progress)
vmRunToolingHarnessScript(plan) // was: (spec, plan)
vmRunToolingHarnessLaunchScript(repoName) // was: (spec)
Tests: the CLI-side git-inspection tests are replaced by a single
TestVMRunPreflightRejectsSubmodules that exercises the preflight.
Everything else (tooling harness script, progress renderer, SSH args,
runVMRun flows) keeps working. The shallow-copy / checkout-script
tests are gone — that code now lives only in
internal/daemon/workspace and is tested there.
Also fixed a latent bug the refactor exposed: vm run's --from flag
defaults to "HEAD", which the daemon reads as "from without branch"
and rejects. CLI now scrubs fromRef when branchName is empty.
Live verified: `banger vm run --name X . -- cmd` boots, workspace
materialises at /root/repo with matching HEAD, exit code propagates.
Guest host-key verification was off in all three SSH paths:
* Go SSH (internal/guest/ssh.go) used ssh.InsecureIgnoreHostKey
* `banger vm ssh` passed StrictHostKeyChecking=no
+ UserKnownHostsFile=/dev/null
* `~/.ssh/config` Host *.vm shipped the same posture into the
user's global config
Now each path verifies against a banger-owned known_hosts file at
`~/.local/state/banger/ssh/known_hosts` with TOFU semantics:
* First dial to a VM pins the key.
* Subsequent dials require an exact match. A mismatch fails with
an explicit "possible MITM" error.
* `vm delete` removes the entries so a future VM reusing the IP
or name re-pins cleanly.
* The user's `~/.ssh/known_hosts` is untouched.
Changes:
internal/guest/known_hosts.go (new) — OpenSSH-compatible parser,
TOFUHostKeyCallback, RemoveKnownHosts. Process-wide mutex
around the file.
internal/guest/ssh.go — Dial and WaitForSSH grew a knownHostsPath
parameter threaded through the callback. Empty path keeps the
insecure callback (tests + throwaway tools only; documented).
internal/daemon/{guest_sessions,session_attach,session_lifecycle,
session_stream}.go — call sites pass d.layout.KnownHostsPath.
internal/daemon/ssh_client_config.go — the ~/.ssh/config Host *.vm
block now points at banger's known_hosts and uses
StrictHostKeyChecking=accept-new. Missing path → fail closed.
internal/daemon/vm_lifecycle.go — deleteVMLocked drops known_hosts
entries for the VM's IP and DNS name via removeVMKnownHosts.
internal/cli/banger.go — sshCommandArgs swaps StrictHostKeyChecking
no + /dev/null for banger's file + accept-new. Path resolution
failure falls through to StrictHostKeyChecking=yes.
internal/paths/paths.go — Layout gains SSHDir + KnownHostsPath;
Ensure creates SSHDir at 0700.
Tests (internal/guest/known_hosts_test.go): pin on first use, accept
matching key on second dial, reject mismatch, empty path skips
checking, RemoveKnownHosts drops the entry, re-pin works after
remove. Existing daemon + cli tests updated to assert the new
posture and regression-guard against the old flags.
Live verified: vm run writes the pin to banger's known_hosts at 0600
inside a 0700 dir; banger vm ssh + ssh root@<vm>.vm both succeed
using the pin; vm delete clears it.
Open() touched several pieces of host state before hitting the step
that returned the error:
* SQLite handle (store.Open)
* managed SSH client config block (ensureVMSSHClientConfig)
* vm-DNS UDP listener goroutine (startVMDNS)
* systemd-resolved per-interface routing (ensureVMDNSResolverRouting)
The only deferred cleanup guarded stopVMDNS. A reconcile() or
initializeTapPool() failure therefore left the listener running, the
resolver wiring in place, and the SQLite handle open. A subsequent
startup attempt ran into "port 42069 already in use" or silently
published stale state.
Fix: once `d` exists, defer `d.Close()` on `err != nil`. Close is
idempotent (sync.Once) and every teardown step (listener close, DNS
listener close, resolver revert, session registry close, store close)
is nil-guarded, so calling it on a daemon that never got past the
first startup step is safe.
Tests (internal/daemon/open_close_test.go):
- TestCloseOnPartiallyInitialisedDaemon: Close survives a daemon
with only store + closing channel, and with a vmDNS listener but
nothing else. Catches regressions where a teardown step forgets
to nil-check.
- TestCloseIdempotentUnderConcurrency: 5 goroutines racing on
Close() never panic (sync.Once + close(d.closing) survive).
- TestOpenFailureRunsCloseCleanup: structural check that the
`defer cleanup() if err != nil` pattern actually fires.
Live: `banger daemon stop` cleanly, `banger vm ls` restarts daemon
without a residual listener on port 42069.
The web UI shipped as "experimental" and was never finished — no nav
off the dashboard, no live updates, no settled design, never a
supported surface. It was opt-in by default already; leaving the code
in the tree for v0.1.0 only invited "does this work?" questions and
kept HostSummary/BangerSummary/SudoStatus types on the public RPC
surface that nothing else uses.
Removed:
internal/webui/ (all Go + templates + assets)
internal/daemon/web.go (server start / Layout / Config / ListVMs / ListImages)
internal/daemon/dashboard.go (DashboardSummary aggregator)
Simplified:
internal/api/types.go drop WebURL on PingResult, drop
HostSummary / SudoStatus / BangerSummary /
DashboardSummary / DashboardSummaryResult
internal/model/types.go drop DaemonConfig.WebListenAddr
internal/config/config.go drop web_listen_addr from fileConfig + Load
internal/daemon/daemon.go drop webListener / webServer / webURL fields +
startWebServer() call + ping WebURL population
internal/cli/banger.go `daemon status` output no longer branches on web
internal/daemon/{doc.go,ARCHITECTURE.md} drop web UI sections
README.md drop web_listen_addr config bullet + security paragraph
Tests updated to reflect the new shape. Coverage 57.3 -> 58.9% (the
webui package was largely untested; its removal lifts the ratio
without moving the numerator). `banger daemon status` output and
--help are web-free. Lint + full suite green.
Separates what a VM IS (durable intent + identity + deterministic
derived paths — `VMRuntime`) from what is CURRENTLY TRUE about it
(firecracker PID, tap device, loop devices, dm-snapshot target — new
`VMHandles`). The durable state lives in the SQLite `vms` row; the
transient state lives in an in-memory cache on the daemon plus a
per-VM `handles.json` scratch file inside VMDir, rebuilt at startup
from OS inspection. Nothing kernel-level rides the SQLite schema
anymore.
Why:
Persisting ephemeral process handles to SQLite forced reconcile to
treat "running with a stale PID" as a first-class case and mix it
with real state transitions. The schema described what we last
observed, not what the VM is. Every time the observation model
shifted (tap pool, DM naming, pgrep fallback) the reconcile logic
grew a new branch. Splitting lets each layer own what it's good at:
durable records describe intent, in-memory cache + scratch file
describe momentary reality.
Shape:
- `model.VMHandles` = PID, TapDevice, BaseLoop, COWLoop, DMName,
DMDev. Never in SQLite.
- `VMRuntime` keeps: State, GuestIP, APISockPath, VSockPath,
VSockCID, LogPath, MetricsPath, DNSName, VMDir, SystemOverlay,
WorkDiskPath, LastError. All durable or deterministic.
- `handleCache` on `*Daemon` — mutex-guarded map + scratch-file
plumbing (`writeHandlesFile` / `readHandlesFile` /
`rediscoverHandles`). See `internal/daemon/vm_handles.go`.
- `d.vmAlive(vm)` replaces the 20+ inline
`vm.State==Running && ProcessRunning(vm.Runtime.PID, apiSock)`
spreads. Single source of truth for liveness.
- Startup reconcile: per running VM, load the scratch file, pgrep
the api sock, either keep (cache seeded from scratch) or demote
to stopped (scratch handles passed to cleanupRuntime first so DM
/ loops / tap actually get torn down).
Verification:
- `go test ./...` green.
- Live: `banger vm run --name handles-test -- cat /etc/hostname`
starts; `handles.json` appears in VMDir with the expected PID,
tap, loops, DM.
- `kill -9 $(pgrep bangerd)` while the VM is running, re-invoke the
CLI, daemon auto-starts, reconcile recognises the VM as alive,
`banger vm ssh` still connects, `banger vm delete` cleans up.
Tests added:
- vm_handles_test.go: scratch-file roundtrip, missing/corrupt file
behaviour, cache concurrency, rediscoverHandles prefers pgrep
over scratch, returns scratch contents even when process is
dead (so cleanup can tear down kernel state).
- vm_test.go: reconcile test rewritten to exercise the new flow
(write scratch → reconcile reads it → verifies process is gone →
issues dmsetup/losetup teardown).
ARCHITECTURE.md updated; `handles` added to Daemon field docs.
Previously /etc/ssh/sshd_config.d/99-banger.conf landed with:
LogLevel DEBUG3
PermitRootLogin yes
PubkeyAuthentication yes
AuthorizedKeysFile /root/.ssh/authorized_keys
StrictModes no
DEBUG3 was debug leftover that floods journald in normal use.
StrictModes no was a workaround for /root perm drift on the work
disk — the real fix is to make those perms correct at provisioning
time.
New drop-in:
PermitRootLogin prohibit-password
PubkeyAuthentication yes
PasswordAuthentication no
KbdInteractiveAuthentication no
AuthorizedKeysFile /root/.ssh/authorized_keys
prohibit-password blocks password root login even if PasswordAuth
gets flipped on elsewhere; KbdInteractiveAuth no closes the last
interactive fallback; StrictModes is now on (sshd's default).
normaliseHomeDirPerms chown/chmods /root to 0755 root:root at every
work-disk mount (ensureAuthorizedKeyOnWorkDisk,
seedAuthorizedKeyOnExt4Image); the .ssh dir also explicitly
chown'd root:root. Verified end-to-end against a real VM:
`sshd -T` reports strictmodes yes and all five directives match.
Regression test (sshd_config_test.go) pins the allow-list and the
deny-list (DEBUG3, StrictModes no, bare `PermitRootLogin yes`) so
the next accidental reintroduction fails fast.
README's Security section updated to reflect the new posture.
Previously withVMLockByRef held the per-VM mutex across InspectRepo,
waitForGuestSSH, dialGuest, ImportRepoToGuest (the tar stream!), and
the readonly chmod. A large repo could block `vm stop` / `vm delete`
/ `vm restart` on the same VM for however long the import took.
Split into two phases:
1. VM mutex held briefly to validate state (running + PID alive)
and snapshot the fields needed for SSH (guest IP, api sock).
2. VM mutex released. Acquire workspaceLocks[id] — a separate
per-VM mutex scoped to workspace.prepare / workspace.export —
for the guest I/O phase.
Lifecycle ops (stop/delete/restart/set) only take vmLocks, so they
no longer queue behind a slow import. Two concurrent prepares on the
same VM still serialise via workspaceLocks so tar streams don't
interleave. ExportVMWorkspace also acquires workspaceLocks to avoid
snapshotting a half-streamed import.
Two regression tests (sequential — they swap package-level seams):
ReleasesVMLockDuringGuestIO: stall the import fake, assert the VM
mutex is acquirable from another goroutine during the stall.
SerialisesConcurrentPreparesOnSameVM: 3 concurrent prepares, assert
Import is only ever invoked 1-at-a-time per VM.
ARCHITECTURE.md documents the split + updated lock ordering.
Previously `banger vm workspace export` ran `git add -A` against the
guest's real `.git/index`, so the observation step left staged
changes behind that users never asked for. Reconnecting later (ssh,
another export) surfaced them and looked like phantom work.
Route `git add -A` through a throwaway index file instead:
tmp_idx=$(mktemp ...)
trap 'rm -f "$tmp_idx"' EXIT
git read-tree <ref> --index-output="$tmp_idx"
GIT_INDEX_FILE="$tmp_idx" git add -A
GIT_INDEX_FILE="$tmp_idx" git diff --cached <ref> --binary|--name-only
The real .git/index, working tree, and refs stay exactly as the user
left them. Same diff content — commits past <ref>, uncommitted edits,
and untracked files (minus .gitignore) all captured.
Regression test locks the invariant: every export script must route
add -A through GIT_INDEX_FILE and clean the temp index on exit. CLI
help text updated to say "non-mutating".
Replaces the static model.Default* constants that drove --vcpu / --memory
/ --disk-size with a three-layer resolver:
1. [vm_defaults] in ~/.config/banger/config.toml (if set)
2. host-derived heuristics (cpus/4 capped at 4; ram/8 capped at 8 GiB)
3. baked-in constants (floor)
Visibility:
- Every `vm run` / `vm create` prints a `spec:` line before progress
begins: `spec: 4 vcpu · 8192 MiB · 8G disk`. Matches the VM that
actually gets created because the CLI is now the single source of
truth — it resolves, populates the flag defaults, and forwards the
explicit values to the daemon.
- `banger doctor` adds a "vm defaults" check showing per-field
provenance (config|auto|builtin) and the config file path for
overrides.
- `--help` shows the resolved defaults (e.g. `--vcpu int (default 4)`
on an 8-core host).
No `banger config init` command, no first-run side effects, no writes
to the user's filesystem behind their back. Users who want explicit
control set the keys; everyone else gets sensible numbers that track
their hardware.
- WebListenAddr default is now "" (empty). The experimental web UI was
running on 127.0.0.1:7777 by default, which surprises users who never
opted in. Users who want it set `web_listen_addr = "127.0.0.1:7777"`
in config.toml.
- `make uninstall` stops the daemon (if any) and removes the installed
binaries. Preserves user data on disk but prints the paths so `rm -rf`
can follow for a full purge. Documented in README next to install.
- docs/kernel-catalog.md: replace the `void-6.12` and `alpine-3.23`
examples (never published) with `generic-6.12` (the only cataloged
kernel today). Updates the versioning-convention example too.
- `banger vm prune` sweeps every non-running VM (stopped, created,
error) with an interactive confirmation; -f/--force skips the prompt.
Partial failures report which VM failed and exit non-zero.
- list commands gain `ls` alias: vm list already had it; added to image
list, kernel list, and vm session list.
- delete commands gain `rm` alias: vm delete and image delete. kernel
rm already aliased delete/remove.
Uses new test seams (vmListFunc) plus the existing vmDeleteFunc so
prune unit-tests without touching the daemon socket.
Re-enable cobra's default `completion` subcommand (`banger completion
bash|zsh|fish|powershell`). Plus live resource-name suggestions that
hit the running daemon via the same RPC the real commands use:
vm start/stop/restart/delete/kill/set → completeVMNames (variadic)
vm ssh/show/logs/stats/ports/... → completeVMNameOnlyAtPos0
vm session list/start → completeVMNameOnlyAtPos0
vm session show/logs/stop/kill/attach/send → completeSessionNames (vm + session)
image show/delete/promote → completeImageNameOnlyAtPos0
kernel show/rm → completeKernelNameOnlyAtPos0
vm run/create --image, image pull/register --kernel-ref → flag-value completion
Design notes in internal/cli/completion.go: completers never auto-start
the daemon (ping-check, bail with NoFileComp on miss), so tab-completion
stays a zero-cost probe. Variadic completers exclude already-entered
args to avoid duplicate suggestions.
README: install recipes for bash / zsh / fish.
Adds `make coverage` (per-package + total via -coverpkg=./...),
`make coverage-html`, and `make coverage-total` (CI-friendly). Wires
coverage.out/coverage.html through `make clean` and .gitignore.
Closes the two easy zero-coverage packages: namegen (77.8%) and
sessionstream (93.5%). Total statement coverage 51.7% -> 52.1%.
Adds docs/dns-routing.md covering how `<vm>.vm` resolution works:
auto-configuration on systemd-resolved hosts (what the daemon
already does), and per-resolver recipes for dnsmasq /
NetworkManager+dnsmasq / /etc/resolv.conf / macOS `/etc/resolver/`
/ WSL. Plus verification via `dig @127.0.0.1 -p 42069` and
troubleshooting for the common failure modes.
README reshape: lead with the three things a common user needs —
quick start, what `vm run` does, where to put hostnames + image +
config — and push the rest to docs. `vm create` / OCI `image pull`
/ `image register` / workspace-and-session primitives are all still
documented, just under docs/advanced.md where they're not in the
first-time reader's way. Web UI and unnecessary implementation
notes dropped; the "further reading" section at the bottom
enumerates the five docs pages so nothing becomes hard to find.
README shrinks from 208 → 158 lines.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundle downloads can take 20–60s on a typical connection and the
CLI was going silent between "resolving daemon" and the final image
summary. Users wondered whether banger had wedged.
New `withHeartbeat` helper wraps an RPC call with a dot-every-2s
ticker on stderr. No-op when stderr isn't a terminal, so piped or
scripted invocations stay quiet. Wired into `image pull` and `kernel
pull`, the two commands that actually download bytes.
Example:
$ banger image pull debian-bookworm
[image pull] ..........
id name managed ...
Tests cover the non-TTY short-circuit and error propagation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `internal/opencode` package and the `opencodeCapability` that
consumed it were hard-wired to wait for opencode on guest port 4096
when an image shipped an initrd. After the prune commits (void /
alpine / customize.sh / image build all removed), nothing banger
produces today carries an initrd, so the capability's wait path was
unreachable: every startup short-circuited to the "direct-boot, skip
opencode" branch.
Same logic for `banger vm acp`: it SSHes to `opencode acp --cwd
<path>`, a binary the golden image no longer ships. Users who run
their own image with opencode can still invoke
`ssh vm -- opencode acp --cwd /root/repo` directly — no banger
scaffolding required.
Removed:
- internal/opencode/ (whole package, 255 LOC incl. tests)
- internal/daemon/opencode.go (opencodeCapability)
- cli `vm acp` command + its helpers (runVMACP, sshACPCommandArgs,
vmACPRemoteCommand) + their tests
- The opencodeCapability{} entry in registeredCapabilities() plus
the test that pinned its presence
- `wait_opencode` progress-stage label from the vm-create renderer
- Stale mentions in daemon/doc.go, README, and webui test fixtures
~480 lines gone, 12 added. `banger/internal` is now 25 packages
instead of 26.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the three hardcoded host→guest credential syncs (opencode,
claude, pi) with a generic `[[file_sync]]` config list. Default is
empty — users opt in to exactly what they want synced, with no
surprise about which tools banger "supports".
```toml
[[file_sync]]
host = "~/.local/share/opencode/auth.json"
guest = "~/.local/share/opencode/auth.json"
[[file_sync]]
host = "~/.aws" # directories are copied recursively
guest = "~/.aws"
[[file_sync]]
host = "~/bin/my-script"
guest = "~/bin/my-script"
mode = "0755" # optional; default 0600 for files
```
Semantics:
- Host `~/...` expands against the host user's $HOME. Absolute host
paths are used as-is.
- Guest must live under `~/` or `/root/...` — banger's work disk is
mounted at /root in the guest, so that's the syncable namespace.
Anything outside is rejected at config load.
- Validation at config load: reject empty paths, relative paths,
`..` traversal, `~user/...`, malformed mode strings. Errors name
the offending entry index.
- Missing host paths are a soft skip with a warn log (existing
behaviour). Other errors (read, mkdir, install) abort VM create.
- File entries: `install -o 0 -g 0 -m <mode>` (default 0600).
- Directory entries: walked in Go; each source file is installed
with its own source permissions preserved. The entry's `mode` is
ignored for directories.
Removed (all dead after this):
- `ensureOpencodeAuthOnWorkDisk`, `ensureClaudeAuthOnWorkDisk`,
`ensurePiAuthOnWorkDisk`, the shared `ensureAuthFileOnWorkDisk`,
their `warn*Skipped` helpers, `resolveHost{Opencode,Claude,Pi}AuthPath`,
and the work-disk relative-path + default display-path constants.
- The capability hook registering the three syncs now calls the
generic `runFileSync` once.
Seven tests exercising the old codepath deleted; six new tests cover
the new runFileSync (no-op on empty config, file copy, custom mode,
missing-host-skip, overwrite, recursive directory). Config-layer
test adds happy-path parsing and a case-per-shape table of invalid
entries (empty, relative host, guest outside /root, '..' traversal,
`~user`, bad mode).
README updated: replaces the "Credential sync" section with a
"File sync" section showing the new config shape.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "repairing SSH access on work disk" stage detail sounded
remedial, like something had gone wrong. It's just writing banger's
SSH key to /root/.ssh/authorized_keys on the work disk for the first
time.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The deferred --rm delete fires AFTER runSSHSession returns, but
runSSHSession prints "vm X is still running (stop with ...)" before
returning. Net effect: the user sees the reminder, then the VM gets
deleted behind it — misleading.
Thread a skipReminder bool into runSSHSession. `vm run` passes the
same value as removeOnExit; other callers (`vm ssh`) pass false.
Reinforced by a new assertion in the --rm happy-path test that the
reminder string never appears in stderr.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New `--rm` flag deletes the VM once the ssh session or `-- cmd`
exits, making `vm run` one-shot. Exit code from command mode still
propagates correctly.
Semantics:
- Create fails → no VM to delete, nothing to do.
- SSH-wait timeout → VM intentionally kept alive so `vm logs <name>`
shows why; the timeout error already pointed users at that. Even
with --rm, this path skips delete — a wedged sshd is exactly when
you want post-mortem access.
- Session/command ends (any exit code, any reason) → VM is deleted
via `vm.delete` RPC. Uses a fresh 10s context so Ctrl-C during the
session doesn't abort the cleanup.
New vmDeleteFunc seam at the top of banger.go alongside the other
RPC seams. Two tests cover the happy path (session ends cleanly →
delete fires with correct ref) and the skip-on-timeout path (ssh
wait errors → delete does NOT fire).
README updated with an ephemeral example and a note about the
timeout-skip behaviour.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Before: `guestWaitForSSHFunc` loops forever bounded only by context
cancellation, so if sshd fails to start in the guest `vm run` hangs
indefinitely — which burned a long debugging session during the
golden-image bring-up.
After: the ssh wait gets its own 90s deadline. On guest-side timeout
the error names the VM, explains sshd is the likely suspect, points
at `banger vm logs <name>` for the console output, and notes the VM
is still alive for inspection (or `vm delete` to clean up). Parent
context cancellation (Ctrl-C, caller timeout) still surfaces as-is
without the hint.
`vmRunSSHTimeout` is a var rather than a const so tests can shrink
it; the new TestRunVMRunSSHTimeoutReturnsActionableError sets it to
50ms and asserts the error message contains the actionable bits.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `image build` flow spun up a transient Firecracker VM, SSHed in,
and ran a large bash provisioning script to derive a new managed
image from an existing one. It overlapped heavily with the golden-
image Dockerfile flow (same mise/docker/tmux/opencode install logic
duplicated in Go as `imagemgr.BuildProvisionScript`) and had far more
machinery: async op state, RPC begin/status/cancel, webui form +
operation page, preflight checks, API types, tests. For custom
images, writing a Dockerfile is simpler and more reproducible.
Removed end-to-end:
- CLI `image build` subcommand + `absolutizeImageBuildPaths`.
- Daemon: BuildImage method, imagebuild.go (transient-VM orchestration),
image_build_ops.go (async begin/status/cancel), imagemgr/build.go
(the 247-line provisioning script generator and all its append*
helpers), validateImageBuildPrereqs + addImageBuildPrereqs.
- RPC dispatches for image.build / .begin / .status / .cancel.
- opstate registry `imageBuildOps`, daemon seam `imageBuild`,
background pruner call.
- API types: ImageBuildParams, ImageBuildOperation, ImageBuildBeginResult,
ImageBuildStatusParams, ImageBuildStatusResult; model type
ImageBuildRequest.
- Web UI: Backend interface methods, handlers, form, routes, template
branches (images.html build form, operation.html build branch,
dashboard.html Build button).
- Tests that directly exercised BuildImage.
Doctor polish (task C):
- Drop the "image build" preflight section entirely (its raison d'être
is gone).
- Default-image check now accepts "not local but in imagecat" as OK:
vm create auto-pulls on first use. Only flag when the image is
neither locally registered nor in the catalog.
Net: 24 files touched, 1,373 lines deleted, 25 added.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Accidentally staged into the prior prune commit by `git add -A`.
It's a local scratch file the maintainer keeps in the repo root.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The golden-image Dockerfile + catalog pipeline replaces the entire
manual rootfs-build stack. With that shipped, the per-distro shell
flows are dead code.
Removed:
- scripts/customize.sh, scripts/interactive.sh, scripts/verify.sh
- scripts/make-rootfs{,-void,-alpine}.sh
- scripts/register-{void,alpine}-image.sh
- scripts/make-{void,alpine}-kernel.sh
- internal/imagepreset/ (only consumer was `banger internal packages`,
which fed customize.sh)
- examples/{void,alpine}.config.toml
- Makefile targets: rootfs, rootfs-void, rootfs-alpine, void-kernel,
alpine-kernel, void-register, alpine-register, void-vm, alpine-vm,
verify-void, verify-alpine, plus the ALPINE_RELEASE / *_IMAGE_NAME
/ *_VM_NAME variables
The void-6.12 kernel catalog entry is also gone — golden image pairs
with generic-6.12 and nothing else in the catalog depended on it.
Consolidated: imagemgr now holds the small DebianBasePackages list +
package-hash helper inline, so the `image build --from-image` flow
(still supported) no longer pulls from a separate imagepreset package.
Net: 3,815 lines deleted, 59 added. No runtime functionality removed
beyond the `banger internal packages` subcommand (hidden, used only
by the deleted customize.sh).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lead the README with `banger vm run` (one command, auto-pull default
image + kernel from the catalogs), move `image register` / `image
build` / OCI-pull to a "power-user flows" section. Golden-image
content from customize.sh moves to the golden-image Dockerfile story.
New `docs/image-catalog.md` mirrors `docs/kernel-catalog.md` — the
bundle format, content-addressed filenames, publish flow, trust
model, R2 hosting. Cross-links with oci-import.md.
`docs/oci-import.md` refactored to document the OCI-pull path as the
fallthrough for arbitrary registry refs (it's the secondary path now
that the catalog covers the headline debian-bookworm case). Phase A
caveats removed — ownership fixup, agent injection, and first-boot
sshd install all landed.
AGENTS.md: promotes `vm run` as the smoke-test primitive, notes the
default-image auto-pull behaviour, and points at both catalog docs.
README shrinks 330 → 198 lines, mostly by removing the experimental
void/alpine sections (those flows still work as advanced scripts but
the README no longer advertises them).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Embed the sha256 prefix in the uploaded filename so every rebuild
lives at a unique URL. Cloudflare's edge cache (and any similar CDN
in front of R2) can never serve stale bytes for the URL the catalog
points at. The R2 console offers no per-URL purge for this bucket
layout, so making the URL itself content-addressed is the only
durable fix.
Also republishes the debian-bookworm catalog entry with the new
filename.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
One-command sandbox: `banger vm run` on a fresh host now Just Works.
No prior `banger image pull` or `banger kernel pull` needed.
Changes:
- Default `default_image_name` flips from "default" to "debian-bookworm"
so the golden image is the implicit target when `--image` is omitted.
- `CreateVM` resolves the image via a new `findOrAutoPullImage`: try
the local store first, and on miss fall back to the embedded imagecat
catalog + auto-pull. Emits a vm-create progress stage so the user
sees "pulling from image catalog" in the create output.
- `resolveKernelInputs` gains context + the same pattern via
`readOrAutoPullKernel`: try the local kernelcat, and on miss look up
the embedded kernelcat and auto-pull. Fires whenever a bundle's
manifest references a kernel the user hasn't pulled yet, not just
during image pull — any CreateVM with an image that needs a kernel
not yet local will resolve it.
- `--image` help text updated on both `vm run` and `vm create`.
Six tests cover local-hit-no-pull, auto-pull-on-miss, not-in-catalog
error propagation, and a non-ENOENT kernel read error does NOT trigger
a misleading "not in catalog" claim.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end verified:
banger image pull debian-bookworm
banger vm run --image debian-bookworm --name goldenvm
boots through multi-user.target, sshd starts, and vm run drops into
an interactive ssh session.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three fixes discovered during end-to-end boot testing on Firecracker:
- Install udev + dbus alongside systemd. Both are Recommends of the
systemd package, skipped by --no-install-recommends. Without udev,
systemd never activates device units (dev-vdb.device stays inactive
even after the kernel enumerates /dev/vdb) and the work-disk mount
hangs forever. dbus is required by a growing set of services
(logind, systemd-resolved shim, etc.).
- Ship /usr/lib/tmpfiles.d/sshd.conf creating /run/sshd. Debian's
openssh-server package doesn't ship one, and ssh.service's own
RuntimeDirectory=sshd fires too late for the ExecStartPre config
check, which blows up with 'Missing privilege separation directory'.
The tmpfiles entry runs in systemd-tmpfiles-setup.service well
before ssh.service starts.
- Rewrite the ssh.service drop-in to reset the main unit's
ExecStartPre list. Debian ships `sshd -t` as ExecStartPre #1; that
fails without host keys and terminates the service before our
`ssh-keygen -A` fires. Reset + re-add in the correct order: mkdir,
keygen, then the test.
StandardOutput/Error=journal+console on ssh.service so future sshd
failures surface in the firecracker console log too, not only in the
(unreachable) guest journal.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`docker create` drops /.dockerenv into the container's writable layer,
and `docker export` includes it in the tar. When systemd later boots
that rootfs it finds /.dockerenv and flags virtualization=docker,
which disables a bunch of udev device-unit behaviour (device units
never become active, mount units waiting on them hang forever).
Strip /.dockerenv (and /run/.containerenv for podman symmetry) from
the staging tree after FlattenTar and before BuildExt4 so systemd
correctly detects virtualization=kvm.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mkfs.ext4's positional fs-size is documented in 1 KiB units (not the
filesystem's 4 KiB block size), so passing sizeBytes/4096 made
filesystems 1/4 the intended size. A 4 GiB request became a 1 GiB
ext4 in a 4 GiB file, packed to 0 free blocks — VM create then failed
with 'Could not allocate block' when patchRootOverlay tried to write
guest config.
The file is truncated to the target size before mkfs runs; without
the positional arg, mkfs uses the whole device.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both masks were added when the direct-boot path first landed for
container rootfses that didn't have anything mounted on /dev/vdb. The
golden image (and any pulled OCI image running under banger's
patchRootOverlay) has an /etc/fstab entry mounting /dev/vdb at /root —
masking dev-vdb.device makes systemd wait forever for a unit that can
never become active, and the work-disk mount never completes. dev-ttyS0
is a real serial console the image needs too. Drop both.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First entry in the image catalog. Verified end-to-end:
- https://images.thaloco.com/debian-bookworm-x86_64.tar.zst reachable
- sha256 071495e6... matches
- bundle unpacks to rootfs.ext4 (4 GiB) + manifest.json with the
expected name/distro/arch/kernel_ref.
publish-golden-image.sh tweaks:
- default RCLONE_REMOTE from 'r2' to 'banger-images' (matches the
rclone config actually in use here).
- rclone copyto now passes --s3-no-check-bucket and --no-check-dest
so scoped R2 tokens without HeadBucket/HeadObject permission
still upload cleanly.
To use: restart bangerd so it picks up the new embedded catalog,
then `banger image pull debian-bookworm`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PullImage now checks the embedded imagecat catalog first. If the
ref matches a catalog entry, it takes the bundle path:
1. Fetch the .tar.zst bundle into a staging dir (rootfs.ext4 +
manifest.json).
2. Strip manifest.json (staging-only metadata).
3. Stage kernel/initrd/modules alongside rootfs.ext4.
4. Publish the staging dir and upsert the image row.
Bundle rootfs is already flattened + ownership-fixed + agent-
injected at build time, so the daemon-side work is strictly I/O —
no flatten, no mkfs, no debugfs.
Kernel resolution in the bundle path: --kernel-ref > entry.kernel_ref
> --kernel/--initrd/--modules.
If the ref doesn't match a catalog entry, PullImage falls through
to the existing OCI path unchanged (extracted into pullFromOCI).
New test seam: d.bundleFetch. Six unit tests cover happy path,
--kernel-ref override, existing-name rejection, kernel-required
error, fetch-failure cleanup, and the catalog → OCI fallthrough.
CLI help updated: image pull now documents both forms and takes
<name-or-oci-ref> instead of requiring an OCI ref.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the OCI-push flow with a bundle-based one that mirrors the
kernel catalog (publish-kernel.sh / kernelcat).
- scripts/make-golden-bundle.sh: docker build → docker create → docker
export | banger internal make-bundle → .tar.zst. Defaults target
debian-bookworm / generic-6.12 / x86_64; pinned --size 4G to leave
headroom for first-boot installs and in-VM apt use.
- scripts/publish-golden-image.sh: rewritten to call make-golden-bundle,
rclone upload to R2 (banger-images bucket, images.thaloco.com), and
jq-patch internal/imagecat/catalog.json with URL / sha256 / size.
--skip-upload stops after bundle build and copies to dist/.
make-bundle default ext4 sizing also bumped from +25% to +50% headroom
(mkfs.ext4 needs room for inode tables, block-group metadata, journal,
and the default 5% reserved-blocks margin). The old 25% was too tight
for the ~950 MB golden rootfs and aborted with "Could not allocate
block".
End-to-end smoke (local): golden Dockerfile → 286 MB tar.zst bundle
with correct manifest, valid ext4, and all banger units + vsock agent
present.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
main.go previously unwrapped *any* error implementing `ExitCode() int`
into the process exit status, which matched *exec.ExitError too. So
whenever a CLI command ran a subprocess (mkfs.ext4, debugfs, ssh to a
daemon preflight, etc.) and that subprocess failed, the CLI would
silently exit with the subprocess's code — no error message printed.
Surfaced while bringing up `banger internal make-bundle`: mkfs.ext4
was failing on an undersized ext4 and the user saw only `EXIT=1`.
Fix: export the type as `cli.ExitCodeError` and unwrap against the
concrete type in main.go. The `ExitCode()` method is gone — only the
explicit wrap at the `vm run` command-mode call site produces this
error now.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New hidden subcommand that turns a `docker export`-style rootfs tar
into a banger bundle (`rootfs.ext4` + `manifest.json`, tar+zstd):
1. FlattenTar (new in imagepull) extracts the stream into a staging
dir while capturing per-file uid/gid/mode into a Metadata record.
2. imagepull.BuildExt4 produces the ext4 via `mkfs.ext4 -d`.
3. imagepull.ApplyOwnership re-applies the captured metadata with
`debugfs sif` so setuid/root-owned files keep their identity.
4. imagepull.InjectGuestAgents drops the vsock agent + network
bootstrap + first-boot service into the ext4.
5. manifest.json is written with name/distro/arch/kernel_ref.
6. Both files are packaged as .tar.zst with max compression.
Flags: --rootfs-tar (file or '-' for stdin), --name, --distro, --arch,
--kernel-ref, --description, --size, --out. Stdout prints bundle path,
sha256, and size so callers can patch the catalog.
Unit tests cover flag registration, required-arg validation, the
bundle tar round-trip, sha256HexFile, and dirSize. An end-to-end test
runs the full pipeline against a synthesized tiny rootfs tar; skips
gracefully when mkfs.ext4 / debugfs / companion binaries are missing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New package mirroring `kernelcat`: catalog + SHA256-verified HTTP
fetch of `.tar.zst` bundles that contain rootfs.ext4 + manifest.json.
Mounted empty (version:1, entries:[]) so nothing is pullable via the
bundle path yet; wiring into `banger image pull` lands in a later
phase.
- catalog.go: Catalog/CatEntry, LoadEmbedded, ParseCatalog, Lookup,
ValidateName.
- fetch.go: Fetch(ctx, client, destDir, entry) downloads the bundle,
verifies sha256, extracts exactly rootfs.ext4 and manifest.json
into destDir, returns the parsed manifest. Rejects unexpected tar
entries, unsafe paths, non-regular files, and cleans up partial
writes on failure.
- Thirteen unit tests (happy path + every failure mode).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Debian bookworm with two clearly-labeled sections:
- ESSENTIAL: systemd, openssh-server, ca-certificates, curl, iproute2.
- OPINION: git, jq, ripgrep, fd, build-essential, shellcheck, mise,
Docker CE (+ Compose v2 + buildx), tmux, htop, and friends.
Per-VM identity stripped at build time: /etc/machine-id cleared,
SSH host keys removed with a ssh.service drop-in that runs
`ssh-keygen -A` on first start so each VM gets a unique set.
The script is a parameterized wrapper around `docker build`; it also
supports `--push` to an OCI registry, which will be removed once the
bundle pipeline is in place.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`vm run` now covers bare sandbox (no args), workspace sandbox (path),
and workspace+command (path -- cmd) in a single entry point. Replaces
the old print-next-steps-and-exit behaviour: bare and workspace modes
drop into interactive ssh, command mode execs via ssh and propagates
the remote exit code through banger's own exit status.
- path argument is optional; --branch / --from still require a path.
- workspace prep and mise tooling bootstrap only run when a path is
given; command mode skips the bootstrap.
- remote command exit status is wrapped as exitCodeError so main() can
propagate it instead of collapsing every failure to 1.
- README: promote vm run with three-mode examples; demote vm create
to a scripting primitive.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the full arc: banger kernel pull + image pull + vm create + vm ssh
now works end-to-end against docker.io/library/debian:bookworm with zero
manual image building.
Generic kernel:
- New scripts/make-generic-kernel.sh builds vmlinux from upstream
kernel.org sources using Firecracker's official minimal config
(configs/firecracker-x86_64-6.1.config). All critical drivers
(virtio_blk, virtio_net, ext4, vsock) compiled in — no modules,
no initramfs needed.
- Published as generic-6.12 in the catalog (kernels.thaloco.com).
- catalog.json updated with the new entry.
Direct-boot init= override (vm_lifecycle.go):
- For images without an initrd (direct-boot / OCI-pulled), banger now
passes init=/usr/local/libexec/banger-first-boot on the kernel
cmdline. The script runs as PID 1, mounts /proc /sys /dev /run,
checks for systemd — if present execs it immediately; if not
(container images), installs systemd-sysv + openssh-server via the
guest's package manager, then execs systemd.
- Also passes kernel-level ip= parameter via BuildBootArgsWithKernelIP
so the kernel configures the network interface before init runs
(container images don't ship iproute2, so the userspace bootstrap
script can't call ip(8)).
- Masks dev-ttyS0.device and dev-vdb.device systemd units that
otherwise wait 90s for udev events that never fire in Firecracker
guests started from container rootfses.
first-boot.sh rewritten as universal init wrapper:
- Works as PID 1 (mounts essential filesystems) OR as a systemd
oneshot (existing behavior).
- Installs both systemd-sysv AND openssh-server (container images
have neither).
- Dispatch updated: debian, alpine, fedora, arch, opensuse families
+ ID_LIKE fallback. All tests updated.
Opencode capability skip for direct-boot images:
- The opencode readiness check (WaitReady on vsock port 4096) now
returns nil for images without an initrd, since pulled container
images don't ship the opencode service. Without this, the VM
would be marked as error for lacking an opinionated add-on.
Docs: README and kernel-catalog.md updated to recommend generic-6.12
as the default kernel for OCI-pulled images. AGENTS.md notes the new
build script.
Verified live:
- banger kernel pull generic-6.12
- banger image pull docker.io/library/debian:bookworm --kernel-ref generic-6.12
- banger vm create --image debian-bookworm --name testbox --nat
- banger vm ssh testbox -- "id; uname -r; systemctl is-active banger-vsock-agent"
→ uid=0(root), kernel 6.12.8, Debian bookworm, vsock-agent active,
sshd running, SSH working.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
docs/oci-import.md: removed the "Phase A acquisition-only" framing
and the bootability-gap warnings. Expanded architecture section
with ApplyOwnership + InjectGuestAgents. Added a "guest-side boot
sequence" diagram-in-prose showing network → first-boot → vsock-
agent unit ordering. Added a "how to add distro support" section
pointing at the ID-case dispatch in first-boot.sh.
README.md: replaced the experimental-caveat block with an honest
"boots as a banger VM directly, no image build step required"
description. Pointer to the docs for distro support details.
Tech-debt list trimmed — ownership fixup and first-boot install
are no longer planned work, they shipped. What remains: private-
registry auth (authn.DefaultKeychain), cache eviction, first-boot
timeout UX (retry still works but could be smoother with a
FirstBootPending flag), non-systemd distros.
All 20 packages green. make lint clean.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>