banger

Author	SHA1	Message	Date
Thales Maciel	362009d747	daemon split (1/5): extract HostNetwork service First phase of splitting the daemon god-struct into focused services with explicit ownership. HostNetwork now owns everything host-networking: the TAP interface pool (initializeTapPool / ensureTapPool / acquireTap / releaseTap / createTap), bridge + socket dir setup, firecracker process primitives (find/resolve/kill/wait/ensureSocketAccess/sendCtrlAltDel), DM snapshot lifecycle, NAT rule enforcement, guest DNS server lifecycle + routing setup, and the vsock-agent readiness probe. That's 7 files whose receivers flipped from Daemon to HostNetwork, plus a new host_network.go that declares the struct, its hostNetworkDeps, and the factored firecracker + DNS helpers that used to live in vm.go. Daemon gives up the tapPool and vmDNS fields entirely; they're now HostNetwork's business. Construction goes through newHostNetwork in Daemon.Open with an explicit dependency bag (runner, logger, config, layout, closing). A lazy-init hostNet() helper on Daemon supports test literals that don't wire net explicitly — production always populates it eagerly. Signature tightenings where the old receiver reached into VM-service state: - ensureNAT(ctx, vm, enable) → ensureNAT(ctx, guestIP, tap, enable). Callers resolve tap from the handle cache themselves. - initializeTapPool(ctx) → initializeTapPool(usedTaps []string). Daemon.Open enumerates VMs, collects taps from handles, hands the slice in. rebuildDNS stays on Daemon as the orchestrator — it filters by vm-alive (a VMService concern handles will move to in phase 4) then calls HostNetwork.replaceDNS with the already-filtered map. Capability hooks continue to take Daemon; they now use it as a facade to reach services (d.net.ensureNAT, d.hostNet().). Planned CapabilityHost interface extraction is orthogonal, left for later. Tests: dns_routing_test.go + fastpath_test.go + nat_test.go + snapshot_test.go + open_close_test.go were touched to construct HostNetwork literals where they exercise its methods directly, or route through d.hostNet() where they exercise the Daemon entry points. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:11:46 -03:00
Thales Maciel	eba9a553bf	daemon: use exact-name lookup for VM-create uniqueness reserveVM's duplicate-name guard routed through Daemon.FindVM, which falls back to prefix-matching on both ids and names when no exact match is found. That turns the uniqueness check into a correctness bug: a brand-new VM name can be rejected because it happens to prefix an existing VM's id, or an existing VM's name. So `vm create --name beta` fails when `beta-sandbox` already exists. Swap in a dedicated store.GetVMByName that does a literal `WHERE name = ?` lookup, and use it from reserveVM. FindVM keeps its prefix-matching behaviour for user-facing lookup paths (`vm ssh <partial>`, `vm stop <partial>`) where "did you mean" semantics are the feature. Tests: - TestReserveVMAllowsNameThatPrefixesExistingVM — seeds a VM whose id + name both start with "longname", then reserves two new VMs named "longname" and "longname-sandbox". Both must succeed. Under the old FindVM-based check, both would fail. - TestReserveVMRejectsExactDuplicateName — actual collisions are still rejected after the swap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 14:00:33 -03:00
Thales Maciel	108f7a0600	ssh-config: make the `ssh <name>.vm` shortcut opt-in Before this change, every daemon.Open() wrote a Host .vm stanza into ~/.ssh/config in a marker-fenced block. That's a real footgun for users who manage their SSH config declaratively (chezmoi, dotfiles, NixOS): banger was mutating host state outside its own directory on every daemon start, easy to miss and hard to audit. New contract: the daemon only ever writes its own ssh_config file at ~/.config/banger/ssh_config. ~/.ssh/config is untouched unless the user opts in. `banger vm ssh <name>` still works out of the box — the shortcut only matters for plain `ssh sandbox.vm` from any terminal. The opt-in surface is `banger ssh-config`: banger ssh-config # prints path + include-line + # install/uninstall hints banger ssh-config --install # adds `Include <bangerConfig>` to # ~/.ssh/config inside a marker-fenced # block; idempotent; migrates any # legacy inline Host .vm block from # pre-opt-in builds banger ssh-config --uninstall # removes the new Include block AND # any legacy inline block Doctor gains a gentle warn-level note when banger's ssh_config exists but the user hasn't wired it in — not a fail, since the shortcut is convenience and `banger vm ssh` covers the essential case. Tests cover: daemon writes banger file and does NOT touch ~/.ssh/config, Install adds the block, Install is idempotent, Install migrates the legacy inline block cleanly (removing it, preserving unrelated entries, adding the new Include block), Uninstall removes both marker variants, Uninstall is a no-op when ~/.ssh/config is absent, and UserSSHIncludeInstalled detects both marker shapes. README reframes the feature as optional convenience. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 13:57:26 -03:00
Thales Maciel	99d0811097	daemon: shrink createVMMu + imageOpsMu to reservation/publication windows Before: createVMMu was held across the whole of CreateVM — including image resolution (which could fire a full auto-pull) and startVMLocked (boot of multiple seconds). imageOpsMu was held across the whole of PullImage/RegisterImage/PromoteImage/DeleteImage, so any slow OCI pull, bundle download, or file copy blocked every other image mutation and every other VM create that needed to auto-pull. The async create API bought nothing if all creates serialised on the same mutex. CreateVM is now three phases: 1. Validate + resolve image (possibly auto-pulling). No global lock. 2. reserveVM: take createVMMu only long enough to re-check the name is free, allocate the next guest IP, and UpsertVM the "created" row. Milliseconds. 3. startVMLocked: run the full boot flow under the per-VM lock only. Parallel creates of different VMs now overlap on image resolution + boot; they contend only across the reservation claim. For the image surface a new publishImage helper isolates the commit atom (recheck name free, atomic rename stagingDir→finalDir, UpsertImage) under imageOpsMu. pullFromBundle + pullFromOCI do their network fetch + ext4 build + ownership fixup + agent injection outside the lock; Register moves validation + kernel resolution outside; Promote moves file copy + SSH-key seeding outside; Delete keeps a brief lock over the lookup + reference check + store delete and does file cleanup unlocked. Two concurrency tests assert the new behaviour: - TestPullImageDoesNotSerialiseOnDifferentNames fails the old code (second pull blocks on imageOpsMu and never reaches the body). - TestPullImageRejectsNameClashAtPublish confirms the publish-window recheck is what enforces name uniqueness now that the body runs unlocked — exactly one winner. ARCHITECTURE.md updated to describe the new scope explicitly instead of calling the locks "narrow". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 13:44:22 -03:00
Thales Maciel	58464ac28c	docs + doctor: be honest about amd64-only support The README sold the product as "Linux with /dev/kvm"; the deeper docs admit that the Makefile pins companion builds to GOARCH=amd64, the kernel catalog ships only x86_64 entries, and OCI import pulls linux/amd64 layers. arm64 users who show up through the README only discover that after install fails in non-obvious ways. Two surface-level fixes: - README requirements list leads with "x86_64 / amd64 Linux — arm64 is not supported today", with a short note on the three places that assumption lives so users understand it's not a last-mile gap. - `banger doctor` now runs an architecture check that passes on amd64 and FAILS (not warns) on anything else, referencing the three downstream assumptions. Hard-fail rather than warn so a user on an arm64 machine doesn't waste time chasing unrelated preflight items. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 13:03:50 -03:00
Thales Maciel	e69810610a	daemon: correct ARCHITECTURE doc to match actual package shape + lock scope Two promises the doc was making that the code doesn't keep: 1. "Helpers moved out so the package stays focused on orchestration." The package still has ~29 files and ~130 func (d *Daemon) methods wiring VM lifecycle, image management, host networking, background reconciliation, and JSON-RPC dispatch. Calling it "just orchestration" sets readers up for surprise. Rewrite the subpackages preamble to say so, and flag the service split as a post-v0.1.0 project. 2. "vmLocks[id] is held only across short synchronous state validation and DB mutations." That's what workspace.prepare does; regular lifecycle ops (start/stop/delete/set) go through withVMLockByRef and hold the lock across the whole callback body, which for `start` means preflight + bridge + firecracker spawn + post-boot wiring. Rewrite the vmLocks bullet and the lock-ordering section to say that explicitly, so readers don't build "surely my long flow under the lock can't be what the doc means" reasoning on top of a false premise. Doc-only change. Code behaviour is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 13:02:36 -03:00
Thales Maciel	34dd7644d8	store: introduce versioned migrations with ordered runner + atomic apply The old migrate() helper only knew how to re-run a fixed slab of CREATE TABLE IF NOT EXISTS plus per-column ensureColumnExists calls. That worked while every schema change was a benign additive column; it falls apart as soon as we need a data backfill, an index, a rename, or anything that has to happen exactly once in a known order. Replaces it with a schema_migrations table + ordered []migration slice. Each migration has a unique id, a human-readable name, and a func(*Tx) body; the runner opens a transaction per migration so DDL and any data changes either both land and get recorded or both roll back together, leaving the DB in a state where retrying on next Open() reapplies from the same point. Migration 1 ("baseline") collapses the current schema into one entry: fresh databases apply it in one shot; existing dev databases see idempotent `CREATE TABLE IF NOT EXISTS` + `ALTER TABLE … ADD COLUMN` statements that succeed as no-ops, and the only net effect is the schema_migrations row that brings them into the versioned system. Tests cover fresh apply, idempotent re-open, skipping already-applied ids, rollback on body error (the transient table the migration created must not survive), and duplicate-id rejection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 12:59:42 -03:00
Thales Maciel	b930c51990	runtime sockets: close the local-user race window around control-plane creation Previously the daemon socket, per-VM firecracker API socket, and vsock socket were transiently world-exposed on hosts without XDG_RUNTIME_DIR: the runtime directory landed in /tmp at 0755, Firecracker ran with umask 000 (mode 0666 sockets), and only a follow-up chown/chmod in EnsureSocketAccess tightened them. A local attacker could race into bangerd.sock or the firecracker API socket during that window. Three changes: - internal/paths/paths.go: RuntimeDir is now created (and re-chmod'd if stale) at 0700 unconditionally. When XDG_RUNTIME_DIR is unset and we fall back to /tmp/banger-runtime-<uid>, Ensure() now verifies the parent dir is owned by the current uid and 0700 mode — refusing to place sockets inside a directory someone else created. Symlink swaps rejected via Lstat. - internal/firecracker/client.go: launch firecracker with umask 077 instead of umask 000 so the API socket is mode 0600 from birth. The chown in fcproc.EnsureSocketAccess still transfers ownership from root to the invoking user afterwards. - internal/daemon/fcproc/fcproc.go: EnsureSocketDir now creates (and re-chmod's) the runtime socket directory at 0700. Tests cover the tightening path — an existing 0755 RuntimeDir is re-chmod'd on Ensure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 12:53:47 -03:00
Thales Maciel	2b6437d1b4	remove vm session feature Cuts the daemon-managed guest-session machinery (start/list/show/ logs/stop/kill/attach/send). The feature shipped aimed at agent- orchestration workflows (programmatic stdin piping into a long-lived guest process) that aren't driving any concrete user today, and the ~2.3K LOC of daemon surface area — attach bridge, FIFO keepalive, controller registry, sessionstream framing, SQLite persistence — was locking in an API we'd have to keep through v0.1.0. Anything session-flavoured that people actually need today can be done with `vm ssh + tmux` or `vm run -- cmd`. Deleted: - internal/cli/commands_vm_session.go - internal/daemon/{guest_sessions,session_lifecycle,session_attach,session_stream,session_controller}.go - internal/daemon/session/ (guest-session helpers package) - internal/sessionstream/ (framing package) - internal/daemon/guest_sessions_test.go - internal/store/guest_session_test.go - GuestSession* types from internal/{api,model} - Store UpsertGuestSession/GetGuestSession/ListGuestSessionsByVM/DeleteGuestSession + scanner helpers - guest.session.* RPC dispatch entries - 5 CLI session tests, 2 completion tests, 2 printer tests Extracted: - ShellQuote + FormatStepError lifted to internal/daemon/workspace/util.go (only non-session consumer); workspace package now self-contained - internal/daemon/guest_ssh.go keeps guestSSHClient + dialGuest + waitForGuestSSH — still used by workspace prepare/export - internal/daemon/fake_firecracker_test.go preserves the test helper that used to live in guest_sessions_test.go Store schema: CREATE TABLE guest_sessions and its column migrations removed. Existing dev DBs keep an orphan table (harmless, pre-v0.1.0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 12:47:58 -03:00
Thales Maciel	c42fcbe012	cli + daemon: move test seams off package globals onto injected structs CLI: introduce internal/cli.deps which owns every RPC/SSH/host-command seam the tree used to reach through mutable package vars. Command builders, orchestrators, and the completion helpers become methods on *deps. Tests construct their own deps per case, so fakes no longer leak across cases and tests are free to run in parallel. Daemon: move workspaceInspectRepoFunc + workspaceImportFunc onto the Daemon struct (workspaceInspectRepo / workspaceImport), mirroring the existing guestWaitForSSH / guestDial pattern. Workspace-prepare tests drop t.Parallel() guards now that they no longer mutate process-wide state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 19:03:55 -03:00
Thales Maciel	d38f580e00	doctor: surface state store open failure as failing check Previously store.Open errors were silently swallowed, so `banger doctor` could report green while the default-image check (and any other store-dependent diagnostic) was silently skipped because d.store was nil. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:35:27 -03:00
Thales Maciel	3f6ecb4376	cli: split banger.go god file into focused files Pure code motion — banger.go 3508→240 LOC, same-package decomposition keeps all identifiers visible without export changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-19 17:34:32 -03:00
Thales Maciel	3a5f4cd40d	cli: delete vm run's dead import path + duplicated git inspection The CLI carried a full second copy of the workspace import implementation that `vm run` never actually used: - importVMRunRepoToGuest (no callers — the live flow calls the daemon's PrepareVMWorkspace RPC instead) - prepareVMRunRepoCopy, vmRunCheckoutCommit, vmRunCheckoutScript, gitFileURL, runHostCommand (all reachable only from the dead importVMRunRepoToGuest) Plus a duplicated repo-inspection surface that shadowed the daemon's: - inspectVMRunRepo ran every git query the daemon re-ran during workspace.prepare (HEAD, branch, identity, origin, overlay list) - gitOutput / gitTrimmedOutput / gitResolvedConfigValue / parseNullSeparatedOutput / listSubmodules / listOverlayPaths / resolveVMRunSourcePath — all identical to the exported workspace.* versions - vmRunRepoSpec — same fields as workspace.RepoSpec Replaced with a single minimal preflight: func vmRunPreflightRepo(ctx, rawPath) (absPath, err error) The preflight only checks what the user can fix locally before banger creates a VM (path exists, sits in a non-bare git repo, no submodules). The daemon's workspace.prepare RPC does the full inspection — and returns RepoRoot + RepoName in the response, which the CLI now threads into the tooling harness instead of computing them a second time. Signature changes: runVMRun(ctx, ..., vmRunRepo, ...) // was: vmRunRepoSpec startVMRunToolingHarness(ctx, client, repoRoot, repoName, progress) // was: (ctx, client, spec, progress) vmRunToolingHarnessScript(plan) // was: (spec, plan) vmRunToolingHarnessLaunchScript(repoName) // was: (spec) Tests: the CLI-side git-inspection tests are replaced by a single TestVMRunPreflightRejectsSubmodules that exercises the preflight. Everything else (tooling harness script, progress renderer, SSH args, runVMRun flows) keeps working. The shallow-copy / checkout-script tests are gone — that code now lives only in internal/daemon/workspace and is tested there. Also fixed a latent bug the refactor exposed: vm run's --from flag defaults to "HEAD", which the daemon reads as "from without branch" and rejects. CLI now scrubs fromRef when branchName is empty. Live verified: `banger vm run --name X . -- cmd` boots, workspace materialises at /root/repo with matching HEAD, exit code propagates.	2026-04-19 17:01:26 -03:00
Thales Maciel	ae14b9499d	ssh: trust-on-first-use host key pinning everywhere Guest host-key verification was off in all three SSH paths: * Go SSH (internal/guest/ssh.go) used ssh.InsecureIgnoreHostKey * `banger vm ssh` passed StrictHostKeyChecking=no + UserKnownHostsFile=/dev/null * `~/.ssh/config` Host .vm shipped the same posture into the user's global config Now each path verifies against a banger-owned known_hosts file at `~/.local/state/banger/ssh/known_hosts` with TOFU semantics: First dial to a VM pins the key. * Subsequent dials require an exact match. A mismatch fails with an explicit "possible MITM" error. * `vm delete` removes the entries so a future VM reusing the IP or name re-pins cleanly. * The user's `~/.ssh/known_hosts` is untouched. Changes: internal/guest/known_hosts.go (new) — OpenSSH-compatible parser, TOFUHostKeyCallback, RemoveKnownHosts. Process-wide mutex around the file. internal/guest/ssh.go — Dial and WaitForSSH grew a knownHostsPath parameter threaded through the callback. Empty path keeps the insecure callback (tests + throwaway tools only; documented). internal/daemon/{guest_sessions,session_attach,session_lifecycle, session_stream}.go — call sites pass d.layout.KnownHostsPath. internal/daemon/ssh_client_config.go — the ~/.ssh/config Host *.vm block now points at banger's known_hosts and uses StrictHostKeyChecking=accept-new. Missing path → fail closed. internal/daemon/vm_lifecycle.go — deleteVMLocked drops known_hosts entries for the VM's IP and DNS name via removeVMKnownHosts. internal/cli/banger.go — sshCommandArgs swaps StrictHostKeyChecking no + /dev/null for banger's file + accept-new. Path resolution failure falls through to StrictHostKeyChecking=yes. internal/paths/paths.go — Layout gains SSHDir + KnownHostsPath; Ensure creates SSHDir at 0700. Tests (internal/guest/known_hosts_test.go): pin on first use, accept matching key on second dial, reject mismatch, empty path skips checking, RemoveKnownHosts drops the entry, re-pin works after remove. Existing daemon + cli tests updated to assert the new posture and regression-guard against the old flags. Live verified: vm run writes the pin to banger's known_hosts at 0600 inside a 0700 dir; banger vm ssh + ssh root@<vm>.vm both succeed using the pin; vm delete clears it.	2026-04-19 16:46:03 -03:00
Thales Maciel	a59958d4f5	daemon: roll back host state on any Open() failure Open() touched several pieces of host state before hitting the step that returned the error: * SQLite handle (store.Open) * managed SSH client config block (ensureVMSSHClientConfig) * vm-DNS UDP listener goroutine (startVMDNS) * systemd-resolved per-interface routing (ensureVMDNSResolverRouting) The only deferred cleanup guarded stopVMDNS. A reconcile() or initializeTapPool() failure therefore left the listener running, the resolver wiring in place, and the SQLite handle open. A subsequent startup attempt ran into "port 42069 already in use" or silently published stale state. Fix: once `d` exists, defer `d.Close()` on `err != nil`. Close is idempotent (sync.Once) and every teardown step (listener close, DNS listener close, resolver revert, session registry close, store close) is nil-guarded, so calling it on a daemon that never got past the first startup step is safe. Tests (internal/daemon/open_close_test.go): - TestCloseOnPartiallyInitialisedDaemon: Close survives a daemon with only store + closing channel, and with a vmDNS listener but nothing else. Catches regressions where a teardown step forgets to nil-check. - TestCloseIdempotentUnderConcurrency: 5 goroutines racing on Close() never panic (sync.Once + close(d.closing) survive). - TestOpenFailureRunsCloseCleanup: structural check that the `defer cleanup() if err != nil` pattern actually fires. Live: `banger daemon stop` cleanly, `banger vm ls` restarts daemon without a residual listener on port 42069.	2026-04-19 16:36:29 -03:00
Thales Maciel	d1b9a8c102	remove experimental web UI The web UI shipped as "experimental" and was never finished — no nav off the dashboard, no live updates, no settled design, never a supported surface. It was opt-in by default already; leaving the code in the tree for v0.1.0 only invited "does this work?" questions and kept HostSummary/BangerSummary/SudoStatus types on the public RPC surface that nothing else uses. Removed: internal/webui/ (all Go + templates + assets) internal/daemon/web.go (server start / Layout / Config / ListVMs / ListImages) internal/daemon/dashboard.go (DashboardSummary aggregator) Simplified: internal/api/types.go drop WebURL on PingResult, drop HostSummary / SudoStatus / BangerSummary / DashboardSummary / DashboardSummaryResult internal/model/types.go drop DaemonConfig.WebListenAddr internal/config/config.go drop web_listen_addr from fileConfig + Load internal/daemon/daemon.go drop webListener / webServer / webURL fields + startWebServer() call + ping WebURL population internal/cli/banger.go `daemon status` output no longer branches on web internal/daemon/{doc.go,ARCHITECTURE.md} drop web UI sections README.md drop web_listen_addr config bullet + security paragraph Tests updated to reflect the new shape. Coverage 57.3 -> 58.9% (the webui package was largely untested; its removal lifts the ratio without moving the numerator). `banger daemon status` output and --help are web-free. Lint + full suite green.	2026-04-19 14:28:08 -03:00
Thales Maciel	687fcf0b59	vm state: split transient kernel/process handles off the durable schema Separates what a VM IS (durable intent + identity + deterministic derived paths — `VMRuntime`) from what is CURRENTLY TRUE about it (firecracker PID, tap device, loop devices, dm-snapshot target — new `VMHandles`). The durable state lives in the SQLite `vms` row; the transient state lives in an in-memory cache on the daemon plus a per-VM `handles.json` scratch file inside VMDir, rebuilt at startup from OS inspection. Nothing kernel-level rides the SQLite schema anymore. Why: Persisting ephemeral process handles to SQLite forced reconcile to treat "running with a stale PID" as a first-class case and mix it with real state transitions. The schema described what we last observed, not what the VM is. Every time the observation model shifted (tap pool, DM naming, pgrep fallback) the reconcile logic grew a new branch. Splitting lets each layer own what it's good at: durable records describe intent, in-memory cache + scratch file describe momentary reality. Shape: - `model.VMHandles` = PID, TapDevice, BaseLoop, COWLoop, DMName, DMDev. Never in SQLite. - `VMRuntime` keeps: State, GuestIP, APISockPath, VSockPath, VSockCID, LogPath, MetricsPath, DNSName, VMDir, SystemOverlay, WorkDiskPath, LastError. All durable or deterministic. - `handleCache` on `*Daemon` — mutex-guarded map + scratch-file plumbing (`writeHandlesFile` / `readHandlesFile` / `rediscoverHandles`). See `internal/daemon/vm_handles.go`. - `d.vmAlive(vm)` replaces the 20+ inline `vm.State==Running && ProcessRunning(vm.Runtime.PID, apiSock)` spreads. Single source of truth for liveness. - Startup reconcile: per running VM, load the scratch file, pgrep the api sock, either keep (cache seeded from scratch) or demote to stopped (scratch handles passed to cleanupRuntime first so DM / loops / tap actually get torn down). Verification: - `go test ./...` green. - Live: `banger vm run --name handles-test -- cat /etc/hostname` starts; `handles.json` appears in VMDir with the expected PID, tap, loops, DM. - `kill -9 $(pgrep bangerd)` while the VM is running, re-invoke the CLI, daemon auto-starts, reconcile recognises the VM as alive, `banger vm ssh` still connects, `banger vm delete` cleans up. Tests added: - vm_handles_test.go: scratch-file roundtrip, missing/corrupt file behaviour, cache concurrency, rediscoverHandles prefers pgrep over scratch, returns scratch contents even when process is dead (so cleanup can tear down kernel state). - vm_test.go: reconcile test rewritten to exercise the new flow (write scratch → reconcile reads it → verifies process is gone → issues dmsetup/losetup teardown). ARCHITECTURE.md updated; `handles` added to Daemon field docs.	2026-04-19 14:18:13 -03:00
Thales Maciel	2e6e64bc04	guest sshd: drop DEBUG3 + StrictModes no; normalise /root perms Previously /etc/ssh/sshd_config.d/99-banger.conf landed with: LogLevel DEBUG3 PermitRootLogin yes PubkeyAuthentication yes AuthorizedKeysFile /root/.ssh/authorized_keys StrictModes no DEBUG3 was debug leftover that floods journald in normal use. StrictModes no was a workaround for /root perm drift on the work disk — the real fix is to make those perms correct at provisioning time. New drop-in: PermitRootLogin prohibit-password PubkeyAuthentication yes PasswordAuthentication no KbdInteractiveAuthentication no AuthorizedKeysFile /root/.ssh/authorized_keys prohibit-password blocks password root login even if PasswordAuth gets flipped on elsewhere; KbdInteractiveAuth no closes the last interactive fallback; StrictModes is now on (sshd's default). normaliseHomeDirPerms chown/chmods /root to 0755 root:root at every work-disk mount (ensureAuthorizedKeyOnWorkDisk, seedAuthorizedKeyOnExt4Image); the .ssh dir also explicitly chown'd root:root. Verified end-to-end against a real VM: `sshd -T` reports strictmodes yes and all five directives match. Regression test (sshd_config_test.go) pins the allow-list and the deny-list (DEBUG3, StrictModes no, bare `PermitRootLogin yes`) so the next accidental reintroduction fails fast. README's Security section updated to reflect the new posture.	2026-04-19 13:40:40 -03:00
Thales Maciel	6cd52d12f4	workspace prepare: release VM mutex before guest I/O Previously withVMLockByRef held the per-VM mutex across InspectRepo, waitForGuestSSH, dialGuest, ImportRepoToGuest (the tar stream!), and the readonly chmod. A large repo could block `vm stop` / `vm delete` / `vm restart` on the same VM for however long the import took. Split into two phases: 1. VM mutex held briefly to validate state (running + PID alive) and snapshot the fields needed for SSH (guest IP, api sock). 2. VM mutex released. Acquire workspaceLocks[id] — a separate per-VM mutex scoped to workspace.prepare / workspace.export — for the guest I/O phase. Lifecycle ops (stop/delete/restart/set) only take vmLocks, so they no longer queue behind a slow import. Two concurrent prepares on the same VM still serialise via workspaceLocks so tar streams don't interleave. ExportVMWorkspace also acquires workspaceLocks to avoid snapshotting a half-streamed import. Two regression tests (sequential — they swap package-level seams): ReleasesVMLockDuringGuestIO: stall the import fake, assert the VM mutex is acquirable from another goroutine during the stall. SerialisesConcurrentPreparesOnSameVM: 3 concurrent prepares, assert Import is only ever invoked 1-at-a-time per VM. ARCHITECTURE.md documents the split + updated lock ordering.	2026-04-19 13:32:42 -03:00
Thales Maciel	99de42385f	workspace export: stop mutating the guest repo index Previously `banger vm workspace export` ran `git add -A` against the guest's real `.git/index`, so the observation step left staged changes behind that users never asked for. Reconnecting later (ssh, another export) surfaced them and looked like phantom work. Route `git add -A` through a throwaway index file instead: tmp_idx=$(mktemp ...) trap 'rm -f "$tmp_idx"' EXIT git read-tree <ref> --index-output="$tmp_idx" GIT_INDEX_FILE="$tmp_idx" git add -A GIT_INDEX_FILE="$tmp_idx" git diff --cached <ref> --binary\|--name-only The real .git/index, working tree, and refs stay exactly as the user left them. Same diff content — commits past <ref>, uncommitted edits, and untracked files (minus .gitignore) all captured. Regression test locks the invariant: every export script must route add -A through GIT_INDEX_FILE and clean the temp index on exit. CLI help text updated to say "non-mutating".	2026-04-19 13:20:56 -03:00
Thales Maciel	21b74639d8	vm defaults: host-aware sizing + spec line on spawn + doctor check Replaces the static model.Default* constants that drove --vcpu / --memory / --disk-size with a three-layer resolver: 1. [vm_defaults] in ~/.config/banger/config.toml (if set) 2. host-derived heuristics (cpus/4 capped at 4; ram/8 capped at 8 GiB) 3. baked-in constants (floor) Visibility: - Every `vm run` / `vm create` prints a `spec:` line before progress begins: `spec: 4 vcpu · 8192 MiB · 8G disk`. Matches the VM that actually gets created because the CLI is now the single source of truth — it resolves, populates the flag defaults, and forwards the explicit values to the daemon. - `banger doctor` adds a "vm defaults" check showing per-field provenance (config\|auto\|builtin) and the config file path for overrides. - `--help` shows the resolved defaults (e.g. `--vcpu int (default 4)` on an 8-core host). No `banger config init` command, no first-run side effects, no writes to the user's filesystem behind their back. Users who want explicit control set the keys; everyone else gets sensible numbers that track their hardware.	2026-04-19 13:06:51 -03:00
Thales Maciel	78ff482bfa	release prep: opt-in web UI, make uninstall, fix stale kernel-catalog docs - WebListenAddr default is now "" (empty). The experimental web UI was running on 127.0.0.1:7777 by default, which surprises users who never opted in. Users who want it set `web_listen_addr = "127.0.0.1:7777"` in config.toml. - `make uninstall` stops the daemon (if any) and removes the installed binaries. Preserves user data on disk but prints the paths so `rm -rf` can follow for a full purge. Documented in README next to install. - docs/kernel-catalog.md: replace the `void-6.12` and `alpine-3.23` examples (never published) with `generic-6.12` (the only cataloged kernel today). Updates the versioning-convention example too.	2026-04-19 12:43:58 -03:00
Thales Maciel	221fb03d68	cli QoL: vm prune, list→ls aliases, delete→rm aliases - `banger vm prune` sweeps every non-running VM (stopped, created, error) with an interactive confirmation; -f/--force skips the prompt. Partial failures report which VM failed and exit non-zero. - list commands gain `ls` alias: vm list already had it; added to image list, kernel list, and vm session list. - delete commands gain `rm` alias: vm delete and image delete. kernel rm already aliased delete/remove. Uses new test seams (vmListFunc) plus the existing vmDeleteFunc so prune unit-tests without touching the daemon socket.	2026-04-19 12:17:46 -03:00
Thales Maciel	e3eaa0c797	cli: shell completion via cobra + dynamic resource name lookups Re-enable cobra's default `completion` subcommand (`banger completion bash\|zsh\|fish\|powershell`). Plus live resource-name suggestions that hit the running daemon via the same RPC the real commands use: vm start/stop/restart/delete/kill/set → completeVMNames (variadic) vm ssh/show/logs/stats/ports/... → completeVMNameOnlyAtPos0 vm session list/start → completeVMNameOnlyAtPos0 vm session show/logs/stop/kill/attach/send → completeSessionNames (vm + session) image show/delete/promote → completeImageNameOnlyAtPos0 kernel show/rm → completeKernelNameOnlyAtPos0 vm run/create --image, image pull/register --kernel-ref → flag-value completion Design notes in internal/cli/completion.go: completers never auto-start the daemon (ping-check, bail with NoFileComp on miss), so tab-completion stays a zero-cost probe. Variadic completers exclude already-entered args to avoid duplicate suggestions. README: install recipes for bash / zsh / fish.	2026-04-19 12:12:40 -03:00
Thales Maciel	346eaba673	coverage: medium batch — hostnat runner, store guest-sessions, daemon helpers Reuses existing fixtures (CommandRunner fakes, SQLite tempfile store, pure-Go seams). No new infra needed. hostnat 50% -> 98% (iptables orchestration via fake runner) store 78% -> 91% (guest_sessions CRUD roundtrip) daemon/session 57% -> 95% (script gen, state parse, snapshot apply) daemon/opstate 67% -> 100% (Registry Insert/Get/Prune) daemon (firstNonEmpty) slight bump Total 54.0% -> 56.5%.	2026-04-18 18:03:37 -03:00
Thales Maciel	f8979de58a	coverage: easy-wins batch across cli, system, paths, vmdns, toolingplan Pure-Go tests for formatters, layout resolution, and validators — no fixtures, no external processes. Targets previously-zero functions the triage scan flagged as low-hanging fruit. cli 55% -> 65% paths 64% -> 91% system 65% -> 75% vmdns 72% -> 86% toolingplan 73% -> 78% Total 52.6% -> 54.0%.	2026-04-18 17:57:05 -03:00
Thales Maciel	a3cc296523	guest: tests for fingerprint, shellQuote, tar-entries edge cases, nil receivers Pure-Go additions (no SSH server fixture): AuthorizedPublicKeyFingerprint, shellQuote escaping, writeTarEntriesArchive error paths (.., ., missing, duplicates, blank entries) and symlink handling, StreamSession/Client nil-receiver safety, WaitForSSH context cancellation. internal/guest coverage 17.8% -> 47.6%. Total 52.1% -> 52.6%. The remaining uncovered paths need a real in-process SSH server; skip.	2026-04-18 17:47:24 -03:00
Thales Maciel	18bf89eae9	coverage: make targets + close zero-cov gaps (namegen, sessionstream) Adds `make coverage` (per-package + total via -coverpkg=./...), `make coverage-html`, and `make coverage-total` (CI-friendly). Wires coverage.out/coverage.html through `make clean` and .gitignore. Closes the two easy zero-coverage packages: namegen (77.8%) and sessionstream (93.5%). Total statement coverage 51.7% -> 52.1%.	2026-04-18 17:44:37 -03:00
Thales Maciel	2584f94828	image/kernel pull: heartbeat dots so slow pulls look alive Bundle downloads can take 20–60s on a typical connection and the CLI was going silent between "resolving daemon" and the final image summary. Users wondered whether banger had wedged. New `withHeartbeat` helper wraps an RPC call with a dot-every-2s ticker on stderr. No-op when stderr isn't a terminal, so piped or scripted invocations stay quiet. Wired into `image pull` and `kernel pull`, the two commands that actually download bytes. Example: $ banger image pull debian-bookworm [image pull] .......... id name managed ... Tests cover the non-TTY short-circuit and error propagation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 17:08:30 -03:00
Thales Maciel	b5c13e3938	Remove opencode package + vm acp command (dead code) The `internal/opencode` package and the `opencodeCapability` that consumed it were hard-wired to wait for opencode on guest port 4096 when an image shipped an initrd. After the prune commits (void / alpine / customize.sh / image build all removed), nothing banger produces today carries an initrd, so the capability's wait path was unreachable: every startup short-circuited to the "direct-boot, skip opencode" branch. Same logic for `banger vm acp`: it SSHes to `opencode acp --cwd <path>`, a binary the golden image no longer ships. Users who run their own image with opencode can still invoke `ssh vm -- opencode acp --cwd /root/repo` directly — no banger scaffolding required. Removed: - internal/opencode/ (whole package, 255 LOC incl. tests) - internal/daemon/opencode.go (opencodeCapability) - cli `vm acp` command + its helpers (runVMACP, sshACPCommandArgs, vmACPRemoteCommand) + their tests - The opencodeCapability{} entry in registeredCapabilities() plus the test that pinned its presence - `wait_opencode` progress-stage label from the vm-create renderer - Stale mentions in daemon/doc.go, README, and webui test fixtures ~480 lines gone, 12 added. `banger/internal` is now 25 packages instead of 26. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 16:54:37 -03:00
Thales Maciel	0933deaeb1	file_sync: config-driven replacement for hardcoded auth sync Replace the three hardcoded host→guest credential syncs (opencode, claude, pi) with a generic `[[file_sync]]` config list. Default is empty — users opt in to exactly what they want synced, with no surprise about which tools banger "supports". ```toml [[file_sync]] host = "~/.local/share/opencode/auth.json" guest = "~/.local/share/opencode/auth.json" [[file_sync]] host = "~/.aws" # directories are copied recursively guest = "~/.aws" [[file_sync]] host = "~/bin/my-script" guest = "~/bin/my-script" mode = "0755" # optional; default 0600 for files ``` Semantics: - Host `~/...` expands against the host user's $HOME. Absolute host paths are used as-is. - Guest must live under `~/` or `/root/...` — banger's work disk is mounted at /root in the guest, so that's the syncable namespace. Anything outside is rejected at config load. - Validation at config load: reject empty paths, relative paths, `..` traversal, `~user/...`, malformed mode strings. Errors name the offending entry index. - Missing host paths are a soft skip with a warn log (existing behaviour). Other errors (read, mkdir, install) abort VM create. - File entries: `install -o 0 -g 0 -m <mode>` (default 0600). - Directory entries: walked in Go; each source file is installed with its own source permissions preserved. The entry's `mode` is ignored for directories. Removed (all dead after this): - `ensureOpencodeAuthOnWorkDisk`, `ensureClaudeAuthOnWorkDisk`, `ensurePiAuthOnWorkDisk`, the shared `ensureAuthFileOnWorkDisk`, their `warn*Skipped` helpers, `resolveHost{Opencode,Claude,Pi}AuthPath`, and the work-disk relative-path + default display-path constants. - The capability hook registering the three syncs now calls the generic `runFileSync` once. Seven tests exercising the old codepath deleted; six new tests cover the new runFileSync (no-op on empty config, file copy, custom mode, missing-host-skip, overwrite, recursive directory). Config-layer test adds happy-path parsing and a case-per-shape table of invalid entries (empty, relative host, guest outside /root, '..' traversal, `~user`, bad mode). README updated: replaces the "Credential sync" section with a "File sync" section showing the new config shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 16:40:11 -03:00
Thales Maciel	843314be5e	vm_authsync: s/repairing/provisioning/ in SSH work-disk stage The "repairing SSH access on work disk" stage detail sounded remedial, like something had gone wrong. It's just writing banger's SSH key to /root/.ssh/authorized_keys on the work disk for the first time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 16:29:18 -03:00
Thales Maciel	cdd857b288	vm run --rm: suppress the still-running reminder The deferred --rm delete fires AFTER runSSHSession returns, but runSSHSession prints "vm X is still running (stop with ...)" before returning. Net effect: the user sees the reminder, then the VM gets deleted behind it — misleading. Thread a skipReminder bool into runSSHSession. `vm run` passes the same value as removeOnExit; other callers (`vm ssh`) pass false. Reinforced by a new assertion in the --rm happy-path test that the reminder string never appears in stderr. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 16:10:29 -03:00
Thales Maciel	b33f24865c	vm run --rm: ephemeral sandboxes New `--rm` flag deletes the VM once the ssh session or `-- cmd` exits, making `vm run` one-shot. Exit code from command mode still propagates correctly. Semantics: - Create fails → no VM to delete, nothing to do. - SSH-wait timeout → VM intentionally kept alive so `vm logs <name>` shows why; the timeout error already pointed users at that. Even with --rm, this path skips delete — a wedged sshd is exactly when you want post-mortem access. - Session/command ends (any exit code, any reason) → VM is deleted via `vm.delete` RPC. Uses a fresh 10s context so Ctrl-C during the session doesn't abort the cleanup. New vmDeleteFunc seam at the top of banger.go alongside the other RPC seams. Two tests cover the happy path (session ends cleanly → delete fires with correct ref) and the skip-on-timeout path (ssh wait errors → delete does NOT fire). README updated with an ephemeral example and a note about the timeout-skip behaviour. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 16:06:46 -03:00
Thales Maciel	3aa64a63c1	vm run: bound the ssh wait and give a useful error on timeout Before: `guestWaitForSSHFunc` loops forever bounded only by context cancellation, so if sshd fails to start in the guest `vm run` hangs indefinitely — which burned a long debugging session during the golden-image bring-up. After: the ssh wait gets its own 90s deadline. On guest-side timeout the error names the VM, explains sshd is the likely suspect, points at `banger vm logs <name>` for the console output, and notes the VM is still alive for inspection (or `vm delete` to clean up). Parent context cancellation (Ctrl-C, caller timeout) still surfaces as-is without the hint. `vmRunSSHTimeout` is a var rather than a const so tests can shrink it; the new TestRunVMRunSSHTimeoutReturnsActionableError sets it to 50ms and asserts the error message contains the actionable bits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 15:59:27 -03:00
Thales Maciel	ac7974f5b9	Remove image build --from-image; doctor treats catalog images as OK The `image build` flow spun up a transient Firecracker VM, SSHed in, and ran a large bash provisioning script to derive a new managed image from an existing one. It overlapped heavily with the golden- image Dockerfile flow (same mise/docker/tmux/opencode install logic duplicated in Go as `imagemgr.BuildProvisionScript`) and had far more machinery: async op state, RPC begin/status/cancel, webui form + operation page, preflight checks, API types, tests. For custom images, writing a Dockerfile is simpler and more reproducible. Removed end-to-end: - CLI `image build` subcommand + `absolutizeImageBuildPaths`. - Daemon: BuildImage method, imagebuild.go (transient-VM orchestration), image_build_ops.go (async begin/status/cancel), imagemgr/build.go (the 247-line provisioning script generator and all its append* helpers), validateImageBuildPrereqs + addImageBuildPrereqs. - RPC dispatches for image.build / .begin / .status / .cancel. - opstate registry `imageBuildOps`, daemon seam `imageBuild`, background pruner call. - API types: ImageBuildParams, ImageBuildOperation, ImageBuildBeginResult, ImageBuildStatusParams, ImageBuildStatusResult; model type ImageBuildRequest. - Web UI: Backend interface methods, handlers, form, routes, template branches (images.html build form, operation.html build branch, dashboard.html Build button). - Tests that directly exercised BuildImage. Doctor polish (task C): - Drop the "image build" preflight section entirely (its raison d'être is gone). - Default-image check now accepts "not local but in imagecat" as OK: vm create auto-pulls on first use. Only flag when the image is neither locally registered nor in the catalog. Net: 24 files touched, 1,373 lines deleted, 25 added. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 15:54:29 -03:00
Thales Maciel	6083e2dde5	Prune legacy void/alpine + customize.sh flows The golden-image Dockerfile + catalog pipeline replaces the entire manual rootfs-build stack. With that shipped, the per-distro shell flows are dead code. Removed: - scripts/customize.sh, scripts/interactive.sh, scripts/verify.sh - scripts/make-rootfs{,-void,-alpine}.sh - scripts/register-{void,alpine}-image.sh - scripts/make-{void,alpine}-kernel.sh - internal/imagepreset/ (only consumer was `banger internal packages`, which fed customize.sh) - examples/{void,alpine}.config.toml - Makefile targets: rootfs, rootfs-void, rootfs-alpine, void-kernel, alpine-kernel, void-register, alpine-register, void-vm, alpine-vm, verify-void, verify-alpine, plus the ALPINE_RELEASE / _IMAGE_NAME / _VM_NAME variables The void-6.12 kernel catalog entry is also gone — golden image pairs with generic-6.12 and nothing else in the catalog depended on it. Consolidated: imagemgr now holds the small DebianBasePackages list + package-hash helper inline, so the `image build --from-image` flow (still supported) no longer pulls from a separate imagepreset package. Net: 3,815 lines deleted, 59 added. No runtime functionality removed beyond the `banger internal packages` subcommand (hidden, used only by the deleted customize.sh). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 15:39:53 -03:00
Thales Maciel	75baf2e415	publish-golden-image: content-addressed tarball names Embed the sha256 prefix in the uploaded filename so every rebuild lives at a unique URL. Cloudflare's edge cache (and any similar CDN in front of R2) can never serve stale bytes for the URL the catalog points at. The R2 console offers no per-URL purge for this bucket layout, so making the URL itself content-addressed is the only durable fix. Also republishes the debian-bookworm catalog entry with the new filename. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 15:26:57 -03:00
Thales Maciel	e0894376ea	vm create: auto-pull image and kernel from catalogs if missing One-command sandbox: `banger vm run` on a fresh host now Just Works. No prior `banger image pull` or `banger kernel pull` needed. Changes: - Default `default_image_name` flips from "default" to "debian-bookworm" so the golden image is the implicit target when `--image` is omitted. - `CreateVM` resolves the image via a new `findOrAutoPullImage`: try the local store first, and on miss fall back to the embedded imagecat catalog + auto-pull. Emits a vm-create progress stage so the user sees "pulling from image catalog" in the create output. - `resolveKernelInputs` gains context + the same pattern via `readOrAutoPullKernel`: try the local kernelcat, and on miss look up the embedded kernelcat and auto-pull. Fires whenever a bundle's manifest references a kernel the user hasn't pulled yet, not just during image pull — any CreateVM with an image that needs a kernel not yet local will resolve it. - `--image` help text updated on both `vm run` and `vm create`. Six tests cover local-hit-no-pull, auto-pull-on-miss, not-in-catalog error propagation, and a non-ENOENT kernel read error does NOT trigger a misleading "not in catalog" claim. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 15:10:26 -03:00
Thales Maciel	81a27d6648	imagecat: publish debian-bookworm bundle with boot fixes End-to-end verified: banger image pull debian-bookworm banger vm run --image debian-bookworm --name goldenvm boots through multi-user.target, sshd starts, and vm run drops into an interactive ssh session. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 14:59:01 -03:00
Thales Maciel	66838bb135	make-bundle: strip /.dockerenv so systemd doesn't misdetect virt `docker create` drops /.dockerenv into the container's writable layer, and `docker export` includes it in the tar. When systemd later boots that rootfs it finds /.dockerenv and flags virtualization=docker, which disables a bunch of udev device-unit behaviour (device units never become active, mount units waiting on them hang forever). Strip /.dockerenv (and /run/.containerenv for podman symmetry) from the staging tree after FlattenTar and before BuildExt4 so systemd correctly detects virtualization=kvm. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 14:58:42 -03:00
Thales Maciel	ed4117d926	imagepull/BuildExt4: omit positional fs-size; rely on file truncation mkfs.ext4's positional fs-size is documented in 1 KiB units (not the filesystem's 4 KiB block size), so passing sizeBytes/4096 made filesystems 1/4 the intended size. A 4 GiB request became a 1 GiB ext4 in a 4 GiB file, packed to 0 free blocks — VM create then failed with 'Could not allocate block' when patchRootOverlay tried to write guest config. The file is truncated to the target size before mkfs runs; without the positional arg, mkfs uses the whole device. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 14:58:42 -03:00
Thales Maciel	b2dcdf9757	vm_lifecycle: drop systemd.mask=dev-{ttyS0,vdb}.device Both masks were added when the direct-boot path first landed for container rootfses that didn't have anything mounted on /dev/vdb. The golden image (and any pulled OCI image running under banger's patchRootOverlay) has an /etc/fstab entry mounting /dev/vdb at /root — masking dev-vdb.device makes systemd wait forever for a unit that can never become active, and the work-disk mount never completes. dev-ttyS0 is a real serial console the image needs too. Drop both. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 14:58:42 -03:00
Thales Maciel	ab5627aec2	imagecat: publish debian-bookworm golden image First entry in the image catalog. Verified end-to-end: - https://images.thaloco.com/debian-bookworm-x86_64.tar.zst reachable - sha256 071495e6... matches - bundle unpacks to rootfs.ext4 (4 GiB) + manifest.json with the expected name/distro/arch/kernel_ref. publish-golden-image.sh tweaks: - default RCLONE_REMOTE from 'r2' to 'banger-images' (matches the rclone config actually in use here). - rclone copyto now passes --s3-no-check-bucket and --no-check-dest so scoped R2 tokens without HeadBucket/HeadObject permission still upload cleanly. To use: restart bangerd so it picks up the new embedded catalog, then `banger image pull debian-bookworm`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 13:25:42 -03:00
Thales Maciel	5bdc9985c2	image pull: dispatch to imagecat bundle path before OCI PullImage now checks the embedded imagecat catalog first. If the ref matches a catalog entry, it takes the bundle path: 1. Fetch the .tar.zst bundle into a staging dir (rootfs.ext4 + manifest.json). 2. Strip manifest.json (staging-only metadata). 3. Stage kernel/initrd/modules alongside rootfs.ext4. 4. Publish the staging dir and upsert the image row. Bundle rootfs is already flattened + ownership-fixed + agent- injected at build time, so the daemon-side work is strictly I/O — no flatten, no mkfs, no debugfs. Kernel resolution in the bundle path: --kernel-ref > entry.kernel_ref > --kernel/--initrd/--modules. If the ref doesn't match a catalog entry, PullImage falls through to the existing OCI path unchanged (extracted into pullFromOCI). New test seam: d.bundleFetch. Six unit tests cover happy path, --kernel-ref override, existing-name rejection, kernel-required error, fetch-failure cleanup, and the catalog → OCI fallthrough. CLI help updated: image pull now documents both forms and takes <name-or-oci-ref> instead of requiring an OCI ref. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 15:43:33 -03:00
Thales Maciel	d22d05555c	scripts: bundle-based golden image pipeline Replaces the OCI-push flow with a bundle-based one that mirrors the kernel catalog (publish-kernel.sh / kernelcat). - scripts/make-golden-bundle.sh: docker build → docker create → docker export \| banger internal make-bundle → .tar.zst. Defaults target debian-bookworm / generic-6.12 / x86_64; pinned --size 4G to leave headroom for first-boot installs and in-VM apt use. - scripts/publish-golden-image.sh: rewritten to call make-golden-bundle, rclone upload to R2 (banger-images bucket, images.thaloco.com), and jq-patch internal/imagecat/catalog.json with URL / sha256 / size. --skip-upload stops after bundle build and copies to dist/. make-bundle default ext4 sizing also bumped from +25% to +50% headroom (mkfs.ext4 needs room for inode tables, block-group metadata, journal, and the default 5% reserved-blocks margin). The old 25% was too tight for the ~950 MB golden rootfs and aborted with "Could not allocate block". End-to-end smoke (local): golden Dockerfile → 286 MB tar.zst bundle with correct manifest, valid ext4, and all banger units + vsock agent present. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 15:38:04 -03:00
Thales Maciel	a7d1a49aca	cli: restrict ExitCodeError unwrap to the CLI's own type main.go previously unwrapped any error implementing `ExitCode() int` into the process exit status, which matched *exec.ExitError too. So whenever a CLI command ran a subprocess (mkfs.ext4, debugfs, ssh to a daemon preflight, etc.) and that subprocess failed, the CLI would silently exit with the subprocess's code — no error message printed. Surfaced while bringing up `banger internal make-bundle`: mkfs.ext4 was failing on an undersized ext4 and the user saw only `EXIT=1`. Fix: export the type as `cli.ExitCodeError` and unwrap against the concrete type in main.go. The `ExitCode()` method is gone — only the explicit wrap at the `vm run` command-mode call site produces this error now. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 15:37:47 -03:00
Thales Maciel	bb95a0a273	banger internal make-bundle: build image bundles from flat rootfs tars New hidden subcommand that turns a `docker export`-style rootfs tar into a banger bundle (`rootfs.ext4` + `manifest.json`, tar+zstd): 1. FlattenTar (new in imagepull) extracts the stream into a staging dir while capturing per-file uid/gid/mode into a Metadata record. 2. imagepull.BuildExt4 produces the ext4 via `mkfs.ext4 -d`. 3. imagepull.ApplyOwnership re-applies the captured metadata with `debugfs sif` so setuid/root-owned files keep their identity. 4. imagepull.InjectGuestAgents drops the vsock agent + network bootstrap + first-boot service into the ext4. 5. manifest.json is written with name/distro/arch/kernel_ref. 6. Both files are packaged as .tar.zst with max compression. Flags: --rootfs-tar (file or '-' for stdin), --name, --distro, --arch, --kernel-ref, --description, --size, --out. Stdout prints bundle path, sha256, and size so callers can patch the catalog. Unit tests cover flag registration, required-arg validation, the bundle tar round-trip, sha256HexFile, and dirSize. An end-to-end test runs the full pipeline against a synthesized tiny rootfs tar; skips gracefully when mkfs.ext4 / debugfs / companion binaries are missing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 15:17:50 -03:00
Thales Maciel	3d9ae624b1	imagecat: catalog + fetch for banger image bundles New package mirroring `kernelcat`: catalog + SHA256-verified HTTP fetch of `.tar.zst` bundles that contain rootfs.ext4 + manifest.json. Mounted empty (version:1, entries:[]) so nothing is pullable via the bundle path yet; wiring into `banger image pull` lands in a later phase. - catalog.go: Catalog/CatEntry, LoadEmbedded, ParseCatalog, Lookup, ValidateName. - fetch.go: Fetch(ctx, client, destDir, entry) downloads the bundle, verifies sha256, extracts exactly rootfs.ext4 and manifest.json into destDir, returns the parsed manifest. Rejects unexpected tar entries, unsafe paths, non-regular files, and cleans up partial writes on failure. - Thirteen unit tests (happy path + every failure mode). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 15:11:52 -03:00
Thales Maciel	feb679a301	vm run redesign: one command, three modes `vm run` now covers bare sandbox (no args), workspace sandbox (path), and workspace+command (path -- cmd) in a single entry point. Replaces the old print-next-steps-and-exit behaviour: bare and workspace modes drop into interactive ssh, command mode execs via ssh and propagates the remote exit code through banger's own exit status. - path argument is optional; --branch / --from still require a path. - workspace prep and mise tooling bootstrap only run when a path is given; command mode skips the bootstrap. - remote command exit status is wrapped as exitCodeError so main() can propagate it instead of collapsing every failure to 1. - README: promote vm run with three-mode examples; demote vm create to a scripting primitive. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 14:00:45 -03:00

1 2 3 4 5

238 commits