banger

Author	SHA1	Message	Date
Thales Maciel	05439d2325	daemon: cut vm stop latency Three changes to stopVMLocked, biggest win first: - Skip waitForExit on the SSH-success path. sync inside the guest already flushed root.ext4, so cleanupRuntime's SIGKILL is safe immediately. Saves up to gracefulShutdownWait (10s) per stop. - Drop the SendCtrlAltDel + 10s wait fallback when SSH is unreachable. On Debian, ctrl+alt+del routes to reboot.target so FC never exits on it — the wait was pure latency. - Shrink the SSH dial timeout 5s → 2s. A reachable guest dials in single-digit milliseconds; if it doesn't, fail fast and SIGKILL. Worst-case (broken SSH) goes ~15s → ~2s + cleanup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 17:51:22 -03:00
Thales Maciel	c352aba50a	daemon: parallelize tap-pool warmup Pool warmup ran createTap calls sequentially (one per loop iteration), so warming N taps cold took N times the per-tap cost. Each releaseTap also fired its own ensureTapPool goroutine, racing on n.tapPool.next. Reserve a batch of names under the lock, then run up to maxConcurrentTapWarmup createTap RPCs in parallel — root helper already handles each connection in its own goroutine, so multiple in-flight priv.create_tap requests don't contend at the wire level. Add a warming flag to dedupe concurrent ensureTapPool invocations triggered by parallel releases. Bail-on-first-error semantics preserved: if every goroutine in a batch fails (e.g. host out of taps, kernel limit), the loop exits rather than burning monotonic indices forever. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 15:54:07 -03:00
Thales Maciel	71e073ac49	fix: land .hushlogin on work disk so vm run is quiet The work disk mounts at /root, so the .hushlogin written to the rootfs overlay was shadowed and never reached the guest — pam_motd kept printing the Debian banner on `banger vm run`. Move the write to the work disk root inode (= /root in the guest) and run it from PrepareHost so existing VMs pick it up on next start. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 14:39:46 -03:00
Thales Maciel	9ed44bfd75	port smoke to go	2026-05-01 19:34:44 -03:00
Thales Maciel	b0a9d64f4a	fix: drop /root/repo fallback in vm exec for unbound VMs vm exec defaulted execGuestPath to /root/repo whenever the VM had no recorded workspace, so running it against a plain VM (one that never had vm workspace prepare / vm run ./repo) blew up with 'cd: /root/repo: No such file or directory' — surfaced via the login shell's mise activate hook because bash -lc sources profile.d before the explicit cd. Now auto-cd only fires when --guest-path is passed or the VM actually has a workspace recorded; otherwise the command runs from root's home. Mise wrapping unchanged — without a .mise.toml it's a no-op. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 17:06:46 -03:00
Thales Maciel	9400bab6fd	fix: accept host:port in validateResolverAddr; release v0.1.8 The root helper's resolver-address validator only accepted bare IPs, so `resolvectl dns <bridge> 127.0.0.1:42069` — banger's own auto-wire call to point systemd-resolved at the in-process DNS server — was rejected before it ever reached resolvectl. The auto-wire is best-effort and only logs a warning on failure, so .vm resolution silently broke on the NSS path: dig @127.0.0.1 worked, curl <vm>.vm didn't. Validator now allows both bare IPs and IP:port (matching what `resolvectl dns` itself accepts), with new test coverage for the port'd form. Existing installs need a one-time `sudo banger system restart` after updating to v0.1.8 so the daemon re-runs the auto-wire with the fixed validator. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 15:42:11 -03:00
Thales Maciel	aaf49fc1b1	vm run: add -d/--detach + transparent tooling bootstrap The mise tooling bootstrap was failing silently when --nat wasn't set: the VM came up, the user landed in ssh, and tools were missing with no obvious cause. Two coupled fixes: * `-d`/`--detach`: create + prep + bootstrap, exit without attaching to ssh. Reconnect later with `banger vm ssh <name>`. Rejects the ambiguous combos `-d --rm` and `-d -- <cmd>`. * NAT precondition: when the workspace has a .mise.toml or .tool-versions, vm run now refuses before VM creation if --nat isn't set. Error message points at --nat or --no-bootstrap. * `--no-bootstrap`: explicit opt-out for users who want a vanilla VM with their workspace and no tooling install. Detached bootstrap runs synchronously (foreground tee'd to the log file) so the CLI only returns once installs finish. Interactive mode keeps today's nohup'd background behaviour so the ssh session starts promptly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 14:51:16 -03:00
Thales Maciel	9b5cbed32d	doctor: collapse healthy output to one line, add --verbose A healthy host triggered ~20 PASS rows with details — too noisy for the common case. Default now prints only fail/warn rows plus a summary footer; an all-pass run collapses to a single line. Pass --verbose / -v for the full per-check output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 14:18:09 -03:00
Thales Maciel	09a3ef812f	style: gofmt internal/firecracker/client.go Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 14:18:04 -03:00
Thales Maciel	02a1472dd4	test: cover absolutizePaths, lastID, runCheckMigrations Adds focused unit tests for previously-uncovered cli helpers: - TestAbsolutizePaths covers the path-vararg helper's empty, absolute, and relative branches; complements the existing TestAbsolutizeImageRegisterPaths. - TestLastID is table-driven across nil/empty/sorted/unsorted/ duplicates/negative inputs. - TestRunCheckMigrations* exercises every Compatibility branch (compatible / migrations needed / incompatible / inspect error) by stubbing bangerdExit and pointing the layout at a temp-dir DB seeded directly with the schema_migrations table. - TestNewBangerdCommandSubcommands pins the flag set against accidental drift. Lifts internal/cli coverage from 71% to 76% combined. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 12:08:19 -03:00
Thales Maciel	2606bfbabb	update: VMs survive `banger update` and rollback Three load-bearing fixes that together let `banger update` (and its auto-rollback path) restart the helper + daemon without killing every running VM. New smoke scenarios prove the property end-to-end. Bug fixes: 1. Disable the firecracker SDK's signal-forwarding goroutine. The default ForwardSignals = [SIGINT, SIGQUIT, SIGTERM, SIGHUP, SIGABRT] installs a handler in the helper that propagates the helper's SIGTERM (sent by systemd on `systemctl stop bangerd- root.service`) to every running firecracker child. Set ForwardSignals to an empty (non-nil) slice so setupSignals short-circuits at len()==0. 2. Add SendSIGKILL=no to bangerd-root.service. KillMode=process limits the initial SIGTERM to the helper main, but systemd still SIGKILLs leftover cgroup processes during the FinalKillSignal stage unless SendSIGKILL=no. 3. Route restart-helper / restart-daemon / wait-daemon-ready failures through rollbackAndRestart instead of rollbackAndWrap. rollbackAndWrap restored .previous binaries but didn't re- restart the failed unit, leaving the helper dead with the rolled-back binary on disk after a failed update. Testing infrastructure (production binaries unaffected): - Hidden --manifest-url and --pubkey-file flags on `banger update` let the smoke harness redirect the updater at locally-built release artefacts. Marked Hidden in cobra; not advertised in --help. - FetchManifestFrom / VerifyBlobSignatureWithKey / FetchAndVerifySignatureWithKey export the existing logic against caller-supplied URL / pubkey. The default entry points still call them with the embedded canonical values. Smoke scenarios: - update_check: --check against fake manifest reports update available - update_to_unknown: --to v9.9.9 fails before any host mutation - update_no_root: refuses without sudo, install untouched - update_dry_run: stages + verifies, no swap, version unchanged - update_keeps_vm_alive: real swap to v0.smoke.0; same VM (same boot_id) answers SSH after the daemon restart - update_rollback_keeps_vm_alive: v0.smoke.broken-bangerd ships a bangerd that passes --check-migrations but exits 1 as the daemon. The post-swap `systemctl restart bangerd` fails, rollbackAndRestart fires, the .previous binaries are restored and re-restarted; the same VM still answers SSH afterwards - daemon_admin (separate prep): covers `banger daemon socket`, `bangerd --check-migrations --system`, `sudo banger daemon stop` The smoke release builder generates a fresh ECDSA P-256 keypair with openssl, signs SHA256SUMS cosign-compatibly, and serves artefacts from a backgrounded python http.server. verify_smoke_check_test.go pins the openssl/cosign signature equivalence so the smoke release builder can't silently drift. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 12:08:08 -03:00
Thales Maciel	7e528f30b3	test: add installmeta tests	2026-04-30 10:49:22 -03:00
Thales Maciel	1be90a7af5	Preserve runtime dir across restart so reconcile re-finds VMs v0.1.4 fixed the binary-level reconcile path for jailer'd VMs but left a hole at the systemd layer: bangerd.service and bangerd-root.service both defaulted to RuntimeDirectoryPreserve=no, so /run/banger was wiped on every daemon stop. The api-sock symlinks the helper creates for live VMs (`/run/banger/fc-<id>.sock` → `<chroot>/firecracker.socket`) went with it, and findByJailerPidfile — which derives the chroot from the symlink target — couldn't resolve them. Reconcile then fell through to "stale_vm" and tore down the surviving FC's dm-snapshot. Add RuntimeDirectoryPreserve=yes to both unit templates so the symlinks survive the restart window. Live-verified end-to-end on the dev host: started a VM under v0.1.5, restarted helper + daemon, confirmed the FC PID was unchanged and `banger vm ssh` returned the same boot_id pre and post. Daemon-lifecycle tests updated to assert the new directive is present in both rendered units so future regressions show up at test time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 17:17:25 -03:00
Thales Maciel	cec7291184	Survive `banger update` with running VMs Two coupled fixes that together make the daemon-restart path of `banger update` non-destructive for running guests: 1. Unit templates set `KillMode=process` on bangerd.service and bangerd-root.service. The default control-group behaviour sent SIGKILL to every process in the cgroup on stop/restart — including jailer-spawned firecracker children, since fork/exec doesn't escape a systemd cgroup. With process mode only the unit's main PID is signalled; FC children stay alive in the (unowned) cgroup until the new helper instance starts up and re-claims them. 2. `fcproc.FindPID` falls back to the jailer-written pidfile at `<chroot>/firecracker.pid` (sibling of the api-sock target) when `pgrep -n -f <api-sock>` doesn't find a match. pgrep can't see jailer'd FCs because their cmdline only carries the chroot-relative `--api-sock /firecracker.socket`, not the host-side path. The pidfile is jailer's actual record of the post-exec FC PID, so reconcile can verify the surviving process is the right one (comm == "firecracker") and re-seed handles.json without tearing down the VM's dm-snapshot. Verified live on the dev host: started a VM, restarted the helper unit, restarted the daemon unit, and confirmed the FC PID was unchanged, vm list still showed the guest as running, and `banger vm ssh` returned the same boot_id pre and post restart. The systemd journal now reports "firecracker remains running after unit stopped" and "Found left-over process X (firecracker) in control group while starting unit. Ignoring." — exactly the shape `KillMode=process` is supposed to produce. Tests cover both the parser (parseVersionOutput from the v0.1.2 fix) and the new pidfile lookup: happy path, missing pidfile, stale pid, wrong comm, garbage content, non-symlink api-sock, whitespace tolerance. CHANGELOG corrects v0.1.0's misleading "daemon restarts do not interrupt running guests" line and documents the unit-refresh caveat: existing v0.1.0–v0.1.3 installs need a one-time `sudo banger system install` after updating to v0.1.4 to pick up the new KillMode directive (`banger update` swaps binaries, not unit files). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 17:09:15 -03:00
Thales Maciel	d867d61eb3	update: refresh install.toml commit + built_at from new binary After `banger update` swaps binaries, install.toml needs to reflect the just-installed identity. The previous code passed buildinfo.Current().{Commit,BuiltAt} into installmeta.UpdateBuildInfo — but buildinfo.Current() in the running CLI is the OLD pre-swap binary's identity (we're it), not the staged one. install.toml's version field got refreshed to target.Version while commit and built_at stayed pinned at the previous release. `banger doctor` compares the running CLI's three fields against install.toml's three fields and so raised a false-positive drift warning on every update. Fix: after the swap, exec /usr/local/bin/banger version, parse the three-line output, and write all three fields to install.toml. If the exec fails for any reason we fall back to the old behaviour (version + stale commit/built_at) with a warning, since install.toml drift is a doctor warning not a broken host — same posture as before for the failure path. The parser is split out (parseVersionOutput) and table-tested: happy path, whitespace-tolerance, missing-field rejection, empty input rejection, ignoring unrelated lines. Caught by running v0.1.0 → v0.1.1 live as the first end-to-end smoke test of the self-update flow, which was the whole point of that exercise. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 14:38:59 -03:00
Thales Maciel	b7c9661c99	updater: embed real cosign public key for v0.1.0 release signing The placeholder in BangerReleasePublicKey is replaced with the production cosign public key (P-256 ECDSA). The matching private key is stored offline by the maintainer; this is the public half that every banger CLI baked from this commit forward will use to verify SHA256SUMS signatures. cosign.pub is also committed at the repo root so external auditors can re-verify a release without parsing the Go source. The placeholder-refuses test now swaps the embedded key for a synthetic placeholder for the duration of the test, since the default value is no longer a placeholder. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 12:50:52 -03:00
Thales Maciel	fae28e3d8b	update: docs + publish script for the self-update feature README gets a top-level Updating section; docs/privileges.md gains a step-by-step trust-model writeup of `banger update`. The new scripts/publish-banger-release.sh drives the manual release cut: build, tar, sha256sum, cosign sign-blob, verify against the embedded public key, jq-merge into manifest.json, rclone upload to the R2 bucket. Refuses outright if the embedded key is still the placeholder so we can't accidentally publish an unverifiable release. Also folds in gofmt drift accumulated across the updater package and a few sibling files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 12:43:46 -03:00
Thales Maciel	8ed351ea47	updater: cosign-blob signature verification on SHA256SUMS Closes the v0.1.0 cosign requirement. Every banger update download now goes through ECDSA-P256 verification before any binary is trusted: SHA256SUMS.sig is fetched, base64-decoded, and verified against the embedded BangerReleasePublicKey. * BangerReleasePublicKey: PEM-encoded ECDSA public key embedded at compile time. The current value is a sentinel PLACEHOLDER — the maintainer must replace it with the output of `cosign generate-key-pair`'s cosign.pub before cutting v0.1.0, and re-cut. Until they do, every `banger update` refuses with ErrSignatureRequired ("the maintainer must replace it and re-cut a release before update can proceed"). Loud refusal beats silent acceptance. * VerifyBlobSignature: parses the embedded public key, base64- decodes the signature, computes SHA256(body), runs ecdsa .VerifyASN1. cosign sign-blob produces the format VerifyASN1 verifies natively (ASN.1-DER encoded ECDSA over a SHA256 digest), so no third-party crypto deps needed. * FetchAndVerifySignature: pulls the signature URL from the release manifest entry, fetches it (1 KiB cap), and verifies against sumsBody. Refuses outright when sha256sums_sig_url is empty — v0.1.0 contract requires every release to be signed, and an unsigned release is a manifest publishing bug we'd rather catch loudly than silently accept. * Wired into banger update: sumsBody captured from DownloadRelease, immediately fed into FetchAndVerifySignature. A failed verification removes the staged tarball before returning so it can't be reused. * BangerReleasePublicKey is var (not const) only to support tests that swap in a generated keypair; production sets it at compile time and never mutates it. Tests: placeholder-key path returns ErrSignatureRequired; happy path with a fresh in-test ECDSA keypair verifies a real sign-then-verify; tampered body, wrong key, and three malformed signature shapes (not-base64, empty, garbage-DER) all reject. Maintainer-cut workflow documented in BangerReleasePublicKey's comment: cosign generate-key-pair → paste cosign.pub into the constant → at release time, cosign sign-blob --key cosign.key SHA256SUMS > SHA256SUMS.sig and publish. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 12:37:53 -03:00
Thales Maciel	92ca1aa96f	cli: add `banger update` command Wires updater + the existing system-install helpers into a single operator-facing flow: 1. FetchManifest, resolve target release (default: latest_stable; override with --to vX.Y.Z). 2. --check exits with a one-line "up to date" / "update available". Same as `banger update --check` style for tools polling on a timer. 3. requireRoot beyond this point — we're about to write /usr/local/bin and talk to systemctl. 4. daemon.operations.list → refuse if any operation isn't Done. --force overrides; per the v0.1.0 plan there's no drain wait. 5. PrepareCleanStaging + DownloadRelease + StageTarball into /var/cache/banger/updates/. 6. Sanity-run the staged binaries: `banger --version` must mention the expected version; `bangerd --check-migrations --system` must exit 0 (compatible) or 1 (will auto-migrate). Exit 2 (incompatible) aborts before the swap. 7. --dry-run stops here with a one-line plan, leaves staging. 8. Swap (vsock → bangerd → banger) → restart bangerd-root then bangerd → waitForDaemonReady on the system socket. 9. Run `banger doctor` against the JUST-INSTALLED CLI binary (not d.doctor in-process — we want to exercise the new binary end-to-end). FAIL triggers auto-rollback: restore .previous backups, restart services, surface the original failure with "(rolled back to previous install)". 10. UpdateBuildInfo on /etc/banger/install.toml. CleanupBackups. Wipe staging dir. rollbackAndWrap / rollbackAndRestart split: the former is for failures BEFORE the systemctl restart (old binaries are still on disk under .previous; the OLD daemon is still running because the restart never happened). The latter is for failures AFTER, where rollback ALSO needs another systemctl restart so the OLD versions take over again. If even rollback's restart fails, we surface everything we know — the install is broken and the operator gets the breadcrumbs to fix it manually. Existing TestNewBangerCommandHasExpectedSubcommands updated to include "update" in the expected ordering. Live exercise against the empty bucket today errors as expected: $ banger update --check banger: discover: fetch manifest: HTTP 404 Not Found # exit 1 once the user publishes the first manifest the same command will report "up to date" or "update available". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 12:35:04 -03:00
Thales Maciel	91af367208	updater: download/stage/swap/rollback flow steps The pure-logic core of `banger update`. No CLI yet; this commit ships the steps the next commit's command will orchestrate. * download.go — DownloadRelease fetches SHA256SUMS, parses it, looks up the tarball's basename, then streams the tarball through download.FetchVerified so the hash is checked on the fly. Returns the SHA256SUMS bytes alongside so a future cosign-verification step can validate them against an embedded public key before trusting the hashes inside. Also: fetchBounded for small bounded GETs (manifest, sums file, future signature), DefaultStagingDir, EnsureStagingDir, PrepareCleanStaging. * stage.go — StageTarball reads gzip+tar, validates the entry set is exactly {banger, bangerd, banger-vsock-agent} (no extras, no missing, no path traversal, no non-regular files), extracts at mode 0755 regardless of what the tarball claims. StagedRelease records the resulting paths. * swap.go — InstallTargets pins the canonical install paths (/usr/local/bin/banger, /usr/local/bin/bangerd, /usr/local/lib/banger/banger-vsock-agent). Swap orders the three replacements vsock → bangerd → banger so the most impactful binary (the CLI) goes last; each step uses system.AtomicReplace and accumulates a SwapResult so partial failures can be rolled back cleanly. Rollback unwinds in reverse, joining errors so a half-rolled-back state surfaces enough info for an operator to fix manually. CleanupBackups removes the .previous trail after `banger doctor` confirms the new install is healthy. * installmeta.UpdateBuildInfo — small helper that refreshes Version/Commit/BuiltAt on /etc/banger/install.toml without re-running the full system install. Preserves OwnerUser/UID/ GID/Home and the original InstalledAt timestamp. Tests: stage rejects extra entries / missing entries / path traversal / non-regular files; happy-path stages all three at 0755 with correct contents. Swap+Rollback covers the all-three-succeed path (then verifies .previous backups exist + rollback restores old contents) AND the partial-failure path (third swap blocked by a non-dir parent → SwappedTargets = 2 → rollback unwinds those two cleanly). DownloadRelease covers happy path, tarball-not-in- SHA256SUMS, and propagated sha256 mismatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 12:30:22 -03:00
Thales Maciel	fb6d2b1dae	updater: manifest + SHA256SUMS parsing scaffolding First slice of the `banger update` package. No CLI yet — this just defines the wire shape and parsers the rest of the flow will plug into. * internal/updater/manifest.go — Manifest / Release types, ManifestSchemaVersion = 1, the hardcoded URL https://releases.thaloco.com/banger/manifest.json (var instead of const so tests can point at httptest), and FetchManifest / ParseManifest / Manifest.LookupRelease / Manifest.Latest. The manifest only references URLs (tarball, SHA256SUMS, optional signature); actual binary hashes come from SHA256SUMS itself, so manifest tampering can't substitute a hash for a known-good tarball. SchemaVersion gates forward-compat: a CLI that doesn't know its server's schema_version refuses to update rather than guessing. * internal/updater/sha256sums.go — ParseSHA256Sums tolerates both GNU `<digest> <file>` (with optional `*` binary prefix) and BSD `SHA256 (file) = <digest>` formats. Comments and blank lines are skipped; malformed lines that LOOK like entries are rejected (silent skipping is the wrong failure mode for a security-relevant input). Digests are lowercased so the caller can `==`-compare without worrying about case. Caps: 1 MiB on the manifest body, 16 KiB on SHA256SUMS, 256 MiB on release tarballs. Generous-but-bounded; bumping requires a code change so a server-side mistake can't fill the disk. Tests: ParseManifest happy path, schema-version-too-new rejection, five malformed-input cases. ParseSHA256Sums covers GNU + BSD + star-prefix + comments-and-blanks, six malformed-input rejections, case-insensitive digest normalisation. FetchManifest end-to-end via httptest. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 12:24:36 -03:00
Thales Maciel	abd5d6f5ab	download: shared FetchVerified helper for capped + hashed downloads imagecat.Fetch and kernelcat.Fetch each implement the same pattern: HTTP GET with a Content-Length pre-check, an io.LimitReader cap on the body, on-the-fly sha256 hashing, and refusal on either the cap trip or a hash mismatch. The about-to-arrive `banger update` flow makes a third caller, which is the right number to factor. * internal/download.FetchVerified(ctx, client, url, expectedSHA256, maxBytes, dstPath): streams the body to dstPath through a sha256 hasher, capped at maxBytes+1 bytes so an oversize body is detected before the hash check fires. On any failure (HTTP error, ContentLength > cap, body exceeds cap, write error, hash mismatch) the partial file is removed before returning so callers don't have to disambiguate "did we leave bytes on disk?". Imagecat and kernelcat are NOT migrated to this helper in this commit — they each have their own destination-dir layout and post-verify decompress/extract steps that don't fit a one-size helper. Lift them later if it stays clean; for now the helper is sized for the updater's "fetch tarball + sha256SUMS" need. Tests cover happy path, hash mismatch, advertised Content-Length over cap, lying server (chunked, no Content-Length, but oversize body), HTTP non-2xx, and the two arg-validation rejections (empty expected hash, non-positive maxBytes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 18:44:27 -03:00
Thales Maciel	fa3a7a3e31	system: add AtomicReplace + Rollback for binary swap Prerequisite for `banger update`'s swap step. The updater renames a staged binary into place and needs (a) atomicity per file (no half-written bytes for a process that's about to systemctl restart into the new binary) and (b) a backup it can restore from when post-restart doctor reports FAIL. * AtomicReplace(newSrc, dst, suffixPrevious): if dst exists, move it to dst+suffixPrevious. Then os.Rename newSrc → dst. Atomic on a single fs (the only case relevant to the updater — everything is staged under /var/cache/banger and then renamed into /usr/local/bin, but those should be on the same fs in a typical install). On rename failure, restore the backup so we don't leave the caller without their binary. * AtomicReplaceRollback(dst, suffixPrevious): symmetric inverse. Removes dst, renames dst+suffixPrevious back to dst. Tolerant of a missing backup (fresh-install case) so the updater can call it unconditionally on failure paths without tracking backup state itself. * Refuses an empty suffix at compile-time-style guard: an empty suffix would silently no-op the backup AND break rollback. Six tests cover: happy path, fresh install (no prior dst), stale .previous from a half-finished prior run, empty-suffix rejection, rollback restores, rollback tolerant of no-backup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 18:43:04 -03:00
Thales Maciel	ec6fc9d185	store,bangerd: add --check-migrations flag for pre-swap schema check Prerequisite for `banger update`. Before swapping a staged binary into place, the updater needs to confirm the new bangerd recognises the running install's DB schema. Without this, an operator could end up with a service that won't open its store after the binary swap + restart. * store.InspectSchemaState(path): opens the DB read-only (reusing OpenReadOnly's mode=ro DSN), reads the schema_migrations table, and classifies the relationship between applied and known IDs: SchemaCompatible (lockstep), SchemaMigrationsNeeded (binary newer, will auto-migrate on first Open), or SchemaIncompatible (DB has applied IDs the binary doesn't know about). Missing schema_migrations table is treated as "all migrations pending" rather than an error — matches the fresh-install case. * bangerd --check-migrations: opens the configured DB read-only, prints a one-line classification, and exits 0/1/2. The exit code is the contract: 0 — compatible 1 — migrations needed (binary newer; safe to swap) 2 — incompatible (binary older than DB; abort the swap) Honours --system to pick between system StateDir and user mode. * bangerdExit indirection so future tests can capture the exit code without terminating the test process. Production points at os.Exit. Tests cover the four classifications: compatible (fully migrated DB), migrations-needed (only baseline applied), incompatible (synthetic id=99 inserted), and missing-table (fresh DB). Live exercise on this dev host returned `migrations needed: pending [3] (binary will apply on first Open)` and exit 1, matching the contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 18:41:31 -03:00
Thales Maciel	3c0af3a2de	opstate,daemon: list in-flight operations via daemon.operations.list Prerequisite for `banger update`'s preflight, which refuses to swap binaries while anything is in flight. Today's opstate.Registry exposes Insert/Get/Prune but no iteration; without a snapshot accessor the update flow can't tell whether a vm.create is mid-prepare-work-disk. * opstate.Registry.List(): returns a freshly-allocated snapshot of every entry. Mutating the slice doesn't poison the registry. Pinned by tests covering the snapshot semantics and the empty case. * api.OperationSummary / OperationsListResult: a public-shape record per op. Today the Kind is always "vm.create" — the field exists so future async kinds (image.pull, kernel.pull) plug in without an API change. * Daemon.ListOperations + daemon.operations.list RPC: walks vmService.createOps and emits OperationSummary entries. Done ops are included in the snapshot; the update preflight filters by Done itself. * dispatch_test's documented-methods list updated. No behaviour change for existing flows; this is a read-only addition. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 18:14:57 -03:00
Thales Maciel	775525b592	cli,doctor: --version flag + CLI/install drift check Two pre-release polish items on the version-display surface. * --version on both binaries: cobra's Version field on the banger and bangerd roots renders a one-line summary (banger v0.1.0 (commit abcd1234, built 2026-04-28T20:45:50Z)). The SetVersionTemplate override drops cobra's "{{.Name}} version" prefix — our string is already a complete sentence. The multi-line `banger version` subcommand is unchanged for callers that want the full SHA / built_at on separate lines. * Doctor "banger version" row: prints the running CLI's version + short commit + built-at, plus what /etc/banger/install.toml recorded at install time. Disagreement is the most common version-skew pitfall (stale CLI against fresh daemon, or vice versa) and a one-line warn is friendlier than tracking that down from a launch failure. Drift detection is suppressed when either side is dev/unknown (untagged build) — comparing a dev CLI against a tagged install is the developer-machine case, not a real problem. formatVersionLine is in internal/cli (banger.go) and reused by bangerd.go via a strings.Replace because bangerd's version line should say "bangerd" not "banger". Slightly tilt-feeling but cheaper than parameterising the helper for one caller. Tests: TestVersionsDriftToleratesDevAndUnknown pins the four branches (match, version diff, commit diff, dev-suppression). The existing version-format test already runs through formatVersionLine indirectly. Live exercise: $ banger --version banger dev (commit `1c1ca7d6`, built 2026-04-28T20:52:33Z) $ bangerd --version bangerd dev (commit `1c1ca7d6`, built 2026-04-28T20:52:33Z) $ banger doctor \| head ... PASS banger version - CLI dev (commit `1c1ca7d6`, built 2026-04-28T20:52:33Z) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 17:53:32 -03:00
Thales Maciel	1c1ca7d6a4	doctor: pin firecracker version range, distro-aware install hint Pre-release polish: be explicit about which firecracker versions banger has been validated against, and give users a one-line install suggestion when the binary is missing rather than the previous generic "install firecracker or set firecracker_bin". internal/firecracker/version.go (new): * MinSupportedVersion = "1.5.0" — the floor banger refuses to launch below. Bumping this is a deliberate decision, paired with whatever helper feature started requiring the newer firecracker. * KnownTestedVersion = "1.14.1" — what banger's smoke suite actually runs against today. * SemVer + Compare + ParseVersionOutput, table-tested. The parser tolerates the trailing "exiting successfully" log line that firecracker tacks onto --version; only the canonical "Firecracker vX.Y.Z" line matters. * QueryVersion shells `<bin> --version` through a CommandRunner- shaped interface; doesn't import internal/system to keep the firecracker package leaf-clean. internal/daemon/doctor.go: * New addFirecrackerVersionCheck replaces the previous bare RequireExecutable preflight for firecracker. Three outcomes: PASS within [Min, Tested], WARN above Tested (newer firecracker usually works but is outside the tested window), FAIL below Min or when the binary is missing. * On missing binary, surfaces a distro-aware install command via parseOSReleaseIDs(/etc/os-release) → guessFirecrackerInstall Command. Pinned suggestions for debian (apt), arch/manjaro (paru), and nixos (nix-env). Other distros get only the upstream Releases URL — guessing wrong sends users on a wild goose chase. * runtimeChecks no longer includes the firecracker preflight; the new check subsumes it. README.md: * Requirements line now spells out the tested-against version (v1.14.1) and the supported floor (≥ v1.5.0), and points at `banger doctor` for the version check + install hint. Tests: ParseVersionOutput across canonical/prerelease/garbage inputs, SemVer.Compare across major/minor/patch boundaries, MustParseSemVer panics on malformed inputs. Doctor-side: PASS on tested version, FAIL below Min, WARN above Tested, FAIL with upstream URL when missing, install-hint dispatch table covering debian/ubuntu (via ID_LIKE)/arch/manjaro/nixos/fedora-fallback/missing-os-release. The renamed TestDoctorReport_MissingFirecrackerFails... now asserts against the new check name. Live `banger doctor` reports "v1.14.1 at /usr/bin/firecracker (within tested range; min v1.5.0, tested v1.14.1)" against the smoke host. Smoke bare_run still green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 17:47:42 -03:00
Thales Maciel	f7a6832ebf	Merge model,cli,docs polish for v0.1.0 # Conflicts: # internal/cli/commands_image.go	2026-04-28 17:36:47 -03:00
Thales Maciel	d0997fd3b5	model,cli,docs: medium-effort polish for v0.1.0 * model.ParseSize / FormatSizeBytes: pinned with table tests in internal/model/types_test.go (TestParseSize 22 cases, TestFormatSizeBytes 11 cases, TestParseSizeFormatRoundTrip 7 boundaries). Fixed the long-suffix regression: "4GiB", "512MiB", "4KiB" now parse correctly (parser strips trailing IB before inspecting the unit byte). Pinned current behaviour for no-suffix input ("1024" treated as MiB) and FormatSizeBytes(0). commands_image.go --size flag-help updated to show 4GiB now that the parser accepts it. * vm ports --json: matches the JSON-vs-table inconsistency between vm stats (always JSON) and vm ports (always table). --json on vm ports flips to the same printJSON path as vm stats. Default table output unchanged. Other vm subcommands (show, stats, logs, health, ping) didn't fit the identical pattern; left alone. * docs/oci-import.md architecture section moved to a new docs/oci-import-internals.md (precedent: internal/daemon/ ARCHITECTURE.md). User-facing oci-import.md keeps a one-line pointer for advanced reading. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 17:36:03 -03:00
Thales Maciel	003b0488ce	cli,docs: trivial polish for v0.1.0 A pre-release audit collected ~12 trivial-effort UX and code-hygiene items. Rolling them up here so the v0.1.0 commit log isn't littered with one-line tweaks. CLI help / completion: * commands_image.go: drop dangling reference to a `banger image catalog` subcommand that doesn't exist; replace with a pointer to `banger image list`. * commands_image.go: --size flag example was "4GiB" but the parser rejects that suffix. Change example to "4G". (Parser-side fix is in a separate concern.) * commands_image.go + completion.go: image pull now wires a catalog completer (falls back to local image names since there's no image-catalog RPC yet); image show / delete / promote already completed local names. * commands_kernel.go + completion.go: kernel pull now wires a new completeKernelCatalogNameOnlyAtPos0 backed by the kernel.catalog RPC, so tab-complete suggests pullable kernels. * commands_vm.go: vm stats and vm set now have Long + Example blocks (peers all do); --from flag description updated to spell out the relationship to --branch. README: * Define "golden image" inline at first use. * Add a one-line Requirements block above Quick Start so users hit the firecracker / KVM dependency before `make build`. Code hygiene: * dashIfEmpty / emptyDash were the same function. Deleted emptyDash, retargeted three call sites. * formatBytes (introduced today in image cache prune) duplicated humanSize. Consolidated to humanSize, now with a space ("1.2 GiB" not "1.2GiB"). formatters_test.go expectations updated. Logging chattiness: * "operation started" (logger.go), "daemon request canceled" (daemon.go), and "helper rpc completed" (roothelper.go) all fired at INFO per RPC. Downgraded to DEBUG so routine shell completions don't spam syslog. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 17:31:54 -03:00
Thales Maciel	4d8dca6b72	image: add `banger image cache prune` for OCI cache cleanup OCI layer blobs accumulate forever — every pull writes layers to ~/.cache/banger/oci/blobs/sha256/<hex> via go-containerregistry's filesystem cache, and nothing ever evicts them. The cache is purely a re-pull-avoidance (every flattened image is independent of the blobs that sourced it), so it's a perfect candidate for an opt-in operator-driven prune. New surface: * api: ImageCachePruneParams{DryRun}, ImageCachePruneResult {BytesFreed, BlobsFreed, DryRun, CacheDir}. * daemon: ImageService.PruneOCICache walks layout.OCICacheDir for a (bytes, blobs) tally, then — outside dry-run — atomically renames the cache aside, recreates it empty, and rm -rf's the aside dir. The rename-then-rm avoids leaving the cache in a half-removed state if a pull starts mid-prune (the in-flight pull's open files survive the rename via standard Linux semantics; it just sees a fresh empty cache afterwards). Missing cache dir is treated as zero — fresh installs that have never pulled an OCI image don't error. * dispatch: image.cache.prune RPC (paramHandler-wrapped, mirroring every other image RPC). Documented-methods test list updated. * cli: `banger image cache` group with a `prune` subcommand (--dry-run flag). Output is a single line: "freed 1.2 GiB across 47 blob(s) in /var/cache/banger/oci" or "would free …". formatBytes helper for the size pretty-print. docs/oci-import.md: replaced the "Tech debt: cache eviction" bullet with a "Cache lifecycle" section describing the new command and the in-flight-pull caveat. Tests: PruneOCICache covers the happy path (real prune empties the cache, recreates an empty dir, doesn't leak the .pruning- aside), the dry-run path (returns size, leaves blobs intact), and the fresh-install path (cache dir absent → zero result, no error). Smoke at JOBS=4 still green; live exercise against an empty cache on a system install prints the expected zero summary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 16:32:57 -03:00
Thales Maciel	182bccf8af	roothelper: pin bridge name + IP + CIDR to a banger-managed shape priv.ensure_bridge / priv.create_tap accepted the daemon's network config triple (BridgeName, BridgeIP, CIDR) and forwarded it straight to `ip link` / `ip addr` / `ip link set master`. Argv-style exec ruled out shell injection, but the kernel happily honours those commands against any iface a compromised owner-uid daemon names — including eth0/docker0/lo. Concretely: * priv.ensure_bridge could `ip link set <iface> up` against any host interface and `ip addr add` arbitrary IP/CIDR to it. * priv.create_tap could `ip link set <new-tap> master <iface>`, bridging the per-VM tap into the host's primary LAN so the guest sees host-local broadcast traffic. * priv.sync_resolver_routing / priv.clear_resolver_routing only enforced "name shaped like a Linux iface" — no banger constraint. New validators (single chokepoint via validateNetworkConfig): * validateBangerBridgeName: name must equal "br-fc" or start with "br-fc-". Stops a compromised daemon from naming any host iface in these RPCs. Users with a custom bridge keep the prefix. * validateCIDRPrefix: numeric in [8, 32]. Wider prefixes would silently widen the bridge subnet beyond what the daemon intends. * validateNetworkConfig bundles bridge-name + validateIPv4 + validateCIDRPrefix so every helper RPC that takes the triple stays in lockstep. Wired into methodEnsureBridge, methodCreateTap, and the resolver- routing pair (replacing the older validateLinuxIfaceName-only check with the stricter banger-bridge check). docs/privileges.md updated: the helper-RPC table rows now spell out the banger-managed bridge constraint, and the trust list includes the new validators. Tests: TestValidateBangerBridgeName (default + suffixed accepted, host ifaces / wrong prefix / oversized rejected), TestValidate CIDRPrefix (boundary + non-numeric + IPv6-style 64 rejected), TestValidateNetworkConfig (happy path + each-field-bad cases). Smoke at JOBS=4 still green — banger's defaults sail through the new gate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 16:19:28 -03:00
Thales Maciel	4004ce2e7e	imagecat,kernelcat: bound staged download, hash before extract Both Fetch flows previously streamed resp.Body straight into zstd → tar → on-disk extractor with the SHA256 check tacked on at the END. A bad mirror or an attacker that's compromised the catalog host could ship a multi-gigabyte tarball, watch banger expand it to disk, and only THEN see the helpful "sha256 mismatch" message — having already filled the host filesystem. Reorder the operations: stage the compressed tarball to a temp file under the destination directory through an io.LimitReader (cap +1 bytes), hash on the way in, refuse to decompress if either the cap trips or the SHA mismatches. Worst-case disk use is bounded by the cap, not by the source. Cap is exposed as a package var (MaxFetchedBundleBytes, MaxFetchedKernelBytes) so callers can tune per-deployment and tests can squeeze it down to provoke the rejection. Default 8 GiB — generous enough for a 4 GiB rootfs (which compresses to ~1-2 GiB), tight enough to make a "fill the host disk" attack expensive. The temp file lives in the destination dir so extraction stays on the same filesystem and we don't pay for cross-FS rename. defer os.Remove cleans up; the existing per-package cleanup() handler still removes any partial extraction on hash mismatch / extraction failure. Tests: each package gets a TestFetchRejectsOversizedTarballBefore Extraction that sets the cap to 64 bytes, points Fetch at a multi-KB tarball, and asserts (a) error mentions "cap", (b) destination dir is left clean (no leaked rootfs / manifest / kernel tree). All existing tests still pass — happy path, hash mismatch, missing files, path traversal, HTTP error, etc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 16:09:55 -03:00
Thales Maciel	3805b093b4	roothelper: tie kill/signal authorization to banger-launched firecracker validateFirecrackerPID was a substring check on /proc/<pid>/cmdline: "contains 'firecracker'". Good enough to refuse init/sshd/the test binary, but on a shared host where multiple users run firecracker the helper would happily SIGKILL someone else's VM. The owner-UID daemon could weaponise the helper as an arbitrary "kill any firecracker on this box" primitive. Replace the substring gate with two stronger acceptance modes: * Cgroup match (the supported path): /proc/<pid>/cgroup contains bangerd-root.service. systemd assigns every direct child of the helper unit into that cgroup at fork; the kernel keeps it there for the process's lifetime, so no daemon-UID code can forge it. Other users' firecracker processes live in different cgroups (user@<uid>.service, foreign service slices) and fail this check. Also robust across helper restarts: KillMode=control-group on the unit kills children when the service goes down, so an "orphan banger firecracker in some other cgroup" is rare by construction. * --api-sock fallback: cmdline carries `--api-sock <path>` with the path under banger's RuntimeDir. Covers the legacy direct (no-jailer) launch path, and gives daemon reconcile a way to clean up the rare orphan that lands outside the service cgroup after a hard helper crash. Tried /proc/<pid>/root first — pivot_root semantics make jailer'd firecracker read its root as "/" from any namespace, so the symlink is useless as a banger-managed fingerprint. Cgroup is the right signal. Also added a signal allowlist: priv.signal_process now rejects anything outside {TERM, KILL, INT, HUP, QUIT, USR1, USR2, ABRT} (case-insensitive, with or without SIG prefix). STOP/CONT, real-time signals, and numeric forms are refused — the helper running as root must not be a generic "send arbitrary signal to my pid" primitive. priv.kill_process is unaffected (it always sends KILL). Tests: validateSignalName covers allowlist + numeric/STOP/RTMIN rejection; extractFirecrackerAPISock pins the three flag forms (--api-sock VAL, --api-sock=VAL, -a VAL); pathIsUnder gets a small table; existing TestValidateFirecrackerPID still rejects PID 0, PID 1, and the test process itself. Doctor's non-system-mode test gained a t.TempDir-backed install path so it stops being environment-dependent on machines that happen to have /etc/banger/install.toml. Smoke at JOBS=4 still green — every banger-launched firecracker sails through the cgroup match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 16:00:41 -03:00
Thales Maciel	4a56e6c7d6	roothelper: walk validateManagedPath components, reject symlinks validateManagedPath was textual-only: filepath.Clean + dest-prefix match. That stopped `..` escapes but not the symlink-bypass attack that motivated this fix — a daemon-UID attacker can write into StateDir/RuntimeDir (it's their UID), so they can plant `<StateDir>/redirect -> /etc` and any helper RPC that then operates on `<StateDir>/redirect/...` resolves through the symlink at the kernel and lands at /etc/... on the host. Concretely the leaks this closed: * priv.create_dm_snapshot: rootfs/cow paths fed to losetup — losetup follows the symlink and attaches a host block device. * priv.launch_firecracker: kernel/initrd paths hard-linked into the chroot via `ln -f` — link(2) on Linux follows source symlinks, hard-linking host files into the jail. * priv.read_ext4_file / priv.write_ext4_files: image paths fed to debugfs / e2cp as root. * validateLaunchDrivePath: drive paths mknod'd or hard-linked. * validateJailerOpts: chroot base. Fix: after the existing prefix match, walk every component below the matched root with Lstat. Any existing symlink — leaf or intermediate — fails the validator. ENOENT is tolerated because several callers pass paths firecracker/the helper materialise later (sockets, log files, kernel hard-link targets); whoever materialises them goes through the same validation when the helper-side primitive runs. Subsumes most of validateNotSymlink's coverage but the explicit call sites (methodEnsureSocketAccess, methodCleanupJailerChroot) keep their belt-and-braces check — those paths must EXIST and not be symlinks, which validateNotSymlink enforces strictly while the broadened validateManagedPath tolerates ENOENT. Race-free in practice: helper RPCs are short and the validator fires on the same kernel state the next syscall sees. The helper loop processes RPCs serially per-connection, and the validator plus the syscall both run as root within microseconds of each other. Four new tests cover symlink leaf, symlink intermediate, missing leaf (must pass), and the plain happy path. Smoke at JOBS=4 still green — every legitimate daemon-supplied path passes the walk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 15:26:56 -03:00
Thales Maciel	0a079277ef	imagepull: reject symlink ancestors during OCI flatten safeJoin previously did textual cleaning + dest-prefix check only. That's enough to catch `../escape`, but not the symlink-ancestor attack: a malicious OCI layer plants `etc -> /tmp/probe`, a later layer writes/deletes/hardlinks against `etc/anything`, and the kernel silently dereferences the symlink so the operation lands at `/tmp/probe/anything` on the host. The daemon runs flatten as the owner UID, so anywhere that UID can write becomes a write target; anywhere it can delete (e.g. its own home) becomes a delete target. Whiteouts and hardlinks make this worse — a whiteout for `etc/.wh.victim` would `RemoveAll` the host file `/tmp/probe/victim`, and a TypeLink would expose host files inside the extracted rootfs. safeJoin now Lstat-walks every intermediate component of the joined path against the already-extracted tree, refusing if any ancestor is a symlink. Walking is race-free against the extraction loop because we process tar entries serially. Leaf components stay caller-owned (TypeSymlink writes legitimately want a symlink leaf; TypeReg RemoveAll's any prior leaf before opening; etc.). Three new tests pin the protection: write through a symlinked ancestor, whiteout through a symlinked ancestor, and hardlink target through a symlinked ancestor — each must fail and leave the host probe path untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 15:20:46 -03:00
Thales Maciel	8bfa525568	test: cover imagemgr + dmsnap helpers Both packages had zero tests before this change. The helpers in them are pure (imagemgr) or scripted-runner-friendly (dmsnap), so they're cheap to pin and worth catching regressions on. imagemgr/paths_test.go: * DebianBasePackages returns a defensive copy (mutating the result can't poison subsequent calls — important because hashPackages digests this list). * BuildMetadataPackages stays in lockstep with DebianBasePackages. * hashPackages is order-sensitive and includes a trailing newline in its canonical join (regression guard for any future "sort the list before hashing" temptation that would invalidate every on-disk hash). * StageOptionalArtifactPath returns "" for empty/whitespace input and joins by name otherwise. * WritePackagesMetadata writes <rootfs>.packages.sha256 with the expected hash, no-ops on empty rootfs path or empty package list. * DebianBasePackages contains the small critical-package floor (ca-certificates, curl, git) so a future apt-list trim can't silently drop them. dmsnap/dmsnap_test.go: * Create runs losetup base, losetup cow, blockdev getsz, dmsetup create in that order, with a snapshot table referencing the loops in (base, cow) order — a swap would corrupt every VM. * Create's failure path unwinds with losetup -d on cow then base. * Cleanup tears down dmsetup before losetup (otherwise dmsetup sees EBUSY against vanished backing devices). * Cleanup falls back to DMDev when DMName is empty. * Cleanup tolerates "No such device" on losetup -d (idempotent re-run after a partial cleanup). * Cleanup surfaces non-missing losetup errors (the tolerance is narrow on purpose). * Remove returns nil on a missing target and surfaces non-retryable errors immediately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 15:13:49 -03:00
Thales Maciel	6b4e1922b0	model: gofmt VMRecord struct alignment Stats and Workspace fields landed in `6b543cb` with column alignment that gofmt wants to pull tighter; rerun gofmt so the new pre-commit hook's `gofmt -l` gate passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 15:08:12 -03:00
Thales Maciel	3e6d0cee89	doctor: surface security-posture drift in `banger doctor` `docs/privileges.md` now documents what the install promises (helper + daemon services active, sockets at 0600 ownerUID, units carrying the hardening directives, firecracker root-owned + non-writable). Doctor verifies the running install matches: drift between the doc and the filesystem would silently weaken the trust model otherwise. In system mode (install.toml present): * helper service / owner daemon service: `systemctl is-active`. * helper socket / daemon socket: stat-and-compare mode + uid against the registered owner. * helper unit hardening / daemon unit hardening: scan the rendered unit for NoNewPrivileges, ProtectSystem=strict, ProtectHome (=yes for the helper, =read-only for the daemon), RestrictSUIDSGID, LockPersonality, and the helper's CapabilityBoundingSet line. The daemon unit also pins User=<registered owner>. * firecracker binary ownership: regular file, not a symlink, mode not group/world writable, executable, owned by uid 0 — same constraints validateRootExecutable enforces at launch, surfaced once at doctor time so a misconfigured binary fails fast with a clearer error than the helper's open-time rejection. In non-system mode (no /etc/banger/install.toml) doctor emits a single WARN row pointing at docs/privileges.md > 'Running outside the system install'. A PASS would imply guarantees the install isn't actually providing. Tests cover both branches: the non-system warn pins its message substrings; system-mode pins that every check name shows up; and the helpers (socket-perms, unit-hardening, executable-ownership) have direct table-style negative tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 14:58:34 -03:00
Thales Maciel	853249dec2	roothelper: tighten input validation across privileged RPCs Defence-in-depth pass over every helper method that touches the host as root. Each fix narrows what a compromised owner-uid daemon could ask the helper to do; many close concrete file-ownership and DoS primitives that the previous validators didn't reach. Path / identifier validation: * priv.fsck_snapshot now requires /dev/mapper/fc-rootfs-* (was "is the string non-empty"). e2fsck -fy on /dev/sda1 was the motivating exploit. * priv.kill_process and priv.signal_process now read /proc/<pid>/cmdline and require a "firecracker" substring before sending the signal. Killing arbitrary host PIDs (sshd, init, …) is no longer a one-RPC primitive. * priv.read_ext4_file and priv.write_ext4_files now require the image path to live under StateDir or be /dev/mapper/fc-rootfs-. priv.cleanup_dm_snapshot validates every non-empty Handles field: DM name fc-rootfs-, DM device /dev/mapper/fc-rootfs-, loops /dev/loopN. * priv.remove_dm_snapshot accepts only fc-rootfs-* names or /dev/mapper/fc-rootfs-* paths. * priv.ensure_nat now requires a parsable IPv4 address and a banger-prefixed tap. * priv.sync_resolver_routing and priv.clear_resolver_routing now require a Linux iface-name-shaped bridge name (1–15 chars, no whitespace/'/'/':') and, for sync, a parsable resolver address. Symlink defence: * priv.ensure_socket_access now validates the socket path is under RuntimeDir and not a symlink. The fcproc layer's chown/chmod moves to unix.Open(O_PATH\|O_NOFOLLOW) + Fchownat(AT_EMPTY_PATH) + Fchmodat via /proc/self/fd, so even a swap of the leaf into a symlink between validation and the syscall is refused. The local-priv (non-root) fallback uses `chown -h`. * priv.cleanup_jailer_chroot rejects symlinks at both the leaf (os.Lstat) and intermediate path components (filepath.EvalSymlinks + clean-equality). The umount sweep was rewritten from shell `umount --recursive --lazy` to direct unix.Unmount(MNT_DETACH \| UMOUNT_NOFOLLOW) per child mount, deepest-first; the findmnt guard remains as the rm-rf safety net. Local-priv mode falls back to `sudo umount --lazy`. Binary validation: * validateRootExecutable now opens with O_PATH\|O_NOFOLLOW and Fstats through the resulting fd. Rejects path-level symlinks and narrows the TOCTOU window between validation and the SDK's exec to fork+exec time on a healthy host. Daemon socket: * The owner daemon now reads SO_PEERCRED on every accepted connection and refuses any UID that isn't 0 or the registered owner. Filesystem perms (0600 + ownerUID) already enforced this; the check is belt-and-braces in case the socket FD is ever leaked to a non-owner process. Docs: * docs/privileges.md walked end-to-end. Each helper RPC's Validation gate row reflects what the code actually enforces. New section "Running outside the system install" calls out the looser dev-mode trust model (NOPASSWD sudoers, helper hardening bypassed) so users don't deploy that path on shared hosts. Trust list updated to include every new validator. Tests added: validators (DM-loop, DM-remove-target, DM-handles, ext4-image-path, iface-name, IPv4, resolver-addr, not-symlink, firecracker-PID, root-executable variants), the daemon's authorize path (non-unix conn rejection + unix conn happy path), the umount2 ordering contract (deepest-first + --lazy on the sudo branch), and positive/negative cases for the chown-no-follow fallback. Verified end-to-end via `make smoke JOBS=4` on a KVM host. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 14:39:41 -03:00
Thales Maciel	6b543cb17f	firecracker: adopt firecracker-jailer for VM launch (Phase B) Each VM's firecracker now runs inside a per-VM chroot dropped to the registered owner UID via firecracker-jailer. Closes the broad ambient- sudo escalation surface that survived Phase A: the helper still needs caps for tap/bridge/dm/loop/iptables, but the VMM itself no longer runs as root in the host root filesystem. The host helper stages each chroot up front: hard-links the kernel and (optional) initrd, mknods block-device drives + /dev/vhost-vsock, copies in the firecracker binary (jailer opens it O_RDWR so a ro bind fails with EROFS), and bind-mounts /usr/lib + /lib trees read-only so the dynamic linker can resolve. Self-binds the chroot first so the findmnt-guarded cleanup can recurse safely. AF_UNIX sun_path is 108 bytes; the chroot path easily blows past that. Daemon-side launch pre-symlinks the short request socket path to the long chroot socket before Machine.Start so the SDK's poll/connect sees the short path while the kernel resolves to the chroot socket. --new-pid-ns is intentionally disabled — jailer's PID-namespace fork makes the SDK see the parent exit and tear the API socket down too early. CapabilityBoundingSet for the helper expands to add CAP_FOWNER, CAP_KILL, CAP_MKNOD, CAP_SETGID, CAP_SETUID, CAP_SYS_CHROOT alongside the existing CAP_CHOWN/CAP_DAC_OVERRIDE/CAP_NET_ADMIN/CAP_NET_RAW/ CAP_SYS_ADMIN. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-28 14:38:07 -03:00
Thales Maciel	d73efe6fbc	firecracker: drop sudo sh -c, race chown against SDK probe in Go Replace the shell-string launcher in buildProcessRunner with a direct exec.Command. The previous sh -c wrapper relied on shellQuote escaping for every MachineConfig field that flowed into the launch script; any future field that ever carried an attacker-controlled value would have become RCE-as-root. The new path passes binary path and flags as separate argv entries, so there is no shell to interpret anything. The wrapper also did two things the shell can no longer do for us: 1. umask 077 — moved to syscall.Umask in cmd/bangerd/main.go so every firecracker child (and any other file the daemon creates) inherits 0600 by default. Single-user dev sandbox state should be private. 2. chown_watcher — the SDK's HTTP probe inside Machine.Start connects to the API socket the moment it appears. Under sudo the socket is created root-owned and the daemon's connect(2) gets EACCES, so the post-Start EnsureSocketAccess never runs. The shell papered over this with a backgrounded chown loop. Replaced by fcproc.EnsureSocketAccessForAsync: same race-window guarantee, in pure Go, kicked off in LaunchFirecracker right before Start and awaited right after. Tests updated: shell-substring assertions replaced with cmd-arg assertions, plus a new fcproc test pinning the async chown sequence. Smoke (full systemd two-service install + KVM scenarios) passes.	2026-04-27 20:14:01 -03:00
Thales Maciel	c4e1cb5953	daemon: tighten concurrency around pulls, cleanup, and handle persistence Four targeted fixes from a race-condition audit of the daemon package. None change behaviour on the happy path; each closes a window where a concurrent or interrupted RPC could strand state on the host. - KernelDelete now holds the same per-name lock as KernelPull / readOrAutoPullKernel. Without it, a delete racing a concurrent pull could remove files mid-write or land between the pull's manifest write and its first use. - cleanupRuntime no longer early-returns on an inner waitForExit failure; DM snapshot, capability, and tap teardown always run and every error is folded into the returned errors.Join. EBUSY against a still-alive firecracker is benign and surfaces in the joined error rather than stranding kernel state across daemon restarts. - Per-name image / kernel pull locks switch from *sync.Mutex to a 1-buffered chan struct{}. Acquire is a select on ctx.Done(), so a peer waiting behind a pull whose RPC was cancelled can bail out instead of blocking forever on a pull nobody is consuming. - setVMHandles writes the per-VM scratch file before updating the in-memory cache. A daemon crash between the two now leaves disk ahead of memory (recoverable: reconcile re-seeds the cache from the file on next start) rather than memory ahead of disk (lost handles → stranded DM/loops/tap). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 19:32:43 -03:00
Thales Maciel	72882e45d7	daemon: serialise concurrent image/kernel pulls + atomic-rename seed refresh Three concurrency bugs surfaced by `make smoke JOBS=4` that all stem from `vm.create` paths assuming single-caller semantics: 1. Kernel auto-pull manifest race. Parallel `vm.create` calls that each need to auto-pull the same kernel ref both run kernelcat.Fetch in parallel against the same /var/lib/banger/kernels/<name>/. Fetch writes manifest.json non-atomically (truncate + write); the peer reads it back mid-write and trips "parse manifest for X: unexpected end of JSON input". Fix: per-name `sync.Mutex` map on `ImageService` (kernelPullLock). `KernelPull` and `readOrAutoPullKernel` both acquire it and re-check `kernelcat.ReadLocal` after the lock so a peer who finished while we waited is treated as success — `readOrAutoPullKernel` does NOT call `s.KernelPull` because that path errors with "already pulled" on a peer-success, which would be wrong for auto-pull. Different kernels stay parallel. 2. Image auto-pull race. Same shape as the kernel race but on the image side: parallel `vm.create` calls both run pullFromBundle / pullFromOCI for the missing image (each ~minutes of OCI fetch + ext4 build). The publishImage atom under imageOpsMu only protects the rename + UpsertImage commit, so the loser does all the work only to fail at the recheck with "image already exists". Fix: per-name `sync.Mutex` map on `ImageService` (imagePullLock). `findOrAutoPullImage` acquires it, re-checks FindImage, and only then calls PullImage. Loser short-circuits with the freshly-published image instead of redoing minutes of work. PullImage's own publishImage recheck stays as defense-in-depth for callers that bypass the auto-pull path. 3. Work-seed refresh race. When the host's SSH key has rotated since an image was last refreshed, `ensureAuthorizedKeyOnWorkDisk` triggers `refreshManagedWorkSeedFingerprint`, which rewrote the shared work-seed.ext4 in place via e2rm + e2cp. Peer `vm.create` calls doing parallel `MaterializeWorkDisk` rdumps observed a torn ext4 image — "Superblock checksum does not match superblock". Fix: stage the rewrite on a sibling tmpfile (`<seed>.refresh.<pid>-<ns>.tmp`) and atomic-rename. Concurrent readers either have the file open (kernel keeps the pre-rename inode alive) or open after the rename (see the new inode) — never observe a partial state. Two parallel refreshes are idempotent (same daemon, same SSH key) so unique tmp names are enough; whichever rename lands last wins, with identical content. UpsertImage runs after the rename so the recorded fingerprint always matches what's on disk. Plus one smoke harness fix: reclassify `vm_prune` from `pure` to `global`. `vm prune -f` removes ALL stopped VMs system-wide, not just the ones the scenario created — so a parallel peer scenario that happens to have its VM in `created`/`stopped` momentarily gets wiped. Moving prune to the post-pool serial phase keeps it from racing with in-flight scenarios. After all four fixes, `make smoke JOBS=4` passes 21/21 in 174s (serial baseline 141s; the small overhead is the buffered-output and `wait -n` semaphore cost — well worth the parallelism for fast-iter work on a 32-core box). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 17:24:11 -03:00
Thales Maciel	c9358ab390	daemon: sync guest over ssh before stop to preserve workspace writes VM stop has been quietly losing data freshly written via `vm workspace prepare`: stop+start of a workspace-prepared VM would come back with /root/repo wiped on the work disk. Root cause is firecracker + Debian's systemd defaults. FC's SendCtrlAltDel (the only "graceful shutdown" action FC exposes) just delivers the keystroke; what the guest does with it is its choice. Debian routes ctrl-alt-del.target -> reboot.target, so the guest reboots, FC stays alive, the daemon's 10s wait_for_exit window expires, and the SIGKILL fallback drops anything still in FC's userspace I/O path. For an idle VM that's invisible. For one that just took 100s of small writes through a workspace prepare, it's data loss. Fix is to dial the guest over SSH inside StopVM and run `sync; systemctl --no-block poweroff \|\| /sbin/poweroff -f &` before the existing SendCtrlAltDel path. The synchronous `sync` is the load-bearing piece — it blocks until every dirty page hits virtio-blk and lands in the on-host root.ext4. Whether poweroff completes before SIGKILL fires is incidental; sync has already run. SSH unreachable falls back to the old SendCtrlAltDel behaviour so a broken-network guest can't make stop hang. Bounded by a 5s SSH-dial timeout so a half-broken guest can't extend the overall stop window past gracefulShutdownWait. Also adds two smoke scenarios: - `workspace + stop/start`: prepare -> stop -> start -> assert marker survives. This is the regression that caught the bug. - `vm exec`: end-to-end coverage for `d59425a` — auto-cd into the prepared workspace, exit-code propagation, dirty-host warning, --auto-prepare resync, refusal on stopped VM. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 15:41:32 -03:00
Thales Maciel	d59425adb9	feat(vm): add vm exec command with workspace dirty detection Introduces three interconnected features for persistent VM workflows: 1. `banger vm exec <vm> -- <cmd>`: runs a command in the prepared workspace, automatically cd-ing into the guest path and wrapping via `mise exec --` so mise-managed tools are on PATH. Falls back to a plain exec when mise isn't available. Exit code propagates verbatim. 2. Workspace persistence: workspace.prepare now stores the guest path, host source path, and HEAD commit into a new `workspace_json` column on the vms table (migration 3). This state survives daemon restarts and informs both dirty-checking and auto-prepare. 3. Dirty detection: `vm exec` compares the stored HEAD commit against the current host repo HEAD. When stale it warns and, with --auto-prepare, re-syncs the workspace before running. Also: - WORKSPACE column added to `banger ps` / `vm list` - `banger vm` quick reference updated with `vm exec` entry	2026-04-26 23:53:45 -03:00
Thales Maciel	c8637b0fe4	daemon: auto-trust mise configs on workspace prepare vm run ./repo (and the explicit vm workspace prepare) imports the host user's own checkout. Any .mise.toml that lands in the guest would otherwise prompt on the first guest command — 'mise trust: hash mismatch, run "mise trust"' — and stall what should be a zero-friction sandbox launch. The repo just came from the host, the guest is single-tenant root@<vm>.vm, the user already trusts this checkout: auto-trust is the right default here. After workspaceImportHook succeeds, run if command -v mise >/dev/null 2>&1; then mise trust --quiet --all <guest_path> \|\| true fi inside the guest. Best effort: a missing mise binary, a non-zero exit, or a no-op trust all log at debug only and never fail prepare. The path is shell-quoted via ws.ShellQuote so guest paths with spaces or quotes don't break the argument. Tests pin the script shape (command -v guard + --quiet --all flag + trailing `\|\| true`) and assert the script actually fires after a successful import. A path with an apostrophe round-trips via ws.ShellQuote without truncation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 23:08:41 -03:00
Thales Maciel	fa4292756d	daemon: surface previously-swallowed errors at warn Three recovery-path errors were silently dropped: - vm_lifecycle.go startVMLocked persisted the VMStateError record with `_ = s.store.UpsertVM(...)`. If the persist failed the user saw the original start error but operators had no way to find out the store had also drifted out of sync. - vm_lifecycle.go deleteVMLocked killed the firecracker process with `_ = s.net.killVMProcess(...)`. cleanupRuntime tears it down regardless, so the explicit kill is best-effort, but a permission-denied / EPERM was still worth logging. - capabilities.go cleanupPreparedCapabilities collected per-cap errors with errors.Join. Callers get the aggregated value but couldn't tell which capability failed when more than one did. All three now log Warn before the original behaviour continues. The aggregate return value, control flow, and user-visible error strings are unchanged — this is purely a "less silence in the journal" pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 22:30:51 -03:00
Thales Maciel	71a332a6a1	cli: maturity polish — color, error translation, tabwriter consistency Adds three small but high-leverage presentation tweaks for v0.1: 1. internal/cli/style is a new ~70 LOC package with Pass/Fail/Warn/ Dim/Bold helpers. Each is TTY-gated and obeys NO_COLOR. No external dep. Wired into the doctor PASS/FAIL/WARN status, the "banger:" error prefix on stderr, and the dim 'ready in <elapsed>' line. 2. internal/cli/errors translates rpc.ErrorResponse into user-facing text. operation_failed becomes invisible (the message wins); not_found, already_exists, bad_request, bad_version, unauthorized, unknown_method get short labels; unknown codes pass through. The daemon-attached op_id lands in dim parens — paste into journalctl --grep to find the daemon log line that produced the failure. 3. Tabwriter config converges on (0, 8, 2, ' ', 0) across every list/table command. The vm prune confirmation table picked up the right config; system install + system status switched from bare "key: value\n" lines to tabular form. printVMSpecLine drops its Unicode middle dot for an ASCII '\|' so terminals without UTF-8 render cleanly. Tests cover translateRPCError for every code, style helpers no-op on non-TTY and under NO_COLOR. Smoke status greps switch from "key: value" to "key value" to match the new format. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 22:27:07 -03:00
Thales Maciel	e47b8146dc	daemon: thread per-RPC op_id end-to-end Today there's no way to correlate a CLI failure with a daemon log line. operationLog records relative timing but no id, two concurrent vm.start calls log indistinguishably, and the async vmCreateOperationState.ID is user-facing yet never reaches the journal. The root helper logs plain text to stderr while bangerd logs JSON, so a merged journalctl is hard to grep across the trust-boundary split. Mint a per-RPC op id at dispatch entry, store it on context, and include it as an "op_id" attr on every operationLog record. The id is stamped onto every error response (including the early short-circuit paths bad_version and unknown_method). rpc.Call forwards the context op id on requests so a daemon RPC and the helper RPCs it triggers all share one id. The helper now logs JSON to match bangerd, adopts the inbound id, and emits a single "helper rpc completed" / "helper rpc failed" line per call so operators can see at a glance how long each privileged op took. vmCreateOperationState.ID is now the same id dispatch generated for vm.create.begin — one identifier between client status polls, daemon logs, and helper logs. The wire format gains two optional fields: rpc.Request.OpID and rpc.ErrorResponse.OpID, both omitempty so older peers (and the opposite direction) ignore them. ErrorResponse.Error() now appends "(op-XXXXXX)" to its string form when set; existing callers that just print err.Error() get the id for free. Tests cover: dispatch stamps op_id on unknown_method, bad_version, and handler-returned errors; rpc.Call exposes the typed *ErrorResponse via errors.As so the CLI can read code/op_id; ctx op_id is forwarded to the server in the request envelope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 22:13:44 -03:00

1 2 3 4 5

243 commits