banger

Author	SHA1	Message	Date
Thales Maciel	235758e5b2	workspace: drop --readonly flag — advisory only against root guests --readonly ran `chmod -R a-w` over the workspace after copying, but every banger guest boots as root, and root bypasses DAC mode checks. So a user running `vm workspace prepare ... --readonly` got the mode bits set to 0444 but `echo x >> file` in the guest still succeeded. The flag promised enforcement it couldn't deliver. The feature also doesn't match the product model: workspaces are prepared precisely so the guest CAN edit them, and `workspace export` exists to pull those edits back as a patch. A "read-only workspace" contradicts that loop. Removed: - CLI flag `--readonly` on `vm workspace prepare` - api.VMWorkspacePrepareParams.ReadOnly field - model.WorkspacePrepareResult.ReadOnly field - daemon chmod dispatch in prepareVMWorkspaceGuestIO - smoke scenario pinning the (advisory) mode-bit behavior - misleading "exportbox-readonly" VM name in an unrelated export test (the test is about not mutating the real git index; renamed to exportbox-noindex-mutation) If real enforcement becomes a user need later, the right primitive is `chattr +i` (immutable bit — root CAN'T write) or a ro bind-mount. Reintroducing a new flag is cheaper than debugging what the current one actually guarantees. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 13:04:33 -03:00
Thales Maciel	bafe816fc7	smoke: cover the gaps — NAT, vm ports/restart/kill/prune, workspace variants, ssh-config Audit of banger's advertised CLI surface vs. what smoke was exercising turned up several gaps where a regression would have shipped silently. New scenarios: - NAT: asserts the per-VM POSTROUTING MASQUERADE rule is installed with --nat (scoped to the guest /32), idempotent across stop/start, and torn down on delete. End-to-end curl tests don't work here because the bridge IP and uplink IP both belong to the host — a guest reaching the uplink lands on host-local loopback whether MASQUERADE is set up or not — so the test pins the iptables rule itself. Skipped if passwordless `sudo iptables` isn't available. - vm ports: sshd :22 must be visible with the <name>.vm endpoint (not localhost, not the raw guest IP — the daemon prefers the DNS record when one exists). - vm restart: dedicated verb, not a stop+start alias. Asserts a fresh boot_id to prove the kernel actually recycled. - vm kill --signal KILL: forceful termination path (distinct from `vm stop`'s graceful Ctrl-Alt-Del). Post-kill state must be 'stopped' (not 'error') and the dm-snapshot must be cleaned up. - vm prune -f: batch delete of non-running VMs while preserving any that are still running. Regression guard for the case where prune could wipe a live session. - workspace prepare --readonly: mode bits on /root/repo/<file> must drop all write bits. Enforcement is advisory against a root guest, so the test asserts the bits, not EACCES. - workspace prepare --mode full_copy: alternate transfer path (tarred into rootfs, no overlay) still lands the repo contents at /root/repo. - workspace export --base-commit: guest-side commits captured in the patch when the pre-commit SHA is pinned. The feature's whole reason for existing; it had zero coverage. Includes a control assertion that the plain (no --base-commit) export does NOT see the committed file. - ssh-config --install / --uninstall: HOME-isolated to a smoke tempdir so we don't touch the invoking user's ~/.ssh/config. Seeds a pre-existing config to catch any regression where install clobbers instead of appending. Asserts idempotency (second install doesn't duplicate the Include line) and clean round-trip (uninstall leaves the user's own content intact). Coverage deltas from smoke (vs the last run): internal/hostnat 14.1% → 64.1% (+50pp — NAT rule dance) internal/daemon/opstate 56.2% → 87.5% (+31pp) internal/daemon 43.4% → 49.4% (+6pp) internal/cli 36.1% → 40.4% (+4pp) internal/daemon/workspace 64.1% → 67.5% (+3pp) Scenario count: 12 → 21. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 12:50:39 -03:00
Thales Maciel	b4afe13b2a	daemon: fix vm start (on a stopped VM) + regression coverage Two defects compounded to make `vm create X` → `vm stop X` → `vm start X` → `vm ssh X` fail with `not_running: vm X is not running` even though `vm show` reports `state=running`. 1. firecracker-go-sdk's startVMM spawns a goroutine that SIGTERMs firecracker when the ctx passed to Machine.Start cancels — and retains that ctx for the lifetime of the VMM, not just the boot phase. Our Machine.Start wrapper was plumbing the caller's ctx through, which on `vm.start` is the RPC request ctx. daemon.go's handleConn cancels reqCtx via `defer cancel()` right after writing the response. Net effect: firecracker is killed ~150ms after the `vm start` RPC "completes", invisibly, and the next `vm ssh` sees a dead PID. `vm.create` side-stepped the bug because BeginVMCreate detaches to context.Background() before calling startVMLocked; `vm.start` used the RPC ctx directly. Fix: Machine.Start now passes context.Background() to the SDK. We own firecracker lifecycle explicitly (StopVM / KillVM / cleanupRuntime), so ctx-driven cancellation here was never actually wired into anything useful. 2. With (1) fixed, the same scenario exposed a second defect: patchRootOverlay's e2cp/e2rm refuses to touch the dm-snapshot with "Inode bitmap checksum does not match bitmap" on a restart, because the COW holds stale free-block/free-inode counters from the previous guest boot. Kernel ext4 is fine with this; e2fsprogs is not. Fix: run `e2fsck -fy` on the snapshot between the dm_snapshot and patch_root_overlay stages. Idempotent on a fresh snapshot, reconciles the bitmaps on a reused COW. Regression coverage: - scripts/repro-restart-bug.sh — minimal create→stop→start→ssh reproducer with rich on-failure diagnostics (daemon log trace, firecracker.log tail, handles.json, pgrep-by-apiSock, apiSock stat). Exits non-zero if the bug returns. - scripts/smoke.sh — lifecycle scenario (create/ssh/stop/start/ ssh/delete) and vm-set scenario (--vcpu 2 → stop → set --vcpu 4 → start → assert nproc=4). Both were pulled when the bug was first found; now restored. Supporting: - internal/system/system.ExitCode — extracts exec.ExitError's code without forcing callers to import os/exec. Needed by the e2fsck caller (policy test pins os/exec to the shell-out packages). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 12:01:46 -03:00
Thales Maciel	e94e7c4dcc	smoke: workspace export scenario + smoke-fresh target + fix the export bug it caught The export round-trip (`vm create` → `workspace prepare` → guest edit → `workspace export`) exposed a reproducible failure on Debian bookworm guests: `git read-tree HEAD --index-output=/tmp/...` returns exit 128 "unable to write new index file" when the target lives on tmpfs while `.git` is on the workspace overlay. Move the temp index into `$(git rev-parse --git-dir)` so it shares a filesystem with `.git/index` and the lockfile + rename + hardlink dance git does internally works. Alongside: - new workspace-export smoke scenario that would have caught this at the boundary between daemon and guest git - `make smoke-fresh` = `smoke-clean && smoke` for release-time runs that want first-install paths (migrations, image pull) stamped into the coverage report Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 11:34:55 -03:00
Thales Maciel	672d7151e9	smoke: five more scenarios + fix exit-code propagation bug the new ones caught Five new smoke scenarios layered on top of the existing bare + workspace vm-run tests: - exit-code propagation: `sh -c 'exit 42'` must rc=42 - workspace dry-run: --dry-run lists tracked files without a VM - workspace --include-untracked: opt-in ships files outside the git index (regression guard on the security-default flip from review 4) - concurrent vm runs: two --rm invocations in parallel both succeed (stresses per-VM locks, createVMMu reservation window, tap pool) - invalid spec rejection: --vcpu 0 must fail with no VM row left behind (the "cleanup on partial failure" path the review flagged) The exit-code scenario caught a real bug on first run: `banger vm run --rm -- sh -c 'exit 42'` returned rc=0, not 42. Root cause in internal/cli/ssh.go's sshCommandArgs: extra args were appended to the ssh argv verbatim, relying on ssh(1)'s implicit space-join to deliver the remote command. That works for single tokens (echo hello) but re-tokenises multi-word commands on the remote side: `ssh host sh -c 'exit 42'` becomes remote `sh -c exit 42`, where `42` is $0 for the already-completed `exit`, and the exit code the user asked for is lost. Fix: shell-quote every element of extra (`'sh'` `'-c'` `'exit 42'`) and join them into a single trailing argv entry. ssh's space-join then produces exactly the command the user typed on the remote shell. TestSSHCommandArgs was updated to pin the quoting; the existing TestRunVMRunCommandModePropagatesExitCode test needed a one-word quote tweak (`false` → `'false'`). Smoke run after fix passes all seven scenarios in ~2 min on warm state. cmd/banger coverage jumped to 100% (the invalid-spec scenario hits the error-reporting path that wasn't covered before). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 19:37:07 -03:00
Thales Maciel	5f81332b0a	make smoke: end-to-end boot suite with coverage from real VM runs The unit + integration tests can't cross machine.Start — the SDK boundary would need a fake firecracker that reimplements the control-plane HTTP API, and the ongoing maintenance cost of keeping that fake honest with upstream kills the value. Instead, add a pre-release smoke target that drives REAL Firecracker + real KVM, captures coverage from the -cover-instrumented binaries, and surfaces per-package deltas so regressions in the boot path don't ship silently. scripts/smoke.sh: - Isolated XDG_{CONFIG,STATE,CACHE,RUNTIME} so the smoke run can't touch real user state (state/cache persist under build/smoke/xdg for fast reruns; runtime is mktemp'd fresh per-run because sockets can't be reused) - Preflight: `banger doctor` must pass; UDP :42069 must be free (otherwise the user's real daemon is up and the smoke daemon can't bind its DNS listener — fail with an actionable message) - Scenario 1 — bare: `banger vm run --rm -- echo smoke-bare-ok` exercises create → start → socket ownership chown → machine.Start → SDK waitForSocket race → vsock agent readiness → guest SSH wait → exec → cleanup → delete - Scenario 2 — workspace: creates a throwaway git repo, runs `banger vm run --rm <repo> -- cat /root/repo/smoke-file.txt`, verifies the tracked file reached the guest (exercises workDisk capability PrepareHost + workspace.prepare) - `banger daemon stop` at the end so instrumented binaries flush GOCOVERDIR pods before the script exits Makefile additions: - smoke-build: builds banger/bangerd under build/smoke/bin/ with `go build -cover` - smoke: runs the script with GOCOVERDIR set, reports per-package coverage via `go tool covdata percent` - smoke-coverage-html: textfmt + go tool cover for a browsable report - smoke-clean: nukes build/smoke/ including the persisted XDG state Bonus fix uncovered during the first smoke run: doctor treated a missing state.db as a FAIL ("out of memory" from SQLite SQLITE_CANTOPEN), which red-flagged every fresh install. Split the store check: DB file absent → PASS with "will be created on first daemon start" detail; DB present but unreadable → FAIL as before. New TestDoctorReport_StoreMissingSurfacesAsPassForFreshInstall pins the behaviour. Concrete coverage delta from the first successful smoke run (compared to `make coverage-total`'s unit-test-only 37.8%): internal/firecracker 43.6% → 75.0% internal/daemon/workspace 33.8% → 60.8% internal/store 40.1% → 56.3% internal/guest 63.7% → 57.4% (different mix: smoke exercises real SSH; unit tests cover more error branches) The packages the review flagged are the ones that moved most — which is the point. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 18:59:57 -03:00

6 commits