banger

Author	SHA1	Message	Date
Thales Maciel	366e1560c9	daemon: replace RPC switch with generic method-to-handler table The dispatch method was a single ~240-line switch of 34 cases, each following the same pattern: decode params into some type P, call a service method returning (R, error), wrap R in a result struct and either marshalResultOrError-encode or return a raw rpc.NewError. Adding a method was a 4-line ceremony per site, and grepping for "methods banger speaks" meant reading the full switch. New shape, in internal/daemon/dispatch.go: - handler is the uniform `func(ctx, d, req) rpc.Response` type every method dispatches through. - paramHandler[P, R] is the generic wrapper that absorbs 28 of the 34 cases (decode, call, marshal). No reflection — P and R are deduced from the service-call literal, so each map entry is a one-liner referencing a small adapter func. - noParamHandler[R] is the decode-free variant for 6 methods that don't carry params. - rpcHandlers is the single source of truth for which methods exist and which adapter they dispatch to. - Four specials (ping, shutdown, vm.logs, vm.ssh) stay as named `handler`-typed functions: ping/shutdown encode with raw rpc.NewResult, vm.logs/vm.ssh need pre-service validation to emit distinct error codes (not_found, not_running) that the generic wrapper maps uniformly to operation_failed. Daemon.dispatch shrinks from a 240-line switch to 11 lines: version check, test-only handler short-circuit, table lookup, invoke-or-unknown. Tests: - TestRPCHandlersMatchDocumentedMethods — keyset guard. Adding or removing a method without updating the expected slice is a red flag the test surfaces. - TestRPCHandlersAllNonNil — catches nil-function registrations. All pre-existing dispatch tests (param decode, error codes, etc.) keep passing unchanged — the handler contract for any given method is byte-identical from the RPC client's perspective. Smoke (all 21 scenarios) exercises every code path end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 15:40:08 -03:00
Thales Maciel	11a33604c0	daemon: extract startVMLocked into step runner with per-step rollback startVMLocked was a ~260-line method running 18 sequential phases with one lumped error path: on any failure, cleanupOnErr called cleanupRuntime — a catch-all teardown that didn't distinguish "this phase acquired resources we should undo" from "this phase is idempotent." The blast radius was the entire VM lifecycle. Every tweak to boot, NAT, disk, or auth-sync orchestration had to reason about a closure that could fire at any of 18 points. This commit extracts the phases into a data-driven pipeline: - startContext threads the mutable state (vm, live, apiSock, dmName, tapName, etc.) through every step by pointer so step bodies mutate in place without returning copies. - startStep carries the op.stage name, optional vmCreateStage progress ping, optional log attrs, a run closure, and an optional undo closure. - runStartSteps walks steps in order, appends the failing step to the rollback set (so partial-acquire failures like machine.Start's post-spawn HTTP config get their undo fired), then iterates the rollback set in reverse and joins errors via errors.Join. Each phase that acquires a resource now owns its own undo: system_overlay removes a file it created, dm_snapshot cleans up the loop + DM handles it set, prepare_host_features delegates to capHooks.cleanupState, tap releases via releaseTap, metrics_file removes the file, firecracker_launch kills the spawned PID and drops the sockets, post_start_features calls capHooks.cleanupState again (capability Cleanup hooks are idempotent — safe to call whether PostStart reached every cap or not). The 11 phases with no teardown obligation leave `undo` nil and the driver silently skips them on rollback. cleanupRuntime is retired from the start-failure path. It stays intact for reconcile, stopVMLocked, killVMLocked, deleteVMLocked, stopStaleVMs — the crash-recovery / lifecycle-teardown contract those paths rely on is unchanged. startVMLocked shrinks from ~225 lines of sequential-phase code plus a cleanupOnErr closure to ~45 lines: compute derived paths, build the step list, drive it, persist ERROR state on failure. Stage names preserved 1:1 so existing log grep + the async-create progress stream stay compatible. Tests: - TestRunStartSteps_RollsBackInReverseOnFailure — the contract is pinned: succeeded-before-failing run, all their undos in reverse, failing step's undo also fires, original err still visible via errors.Is. - TestRunStartSteps_SkipsNilUndos — optional-undo contract. - TestRunStartSteps_JoinsRollbackErrors — undo failures don't hide the root cause. - TestRunStartSteps_HappyPathNoRollback — success path never fires any undo. Smoke: all 21 scenarios pass, including the start-path ones (bare vm run, workspace vm run, vm restart, vm lifecycle, vm set reconfig) that exercise real firecracker boots end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 15:34:34 -03:00
Thales Maciel	2ebd2b64bb	imagepull: update stale package + BuildExt4 docs The package doc in internal/imagepull/imagepull.go still described a two-step Pull + Flatten + BuildExt4 pipeline and warned that the resulting image was "suitable as input to `image build` but not directly bootable" because ownership preservation was deferred. That's been wrong for a while: ApplyOwnership (internal/imagepull/ownership.go) restores tar-header uid/gid/mode via a debugfs set_inode_field batch, and InjectGuestAgents (internal/imagepull/inject.go) writes banger's guest-side assets into the image. `image pull` now produces a directly bootable rootfs end-to-end. Updated: - imagepull.go package doc — describes the full Pull → Flatten → BuildExt4 → ApplyOwnership → InjectGuestAgents pipeline and drops the "Phase A limitations" list that spoke of deferred ownership. - ext4.go BuildExt4 doc — notes that the filesystem is root-owned via `-E root_owner=0:0` and points at ApplyOwnership as the step that handles per-file ownership, instead of the previous "see the package doc for the implications" handwave. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 14:34:25 -03:00
Thales Maciel	5eceebe49f	daemon: persist tap device on VM.Runtime so NAT teardown survives handle-cache loss Cleanup identity for kernel objects was split across two sources of truth: vm.Runtime (DB-backed, durable) held paths and the guest IP, but the TAP name lived only in the in-process handle cache + the best-effort handles.json scratch file next to the VM dir. Every other cleanup-identifying datum has a fallback — firecracker PID can be rediscovered via `pgrep -f <apiSock>`, loops via losetup, dm name from the deterministic ShortID(vm.ID). The tap is the one truly cache-only datum (allocated from a pool, not derivable). That made NAT teardown fragile: - daemon crash between `acquireTap` and the handles.json write - handles.json corrupt on the next daemon start - partial cleanup that already zeroed the cache In any of those cases natCapability.Cleanup short-circuited ("skipping nat cleanup without runtime network handles") and the per-VM POSTROUTING MASQUERADE + the two FORWARD rules keyed off the tap would leak. The VM row in the DB still existed, so a retry couldn't close the loop — the tap name was simply gone. Fix: mirror TapDevice onto model.VMRuntime (serialised via the existing runtime_json column, omitempty so existing rows upgrade cleanly). Set it in startVMLocked right next to the s.setVMHandles call that seeds the in-memory cache; clear it at every post-cleanup reset site (stop normal path + stop stale branch, kill normal path + kill stale branch, cleanupOnErr in start, reconcile's stale-vm branch, the stats poller's auto-stop path). Fallbacks now cascade: - natCapability.Cleanup: handles cache → Runtime.TapDevice - cleanupRuntime (releaseTap): handles cache → Runtime.TapDevice Both surfaces refuse gracefully (old behaviour) only when neither source has a value, which really does mean "no tap was ever allocated for this VM" rather than "we lost track of it." Test: TestNATCapabilityCleanup_FallsBackToRuntimeTapDevice clears the handle cache, sets vm.Runtime.TapDevice, and asserts Cleanup reaches the runner — the exact scenario the review flagged as a plausible leak and the exact code path that now guarantees it doesn't. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 14:21:13 -03:00
Thales Maciel	1850904d9c	file_sync: skip nested symlinks during recursive copy A user who sets `[[file_sync]] host = "~/.aws"` (per the README's own example) can unintentionally copy files from outside that directory if .aws contains symlinks. copyHostDir used os.Stat during recursion, which transparently follows: a symlink to a credential dir elsewhere would be recursed into, materialising unrelated secrets inside the guest. For credential trees that's an avoidable sprawl vector. Switched copyHostDir's per-entry probe from os.Stat to os.Lstat and added a default skip-with-warning branch for ModeSymlink. Files and dirs at the SAME level copy as before; symlinks (both file and directory flavours) surface a "file_sync skipped symlink (would escape the requested tree)" warn log and are otherwise omitted. Top-level entry paths still follow — the Stat in runFileSync is unchanged. The user explicitly named that path, so resolving "~/.aws" through a symlink out of $HOME is on them. Tests: - TestRunFileSyncSkipsNestedSymlinks — builds a synced dir with both a file symlink and a directory symlink pointing outside the tree; asserts real files copy, symlinks do not materialise anywhere in the guest mount, and each skipped symlink surfaces a warn log entry. README updated with a one-line note about the skip behaviour so users know to expect it rather than chasing "why didn't my file show up." Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 14:11:58 -03:00
Thales Maciel	caa6a2b996	model: validate VM names as DNS labels at CLI + daemon A VM name flows into five places that all have narrower grammars than "arbitrary string": - the guest's /etc/hostname (vm_disk.patchRootOverlay) - the guest's /etc/hosts (same) - the <name>.vm DNS record (vmdns.RecordName) - the kernel command line (system.BuildBootArgs*) - VM-dir file-path fragments (layout.VMsDir/<id>, etc.) Nothing in the chain was validating the input. A name with whitespace, newline, dot, slash, colon, or = would produce broken hostnames, weird DNS labels, smuggled kernel cmdline tokens, or (in the worst case) surprising traversal through the on-disk layout. Not host shell injection — we already avoid shelling out with the raw name — but a real correctness and supportability bug. New: model.ValidateVMName. Rules: - 1..63 chars (DNS label max per RFC 1123; also a comfortable /etc/hostname cap) - lowercase ASCII letters, digits, '-' only - no leading or trailing '-' - no normalization — the name is the user-visible identifier (store key, `ssh <name>.vm`, `vm show`); silently rewriting "MyVM" → "myvm" would hand the user back something different than they typed Called from two places: - internal/cli/commands_vm.go vmCreateParamsFromFlags — rejects bad `--name` values before any RPC. Empty name still passes through so the daemon can generate one. - internal/daemon/vm_create.go reserveVM — defense in depth for any non-CLI RPC caller (SDK, direct JSON over the socket). Tests: - internal/model/vm_name_test.go — exhaustive character-class matrix (space, newline, tab, dot, slash, colon, equals, quote, control chars, unicode letters, uppercase, leading/trailing hyphen, over-length, max-length-exact, digits-only). - internal/cli TestVMCreateParamsFromFlagsRejectsInvalidName — CLI wire-through + empty-name passthrough. - internal/daemon TestReserveVMRejectsInvalidName — daemon defense-in-depth (including `box/../evil` path-traversal). - scripts/smoke.sh — end-to-end rejection + no-leaked-row assertion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 14:06:40 -03:00
Thales Maciel	700a1e6e60	cleanup: drop pre-v0.1 migration scaffolding + legacy-behavior refs banger hasn't shipped a public release — every "legacy", "pre-opt-in", "previously", "migration note", "no longer" reference in the tree is pinning against a state no real user's install has ever been in. That scaffolding has weight: it's a coordinate system future readers have to decode, and it keeps dead code alive. Removed (code): - internal/daemon/ssh_client_config.go - vmSSHConfigIncludeBegin / vmSSHConfigIncludeEnd constants and every `removeManagedBlock(existing, vm...)` call they enabled (legacy inline `Host .vm` block scrub) - cleanupLegacySSHConfigDir (+ its caller in syncVMSSHClientConfig) — wiped a pre-opt-in sibling file under $ConfigDir/ssh - sameDirOrParent + resolvePathForComparison — only ever used by cleanupLegacySSHConfigDir - the "also check legacy marker" fallback in UserSSHIncludeInstalled / UninstallUserSSHInclude - internal/store/migrations.go - migrateDropDeadImageColumns (migration 2) + its slice entry - dropColumnIfExists (orphaned after the above) - addColumnIfMissing + the whole "columns added across the pre- versioning lifetime" block at the end of migrateBaseline — subsumed into the baseline CREATE TABLE - `packages_path TEXT` column on the images table (the throwaway migration 2 dropped it, but there was never any reader) - internal/daemon/vm.go - vmDNSRecordName local wrapper — was justified as "avoid pulling vmdns into every file"; three of four callers already imported vmdns directly, so inline the one stray call - internal/cli/cli_test.go - TestLegacyRemovedCommandIsRejected (`tui` subcommand never shipped) Removed / simplified (tests): - ssh_client_config_test.go: dropped TestSameDirOrParentHandlesSymlinks, TestSyncVMSSHClientConfigPreservesUserKeyInLegacyDir, TestSyncVMSSHClientConfigNarrowsCleanupToLegacyFile, TestSyncVMSSHClientConfigLeavesUnexpectedLegacyContents, TestInstallUserSSHIncludeMigratesLegacyInlineBlock, plus the "legacy posture" regression strings in the remaining happy-path test; TestUninstallUserSSHIncludeRemovesBothMarkerBlocks collapsed to a single-block test - migrations_test.go: dropped TestMigrateDropDeadImageColumns_AcrossInstallPaths, TestDropColumnIfExistsIsIdempotent; TestOpenReadOnlyDoesNotRunMigrations simplified to test against the baseline marker Removed (docs): - README.md "Migration note.*" blockquote about the SSH-key path move - docs/advanced.md parenthetical "(the old behaviour)" Reworded (comments): - Dropped "Previously this file also contained LogLevel DEBUG3..." history from vm_disk.go's sshdGuestConfig doc - Dropped "Call sites that previously read vm.Runtime.{PID,...}" from vm_handles.go; now documents the current contract - Dropped "Pre-v0.1 the defaults are" scaffolding in doctor_test.go - Dropped "no longer does its own git inspection" phrasing in vm_run.go - Dropped the "(also cleans up legacy inline block from pre-opt-in builds)" aside on the `ssh-config` CLI docstring - Renamed test var `legacyKey` → `existingKey` in vm_test.go; its purpose was "pre-existing authorized_keys line," not banger-legacy Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 13:56:32 -03:00
Thales Maciel	5791466498	make: coverage-combined — merge unit-test and smoke covdata Unit tests and the smoke suite cover different halves of the codebase: unit for pure-Go branching (error paths, parsers, handler wiring); smoke for the sudo / firecracker / dm-snap / real-KVM paths unit tests physically can't reach. Separate reports each tell half the story. `make coverage-combined` runs the unit suite with `-test.gocoverdir` pointed at a fresh binary-format dir, then merges it with the existing smoke covdata via `go tool covdata merge`. Modes must match; smoke uses the default 'set', so the unit run aligns by NOT passing -covermode=atomic. Output matches the existing `make coverage` layout (per-package list + total) so the two targets read the same in CI. `make coverage-combined-html` also emits an HTML report at build/combined.cover.html for clicking through the uncovered lines that neither suite touches. Combined total right now: 72.7% (vs 37.7% unit-only / 49% daemon via smoke). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 13:17:17 -03:00
Thales Maciel	235758e5b2	workspace: drop --readonly flag — advisory only against root guests --readonly ran `chmod -R a-w` over the workspace after copying, but every banger guest boots as root, and root bypasses DAC mode checks. So a user running `vm workspace prepare ... --readonly` got the mode bits set to 0444 but `echo x >> file` in the guest still succeeded. The flag promised enforcement it couldn't deliver. The feature also doesn't match the product model: workspaces are prepared precisely so the guest CAN edit them, and `workspace export` exists to pull those edits back as a patch. A "read-only workspace" contradicts that loop. Removed: - CLI flag `--readonly` on `vm workspace prepare` - api.VMWorkspacePrepareParams.ReadOnly field - model.WorkspacePrepareResult.ReadOnly field - daemon chmod dispatch in prepareVMWorkspaceGuestIO - smoke scenario pinning the (advisory) mode-bit behavior - misleading "exportbox-readonly" VM name in an unrelated export test (the test is about not mutating the real git index; renamed to exportbox-noindex-mutation) If real enforcement becomes a user need later, the right primitive is `chattr +i` (immutable bit — root CAN'T write) or a ro bind-mount. Reintroducing a new flag is cheaper than debugging what the current one actually guarantees. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 13:04:33 -03:00
Thales Maciel	bafe816fc7	smoke: cover the gaps — NAT, vm ports/restart/kill/prune, workspace variants, ssh-config Audit of banger's advertised CLI surface vs. what smoke was exercising turned up several gaps where a regression would have shipped silently. New scenarios: - NAT: asserts the per-VM POSTROUTING MASQUERADE rule is installed with --nat (scoped to the guest /32), idempotent across stop/start, and torn down on delete. End-to-end curl tests don't work here because the bridge IP and uplink IP both belong to the host — a guest reaching the uplink lands on host-local loopback whether MASQUERADE is set up or not — so the test pins the iptables rule itself. Skipped if passwordless `sudo iptables` isn't available. - vm ports: sshd :22 must be visible with the <name>.vm endpoint (not localhost, not the raw guest IP — the daemon prefers the DNS record when one exists). - vm restart: dedicated verb, not a stop+start alias. Asserts a fresh boot_id to prove the kernel actually recycled. - vm kill --signal KILL: forceful termination path (distinct from `vm stop`'s graceful Ctrl-Alt-Del). Post-kill state must be 'stopped' (not 'error') and the dm-snapshot must be cleaned up. - vm prune -f: batch delete of non-running VMs while preserving any that are still running. Regression guard for the case where prune could wipe a live session. - workspace prepare --readonly: mode bits on /root/repo/<file> must drop all write bits. Enforcement is advisory against a root guest, so the test asserts the bits, not EACCES. - workspace prepare --mode full_copy: alternate transfer path (tarred into rootfs, no overlay) still lands the repo contents at /root/repo. - workspace export --base-commit: guest-side commits captured in the patch when the pre-commit SHA is pinned. The feature's whole reason for existing; it had zero coverage. Includes a control assertion that the plain (no --base-commit) export does NOT see the committed file. - ssh-config --install / --uninstall: HOME-isolated to a smoke tempdir so we don't touch the invoking user's ~/.ssh/config. Seeds a pre-existing config to catch any regression where install clobbers instead of appending. Asserts idempotency (second install doesn't duplicate the Include line) and clean round-trip (uninstall leaves the user's own content intact). Coverage deltas from smoke (vs the last run): internal/hostnat 14.1% → 64.1% (+50pp — NAT rule dance) internal/daemon/opstate 56.2% → 87.5% (+31pp) internal/daemon 43.4% → 49.4% (+6pp) internal/cli 36.1% → 40.4% (+4pp) internal/daemon/workspace 64.1% → 67.5% (+3pp) Scenario count: 12 → 21. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 12:50:39 -03:00
Thales Maciel	b4afe13b2a	daemon: fix vm start (on a stopped VM) + regression coverage Two defects compounded to make `vm create X` → `vm stop X` → `vm start X` → `vm ssh X` fail with `not_running: vm X is not running` even though `vm show` reports `state=running`. 1. firecracker-go-sdk's startVMM spawns a goroutine that SIGTERMs firecracker when the ctx passed to Machine.Start cancels — and retains that ctx for the lifetime of the VMM, not just the boot phase. Our Machine.Start wrapper was plumbing the caller's ctx through, which on `vm.start` is the RPC request ctx. daemon.go's handleConn cancels reqCtx via `defer cancel()` right after writing the response. Net effect: firecracker is killed ~150ms after the `vm start` RPC "completes", invisibly, and the next `vm ssh` sees a dead PID. `vm.create` side-stepped the bug because BeginVMCreate detaches to context.Background() before calling startVMLocked; `vm.start` used the RPC ctx directly. Fix: Machine.Start now passes context.Background() to the SDK. We own firecracker lifecycle explicitly (StopVM / KillVM / cleanupRuntime), so ctx-driven cancellation here was never actually wired into anything useful. 2. With (1) fixed, the same scenario exposed a second defect: patchRootOverlay's e2cp/e2rm refuses to touch the dm-snapshot with "Inode bitmap checksum does not match bitmap" on a restart, because the COW holds stale free-block/free-inode counters from the previous guest boot. Kernel ext4 is fine with this; e2fsprogs is not. Fix: run `e2fsck -fy` on the snapshot between the dm_snapshot and patch_root_overlay stages. Idempotent on a fresh snapshot, reconciles the bitmaps on a reused COW. Regression coverage: - scripts/repro-restart-bug.sh — minimal create→stop→start→ssh reproducer with rich on-failure diagnostics (daemon log trace, firecracker.log tail, handles.json, pgrep-by-apiSock, apiSock stat). Exits non-zero if the bug returns. - scripts/smoke.sh — lifecycle scenario (create/ssh/stop/start/ ssh/delete) and vm-set scenario (--vcpu 2 → stop → set --vcpu 4 → start → assert nproc=4). Both were pulled when the bug was first found; now restored. Supporting: - internal/system/system.ExitCode — extracts exec.ExitError's code without forcing callers to import os/exec. Needed by the e2fsck caller (policy test pins os/exec to the shell-out packages). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 12:01:46 -03:00
Thales Maciel	e94e7c4dcc	smoke: workspace export scenario + smoke-fresh target + fix the export bug it caught The export round-trip (`vm create` → `workspace prepare` → guest edit → `workspace export`) exposed a reproducible failure on Debian bookworm guests: `git read-tree HEAD --index-output=/tmp/...` returns exit 128 "unable to write new index file" when the target lives on tmpfs while `.git` is on the workspace overlay. Move the temp index into `$(git rev-parse --git-dir)` so it shares a filesystem with `.git/index` and the lockfile + rename + hardlink dance git does internally works. Alongside: - new workspace-export smoke scenario that would have caught this at the boundary between daemon and guest git - `make smoke-fresh` = `smoke-clean && smoke` for release-time runs that want first-install paths (migrations, image pull) stamped into the coverage report Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 11:34:55 -03:00
Thales Maciel	672d7151e9	smoke: five more scenarios + fix exit-code propagation bug the new ones caught Five new smoke scenarios layered on top of the existing bare + workspace vm-run tests: - exit-code propagation: `sh -c 'exit 42'` must rc=42 - workspace dry-run: --dry-run lists tracked files without a VM - workspace --include-untracked: opt-in ships files outside the git index (regression guard on the security-default flip from review 4) - concurrent vm runs: two --rm invocations in parallel both succeed (stresses per-VM locks, createVMMu reservation window, tap pool) - invalid spec rejection: --vcpu 0 must fail with no VM row left behind (the "cleanup on partial failure" path the review flagged) The exit-code scenario caught a real bug on first run: `banger vm run --rm -- sh -c 'exit 42'` returned rc=0, not 42. Root cause in internal/cli/ssh.go's sshCommandArgs: extra args were appended to the ssh argv verbatim, relying on ssh(1)'s implicit space-join to deliver the remote command. That works for single tokens (echo hello) but re-tokenises multi-word commands on the remote side: `ssh host sh -c 'exit 42'` becomes remote `sh -c exit 42`, where `42` is $0 for the already-completed `exit`, and the exit code the user asked for is lost. Fix: shell-quote every element of extra (`'sh'` `'-c'` `'exit 42'`) and join them into a single trailing argv entry. ssh's space-join then produces exactly the command the user typed on the remote shell. TestSSHCommandArgs was updated to pin the quoting; the existing TestRunVMRunCommandModePropagatesExitCode test needed a one-word quote tweak (`false` → `'false'`). Smoke run after fix passes all seven scenarios in ~2 min on warm state. cmd/banger coverage jumped to 100% (the invalid-spec scenario hits the error-reporting path that wasn't covered before). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 19:37:07 -03:00
Thales Maciel	5f81332b0a	make smoke: end-to-end boot suite with coverage from real VM runs The unit + integration tests can't cross machine.Start — the SDK boundary would need a fake firecracker that reimplements the control-plane HTTP API, and the ongoing maintenance cost of keeping that fake honest with upstream kills the value. Instead, add a pre-release smoke target that drives REAL Firecracker + real KVM, captures coverage from the -cover-instrumented binaries, and surfaces per-package deltas so regressions in the boot path don't ship silently. scripts/smoke.sh: - Isolated XDG_{CONFIG,STATE,CACHE,RUNTIME} so the smoke run can't touch real user state (state/cache persist under build/smoke/xdg for fast reruns; runtime is mktemp'd fresh per-run because sockets can't be reused) - Preflight: `banger doctor` must pass; UDP :42069 must be free (otherwise the user's real daemon is up and the smoke daemon can't bind its DNS listener — fail with an actionable message) - Scenario 1 — bare: `banger vm run --rm -- echo smoke-bare-ok` exercises create → start → socket ownership chown → machine.Start → SDK waitForSocket race → vsock agent readiness → guest SSH wait → exec → cleanup → delete - Scenario 2 — workspace: creates a throwaway git repo, runs `banger vm run --rm <repo> -- cat /root/repo/smoke-file.txt`, verifies the tracked file reached the guest (exercises workDisk capability PrepareHost + workspace.prepare) - `banger daemon stop` at the end so instrumented binaries flush GOCOVERDIR pods before the script exits Makefile additions: - smoke-build: builds banger/bangerd under build/smoke/bin/ with `go build -cover` - smoke: runs the script with GOCOVERDIR set, reports per-package coverage via `go tool covdata percent` - smoke-coverage-html: textfmt + go tool cover for a browsable report - smoke-clean: nukes build/smoke/ including the persisted XDG state Bonus fix uncovered during the first smoke run: doctor treated a missing state.db as a FAIL ("out of memory" from SQLite SQLITE_CANTOPEN), which red-flagged every fresh install. Split the store check: DB file absent → PASS with "will be created on first daemon start" detail; DB present but unreadable → FAIL as before. New TestDoctorReport_StoreMissingSurfacesAsPassForFreshInstall pins the behaviour. Concrete coverage delta from the first successful smoke run (compared to `make coverage-total`'s unit-test-only 37.8%): internal/firecracker 43.6% → 75.0% internal/daemon/workspace 33.8% → 60.8% internal/store 40.1% → 56.3% internal/guest 63.7% → 57.4% (different mix: smoke exercises real SSH; unit tests cover more error branches) The packages the review flagged are the ones that moved most — which is the point. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 18:59:57 -03:00
Thales Maciel	52612b516b	README: describe the SSH-key migration as a VM restart, not workspace sync The migration note said existing VMs needed a "fresh workspace sync" to pick up a new host SSH key fingerprint. That's wrong: workspace.prepare (vm workspace prepare) touches the git checkout, not authorized_keys. The authorized_keys rewrite happens on the vm start path — specifically in workDiskCapability.PrepareHost calling WorkspaceService.ensureAuthorizedKeyOnWorkDisk, which runs during start, not during an explicit workspace sync. Rewrite the note to name the actual recovery action: stop-and-start (or vm restart). Leave the --rm caveat — those flows always boot fresh and don't carry the problem. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 18:06:52 -03:00
Thales Maciel	e2ac70631b	test: end-to-end VMService lifecycle flow harness Uses the newTestDaemon harness added earlier in this series to drive VMService.CreateVM → FindVM → (duplicate CreateVM reject) → DeleteVM through the real code path. Pins contracts the single-responsibility tests can't: - image resolution finds a locally-upserted row - reserveVM allocates an IP, mkdirs VMDir, persists the Stopped row - FindVM round-trips the record - duplicate-name CreateVM fails without persisting a second row - DeleteVM runs cleanupRuntime with zero handles (no sudo calls needed), removes the store row, removes the VMDir Plus TestVMCreateWithUnknownImageFails for the error branch: if CreateVM can't resolve an image, it must error before mutating any state. Scope: everything except firecracker boot. NoStart: true skips machine.Start, which is the upstream SDK boundary we can't cross without a real firecracker binary. The integration we GET exercises name/IP reservation, per-VM lock lifecycle, store round-trip, VMDir lifecycle, and the never-started delete path — all of which were only indirectly covered by unit tests before. -race clean across the two tests; nothing touches goroutines but the harness does initialize the tap pool background goroutine on wireServices, which the race detector validated. Coverage is flat at the global level (37.8% → 37.8%) because this slice tests integration of already-covered units, not new branches. That's expected — the slice's value is as a regression bedrock for future refactors, not a line-count bump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 17:55:04 -03:00
Thales Maciel	7820960706	store: edge-path tests for migrations and Open Three gaps from the coverage plan, none of which were covered before. internal/store/migrations_test.go: TestRunMigrationsIgnoresUnknownAppliedIDs — simulates a DB written by a newer banger opened by an older one: schema_migrations carries an id (9001) the current binary doesn't know about. The runner must leave the alien row alone AND still apply its own known migrations. Without this, forward-then-backward upgrades or running two daemon versions against the same state dir would either fail or start destructively reinterpreting rows. TestDropColumnIfExistsIsIdempotent — pins the "run twice, no harm" property. A daemon restart after migration 2 succeeded on a fresh install must not fail because the column is already gone. dropColumnIfExists is what makes that idempotent. internal/store/store_test.go: TestOpenRejectsCorruptDB — writes garbage to state.db, Open must error cleanly (not panic, not silently overwrite). Also verifies the garbage bytes are untouched so the operator can hand the file to a recovery tool. TestOpenReadOnlyRejectsMissingDB — the doctor path must not silently create an empty DB when none exists; that would make "no VMs yet" and "your state is missing" indistinguishable. Package function coverage nudged 39.1% → 40.1%. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 17:51:52 -03:00
Thales Maciel	2f3db9b104	fcproc: targeted tests for waitForPath + EnsureSocketAccess error paths Every non-happy branch in fcproc was zero-covered before this. Given that EnsureSocketAccess gates the firecracker control plane on the daemon's ability to chown the API + vsock sockets off root, those failure paths are exactly the ones we need pinned. New file internal/daemon/fcproc/fcproc_test.go adds a local scripted Runner (fcproc is a leaf package — can't pull the daemon's scriptedRunner in) and six tests: waitForPath: - TestWaitForPathReturnsDeadlineExceededWhenSocketNeverAppears — timeout branch wraps context.DeadlineExceeded with the label, and waits at least one poll tick before giving up - TestWaitForPathReturnsOnceSocketAppears — happy path with a mid-wait file creation via goroutine - TestWaitForPathRespectsContextCancellation — ctx.Done() beats the poll interval so a cancelled request doesn't stall EnsureSocketAccess: - TestEnsureSocketAccessChownFailureBubbles — chown error surfaces untouched; chmod not attempted when chown fails - TestEnsureSocketAccessChmodFailureBubbles — chmod error surfaces after chown succeeds - TestEnsureSocketAccessTimesOutBeforeTouchingRunner — ordering contract: no sudo calls when the socket never materialises Package function coverage moved 55.2% → 62.1%. Integration-level chown-race test was considered (run a real shell that exercises buildProcessRunner's script with a fake firecracker binary) but skipped — requires `sudo -n` in the test env and makes CI fragile. The socket-ownership regression this slice is meant to guard against is covered at the unit level here; the manual-smoke in the plan's verification section remains the end-to-end check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 17:49:42 -03:00
Thales Maciel	cef9bf92a5	ssh-config: harden sameDirOrParent against symlinks + add edge tests The symlink test in this commit catches a real bug: sameDirOrParent used filepath.Abs for both sides of the "is the key inside the legacy dir?" check, but filepath.Abs doesn't resolve symlinks. A user whose ssh_key_path pointed into ConfigDir/ssh via a symlinked spelling (e.g. ConfigDir itself is a symlink, or the user maintains an alias tree) would have their key silently deleted by the legacy-dir scrub — the gate thought the key lived elsewhere because the two spellings didn't match lexically. Fix: resolvePathForComparison tries filepath.EvalSymlinks first, falls back to filepath.Abs when the path doesn't exist yet (new install, pre-first-Open). Both sides of the sameDirOrParent comparison now use this helper, so a symlinked key + canonical dir (or the reverse) lands in the same physical path before the Rel check. Tests added in this commit: internal/daemon/ssh_client_config_test.go TestSameDirOrParentHandlesSymlinks — symlinked-key + canonical-dir and the reverse are both reported "inside"; unrelated paths stay out. Skips if the filesystem doesn't support symlinks. internal/config/config_test.go TestLoadNormalizesAbsoluteSSHKeyPath — trailing slash, duplicate slashes, dot segments all collapse via filepath.Clean, so two spellings of the same path compare equal downstream. TestEnsureDefaultSSHKeyRejectsCorruptExistingFile — regression guard against a future "regenerate if invalid" patch that would silently nuke a real user key. TestResolveSSHKeyPathRejectsEmptySSHDirAndStateDir — pins the absolute-path guard that stops a bad layout from scribbling into cwd (this was the test that caught the stray internal/config/ssh/ a few commits back). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 17:48:06 -03:00
Thales Maciel	b2756f5e7e	test: add newTestDaemon harness + options No Daemon test in this package has a shared constructor. Every file re-derives the same pattern — &Daemon{...}, wireServices(d), maybe override a field — which means new lifecycle / integration tests spend half their length standing up infrastructure instead of exercising behaviour. Consolidate into internal/daemon/daemon_testing_test.go: func newTestDaemon(t testing.T, opts ...testDaemonOption) Daemon Defaults: tempdir layout (distinct StateDir/ConfigDir/SSHDir/...), fresh store.Store with migrations auto-run, permissiveRunner, io.Discard logger, empty vmCaps (so default workDisk/dns/nat capabilities don't fire real side effects in tests that just want to exercise VMService plumbing). Options so far: - withRunner(system.CommandRunner) - withConfig(model.DaemonConfig) - withStore(store.Store) - withLogger(slog.Logger) - withLayout(paths.Layout) - withVMCaps(caps ...vmCapability) - withVsockHostDevice(string) withVMCaps tracks a vmCapsSet flag so tests that explicitly pass no caps (i.e. the default) still get the empty-slice behaviour — the reset after wireServices only fires when the caller didn't opt in. That keeps wireServices's production semantics unchanged: if you construct a real Daemon without pre-populating vmCaps, you still get the default three. Two smoke tests pin: - zero-option call wires every service, gives an empty-vmCaps daemon with the default vsock device, store non-nil - each option actually lands on the resulting Daemon (guards against silent rename) Existing tests unchanged — this is purely additive. Later slices (Firecracker error-path tests, store migration edges, lifecycle flow harness) will adopt the helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 17:45:43 -03:00
Thales Maciel	e2885060dc	README: add ssh_key_path migration note + document normalization rules Docs pointing at the new state-dir default were updated in `b1fbf69`; what was still missing is the migration guidance the review asked for. Add a short note under the ssh_key_path bullet covering: - what moved (~/.config/banger/ssh/id_ed25519 → ~/.local/state/banger/ssh/id_ed25519) - that users with the old path hardcoded in config.toml are safe (the narrowed legacy-dir cleanup preserves the enclosing dir when ssh_key_path points inside it) - that unsetting the key and letting banger manage the new default is also fine — the only caveat is existing VMs need a stop-and-start to re-sync authorized_keys Also document the new normalization rules (~/ expansion, absolute required) on the ssh_key_path bullet itself so users know what's accepted before they hit a load error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 17:14:00 -03:00
Thales Maciel	617008e8f1	config: normalize ssh_key_path — expand ~/, reject non-absolute Bug: resolveSSHKeyPath returned a configured ssh_key_path verbatim. That meant: - ssh_key_path = "~/.ssh/id_ed25519" kept the literal "~" — downstream readers (internal/guest/ssh.go, internal/daemon/image_seed.go, internal/daemon/vm_authsync.go, internal/cli/ssh.go) do raw os.ReadFile on the path and fail at runtime with a path that looks fine but isn't. - ssh_key_path = "id_ed25519" (relative) silently worked or didn't depending on the daemon's cwd — the daemon process's cwd is not the user's shell cwd, so behavior was non-obvious. Fix: add normalizeSSHKeyPath() run over configured values. It: - expands "~/..." against $HOME - rejects bare "~" (ambiguous) - rejects "~user/..." (we don't do user-tilde) - rejects relative paths outright - returns filepath.Clean'd absolute paths Tests cover the accepting case (home-anchored expansion) and every rejection branch via a table-driven subtests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 17:00:34 -03:00
Thales Maciel	b1fbf695ca	ssh-config: narrow the legacy-dir cleanup so it can't delete a user key Bug: syncVMSSHClientConfig did os.RemoveAll on $ConfigDir/ssh every daemon Open. The intent was to migrate off the pre-opt-in layout, where banger used to write $ConfigDir/ssh/ssh_config. But a user who sets ssh_key_path = "~/.config/banger/ssh/id_ed25519" in config.toml has their key live exactly in that dir — and the scrub deletes it along with every other file in the tree. This is the same class of bug that cost the default key until `ebe6517` moved it to StateDir, but that fix was scoped to the default path. A configured ssh_key_path pointed under the legacy dir still dies. Fix: replace os.RemoveAll with a narrow two-step cleanup: 1. Skip the cleanup entirely when the configured ssh_key_path resolves under the legacy dir. A user who pointed banger at a key there must keep the enclosing directory. 2. Otherwise, os.Remove the specific legacy file ($ConfigDir/ssh/ ssh_config) and then os.Remove the directory. The second os.Remove fails with ENOTEMPTY if the dir still holds anything (e.g. a user-managed sibling file we don't own). Both errors are swallowed — this is best-effort migration, not a hard failure. Tests pin all three paths: user key under legacy dir survives, legacy dir empties and is removed when the user moved on, and a user-managed sibling file in the legacy dir is preserved. Also fix stale doc claims in README.md and AGENTS.md — both still pointed at the old ~/.config/banger/ssh/id_ed25519 default, which moved to ~/.local/state/banger/ssh/id_ed25519 in `ebe6517`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 16:31:07 -03:00
Thales Maciel	fba30f26d4	firecracker: chown API + vsock sockets inside the sudo shell Bug: Firecracker creates its API and vsock sockets as root:root 0700 (enforced by the intentional umask 077 in buildProcessRunner). The daemon, running as the invoking user, then can't connect(2) to either — AF_UNIX connect needs write permission on the socket file and 0700 root-owned leaves thales without any. firecracker-go-sdk's Machine.Start() blocks on waitForSocket, which probes the socket with both os.Stat (succeeds — parent dir is the user's XDG_RUNTIME_DIR) and an HTTP GET over the socket (fails — EACCES on connect). The SDK loops for 3 seconds then fails with "Firecracker did not create API socket ... context deadline exceeded". The daemon's EnsureSocketAccess chown was meant to fix permissions, but it runs after Machine.Start returns — and Start never returns because it's still looping on the SDK's probe. Chicken-and-egg. Fix: inside the sudo'd shell that launches firecracker, spawn a background subshell that polls for each expected socket (API + vsock, when configured) and chowns it to $SUDO_UID:$SUDO_GID as soon as it appears. The background polling is bounded at 1s (20 × 50ms) so a broken firecracker invocation doesn't leak a waiting shell. Post-fix: socket appears root-owned 0600 briefly, is chowned to the invoking user within ~50ms, SDK's HTTP probe succeeds, Machine.Start returns normally. EnsureSocketAccess's later chmod 600 remains the belt-and-braces guarantee on final mode. Verified: manual repro of the shell script produces a socket owned by thales:thales that a non-root python socket.connect() accepts. Without the fix the same setup gives "PermissionError: [Errno 13] Permission denied". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 16:09:02 -03:00
Thales Maciel	60f90eb8be	config: harden resolveSSHKeyPath against relative paths + drop stray test keys Prior commit's test (TestLoadDefaultsResolveFirecrackerAndGenerateSSHKey) was contained, but every OTHER config test called Load(paths.Layout{ConfigDir: ...}) without SSHDir or StateDir set. Under the new code path that meant resolveSSHKeyPath produced a relative target ("ssh/id_ed25519") which go test happily wrote against the package's own source directory — and I caught that in the commit after the fact, in the form of internal/config/ssh/ showing up as tracked files. Two changes: - resolveSSHKeyPath now errors if the resolved path is not absolute. paths.Resolve always produces an absolute SSHDir in production; this just stops a fumbled layout from silently scribbling into cwd. - Every existing config test that was relying on the old ConfigDir/ssh path gets an explicit SSHDir: t.TempDir() added, restoring the key-generation surface under tempdir isolation. Delete internal/config/ssh/ — those files were test-generated by the prior commit and have no business in the repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 14:31:11 -03:00
Thales Maciel	ebe651762f	config: put the default SSH key under the state dir, not ConfigDir/ssh Bug: every daemon Open deleted the freshly-generated default SSH key before returning, so the next VM create failed reading it. Sequence: 1. Open → config.Load → resolveSSHKeyPath generates ~/.config/banger/ssh/id_ed25519 2. Open → ensureVMSSHClientConfig → syncVMSSHClientConfig scrubs ~/.config/banger/ssh entirely as a migration step for the pre-opt-in layout (commit `108f7a0`) The scrub was added for a file that used to live at ConfigDir/ssh/ssh_config, but it os.RemoveAll'd the whole ConfigDir/ssh dir — including the id_ed25519 the key generator had just put there. Fix: point the default key at layout.SSHDir (a StateDir-rooted path that paths.Ensure already creates). The scrub can keep cleaning up ConfigDir/ssh because nothing banger writes under it anymore. Users whose ssh_key_path is explicitly set in config.toml are unaffected — configured wins. Users on the default path will get a fresh key at StateDir/ssh/id_ed25519 on their next daemon Open; existing VMs' authorized_keys re-sync on next start/create through ensureAuthorizedKeyOnWorkDisk, so no manual intervention is needed beyond restarting the daemon. Regression test pins the new placement and asserts the legacy path stays empty. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 14:29:34 -03:00
Thales Maciel	80ae4d6667	docs: resync package docs, AGENTS, and kernel-catalog with current code Four drift fixes from a doc sweep. internal/daemon/doc.go Replace the capability-hook description that still said "Hook methods take Daemon; VMService reaches them through a capabilityHooks seam." Current reality: every capability is a plain struct carrying its own service pointers (workDiskCapability{vm,ws,store}, dnsCapability{net}, natCapability{vm,net,logger}); wireServices builds the default list; no hook reaches Daemon. internal/daemon/ARCHITECTURE.md The VMService field list still claimed guestWaitForSSH and guestDial were "per-instance fields." Those were deleted as refactor residue. Update the note to say the seams live on *Daemon (reached by WorkspaceService via closures wired at construction) and document the vsockHostDevice field that replaced the old package-global vsockHostDevicePath. AGENTS.md Drop the "experimental web UI" mention (removed) and the `session` subpackage (removed). Mention banger-vsock-agent as the third cmd/ binary while we're here — AGENTS hadn't listed it. docs/kernel-catalog.md The trust-model section still read as if upstream kernel sources were fetched by HTTPS alone. Add a paragraph covering the PGP verification make-generic-kernel.sh now does against the detached .tar.sign and the three kernel.org release signing keys. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 13:01:11 -03:00
Thales Maciel	88bc466d58	tests: targeted coverage for doctor, workspace rejections, and nat capability Three thematic test files pinning behavior surfaces that had none before, following the review's recommendation to plug concrete error/cleanup branches rather than chase a coverage percentage. doctor_test.go Covers Daemon.doctorReport end-to-end with a permissive runner + fake executables on PATH. Pins: store error surfaces as fail, store success as pass, missing firecracker kills the host-runtime check, the three default capability feature checks (work disk, vm dns, nat) are emitted, vm-defaults is always-pass with provenance. Previously 0% — now the Doctor() command's contract with the CLI is under guard. workspace_rejection_test.go Covers the four early-exit branches of PrepareVMWorkspace that the existing happy-path + lock-release tests never hit: malformed mode, --from without --branch, VM not running, VM not found. Each one returns before any SSH I/O, so the fake-firecracker infra the happy-path test needs is unnecessary — a bare wired daemon with a stored VMRecord suffices. nat_capability_test.go Covers natCapability.ApplyConfigChange (unchanged flag → no-op, VM not alive → no-op, toggle on live VM → runner reached) and natCapability.Cleanup (NAT disabled → no-op, runtime handles missing → defensive no-op, full wiring → ensureNAT(false)). A countingRunner + startFakeFirecracker fixture stands in for the real host plumbing, with waitForVMAlive polling past the exec -a race window that startFakeFirecracker exposes on loaded CI boxes. make coverage-total 37.8% → 38.6%. The number isn't the point — these tests exist so the next refactor in this area has to break an explicit assertion to drift. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 12:58:12 -03:00
Thales Maciel	3aec590deb	vmservice: delete dead guestWaitForSSH + guestDial seams VMService carried guestWaitForSSH and guestDial fields + matching constructor args ever since the daemon split, but nothing on VMService ever read them. The live guest-SSH path runs on Daemon (d.waitForGuestSSH / d.dialGuest in guest_ssh.go); WorkspaceService reaches those through closures wired in wireServices. So the VMService copies were refactor residue: they made the service look more decoupled than it actually is, and any future test that stubbed VMService.guestDial would be stubbing nothing. Delete the fields, the deps entries, the newVMService assignments, and the wireServices passes. Real seams on Daemon are unchanged — those are the ones tests (e.g. workspace_test.go) already set directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 12:45:27 -03:00
Thales Maciel	bbd187391e	noteUntrackedSkipped: fix subdir underreport + be best-effort everywhere Two bugs in the untracked-skipped warning, both surfaced by review. 1. Wrong scope for subdir inputs. The helper accepted any path the caller had (sourcePath, which may be a user-supplied subdirectory) and ran `git -C <path> ls-files --others --exclude-standard`. Git scopes that output to the cwd, so pointing `vm run ./repo/sub` at a subdir silently underreported untracked files living elsewhere in the repo — exactly the files the warning exists to surface. Fix: resolve sourcePath to the repo root inside the helper via `rev-parse --show-toplevel` before counting. 2. Inconsistent failure handling. The comment said the helper should be silent when the count can't be determined; the body returned the error. vm_run.go treated the error as non-fatal (logged a warning, continued); workspace prepare and --dry-run aborted the whole operation on the same helper failure. A courtesy notice shouldn't kill the operation. Fix: make the helper best-effort in signature and body — no error return, swallows rev-parse + count failures, emits nothing when there's nothing to say. All three callers lose their error branches. Regression tests: - subdir input reports the root-level untracked file (the bug case) - non-repo path produces silence, not a fatal error - inspector whose runner errors on every call produces silence Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 12:42:33 -03:00
Thales Maciel	ecb18ce6ca	seams: move the last four package globals onto instance fields Three test seams were still package-level mutable vars, which tests had to swap before use. That's the classic path to flaky parallel tests — two goroutines fighting over the same global fake. Push each down to the struct that owns the behaviour. internal/daemon/dns_routing.go lookupExecutableFunc + vmDNSAddrFunc → fields on HostNetwork, defaulted at newHostNetwork time. dns_routing_test builds HostNetwork{..., lookupExecutable: stub, vmDNSAddr: stub} inline, no more t.Cleanup dance around package-level vars. internal/daemon/preflight.go + doctor.go vsockHostDevicePath (mutable string) → vsockHostDevice field on VMService, defaulted via defaultVsockHostDevice constant in newVMService. Preflight reads s.vsockHostDevice; doctor reads d.vm.vsockHostDevice. Logger test sets d.vm.vsockHostDevice = tmp after wireServices. internal/daemon/workspace/workspace.go HostCommandOutputFunc → Inspector struct with a Runner field. Every git-using helper (GitOutput, GitTrimmedOutput, GitResolvedConfigValue, RunHostCommand, ListSubmodules, ListOverlayPaths, CountUntrackedPaths, InspectRepo, ImportRepoToGuest, PrepareRepoCopy) is now a method on Inspector. NewInspector() wraps the real host runner for production; WorkspaceService holds one via repoInspector, CLI deps holds one too. cli_test.go's submodule-rejection test builds its own Inspector with a scripted Runner instead of patching a global. Pure helpers (FinalizeScript, ResolveSourcePath, ParsePrepareMode, ShellQuote, FormatStepError, GitFileURL, ParseNullSeparatedOutput) stay free functions since they don't touch the host. Sentinel: grep for HostCommandOutputFunc, lookupExecutableFunc, vmDNSAddrFunc, vsockHostDevicePath is now empty across internal/. make lint test green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 12:07:14 -03:00
Thales Maciel	2685bc73f8	doctor: open the state DB read-only so inspection never mutates it `banger doctor` used to call store.Open, which unconditionally runs migrations on the way up. Diagnostics mutating persistent state is a surprise — particularly now that migration 2 drops a column, so a plain `doctor` invocation against an old DB would silently schema- evolve it. Add store.OpenReadOnly: separate DSN builder with mode=ro and a minimal pragma set (foreign_keys, busy_timeout — no journal_mode=WAL, no wal_autocheckpoint), skips runMigrations, and pings on open so a missing DB fails up front rather than at first query. doctor.go now uses OpenReadOnly; the existing storeErr fallback path surfaces any failure as a failing check, unchanged. Tests pin two invariants: - OpenReadOnly against a DB whose migration 2 marker was removed and packages_path re-added must leave both alone (i.e. no drift is applied behind the user's back). - Any write attempted through the read-only handle is rejected at the driver layer (belt-and-braces for future refactors). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 11:05:23 -03:00
Thales Maciel	129475be20	config + store: remove dead knobs and stale schema Three drift items surfaced in review, each dead on arrival and each worth trusting a little more at v0.1.0. config: drop MetricsPollInterval. The field was parsed from TOML (metrics_poll_interval), stored on DaemonConfig, and ignored by every consumer — only StatsPollInterval drives the background poll loop. Users setting it in config.toml saw zero effect. Removed from the TOML surface, the model constant, and the config test. daemon: delete ensureDefaultImage. No callers, body was `_ = ctx; return nil`. Dead since whatever flow used to call it got removed. store: drop packages_path from the images table. The column was carried by the baseline migration but never referenced by UpsertImage (no INSERT / UPDATE mention) or any Go model field — a ghost from a build pipeline that no longer exists. Added migration id=2 (drop_dead_image_columns) with an idempotent dropColumnIfExists helper: fresh installs run baseline (creates the column) + 2 (drops it); legacy DBs where the column was never added get a no-op. Updated the direct-INSERT SQL in TestGetImageRejectsMalformedTimestamp to drop the column reference, and added a migration test covering both install paths (fresh + legacy). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-22 10:54:01 -03:00
Thales Maciel	2a7f55f028	vm run: ship tracked files only by default; add --include-untracked + --dry-run Workspace-mode vm run and vm workspace prepare used to copy both tracked AND untracked non-ignored files into the guest. That silently catches local .env files, scratch notes, credentials, and any other working-tree state a developer hasn't explicitly gitignored — a real data-exposure footgun given the golden image ships Docker and the usual dev tooling. Flip the default to tracked-only. Users who actually want the fuller set opt in with --include-untracked (documented in both commands' help). Gitignored files are still always excluded regardless of the flag. Add --dry-run to both vm run and vm workspace prepare. Dry-run inspects the repo CLI-side (no VM created, no daemon RPC needed since the daemon is always local and the inspection is a pure git read), prints the exact file list + mode, and exits. A byte-level preview of what would land in the guest. When running real (non-dry) and untracked files exist in the repo but are being skipped under the new default, print a one-line notice pointing to --include-untracked so users aren't surprised when the guest is missing something they expected. Signature changes: - ListOverlayPaths takes an includeUntracked bool (tracked always; untracked gated by flag). - InspectRepo takes the same flag and passes it through. - VMWorkspacePrepareParams gains IncludeUntracked. - WorkspaceService.workspaceInspectRepo seam signature widened to match (4 callers in tests updated). New workspace package tests cover both modes and verify that gitignored files never leak regardless of the flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 19:53:17 -03:00
Thales Maciel	25a1466947	supply chain: verify signatures and pins across image + kernel builds Three independent hardenings, addressing a review finding that the kernel and image build pipelines were relying on HTTPS alone for artifact integrity. scripts/make-generic-kernel.sh - Fetch the detached PGP signature (linux-<ver>.tar.sign) alongside the tarball and verify it with gpg before extraction. An isolated $GNUPGHOME under the tempdir keeps the kernel signers out of the invoking user's keyring. - Import the three kernel.org release signing keys (Greg KH / Linus / Sasha Levin) from keyserver.ubuntu.com, falling back to keys.openpgp.org. Ubuntu comes first because keys.openpgp.org strips unverified UIDs on upload, leaving gpg with UID-less keys it refuses to trust. - Require VALIDSIG (cryptographic proof) rather than GOODSIG (printed even for expired keys) before proceeding. Verified end-to-end against a clean tarball (accepts) and a byte-flipped tampered copy (rejects with BADSIG). - gpg + gpgv + xz added to the required-tools check. images/golden/Dockerfile - Pin Docker's apt signing key by fingerprint. After downloading /etc/apt/keyrings/docker.asc we gpg --show-keys --with-colons it, extract the fpr, and compare against the expected 9DC858229FC7DD38854AE2D88D81803C0EBFCD88. A tampered or swapped key aborts the build before any apt repo metadata is fetched. - Replace `curl https://mise.run \| sh` with a pinned GitHub release binary (mise v2026.4.18, linux-x64) verified against its published sha256. Refuses to build on unknown architectures rather than silently installing a binary we have no hash for. - Add gnupg to the ESSENTIAL apt-get install so the fingerprint check has gpg available. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 19:38:13 -03:00
Thales Maciel	011b59a72f	daemon split (8/8): document capability decoupling + wireServices Update ARCHITECTURE.md's Composition section to reflect the finished split: capabilities carry explicit service-pointer fields, nothing reaches Daemon at dispatch time, and wireServices(d) is the single entry point that builds services + capabilities eagerly (from Open in production, from tests after constructing &Daemon{...} literals). Removes the paragraph admitting capability→Daemon coupling and the lazy-init getters justification, neither of which applies anymore. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 15:59:39 -03:00
Thales Maciel	9c73155e17	daemon split (7/n): narrow capability interfaces, wire deps at construction Stop passing Daemon into capability hooks. Each capability implementation is now a struct with explicit service-pointer fields populated at wireServices time; the six dynamic-dispatch interfaces (AddStartPreflight, PrepareHost, PostStart, Cleanup, ApplyConfigChange, AddDoctorChecks) no longer have a Daemon parameter. Capability methods reach their dependencies through struct fields, not through d.vm / d.ws / d.net. - workDiskCapability carries {vm, ws, store, defaultImageName} - dnsCapability carries {net} - natCapability carries {vm, net, logger} Daemon.defaultCapabilities() builds the production list from the already-constructed services and is called from wireServices so d.vmCaps is populated eagerly. Tests that preinstall d.vmCaps with stubs still work — wireServices only overwrites an empty slice. registeredCapabilities() is gone (every dispatch loop now reads d.vmCaps directly). capabilities_test.go's testCapability fake drops Daemon from its method set to match the new interfaces. This finishes the daemon service split: capability implementations no longer reach through the composition root, there's no path back to Daemon from any service or capability, and test construction goes through one explicit wireServices call instead of lazy getters. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 15:59:09 -03:00
Thales Maciel	16702bd5e1	daemon split (6/n): extract wireServices + drop lazy service getters Factor the service + capability wiring out of Daemon.Open() into wireServices(d), an idempotent helper that constructs HostNetwork, ImageService, WorkspaceService, and VMService from whatever infrastructure (runner, store, config, layout, logger, closing) is already set on d. Open() calls it once after filling the composition root; tests that build &Daemon{...} literals call it to get a working service graph, preinstalling stubs on the fields they want to fake. Drops the four lazy-init getters on *Daemon — d.hostNet(), d.imageSvc(), d.workspaceSvc(), d.vmSvc() — whose sole purpose was keeping test literals working. Every production call site now reads d.net / d.img / d.ws / d.vm directly; the services are guaranteed non-nil once Open returns. No behavior change. Mechanical: all existing `d.xxxSvc()` calls (production + tests) rewritten to field access; each `d := &Daemon{...}` in tests gets a trailing wireServices(d) so the literal + wiring are side-by-side. Tests that override a pre-built service (e.g. d.img = &ImageService{ bundleFetch: stub}) now set the override before wireServices so the replacement propagates into VMService's peer pointer. Also nil-guards HostNetwork.stopVMDNS and d.store in Close() so partially-initialised daemons (pre-reconcile open failure) still tear down cleanly — same contract the old lazy getters provided. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 15:55:28 -03:00
Thales Maciel	0cfd8a5451	daemon split (5/5): document the service composition Phase 5 of the daemon god-struct refactor. Code motion landed in phases 1-4; this commit retells the architecture so the docs match the structure. ARCHITECTURE.md loses the "deferred v0.2 project" hedge about splitting services. The Composition section now describes the four services (HostNetwork, ImageService, WorkspaceService, VMService) that own behaviour, the consumer-defined seam pattern for cross-service calls, and the lazy-init getter pattern that keeps existing test literals compiling. doc.go inventories which methods live on which service, and the lock-ordering section gains the service prefixes (e.g. VMService.vmLocks instead of bare vmLocks) so readers don't have to guess which type owns which mutex. No code changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:58:53 -03:00
Thales Maciel	466a7c30c4	daemon split (4/5): extract VMService service Phase 4 of the daemon god-struct refactor. VM lifecycle, create-op registry, handle cache, disk provisioning, stats polling, ports query, and the per-VM lock set all move off Daemon onto VMService. Daemon keeps thin forwarders only for FindVM / TouchVM (dispatch surface) and is otherwise out of VM lifecycle. Lazy-init via d.vmSvc() mirrors the earlier services so test literals like \`&Daemon{store: db, runner: r}\` still get a functional service without spelling one out. Three small cleanups along the way: preflight helpers (validateStartPrereqs / addBaseStartPrereqs / addBaseStartCommandPrereqs / validateWorkDiskResizePrereqs) move with the VM methods that call them. * cleanupRuntime / rebuildDNS move to VMService, with HostNetwork primitives (findFirecrackerPID, cleanupDMSnapshot, killVMProcess, releaseTap, waitForExit, sendCtrlAltDel) reached through s.net instead of the hostNet() facade. vsockAgentBinary becomes a package-level function so both Daemon (doctor) and VMService (preflight) call one entry point instead of each owning a forwarder method. WorkspaceService's peer deps switch from eager method values to closures — vmSvc() constructs VMService with WorkspaceService as a peer, so resolving d.vmSvc().FindVM at construction time recursed through workspaceSvc() → vmSvc(). Closures defer the lookup to call time. Pure code motion: build + unit tests green, lint clean. No RPC surface or lock-ordering changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:57:05 -03:00
Thales Maciel	c0d456e734	daemon split (3/5): extract WorkspaceService service Third phase of splitting the daemon god-struct. WorkspaceService now owns workspace.prepare / workspace.export plus the ssh-key + git-identity + arbitrary-file sync that runs as part of VM start's prepare_work_disk capability hook. workspaceLocks (the per-VM tar serialisation set) lives on the service. workspace.go and vm_authsync.go flipped receivers from Daemon to *WorkspaceService. The workspaceInspectRepo / workspaceImport test seams moved onto the service as fields. Peer-service dependencies go through narrow function-typed fields: vmResolver, aliveChecker, waitGuestSSH, dialGuest, imageResolver, imageWorkSeed, withVMLockByRef, beginOperation. WorkspaceService never touches VMService / HostNetwork / ImageService directly — only the exact operations the Daemon hands it at construction. Daemon lazy-init helper workspaceSvc() mirrors the Phase 1/2 pattern. Test literals still write `&Daemon{store: db, runner: r}` and get a wired workspace service for free. Tests that override the inspect/import seams (workspace_test.go, ~4 sites) assign them on d.workspaceSvc() instead of on the daemon literal. Dispatch in daemon.go: vm.workspace.prepare and vm.workspace.export now forward one-liners to d.workspaceSvc(). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:42:31 -03:00
Thales Maciel	d7614a3b2b	daemon split (2/5): extract ImageService service Second phase of splitting the daemon god-struct. ImageService now owns all image + kernel registry operations: register/promote/delete/pull for images (bundle + OCI paths), the six kernel commands, and the shared SSH-key/work-seed injection helpers. imageOpsMu (the publication-window lock) lives on the service; so do the three OCI pull test seams pullAndFlatten / finalizePulledRootfs / bundleFetch. The four files images.go, images_pull.go, image_seed.go, kernels.go flipped their receivers from Daemon to ImageService. FindImage moved with the service. Daemon keeps a thin FindImage forwarder so callers reading the dispatch code see the obvious facade and tests that pre-date the split still compile. flattenNestedWorkHome — called from image_seed.go, vm_authsync.go, and vm_disk.go across future service boundaries — became a package-level helper taking a CommandRunner explicitly. Daemon keeps a deprecated forwarder for now; the other services will use the package form. Lazy-init helper imageSvc() on Daemon mirrors hostNet() from Phase 1, so test literals like &Daemon{store: db, runner: r, ...} that don't spell out an ImageService still get a working one. Tests that override the image test seams (autopull_test, concurrency_test, images_pull_test, images_pull_bundle_test) now assign d.img = &ImageService{...seams...}; the two-statement pattern matches what Phase 1 established for HostNetwork. Dispatch in daemon.go is cleaner now: every image/kernel RPC handler is a single-liner forwarding to d.imageSvc().. Phase 5 will do the same for VM lifecycle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:30:32 -03:00
Thales Maciel	362009d747	daemon split (1/5): extract HostNetwork service First phase of splitting the daemon god-struct into focused services with explicit ownership. HostNetwork now owns everything host-networking: the TAP interface pool (initializeTapPool / ensureTapPool / acquireTap / releaseTap / createTap), bridge + socket dir setup, firecracker process primitives (find/resolve/kill/wait/ensureSocketAccess/sendCtrlAltDel), DM snapshot lifecycle, NAT rule enforcement, guest DNS server lifecycle + routing setup, and the vsock-agent readiness probe. That's 7 files whose receivers flipped from Daemon to HostNetwork, plus a new host_network.go that declares the struct, its hostNetworkDeps, and the factored firecracker + DNS helpers that used to live in vm.go. Daemon gives up the tapPool and vmDNS fields entirely; they're now HostNetwork's business. Construction goes through newHostNetwork in Daemon.Open with an explicit dependency bag (runner, logger, config, layout, closing). A lazy-init hostNet() helper on Daemon supports test literals that don't wire net explicitly — production always populates it eagerly. Signature tightenings where the old receiver reached into VM-service state: - ensureNAT(ctx, vm, enable) → ensureNAT(ctx, guestIP, tap, enable). Callers resolve tap from the handle cache themselves. - initializeTapPool(ctx) → initializeTapPool(usedTaps []string). Daemon.Open enumerates VMs, collects taps from handles, hands the slice in. rebuildDNS stays on Daemon as the orchestrator — it filters by vm-alive (a VMService concern handles will move to in phase 4) then calls HostNetwork.replaceDNS with the already-filtered map. Capability hooks continue to take Daemon; they now use it as a facade to reach services (d.net.ensureNAT, d.hostNet().). Planned CapabilityHost interface extraction is orthogonal, left for later. Tests: dns_routing_test.go + fastpath_test.go + nat_test.go + snapshot_test.go + open_close_test.go were touched to construct HostNetwork literals where they exercise its methods directly, or route through d.hostNet() where they exercise the Daemon entry points. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:11:46 -03:00
Thales Maciel	eba9a553bf	daemon: use exact-name lookup for VM-create uniqueness reserveVM's duplicate-name guard routed through Daemon.FindVM, which falls back to prefix-matching on both ids and names when no exact match is found. That turns the uniqueness check into a correctness bug: a brand-new VM name can be rejected because it happens to prefix an existing VM's id, or an existing VM's name. So `vm create --name beta` fails when `beta-sandbox` already exists. Swap in a dedicated store.GetVMByName that does a literal `WHERE name = ?` lookup, and use it from reserveVM. FindVM keeps its prefix-matching behaviour for user-facing lookup paths (`vm ssh <partial>`, `vm stop <partial>`) where "did you mean" semantics are the feature. Tests: - TestReserveVMAllowsNameThatPrefixesExistingVM — seeds a VM whose id + name both start with "longname", then reserves two new VMs named "longname" and "longname-sandbox". Both must succeed. Under the old FindVM-based check, both would fail. - TestReserveVMRejectsExactDuplicateName — actual collisions are still rejected after the swap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 14:00:33 -03:00
Thales Maciel	108f7a0600	ssh-config: make the `ssh <name>.vm` shortcut opt-in Before this change, every daemon.Open() wrote a Host .vm stanza into ~/.ssh/config in a marker-fenced block. That's a real footgun for users who manage their SSH config declaratively (chezmoi, dotfiles, NixOS): banger was mutating host state outside its own directory on every daemon start, easy to miss and hard to audit. New contract: the daemon only ever writes its own ssh_config file at ~/.config/banger/ssh_config. ~/.ssh/config is untouched unless the user opts in. `banger vm ssh <name>` still works out of the box — the shortcut only matters for plain `ssh sandbox.vm` from any terminal. The opt-in surface is `banger ssh-config`: banger ssh-config # prints path + include-line + # install/uninstall hints banger ssh-config --install # adds `Include <bangerConfig>` to # ~/.ssh/config inside a marker-fenced # block; idempotent; migrates any # legacy inline Host .vm block from # pre-opt-in builds banger ssh-config --uninstall # removes the new Include block AND # any legacy inline block Doctor gains a gentle warn-level note when banger's ssh_config exists but the user hasn't wired it in — not a fail, since the shortcut is convenience and `banger vm ssh` covers the essential case. Tests cover: daemon writes banger file and does NOT touch ~/.ssh/config, Install adds the block, Install is idempotent, Install migrates the legacy inline block cleanly (removing it, preserving unrelated entries, adding the new Include block), Uninstall removes both marker variants, Uninstall is a no-op when ~/.ssh/config is absent, and UserSSHIncludeInstalled detects both marker shapes. README reframes the feature as optional convenience. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 13:57:26 -03:00
Thales Maciel	99d0811097	daemon: shrink createVMMu + imageOpsMu to reservation/publication windows Before: createVMMu was held across the whole of CreateVM — including image resolution (which could fire a full auto-pull) and startVMLocked (boot of multiple seconds). imageOpsMu was held across the whole of PullImage/RegisterImage/PromoteImage/DeleteImage, so any slow OCI pull, bundle download, or file copy blocked every other image mutation and every other VM create that needed to auto-pull. The async create API bought nothing if all creates serialised on the same mutex. CreateVM is now three phases: 1. Validate + resolve image (possibly auto-pulling). No global lock. 2. reserveVM: take createVMMu only long enough to re-check the name is free, allocate the next guest IP, and UpsertVM the "created" row. Milliseconds. 3. startVMLocked: run the full boot flow under the per-VM lock only. Parallel creates of different VMs now overlap on image resolution + boot; they contend only across the reservation claim. For the image surface a new publishImage helper isolates the commit atom (recheck name free, atomic rename stagingDir→finalDir, UpsertImage) under imageOpsMu. pullFromBundle + pullFromOCI do their network fetch + ext4 build + ownership fixup + agent injection outside the lock; Register moves validation + kernel resolution outside; Promote moves file copy + SSH-key seeding outside; Delete keeps a brief lock over the lookup + reference check + store delete and does file cleanup unlocked. Two concurrency tests assert the new behaviour: - TestPullImageDoesNotSerialiseOnDifferentNames fails the old code (second pull blocks on imageOpsMu and never reaches the body). - TestPullImageRejectsNameClashAtPublish confirms the publish-window recheck is what enforces name uniqueness now that the body runs unlocked — exactly one winner. ARCHITECTURE.md updated to describe the new scope explicitly instead of calling the locks "narrow". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 13:44:22 -03:00
Thales Maciel	afe91e805a	drop unused bench-create script + Makefile target The script carried a python3 dep for one json.dumps on a VM name that's always alphanumeric-plus-dashes anyway, it was never wired into CI or docs, and `time banger vm create` covers the same need ad hoc when anyone wants to measure create latency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 13:33:09 -03:00
Thales Maciel	58464ac28c	docs + doctor: be honest about amd64-only support The README sold the product as "Linux with /dev/kvm"; the deeper docs admit that the Makefile pins companion builds to GOARCH=amd64, the kernel catalog ships only x86_64 entries, and OCI import pulls linux/amd64 layers. arm64 users who show up through the README only discover that after install fails in non-obvious ways. Two surface-level fixes: - README requirements list leads with "x86_64 / amd64 Linux — arm64 is not supported today", with a short note on the three places that assumption lives so users understand it's not a last-mile gap. - `banger doctor` now runs an architecture check that passes on amd64 and FAILS (not warns) on anything else, referencing the three downstream assumptions. Hard-fail rather than warn so a user on an arm64 machine doesn't waste time chasing unrelated preflight items. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 13:03:50 -03:00
Thales Maciel	e69810610a	daemon: correct ARCHITECTURE doc to match actual package shape + lock scope Two promises the doc was making that the code doesn't keep: 1. "Helpers moved out so the package stays focused on orchestration." The package still has ~29 files and ~130 func (d *Daemon) methods wiring VM lifecycle, image management, host networking, background reconciliation, and JSON-RPC dispatch. Calling it "just orchestration" sets readers up for surprise. Rewrite the subpackages preamble to say so, and flag the service split as a post-v0.1.0 project. 2. "vmLocks[id] is held only across short synchronous state validation and DB mutations." That's what workspace.prepare does; regular lifecycle ops (start/stop/delete/set) go through withVMLockByRef and hold the lock across the whole callback body, which for `start` means preflight + bridge + firecracker spawn + post-boot wiring. Rewrite the vmLocks bullet and the lock-ordering section to say that explicitly, so readers don't build "surely my long flow under the lock can't be what the doc means" reasoning on top of a false premise. Doc-only change. Code behaviour is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 13:02:36 -03:00
Thales Maciel	34dd7644d8	store: introduce versioned migrations with ordered runner + atomic apply The old migrate() helper only knew how to re-run a fixed slab of CREATE TABLE IF NOT EXISTS plus per-column ensureColumnExists calls. That worked while every schema change was a benign additive column; it falls apart as soon as we need a data backfill, an index, a rename, or anything that has to happen exactly once in a known order. Replaces it with a schema_migrations table + ordered []migration slice. Each migration has a unique id, a human-readable name, and a func(*Tx) body; the runner opens a transaction per migration so DDL and any data changes either both land and get recorded or both roll back together, leaving the DB in a state where retrying on next Open() reapplies from the same point. Migration 1 ("baseline") collapses the current schema into one entry: fresh databases apply it in one shot; existing dev databases see idempotent `CREATE TABLE IF NOT EXISTS` + `ALTER TABLE … ADD COLUMN` statements that succeed as no-ops, and the only net effect is the schema_migrations row that brings them into the versioned system. Tests cover fresh apply, idempotent re-open, skipping already-applied ids, rollback on body error (the transient table the migration created must not survive), and duplicate-id rejection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 12:59:42 -03:00

1 2 3 4 5 ...

326 commits