banger

Author	SHA1	Message	Date
Thales Maciel	05439d2325	daemon: cut vm stop latency Three changes to stopVMLocked, biggest win first: - Skip waitForExit on the SSH-success path. sync inside the guest already flushed root.ext4, so cleanupRuntime's SIGKILL is safe immediately. Saves up to gracefulShutdownWait (10s) per stop. - Drop the SendCtrlAltDel + 10s wait fallback when SSH is unreachable. On Debian, ctrl+alt+del routes to reboot.target so FC never exits on it — the wait was pure latency. - Shrink the SSH dial timeout 5s → 2s. A reachable guest dials in single-digit milliseconds; if it doesn't, fail fast and SIGKILL. Worst-case (broken SSH) goes ~15s → ~2s + cleanup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 17:51:22 -03:00
Thales Maciel	c9358ab390	daemon: sync guest over ssh before stop to preserve workspace writes VM stop has been quietly losing data freshly written via `vm workspace prepare`: stop+start of a workspace-prepared VM would come back with /root/repo wiped on the work disk. Root cause is firecracker + Debian's systemd defaults. FC's SendCtrlAltDel (the only "graceful shutdown" action FC exposes) just delivers the keystroke; what the guest does with it is its choice. Debian routes ctrl-alt-del.target -> reboot.target, so the guest reboots, FC stays alive, the daemon's 10s wait_for_exit window expires, and the SIGKILL fallback drops anything still in FC's userspace I/O path. For an idle VM that's invisible. For one that just took 100s of small writes through a workspace prepare, it's data loss. Fix is to dial the guest over SSH inside StopVM and run `sync; systemctl --no-block poweroff \|\| /sbin/poweroff -f &` before the existing SendCtrlAltDel path. The synchronous `sync` is the load-bearing piece — it blocks until every dirty page hits virtio-blk and lands in the on-host root.ext4. Whether poweroff completes before SIGKILL fires is incidental; sync has already run. SSH unreachable falls back to the old SendCtrlAltDel behaviour so a broken-network guest can't make stop hang. Bounded by a 5s SSH-dial timeout so a half-broken guest can't extend the overall stop window past gracefulShutdownWait. Also adds two smoke scenarios: - `workspace + stop/start`: prepare -> stop -> start -> assert marker survives. This is the regression that caught the bug. - `vm exec`: end-to-end coverage for `d59425a` — auto-cd into the prepared workspace, exit-code propagation, dirty-host warning, --auto-prepare resync, refusal on stopped VM. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 15:41:32 -03:00
Thales Maciel	fa4292756d	daemon: surface previously-swallowed errors at warn Three recovery-path errors were silently dropped: - vm_lifecycle.go startVMLocked persisted the VMStateError record with `_ = s.store.UpsertVM(...)`. If the persist failed the user saw the original start error but operators had no way to find out the store had also drifted out of sync. - vm_lifecycle.go deleteVMLocked killed the firecracker process with `_ = s.net.killVMProcess(...)`. cleanupRuntime tears it down regardless, so the explicit kill is best-effort, but a permission-denied / EPERM was still worth logging. - capabilities.go cleanupPreparedCapabilities collected per-cap errors with errors.Join. Callers get the aggregated value but couldn't tell which capability failed when more than one did. All three now log Warn before the original behaviour continues. The aggregate return value, control flow, and user-visible error strings are unchanged — this is purely a "less silence in the journal" pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 22:30:51 -03:00
Thales Maciel	e47b8146dc	daemon: thread per-RPC op_id end-to-end Today there's no way to correlate a CLI failure with a daemon log line. operationLog records relative timing but no id, two concurrent vm.start calls log indistinguishably, and the async vmCreateOperationState.ID is user-facing yet never reaches the journal. The root helper logs plain text to stderr while bangerd logs JSON, so a merged journalctl is hard to grep across the trust-boundary split. Mint a per-RPC op id at dispatch entry, store it on context, and include it as an "op_id" attr on every operationLog record. The id is stamped onto every error response (including the early short-circuit paths bad_version and unknown_method). rpc.Call forwards the context op id on requests so a daemon RPC and the helper RPCs it triggers all share one id. The helper now logs JSON to match bangerd, adopts the inbound id, and emits a single "helper rpc completed" / "helper rpc failed" line per call so operators can see at a glance how long each privileged op took. vmCreateOperationState.ID is now the same id dispatch generated for vm.create.begin — one identifier between client status polls, daemon logs, and helper logs. The wire format gains two optional fields: rpc.Request.OpID and rpc.ErrorResponse.OpID, both omitempty so older peers (and the opposite direction) ignore them. ErrorResponse.Error() now appends "(op-XXXXXX)" to its string form when set; existing callers that just print err.Error() get the id for free. Tests cover: dispatch stamps op_id on unknown_method, bad_version, and handler-returned errors; rpc.Call exposes the typed *ErrorResponse via errors.As so the CLI can read code/op_id; ctx op_id is forwarded to the server in the request envelope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 22:13:44 -03:00
Thales Maciel	59e48e830b	daemon: split owner daemon from root helper Move the supported systemd path to two services: an owner-user bangerd for orchestration and a narrow root helper for bridge/tap, NAT/resolver, dm/loop, and Firecracker ownership. This removes repeated sudo from daily vm and image flows without leaving the general daemon running as root. Add install metadata, system install/status/restart/uninstall commands, and a system-owned runtime layout. Keep user SSH/config material in the owner home, lock file_sync to the owner home, and move daemon known_hosts handling out of the old root-owned control path. Route privileged lifecycle steps through typed privilegedOps calls, harden the two systemd units, and rewrite smoke plus docs around the supported service model. Verified with make build, make test, make lint, and make smoke on the supported systemd host path.	2026-04-26 12:43:17 -03:00
Thales Maciel	d743a8ba4b	daemon: persist teardown fallbacks and reject unsafe import paths Preserve cleanup after daemon restarts and harden OCI and tar imports against filenames that debugfs cannot encode safely. Mirror tap, loop, and dm teardown identity onto VM.Runtime, teach cleanup and reconcile to fall back to those persisted fields when handles.json is missing or corrupt, and clear the recovery state on stop, error, and delete paths. Reject debugfs-hostile entry names during flattening and in ApplyOwnership itself, then add regression coverage for corrupt handles.json recovery and unsafe import paths. Verified with targeted go tests, make lint-go, make lint-shell, and make build.	2026-04-23 16:21:59 -03:00
Thales Maciel	11a33604c0	daemon: extract startVMLocked into step runner with per-step rollback startVMLocked was a ~260-line method running 18 sequential phases with one lumped error path: on any failure, cleanupOnErr called cleanupRuntime — a catch-all teardown that didn't distinguish "this phase acquired resources we should undo" from "this phase is idempotent." The blast radius was the entire VM lifecycle. Every tweak to boot, NAT, disk, or auth-sync orchestration had to reason about a closure that could fire at any of 18 points. This commit extracts the phases into a data-driven pipeline: - startContext threads the mutable state (vm, live, apiSock, dmName, tapName, etc.) through every step by pointer so step bodies mutate in place without returning copies. - startStep carries the op.stage name, optional vmCreateStage progress ping, optional log attrs, a run closure, and an optional undo closure. - runStartSteps walks steps in order, appends the failing step to the rollback set (so partial-acquire failures like machine.Start's post-spawn HTTP config get their undo fired), then iterates the rollback set in reverse and joins errors via errors.Join. Each phase that acquires a resource now owns its own undo: system_overlay removes a file it created, dm_snapshot cleans up the loop + DM handles it set, prepare_host_features delegates to capHooks.cleanupState, tap releases via releaseTap, metrics_file removes the file, firecracker_launch kills the spawned PID and drops the sockets, post_start_features calls capHooks.cleanupState again (capability Cleanup hooks are idempotent — safe to call whether PostStart reached every cap or not). The 11 phases with no teardown obligation leave `undo` nil and the driver silently skips them on rollback. cleanupRuntime is retired from the start-failure path. It stays intact for reconcile, stopVMLocked, killVMLocked, deleteVMLocked, stopStaleVMs — the crash-recovery / lifecycle-teardown contract those paths rely on is unchanged. startVMLocked shrinks from ~225 lines of sequential-phase code plus a cleanupOnErr closure to ~45 lines: compute derived paths, build the step list, drive it, persist ERROR state on failure. Stage names preserved 1:1 so existing log grep + the async-create progress stream stay compatible. Tests: - TestRunStartSteps_RollsBackInReverseOnFailure — the contract is pinned: succeeded-before-failing run, all their undos in reverse, failing step's undo also fires, original err still visible via errors.Is. - TestRunStartSteps_SkipsNilUndos — optional-undo contract. - TestRunStartSteps_JoinsRollbackErrors — undo failures don't hide the root cause. - TestRunStartSteps_HappyPathNoRollback — success path never fires any undo. Smoke: all 21 scenarios pass, including the start-path ones (bare vm run, workspace vm run, vm restart, vm lifecycle, vm set reconfig) that exercise real firecracker boots end-to-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 15:34:34 -03:00
Thales Maciel	5eceebe49f	daemon: persist tap device on VM.Runtime so NAT teardown survives handle-cache loss Cleanup identity for kernel objects was split across two sources of truth: vm.Runtime (DB-backed, durable) held paths and the guest IP, but the TAP name lived only in the in-process handle cache + the best-effort handles.json scratch file next to the VM dir. Every other cleanup-identifying datum has a fallback — firecracker PID can be rediscovered via `pgrep -f <apiSock>`, loops via losetup, dm name from the deterministic ShortID(vm.ID). The tap is the one truly cache-only datum (allocated from a pool, not derivable). That made NAT teardown fragile: - daemon crash between `acquireTap` and the handles.json write - handles.json corrupt on the next daemon start - partial cleanup that already zeroed the cache In any of those cases natCapability.Cleanup short-circuited ("skipping nat cleanup without runtime network handles") and the per-VM POSTROUTING MASQUERADE + the two FORWARD rules keyed off the tap would leak. The VM row in the DB still existed, so a retry couldn't close the loop — the tap name was simply gone. Fix: mirror TapDevice onto model.VMRuntime (serialised via the existing runtime_json column, omitempty so existing rows upgrade cleanly). Set it in startVMLocked right next to the s.setVMHandles call that seeds the in-memory cache; clear it at every post-cleanup reset site (stop normal path + stop stale branch, kill normal path + kill stale branch, cleanupOnErr in start, reconcile's stale-vm branch, the stats poller's auto-stop path). Fallbacks now cascade: - natCapability.Cleanup: handles cache → Runtime.TapDevice - cleanupRuntime (releaseTap): handles cache → Runtime.TapDevice Both surfaces refuse gracefully (old behaviour) only when neither source has a value, which really does mean "no tap was ever allocated for this VM" rather than "we lost track of it." Test: TestNATCapabilityCleanup_FallsBackToRuntimeTapDevice clears the handle cache, sets vm.Runtime.TapDevice, and asserts Cleanup reaches the runner — the exact scenario the review flagged as a plausible leak and the exact code path that now guarantees it doesn't. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 14:21:13 -03:00
Thales Maciel	b4afe13b2a	daemon: fix vm start (on a stopped VM) + regression coverage Two defects compounded to make `vm create X` → `vm stop X` → `vm start X` → `vm ssh X` fail with `not_running: vm X is not running` even though `vm show` reports `state=running`. 1. firecracker-go-sdk's startVMM spawns a goroutine that SIGTERMs firecracker when the ctx passed to Machine.Start cancels — and retains that ctx for the lifetime of the VMM, not just the boot phase. Our Machine.Start wrapper was plumbing the caller's ctx through, which on `vm.start` is the RPC request ctx. daemon.go's handleConn cancels reqCtx via `defer cancel()` right after writing the response. Net effect: firecracker is killed ~150ms after the `vm start` RPC "completes", invisibly, and the next `vm ssh` sees a dead PID. `vm.create` side-stepped the bug because BeginVMCreate detaches to context.Background() before calling startVMLocked; `vm.start` used the RPC ctx directly. Fix: Machine.Start now passes context.Background() to the SDK. We own firecracker lifecycle explicitly (StopVM / KillVM / cleanupRuntime), so ctx-driven cancellation here was never actually wired into anything useful. 2. With (1) fixed, the same scenario exposed a second defect: patchRootOverlay's e2cp/e2rm refuses to touch the dm-snapshot with "Inode bitmap checksum does not match bitmap" on a restart, because the COW holds stale free-block/free-inode counters from the previous guest boot. Kernel ext4 is fine with this; e2fsprogs is not. Fix: run `e2fsck -fy` on the snapshot between the dm_snapshot and patch_root_overlay stages. Idempotent on a fresh snapshot, reconciles the bitmaps on a reused COW. Regression coverage: - scripts/repro-restart-bug.sh — minimal create→stop→start→ssh reproducer with rich on-failure diagnostics (daemon log trace, firecracker.log tail, handles.json, pgrep-by-apiSock, apiSock stat). Exits non-zero if the bug returns. - scripts/smoke.sh — lifecycle scenario (create/ssh/stop/start/ ssh/delete) and vm-set scenario (--vcpu 2 → stop → set --vcpu 4 → start → assert nproc=4). Both were pulled when the bug was first found; now restored. Supporting: - internal/system/system.ExitCode — extracts exec.ExitError's code without forcing callers to import os/exec. Needed by the e2fsck caller (policy test pins os/exec to the shell-out packages). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 12:01:46 -03:00
Thales Maciel	466a7c30c4	daemon split (4/5): extract VMService service Phase 4 of the daemon god-struct refactor. VM lifecycle, create-op registry, handle cache, disk provisioning, stats polling, ports query, and the per-VM lock set all move off Daemon onto VMService. Daemon keeps thin forwarders only for FindVM / TouchVM (dispatch surface) and is otherwise out of VM lifecycle. Lazy-init via d.vmSvc() mirrors the earlier services so test literals like \`&Daemon{store: db, runner: r}\` still get a functional service without spelling one out. Three small cleanups along the way: preflight helpers (validateStartPrereqs / addBaseStartPrereqs / addBaseStartCommandPrereqs / validateWorkDiskResizePrereqs) move with the VM methods that call them. * cleanupRuntime / rebuildDNS move to VMService, with HostNetwork primitives (findFirecrackerPID, cleanupDMSnapshot, killVMProcess, releaseTap, waitForExit, sendCtrlAltDel) reached through s.net instead of the hostNet() facade. vsockAgentBinary becomes a package-level function so both Daemon (doctor) and VMService (preflight) call one entry point instead of each owning a forwarder method. WorkspaceService's peer deps switch from eager method values to closures — vmSvc() constructs VMService with WorkspaceService as a peer, so resolving d.vmSvc().FindVM at construction time recursed through workspaceSvc() → vmSvc(). Closures defer the lookup to call time. Pure code motion: build + unit tests green, lint clean. No RPC surface or lock-ordering changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:57:05 -03:00
Thales Maciel	362009d747	daemon split (1/5): extract HostNetwork service First phase of splitting the daemon god-struct into focused services with explicit ownership. HostNetwork now owns everything host-networking: the TAP interface pool (initializeTapPool / ensureTapPool / acquireTap / releaseTap / createTap), bridge + socket dir setup, firecracker process primitives (find/resolve/kill/wait/ensureSocketAccess/sendCtrlAltDel), DM snapshot lifecycle, NAT rule enforcement, guest DNS server lifecycle + routing setup, and the vsock-agent readiness probe. That's 7 files whose receivers flipped from Daemon to HostNetwork, plus a new host_network.go that declares the struct, its hostNetworkDeps, and the factored firecracker + DNS helpers that used to live in vm.go. Daemon gives up the tapPool and vmDNS fields entirely; they're now HostNetwork's business. Construction goes through newHostNetwork in Daemon.Open with an explicit dependency bag (runner, logger, config, layout, closing). A lazy-init hostNet() helper on Daemon supports test literals that don't wire net explicitly — production always populates it eagerly. Signature tightenings where the old receiver reached into VM-service state: - ensureNAT(ctx, vm, enable) → ensureNAT(ctx, guestIP, tap, enable). Callers resolve tap from the handle cache themselves. - initializeTapPool(ctx) → initializeTapPool(usedTaps []string). Daemon.Open enumerates VMs, collects taps from handles, hands the slice in. rebuildDNS stays on Daemon as the orchestrator — it filters by vm-alive (a VMService concern handles will move to in phase 4) then calls HostNetwork.replaceDNS with the already-filtered map. Capability hooks continue to take Daemon; they now use it as a facade to reach services (d.net.ensureNAT, d.hostNet().). Planned CapabilityHost interface extraction is orthogonal, left for later. Tests: dns_routing_test.go + fastpath_test.go + nat_test.go + snapshot_test.go + open_close_test.go were touched to construct HostNetwork literals where they exercise its methods directly, or route through d.hostNet() where they exercise the Daemon entry points. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:11:46 -03:00
Thales Maciel	ae14b9499d	ssh: trust-on-first-use host key pinning everywhere Guest host-key verification was off in all three SSH paths: * Go SSH (internal/guest/ssh.go) used ssh.InsecureIgnoreHostKey * `banger vm ssh` passed StrictHostKeyChecking=no + UserKnownHostsFile=/dev/null * `~/.ssh/config` Host .vm shipped the same posture into the user's global config Now each path verifies against a banger-owned known_hosts file at `~/.local/state/banger/ssh/known_hosts` with TOFU semantics: First dial to a VM pins the key. * Subsequent dials require an exact match. A mismatch fails with an explicit "possible MITM" error. * `vm delete` removes the entries so a future VM reusing the IP or name re-pins cleanly. * The user's `~/.ssh/known_hosts` is untouched. Changes: internal/guest/known_hosts.go (new) — OpenSSH-compatible parser, TOFUHostKeyCallback, RemoveKnownHosts. Process-wide mutex around the file. internal/guest/ssh.go — Dial and WaitForSSH grew a knownHostsPath parameter threaded through the callback. Empty path keeps the insecure callback (tests + throwaway tools only; documented). internal/daemon/{guest_sessions,session_attach,session_lifecycle, session_stream}.go — call sites pass d.layout.KnownHostsPath. internal/daemon/ssh_client_config.go — the ~/.ssh/config Host *.vm block now points at banger's known_hosts and uses StrictHostKeyChecking=accept-new. Missing path → fail closed. internal/daemon/vm_lifecycle.go — deleteVMLocked drops known_hosts entries for the VM's IP and DNS name via removeVMKnownHosts. internal/cli/banger.go — sshCommandArgs swaps StrictHostKeyChecking no + /dev/null for banger's file + accept-new. Path resolution failure falls through to StrictHostKeyChecking=yes. internal/paths/paths.go — Layout gains SSHDir + KnownHostsPath; Ensure creates SSHDir at 0700. Tests (internal/guest/known_hosts_test.go): pin on first use, accept matching key on second dial, reject mismatch, empty path skips checking, RemoveKnownHosts drops the entry, re-pin works after remove. Existing daemon + cli tests updated to assert the new posture and regression-guard against the old flags. Live verified: vm run writes the pin to banger's known_hosts at 0600 inside a 0700 dir; banger vm ssh + ssh root@<vm>.vm both succeed using the pin; vm delete clears it.	2026-04-19 16:46:03 -03:00
Thales Maciel	687fcf0b59	vm state: split transient kernel/process handles off the durable schema Separates what a VM IS (durable intent + identity + deterministic derived paths — `VMRuntime`) from what is CURRENTLY TRUE about it (firecracker PID, tap device, loop devices, dm-snapshot target — new `VMHandles`). The durable state lives in the SQLite `vms` row; the transient state lives in an in-memory cache on the daemon plus a per-VM `handles.json` scratch file inside VMDir, rebuilt at startup from OS inspection. Nothing kernel-level rides the SQLite schema anymore. Why: Persisting ephemeral process handles to SQLite forced reconcile to treat "running with a stale PID" as a first-class case and mix it with real state transitions. The schema described what we last observed, not what the VM is. Every time the observation model shifted (tap pool, DM naming, pgrep fallback) the reconcile logic grew a new branch. Splitting lets each layer own what it's good at: durable records describe intent, in-memory cache + scratch file describe momentary reality. Shape: - `model.VMHandles` = PID, TapDevice, BaseLoop, COWLoop, DMName, DMDev. Never in SQLite. - `VMRuntime` keeps: State, GuestIP, APISockPath, VSockPath, VSockCID, LogPath, MetricsPath, DNSName, VMDir, SystemOverlay, WorkDiskPath, LastError. All durable or deterministic. - `handleCache` on `*Daemon` — mutex-guarded map + scratch-file plumbing (`writeHandlesFile` / `readHandlesFile` / `rediscoverHandles`). See `internal/daemon/vm_handles.go`. - `d.vmAlive(vm)` replaces the 20+ inline `vm.State==Running && ProcessRunning(vm.Runtime.PID, apiSock)` spreads. Single source of truth for liveness. - Startup reconcile: per running VM, load the scratch file, pgrep the api sock, either keep (cache seeded from scratch) or demote to stopped (scratch handles passed to cleanupRuntime first so DM / loops / tap actually get torn down). Verification: - `go test ./...` green. - Live: `banger vm run --name handles-test -- cat /etc/hostname` starts; `handles.json` appears in VMDir with the expected PID, tap, loops, DM. - `kill -9 $(pgrep bangerd)` while the VM is running, re-invoke the CLI, daemon auto-starts, reconcile recognises the VM as alive, `banger vm ssh` still connects, `banger vm delete` cleans up. Tests added: - vm_handles_test.go: scratch-file roundtrip, missing/corrupt file behaviour, cache concurrency, rediscoverHandles prefers pgrep over scratch, returns scratch contents even when process is dead (so cleanup can tear down kernel state). - vm_test.go: reconcile test rewritten to exercise the new flow (write scratch → reconcile reads it → verifies process is gone → issues dmsetup/losetup teardown). ARCHITECTURE.md updated; `handles` added to Daemon field docs.	2026-04-19 14:18:13 -03:00
Thales Maciel	b2dcdf9757	vm_lifecycle: drop systemd.mask=dev-{ttyS0,vdb}.device Both masks were added when the direct-boot path first landed for container rootfses that didn't have anything mounted on /dev/vdb. The golden image (and any pulled OCI image running under banger's patchRootOverlay) has an /etc/fstab entry mounting /dev/vdb at /root — masking dev-vdb.device makes systemd wait forever for a unit that can never become active, and the work-disk mount never completes. dev-ttyS0 is a real serial console the image needs too. Drop both. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 14:58:42 -03:00
Thales Maciel	8f4be112c2	Generic kernel + init= boot path for OCI-pulled images Closes the full arc: banger kernel pull + image pull + vm create + vm ssh now works end-to-end against docker.io/library/debian:bookworm with zero manual image building. Generic kernel: - New scripts/make-generic-kernel.sh builds vmlinux from upstream kernel.org sources using Firecracker's official minimal config (configs/firecracker-x86_64-6.1.config). All critical drivers (virtio_blk, virtio_net, ext4, vsock) compiled in — no modules, no initramfs needed. - Published as generic-6.12 in the catalog (kernels.thaloco.com). - catalog.json updated with the new entry. Direct-boot init= override (vm_lifecycle.go): - For images without an initrd (direct-boot / OCI-pulled), banger now passes init=/usr/local/libexec/banger-first-boot on the kernel cmdline. The script runs as PID 1, mounts /proc /sys /dev /run, checks for systemd — if present execs it immediately; if not (container images), installs systemd-sysv + openssh-server via the guest's package manager, then execs systemd. - Also passes kernel-level ip= parameter via BuildBootArgsWithKernelIP so the kernel configures the network interface before init runs (container images don't ship iproute2, so the userspace bootstrap script can't call ip(8)). - Masks dev-ttyS0.device and dev-vdb.device systemd units that otherwise wait 90s for udev events that never fire in Firecracker guests started from container rootfses. first-boot.sh rewritten as universal init wrapper: - Works as PID 1 (mounts essential filesystems) OR as a systemd oneshot (existing behavior). - Installs both systemd-sysv AND openssh-server (container images have neither). - Dispatch updated: debian, alpine, fedora, arch, opensuse families + ID_LIKE fallback. All tests updated. Opencode capability skip for direct-boot images: - The opencode readiness check (WaitReady on vsock port 4096) now returns nil for images without an initrd, since pulled container images don't ship the opencode service. Without this, the VM would be marked as error for lacking an opinionated add-on. Docs: README and kernel-catalog.md updated to recommend generic-6.12 as the default kernel for OCI-pulled images. AGENTS.md notes the new build script. Verified live: - banger kernel pull generic-6.12 - banger image pull docker.io/library/debian:bookworm --kernel-ref generic-6.12 - banger vm create --image debian-bookworm --name testbox --nat - banger vm ssh testbox -- "id; uname -r; systemctl is-active banger-vsock-agent" → uid=0(root), kernel 6.12.8, Debian bookworm, vsock-agent active, sshd running, SSH working. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-16 20:12:56 -03:00
Thales Maciel	ea0db1e17e	Split internal/daemon vm.go and guest_sessions.go by concern vm.go (1529 LOC) splits into vm_create, vm_lifecycle, vm_set, vm_stats, vm_disk, vm_authsync; firecracker/DNS/helpers stay in vm.go. guest_sessions.go (1266 LOC) splits into session_controller, session_lifecycle, session_attach, session_stream; scripts and helpers stay in guest_sessions.go. Mechanical move only. No behavior change. Adds doc.go and ARCHITECTURE.md capturing subsystem map and current lock ordering as the baseline for the upcoming subsystem extraction. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-15 15:47:08 -03:00

16 commits