banger

Author	SHA1	Message	Date
Thales Maciel	e47b8146dc	daemon: thread per-RPC op_id end-to-end Today there's no way to correlate a CLI failure with a daemon log line. operationLog records relative timing but no id, two concurrent vm.start calls log indistinguishably, and the async vmCreateOperationState.ID is user-facing yet never reaches the journal. The root helper logs plain text to stderr while bangerd logs JSON, so a merged journalctl is hard to grep across the trust-boundary split. Mint a per-RPC op id at dispatch entry, store it on context, and include it as an "op_id" attr on every operationLog record. The id is stamped onto every error response (including the early short-circuit paths bad_version and unknown_method). rpc.Call forwards the context op id on requests so a daemon RPC and the helper RPCs it triggers all share one id. The helper now logs JSON to match bangerd, adopts the inbound id, and emits a single "helper rpc completed" / "helper rpc failed" line per call so operators can see at a glance how long each privileged op took. vmCreateOperationState.ID is now the same id dispatch generated for vm.create.begin — one identifier between client status polls, daemon logs, and helper logs. The wire format gains two optional fields: rpc.Request.OpID and rpc.ErrorResponse.OpID, both omitempty so older peers (and the opposite direction) ignore them. ErrorResponse.Error() now appends "(op-XXXXXX)" to its string form when set; existing callers that just print err.Error() get the id for free. Tests cover: dispatch stamps op_id on unknown_method, bad_version, and handler-returned errors; rpc.Call exposes the typed *ErrorResponse via errors.As so the CLI can read code/op_id; ctx op_id is forwarded to the server in the request envelope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-26 22:13:44 -03:00
Thales Maciel	687fcf0b59	vm state: split transient kernel/process handles off the durable schema Separates what a VM IS (durable intent + identity + deterministic derived paths — `VMRuntime`) from what is CURRENTLY TRUE about it (firecracker PID, tap device, loop devices, dm-snapshot target — new `VMHandles`). The durable state lives in the SQLite `vms` row; the transient state lives in an in-memory cache on the daemon plus a per-VM `handles.json` scratch file inside VMDir, rebuilt at startup from OS inspection. Nothing kernel-level rides the SQLite schema anymore. Why: Persisting ephemeral process handles to SQLite forced reconcile to treat "running with a stale PID" as a first-class case and mix it with real state transitions. The schema described what we last observed, not what the VM is. Every time the observation model shifted (tap pool, DM naming, pgrep fallback) the reconcile logic grew a new branch. Splitting lets each layer own what it's good at: durable records describe intent, in-memory cache + scratch file describe momentary reality. Shape: - `model.VMHandles` = PID, TapDevice, BaseLoop, COWLoop, DMName, DMDev. Never in SQLite. - `VMRuntime` keeps: State, GuestIP, APISockPath, VSockPath, VSockCID, LogPath, MetricsPath, DNSName, VMDir, SystemOverlay, WorkDiskPath, LastError. All durable or deterministic. - `handleCache` on `*Daemon` — mutex-guarded map + scratch-file plumbing (`writeHandlesFile` / `readHandlesFile` / `rediscoverHandles`). See `internal/daemon/vm_handles.go`. - `d.vmAlive(vm)` replaces the 20+ inline `vm.State==Running && ProcessRunning(vm.Runtime.PID, apiSock)` spreads. Single source of truth for liveness. - Startup reconcile: per running VM, load the scratch file, pgrep the api sock, either keep (cache seeded from scratch) or demote to stopped (scratch handles passed to cleanupRuntime first so DM / loops / tap actually get torn down). Verification: - `go test ./...` green. - Live: `banger vm run --name handles-test -- cat /etc/hostname` starts; `handles.json` appears in VMDir with the expected PID, tap, loops, DM. - `kill -9 $(pgrep bangerd)` while the VM is running, re-invoke the CLI, daemon auto-starts, reconcile recognises the VM as alive, `banger vm ssh` still connects, `banger vm delete` cleans up. Tests added: - vm_handles_test.go: scratch-file roundtrip, missing/corrupt file behaviour, cache concurrency, rediscoverHandles prefers pgrep over scratch, returns scratch contents even when process is dead (so cleanup can tear down kernel state). - vm_test.go: reconcile test rewritten to exercise the new flow (write scratch → reconcile reads it → verifies process is gone → issues dmsetup/losetup teardown). ARCHITECTURE.md updated; `handles` added to Daemon field docs.	2026-04-19 14:18:13 -03:00
Thales Maciel	c8d9a122f9	Speed up VM create with work seeds Beat VM create wall time without changing VM semantics. Generate a work-seed ext4 sidecar during image builds and rootfs rebuilds, then clone and resize that seed for each new VM instead of rebuilding /root from scratch. Plumb the new seed artifact through config, runtime metadata, store state, runtime-bundle defaults, doctor checks, and default-image reconciliation so older images still fall back cleanly. Add a daemon TAP pool to keep idle bridge-attached devices warm, expose stage timing in lifecycle logs, add a create/SSH benchmark script plus Make target, and teach verify.sh that tap-pool-* devices are reusable capacity rather than cleanup leaks. Validated with go test ./..., make build, ./verify.sh, and make bench-create ARGS="--runs 2".	2026-03-18 21:22:12 -03:00
Thales Maciel	644e60d739	Add structured daemon lifecycle logs VM start, image build, and network/setup failures were hard to diagnose because bangerd emitted almost no lifecycle logs and the Firecracker SDK logger was discarded. This adds a daemon-wide JSON logger with configurable log level so failures leave breadcrumbs instead of only side effects. Log the main daemon and VM lifecycle stages, preserve raw Firecracker and image-build helper output in dedicated files, and include those log paths in daemon status and returned errors. Bridge SDK logrus output into the daemon logger at debug level so low-level Firecracker diagnostics are available without making normal info logs unreadable. Validation: go test ./... and make build. Left unrelated worktree changes out of this commit, including internal/api/types.go, the deleted shell scripts, and my-rootfs.ext4.	2026-03-16 16:16:28 -03:00

4 commits