Commit graph

12 commits

Author SHA1 Message Date
b4afe13b2a
daemon: fix vm start (on a stopped VM) + regression coverage
Two defects compounded to make `vm create X` → `vm stop X` → `vm start X`
→ `vm ssh X` fail with `not_running: vm X is not running` even though
`vm show` reports `state=running`.

1. firecracker-go-sdk's startVMM spawns a goroutine that SIGTERMs
   firecracker when the ctx passed to Machine.Start cancels — and
   retains that ctx for the lifetime of the VMM, not just the boot
   phase. Our Machine.Start wrapper was plumbing the caller's ctx
   through, which on `vm.start` is the RPC request ctx. daemon.go's
   handleConn cancels reqCtx via `defer cancel()` right after
   writing the response. Net effect: firecracker is killed ~150ms
   after the `vm start` RPC "completes", invisibly, and the next
   `vm ssh` sees a dead PID. `vm.create` side-stepped the bug
   because BeginVMCreate detaches to context.Background() before
   calling startVMLocked; `vm.start` used the RPC ctx directly.
   Fix: Machine.Start now passes context.Background() to the SDK.
   We own firecracker lifecycle explicitly (StopVM / KillVM /
   cleanupRuntime), so ctx-driven cancellation here was never
   actually wired into anything useful.

2. With (1) fixed, the same scenario exposed a second defect:
   patchRootOverlay's e2cp/e2rm refuses to touch the dm-snapshot
   with "Inode bitmap checksum does not match bitmap" on a restart,
   because the COW holds stale free-block/free-inode counters from
   the previous guest boot. Kernel ext4 is fine with this; e2fsprogs
   is not. Fix: run `e2fsck -fy` on the snapshot between the
   dm_snapshot and patch_root_overlay stages. Idempotent on a fresh
   snapshot, reconciles the bitmaps on a reused COW.

Regression coverage:
  - scripts/repro-restart-bug.sh — minimal create→stop→start→ssh
    reproducer with rich on-failure diagnostics (daemon log trace,
    firecracker.log tail, handles.json, pgrep-by-apiSock, apiSock
    stat). Exits non-zero if the bug returns.
  - scripts/smoke.sh — lifecycle scenario (create/ssh/stop/start/
    ssh/delete) and vm-set scenario (--vcpu 2 → stop → set --vcpu 4
    → start → assert nproc=4). Both were pulled when the bug was
    first found; now restored.

Supporting:
  - internal/system/system.ExitCode — extracts exec.ExitError's
    code without forcing callers to import os/exec. Needed by the
    e2fsck caller (policy test pins os/exec to the shell-out
    packages).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 12:01:46 -03:00
fba30f26d4
firecracker: chown API + vsock sockets inside the sudo shell
Bug: Firecracker creates its API and vsock sockets as root:root 0700
(enforced by the intentional umask 077 in buildProcessRunner). The
daemon, running as the invoking user, then can't connect(2) to
either — AF_UNIX connect needs write permission on the socket file
and 0700 root-owned leaves thales without any.

firecracker-go-sdk's Machine.Start() blocks on waitForSocket, which
probes the socket with both os.Stat (succeeds — parent dir is the
user's XDG_RUNTIME_DIR) and an HTTP GET over the socket (fails —
EACCES on connect). The SDK loops for 3 seconds then fails with
"Firecracker did not create API socket ... context deadline exceeded".

The daemon's EnsureSocketAccess chown was meant to fix permissions,
but it runs *after* Machine.Start returns — and Start never returns
because it's still looping on the SDK's probe. Chicken-and-egg.

Fix: inside the sudo'd shell that launches firecracker, spawn a
background subshell that polls for each expected socket (API + vsock,
when configured) and chowns it to $SUDO_UID:$SUDO_GID as soon as it
appears. The background polling is bounded at 1s (20 × 50ms) so a
broken firecracker invocation doesn't leak a waiting shell.

Post-fix: socket appears root-owned 0600 briefly, is chowned to the
invoking user within ~50ms, SDK's HTTP probe succeeds, Machine.Start
returns normally. EnsureSocketAccess's later chmod 600 remains the
belt-and-braces guarantee on final mode.

Verified: manual repro of the shell script produces a socket owned
by thales:thales that a non-root python socket.connect() accepts.
Without the fix the same setup gives "PermissionError: [Errno 13]
Permission denied".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 16:09:02 -03:00
b930c51990
runtime sockets: close the local-user race window around control-plane creation
Previously the daemon socket, per-VM firecracker API socket, and vsock
socket were transiently world-exposed on hosts without XDG_RUNTIME_DIR:
the runtime directory landed in /tmp at 0755, Firecracker ran with
umask 000 (mode 0666 sockets), and only a follow-up chown/chmod in
EnsureSocketAccess tightened them. A local attacker could race into
bangerd.sock or the firecracker API socket during that window.

Three changes:

- internal/paths/paths.go: RuntimeDir is now created (and re-chmod'd if
  stale) at 0700 unconditionally. When XDG_RUNTIME_DIR is unset and we
  fall back to /tmp/banger-runtime-<uid>, Ensure() now verifies the
  parent dir is owned by the current uid and 0700 mode — refusing to
  place sockets inside a directory someone else created. Symlink swaps
  rejected via Lstat.

- internal/firecracker/client.go: launch firecracker with umask 077
  instead of umask 000 so the API socket is mode 0600 from birth. The
  chown in fcproc.EnsureSocketAccess still transfers ownership from
  root to the invoking user afterwards.

- internal/daemon/fcproc/fcproc.go: EnsureSocketDir now creates (and
  re-chmod's) the runtime socket directory at 0700.

Tests cover the tightening path — an existing 0755 RuntimeDir is
re-chmod'd on Ensure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 12:53:47 -03:00
3ed78fdcfc
Add experimental Void guest workflow and vsock agent
Make iterating on a Firecracker-friendly Void guest practical without replacing the Debian default image path.

Add local Void rootfs build/register/verify plumbing, a language-agnostic dev package baseline, and guest SSH/work-disk hardening so new images use the runtime bundle key, keep a normal root bash environment, and repair stale nested /root layouts on restart.

Replace the guest PING/PONG responder with an HTTP /healthz agent over vsock, rename the runtime bundle and config surface from ping helper to agent while still accepting the legacy keys, and route the post-SSH reminder through the new vm.health path.

Validated with GOCACHE=/tmp/banger-gocache go test ./..., make build, bash -n customize.sh make-rootfs-void.sh, and git diff --check.
2026-03-19 14:51:25 -03:00
08ef706e3f
Add vsock-backed SSH session reminders
Remind users when a VM is still running after 	hanger vm ssh exits instead of silently dropping them back to the host shell.\n\nAttach a Firecracker vsock device to each VM, persist the host vsock path/CID,\nadd a new guest-side banger-vsock-pingd responder to the runtime bundle and both\nimage-build paths, and expose a vm.ping RPC that the CLI and TUI call after SSH\nreturns. Doctor and start/build preflight now validate the helper plus\n/dev/vhost-vsock so the feature fails early and clearly.\n\nValidated with go mod tidy, bash -n customize.sh, git diff --check, make build,\nand GOCACHE=/tmp/banger-gocache go test ./... outside the sandbox because the\ndaemon tests need real Unix/UDP sockets. Rebuild the image/rootfs used for new\nVMs so the guest ping service is present.
2026-03-18 20:14:51 -03:00
4930d82cb9
Refactor VM lifecycle around capabilities
Make host-integrated VM features fit a standard Go extension path instead of adding more one-off branches through vm.go. This is the enabling refactor for future work like shared mounts, not the /work feature itself.

Add a daemon capability pipeline plus a structured guest-config builder, then move the existing /root work-disk mount, built-in DNS, and NAT wiring onto those hooks. Generalize Firecracker drive config at the same time so later storage features can extend machine setup without another hardcoded path.

Add banger doctor on top of the shared readiness checks, update the docs to describe the new architecture, and cover the new seams with guest-config, capability, report, CLI, and full go test verification. Also verify make build and a real ./banger doctor run on the host.
2026-03-18 19:28:26 -03:00
787b234029
Fix VM startup regressions after shell-out cleanup
The shell-out reduction pass introduced two linked startup regressions in the hot path for vm create.

Make flattenNestedWorkHome repair the temporary nested /root tree without trying to read a root-owned 0700 directory as the calling user: chmod the scratch directory under sudo, then copy each child entry individually before removing it. Add a regression test for that overlap/permission case.

Restore the Firecracker launch wrapper that sets umask 000 before exec. Firecracker was creating the API socket, but the SDK could not use it during machine.Start after the direct sudo launch, so vm create timed out waiting on a socket that already existed.

Validated with go test ./... and make build.
2026-03-18 12:18:34 -03:00
942d242c03
Move avoidable daemon shell-outs into Go
Reduce the control plane's dependency on helper scripts while keeping the hard Linux integration points in the approved shell-out layer.

Replace the bash-driven image build path with a native Go builder that clones and optionally resizes the rootfs, boots a temporary Firecracker VM, provisions the guest over SSH, installs packages and modules, and preserves the package-manifest sidecar.

Also replace a few small convenience shell-outs with Go helpers: read process stats from /proc, use os.Truncate for ext4 image growth, add file-clone and normalized-line helpers, drop the sh -c work-disk flattening path, and launch Firecracker via a direct sudo command.

Add tests for the new SSH/archive and system helpers, plus a policy test that keeps os/exec imports confined to cli/firecracker/system. Update the docs to describe customize.sh as a manual helper rather than the daemon's image-build backend.

Validated with go mod tidy, go test ./..., and make build.
2026-03-17 17:13:07 -03:00
60294e8c90
Fix VM lifecycle issues behind verify.sh
Make the Firecracker and bangerd processes outlive short-lived CLI request contexts so vm create no longer kills the VMM or daemon as soon as the RPC returns.

Fix fresh-VM SSH by flattening the seeded /root work disk when the copied home tree lands under a nested root/ directory, and write a guest sshd override to keep root pubkey auth explicit while debugging.

Harden teardown and smoke diagnostics: verify.sh now reports early Firecracker exit and delete failures directly, while dm snapshot cleanup tolerates already-gone handles and retries busy mapper removal long enough for Firecracker to release the device.

Validation: go test ./..., make build, bash -n verify.sh, direct SSH against a fresh VM, and a live ./verify.sh run that now completes with [verify] ok.
2026-03-17 14:43:09 -03:00
644e60d739
Add structured daemon lifecycle logs
VM start, image build, and network/setup failures were hard to diagnose because bangerd emitted almost no lifecycle logs and the Firecracker SDK logger was discarded. This adds a daemon-wide JSON logger with configurable log level so failures leave breadcrumbs instead of only side effects.

Log the main daemon and VM lifecycle stages, preserve raw Firecracker and image-build helper output in dedicated files, and include those log paths in daemon status and returned errors. Bridge SDK logrus output into the daemon logger at debug level so low-level Firecracker diagnostics are available without making normal info logs unreadable.

Validation: go test ./... and make build. Left unrelated worktree changes out of this commit, including internal/api/types.go, the deleted shell scripts, and my-rootfs.ext4.
2026-03-16 16:16:28 -03:00
2539800f5c
Use Firecracker SDK in daemon
Replace the daemon's hand-rolled Firecracker process/socket client with the official firecracker-go-sdk while keeping the existing VM lifecycle and host-side disk and TAP setup intact.

Build machine configs through the SDK, launch Firecracker through a sudo process runner, resolve the real VM PID after startup, and use the SDK client for Ctrl-Alt-Del instead of raw REST calls. Drop the unused cached Firecracker state and add focused adapter tests for config and process-runner wiring.

Validated with go mod tidy, go test ./..., and make build. A live KVM/Firecracker smoke boot was not run in this environment.
2026-03-16 13:26:41 -03:00
ea72ea26fe
Add Go daemon-driven VM control plane
Replace the shell-only user workflow with `banger` and `bangerd`: Cobra commands, XDG/SQLite-backed state, managed VM and image lifecycle, and a Bubble Tea TUI for browsing and operating VMs.\n\nKeep Firecracker orchestration behind the daemon so VM specs become persistent objects, and add repo entrypoints for building, installing, and documenting the new flow while still delegating rootfs customization to the existing shell tooling.\n\nHarden the control plane around real usage by reclaiming Firecracker API sockets for the user, restarting stale daemons after rebuilds, and returning the correct `vm.create` payload so the CLI and TUI creation flow work reliably.\n\nValidation: `go test ./...`, `make build`, and a host-side smoke test with `./banger vm create --name codex-smoke`.
2026-03-16 12:52:54 -03:00