Separates what a VM IS (durable intent + identity + deterministic
derived paths — `VMRuntime`) from what is CURRENTLY TRUE about it
(firecracker PID, tap device, loop devices, dm-snapshot target — new
`VMHandles`). The durable state lives in the SQLite `vms` row; the
transient state lives in an in-memory cache on the daemon plus a
per-VM `handles.json` scratch file inside VMDir, rebuilt at startup
from OS inspection. Nothing kernel-level rides the SQLite schema
anymore.
Why:
Persisting ephemeral process handles to SQLite forced reconcile to
treat "running with a stale PID" as a first-class case and mix it
with real state transitions. The schema described what we last
observed, not what the VM is. Every time the observation model
shifted (tap pool, DM naming, pgrep fallback) the reconcile logic
grew a new branch. Splitting lets each layer own what it's good at:
durable records describe intent, in-memory cache + scratch file
describe momentary reality.
Shape:
- `model.VMHandles` = PID, TapDevice, BaseLoop, COWLoop, DMName,
DMDev. Never in SQLite.
- `VMRuntime` keeps: State, GuestIP, APISockPath, VSockPath,
VSockCID, LogPath, MetricsPath, DNSName, VMDir, SystemOverlay,
WorkDiskPath, LastError. All durable or deterministic.
- `handleCache` on `*Daemon` — mutex-guarded map + scratch-file
plumbing (`writeHandlesFile` / `readHandlesFile` /
`rediscoverHandles`). See `internal/daemon/vm_handles.go`.
- `d.vmAlive(vm)` replaces the 20+ inline
`vm.State==Running && ProcessRunning(vm.Runtime.PID, apiSock)`
spreads. Single source of truth for liveness.
- Startup reconcile: per running VM, load the scratch file, pgrep
the api sock, either keep (cache seeded from scratch) or demote
to stopped (scratch handles passed to cleanupRuntime first so DM
/ loops / tap actually get torn down).
Verification:
- `go test ./...` green.
- Live: `banger vm run --name handles-test -- cat /etc/hostname`
starts; `handles.json` appears in VMDir with the expected PID,
tap, loops, DM.
- `kill -9 $(pgrep bangerd)` while the VM is running, re-invoke the
CLI, daemon auto-starts, reconcile recognises the VM as alive,
`banger vm ssh` still connects, `banger vm delete` cleans up.
Tests added:
- vm_handles_test.go: scratch-file roundtrip, missing/corrupt file
behaviour, cache concurrency, rediscoverHandles prefers pgrep
over scratch, returns scratch contents even when process is
dead (so cleanup can tear down kernel state).
- vm_test.go: reconcile test rewritten to exercise the new flow
(write scratch → reconcile reads it → verifies process is gone →
issues dmsetup/losetup teardown).
ARCHITECTURE.md updated; `handles` added to Daemon field docs.
Previously /etc/ssh/sshd_config.d/99-banger.conf landed with:
LogLevel DEBUG3
PermitRootLogin yes
PubkeyAuthentication yes
AuthorizedKeysFile /root/.ssh/authorized_keys
StrictModes no
DEBUG3 was debug leftover that floods journald in normal use.
StrictModes no was a workaround for /root perm drift on the work
disk — the real fix is to make those perms correct at provisioning
time.
New drop-in:
PermitRootLogin prohibit-password
PubkeyAuthentication yes
PasswordAuthentication no
KbdInteractiveAuthentication no
AuthorizedKeysFile /root/.ssh/authorized_keys
prohibit-password blocks password root login even if PasswordAuth
gets flipped on elsewhere; KbdInteractiveAuth no closes the last
interactive fallback; StrictModes is now on (sshd's default).
normaliseHomeDirPerms chown/chmods /root to 0755 root:root at every
work-disk mount (ensureAuthorizedKeyOnWorkDisk,
seedAuthorizedKeyOnExt4Image); the .ssh dir also explicitly
chown'd root:root. Verified end-to-end against a real VM:
`sshd -T` reports strictmodes yes and all five directives match.
Regression test (sshd_config_test.go) pins the allow-list and the
deny-list (DEBUG3, StrictModes no, bare `PermitRootLogin yes`) so
the next accidental reintroduction fails fast.
README's Security section updated to reflect the new posture.
vm.go (1529 LOC) splits into vm_create, vm_lifecycle, vm_set, vm_stats,
vm_disk, vm_authsync; firecracker/DNS/helpers stay in vm.go.
guest_sessions.go (1266 LOC) splits into session_controller,
session_lifecycle, session_attach, session_stream; scripts and helpers
stay in guest_sessions.go.
Mechanical move only. No behavior change. Adds doc.go and
ARCHITECTURE.md capturing subsystem map and current lock ordering as
the baseline for the upcoming subsystem extraction.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>