banger/internal/daemon
Thales Maciel 72882e45d7
daemon: serialise concurrent image/kernel pulls + atomic-rename seed refresh
Three concurrency bugs surfaced by `make smoke JOBS=4` that all stem
from `vm.create` paths assuming single-caller semantics:

1. **Kernel auto-pull manifest race.** Parallel `vm.create` calls that
   each need to auto-pull the same kernel ref both run kernelcat.Fetch
   in parallel against the same /var/lib/banger/kernels/<name>/. Fetch
   writes manifest.json non-atomically (truncate + write); the peer
   reads it back mid-write and trips
   "parse manifest for X: unexpected end of JSON input".

   Fix: per-name `sync.Mutex` map on `ImageService` (kernelPullLock).
   `KernelPull` and `readOrAutoPullKernel` both acquire it and re-check
   `kernelcat.ReadLocal` after the lock so a peer who finished while we
   waited is treated as success — `readOrAutoPullKernel` does NOT call
   `s.KernelPull` because that path errors with "already pulled" on a
   peer-success, which would be wrong for auto-pull. Different kernels
   stay parallel.

2. **Image auto-pull race.** Same shape as the kernel race but on the
   image side: parallel `vm.create` calls both run pullFromBundle /
   pullFromOCI for the missing image (each ~minutes of OCI fetch +
   ext4 build). The publishImage atom under imageOpsMu only protects
   the rename + UpsertImage commit, so the loser does all the work
   only to fail at the recheck with "image already exists".

   Fix: per-name `sync.Mutex` map on `ImageService` (imagePullLock).
   `findOrAutoPullImage` acquires it, re-checks FindImage, and only
   then calls PullImage. Loser short-circuits with the
   freshly-published image instead of redoing minutes of work.
   PullImage's own publishImage recheck stays as defense-in-depth
   for callers that bypass the auto-pull path.

3. **Work-seed refresh race.** When the host's SSH key has rotated
   since an image was last refreshed, `ensureAuthorizedKeyOnWorkDisk`
   triggers `refreshManagedWorkSeedFingerprint`, which rewrote the
   shared work-seed.ext4 in place via e2rm + e2cp. Peer `vm.create`
   calls doing parallel `MaterializeWorkDisk` rdumps observed a torn
   ext4 image — "Superblock checksum does not match superblock".

   Fix: stage the rewrite on a sibling tmpfile (`<seed>.refresh.<pid>-<ns>.tmp`)
   and atomic-rename. Concurrent readers either have the file open
   (kernel keeps the pre-rename inode alive) or open after the rename
   (see the new inode) — never observe a partial state. Two parallel
   refreshes are idempotent (same daemon, same SSH key) so unique tmp
   names are enough; whichever rename lands last wins, with identical
   content. UpsertImage runs after the rename so the recorded
   fingerprint always matches what's on disk.

Plus one smoke harness fix: reclassify `vm_prune` from `pure` to
`global`. `vm prune -f` removes ALL stopped VMs system-wide, not just
the ones the scenario created — so a parallel peer scenario that
happens to have its VM in `created`/`stopped` momentarily gets wiped.
Moving prune to the post-pool serial phase keeps it from racing with
in-flight scenarios.

After all four fixes, `make smoke JOBS=4` passes 21/21 in 174s
(serial baseline 141s; the small overhead is the buffered-output and
`wait -n` semaphore cost — well worth the parallelism for fast-iter
work on a 32-core box).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 17:24:11 -03:00
..
dmsnap Extract opstate and dmsnap into subpackages 2026-04-15 16:02:43 -03:00
fcproc daemon: split owner daemon from root helper 2026-04-26 12:43:17 -03:00
imagemgr images: remove the docker field 2026-04-26 20:28:40 -03:00
opstate coverage: medium batch — hostnat runner, store guest-sessions, daemon helpers 2026-04-18 18:03:37 -03:00
workspace seams: move the last four package globals onto instance fields 2026-04-22 12:07:14 -03:00
ARCHITECTURE.md daemon: split owner daemon from root helper 2026-04-26 12:43:17 -03:00
autopull_test.go daemon: build a work-seed during image pull, refresh doctor check 2026-04-23 20:24:10 -03:00
capabilities.go daemon: surface previously-swallowed errors at warn 2026-04-26 22:30:51 -03:00
capabilities_test.go daemon: doctor passes vm dns when banger itself owns the port 2026-04-26 18:57:27 -03:00
concurrency_test.go daemon: build a work-seed during image pull, refresh doctor check 2026-04-23 20:24:10 -03:00
daemon.go daemon: thread per-RPC op_id end-to-end 2026-04-26 22:13:44 -03:00
daemon_test.go daemon: split owner daemon from root helper 2026-04-26 12:43:17 -03:00
daemon_testing_test.go test: add newTestDaemon harness + options 2026-04-22 17:45:43 -03:00
dispatch.go daemon: extract StatsService sibling; shrink VMService's surface 2026-04-23 15:46:59 -03:00
dispatch_test.go daemon: thread per-RPC op_id end-to-end 2026-04-26 22:13:44 -03:00
dns_routing.go daemon: split owner daemon from root helper 2026-04-26 12:43:17 -03:00
dns_routing_test.go seams: move the last four package globals onto instance fields 2026-04-22 12:07:14 -03:00
doc.go daemon: split owner daemon from root helper 2026-04-26 12:43:17 -03:00
doctor.go daemon: split owner daemon from root helper 2026-04-26 12:43:17 -03:00
doctor_test.go cleanup: drop pre-v0.1 migration scaffolding + legacy-behavior refs 2026-04-23 13:56:32 -03:00
fake_firecracker_test.go remove vm session feature 2026-04-20 12:47:58 -03:00
fastpath_test.go daemon: build the work disk fresh instead of cloning the seed file 2026-04-26 20:42:10 -03:00
guest_ssh.go remove vm session feature 2026-04-20 12:47:58 -03:00
host_network.go daemon: split owner daemon from root helper 2026-04-26 12:43:17 -03:00
image_seed.go daemon: serialise concurrent image/kernel pulls + atomic-rename seed refresh 2026-04-27 17:24:11 -03:00
image_service.go daemon: serialise concurrent image/kernel pulls + atomic-rename seed refresh 2026-04-27 17:24:11 -03:00
images.go daemon: serialise concurrent image/kernel pulls + atomic-rename seed refresh 2026-04-27 17:24:11 -03:00
images_helpers_test.go coverage: medium batch — hostnat runner, store guest-sessions, daemon helpers 2026-04-18 18:03:37 -03:00
images_pull.go daemon: build a work-seed during image pull, refresh doctor check 2026-04-23 20:24:10 -03:00
images_pull_bundle_test.go daemon: build a work-seed during image pull, refresh doctor check 2026-04-23 20:24:10 -03:00
images_pull_test.go daemon: build a work-seed during image pull, refresh doctor check 2026-04-23 20:24:10 -03:00
kernels.go daemon: serialise concurrent image/kernel pulls + atomic-rename seed refresh 2026-04-27 17:24:11 -03:00
kernels_test.go daemon split (6/n): extract wireServices + drop lazy service getters 2026-04-21 15:55:28 -03:00
lifecycle_flow_test.go test: end-to-end VMService lifecycle flow harness 2026-04-22 17:55:04 -03:00
logger.go daemon: thread per-RPC op_id end-to-end 2026-04-26 22:13:44 -03:00
logger_test.go seams: move the last four package globals onto instance fields 2026-04-22 12:07:14 -03:00
nat.go daemon: split owner daemon from root helper 2026-04-26 12:43:17 -03:00
nat_capability_test.go daemon: persist tap device on VM.Runtime so NAT teardown survives handle-cache loss 2026-04-23 14:21:13 -03:00
nat_test.go vm state: split transient kernel/process handles off the durable schema 2026-04-19 14:18:13 -03:00
open_close_test.go daemon: split owner daemon from root helper 2026-04-26 12:43:17 -03:00
preflight.go daemon: split owner daemon from root helper 2026-04-26 12:43:17 -03:00
privileged_ops.go daemon: split owner daemon from root helper 2026-04-26 12:43:17 -03:00
runtime_assets.go daemon split (4/5): extract *VMService service 2026-04-20 20:57:05 -03:00
snapshot.go daemon: split owner daemon from root helper 2026-04-26 12:43:17 -03:00
snapshot_test.go daemon split (6/n): extract wireServices + drop lazy service getters 2026-04-21 15:55:28 -03:00
ssh_client_config.go daemon: split owner daemon from root helper 2026-04-26 12:43:17 -03:00
ssh_client_config_test.go daemon: split owner daemon from root helper 2026-04-26 12:43:17 -03:00
sshd_config_test.go daemon: delete flattenNestedWorkHome and normaliseHomeDirPerms 2026-04-23 18:33:06 -03:00
stats_service.go daemon: thread per-RPC op_id end-to-end 2026-04-26 22:13:44 -03:00
stats_service_test.go daemon: extract StatsService sibling; shrink VMService's surface 2026-04-23 15:46:59 -03:00
tap_pool.go daemon: split owner daemon from root helper 2026-04-26 12:43:17 -03:00
vm.go daemon: persist teardown fallbacks and reject unsafe import paths 2026-04-23 16:21:59 -03:00
vm_authsync.go daemon: split owner daemon from root helper 2026-04-26 12:43:17 -03:00
vm_create.go daemon: serialise concurrent image/kernel pulls + atomic-rename seed refresh 2026-04-27 17:24:11 -03:00
vm_create_ops.go daemon: thread per-RPC op_id end-to-end 2026-04-26 22:13:44 -03:00
vm_create_test.go model: validate VM names as DNS labels at CLI + daemon 2026-04-23 14:06:40 -03:00
vm_disk.go system: mkfs work disks with lazy_itable_init + lazy_journal_init 2026-04-26 21:32:57 -03:00
vm_handles.go daemon: split owner daemon from root helper 2026-04-26 12:43:17 -03:00
vm_handles_test.go daemon: persist teardown fallbacks and reject unsafe import paths 2026-04-23 16:21:59 -03:00
vm_lifecycle.go daemon: sync guest over ssh before stop to preserve workspace writes 2026-04-27 15:41:32 -03:00
vm_lifecycle_steps.go daemon: skip fsck_snapshot on freshly-created system overlays 2026-04-26 21:37:14 -03:00
vm_lifecycle_steps_test.go daemon: extract startVMLocked into step runner with per-step rollback 2026-04-23 15:34:34 -03:00
vm_locks.go Move subsystem state/locks off Daemon into owning types 2026-04-15 15:58:33 -03:00
vm_service.go daemon: thread per-RPC op_id end-to-end 2026-04-26 22:13:44 -03:00
vm_set.go daemon: thread per-RPC op_id end-to-end 2026-04-26 22:13:44 -03:00
vm_test.go daemon: split owner daemon from root helper 2026-04-26 12:43:17 -03:00
workspace.go feat(vm): add vm exec command with workspace dirty detection 2026-04-26 23:53:45 -03:00
workspace_rejection_test.go tests: targeted coverage for doctor, workspace rejections, and nat capability 2026-04-22 12:58:12 -03:00
workspace_service.go daemon: thread per-RPC op_id end-to-end 2026-04-26 22:13:44 -03:00
workspace_test.go feat(vm): add vm exec command with workspace dirty detection 2026-04-26 23:53:45 -03:00