banger

History

Thales Maciel 72882e45d7 daemon: serialise concurrent image/kernel pulls + atomic-rename seed refresh Three concurrency bugs surfaced by `make smoke JOBS=4` that all stem from `vm.create` paths assuming single-caller semantics: 1. Kernel auto-pull manifest race. Parallel `vm.create` calls that each need to auto-pull the same kernel ref both run kernelcat.Fetch in parallel against the same /var/lib/banger/kernels/<name>/. Fetch writes manifest.json non-atomically (truncate + write); the peer reads it back mid-write and trips "parse manifest for X: unexpected end of JSON input". Fix: per-name `sync.Mutex` map on `ImageService` (kernelPullLock). `KernelPull` and `readOrAutoPullKernel` both acquire it and re-check `kernelcat.ReadLocal` after the lock so a peer who finished while we waited is treated as success — `readOrAutoPullKernel` does NOT call `s.KernelPull` because that path errors with "already pulled" on a peer-success, which would be wrong for auto-pull. Different kernels stay parallel. 2. Image auto-pull race. Same shape as the kernel race but on the image side: parallel `vm.create` calls both run pullFromBundle / pullFromOCI for the missing image (each ~minutes of OCI fetch + ext4 build). The publishImage atom under imageOpsMu only protects the rename + UpsertImage commit, so the loser does all the work only to fail at the recheck with "image already exists". Fix: per-name `sync.Mutex` map on `ImageService` (imagePullLock). `findOrAutoPullImage` acquires it, re-checks FindImage, and only then calls PullImage. Loser short-circuits with the freshly-published image instead of redoing minutes of work. PullImage's own publishImage recheck stays as defense-in-depth for callers that bypass the auto-pull path. 3. Work-seed refresh race. When the host's SSH key has rotated since an image was last refreshed, `ensureAuthorizedKeyOnWorkDisk` triggers `refreshManagedWorkSeedFingerprint`, which rewrote the shared work-seed.ext4 in place via e2rm + e2cp. Peer `vm.create` calls doing parallel `MaterializeWorkDisk` rdumps observed a torn ext4 image — "Superblock checksum does not match superblock". Fix: stage the rewrite on a sibling tmpfile (`<seed>.refresh.<pid>-<ns>.tmp`) and atomic-rename. Concurrent readers either have the file open (kernel keeps the pre-rename inode alive) or open after the rename (see the new inode) — never observe a partial state. Two parallel refreshes are idempotent (same daemon, same SSH key) so unique tmp names are enough; whichever rename lands last wins, with identical content. UpsertImage runs after the rename so the recorded fingerprint always matches what's on disk. Plus one smoke harness fix: reclassify `vm_prune` from `pure` to `global`. `vm prune -f` removes ALL stopped VMs system-wide, not just the ones the scenario created — so a parallel peer scenario that happens to have its VM in `created`/`stopped` momentarily gets wiped. Moving prune to the post-pool serial phase keeps it from racing with in-flight scenarios. After all four fixes, `make smoke JOBS=4` passes 21/21 in 174s (serial baseline 141s; the small overhead is the buffered-output and `wait -n` semaphore cost — well worth the parallelism for fast-iter work on a 32-core box). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-27 17:24:11 -03:00
..
dmsnap	Extract opstate and dmsnap into subpackages	2026-04-15 16:02:43 -03:00
fcproc	daemon: split owner daemon from root helper	2026-04-26 12:43:17 -03:00
imagemgr	images: remove the docker field	2026-04-26 20:28:40 -03:00
opstate	coverage: medium batch — hostnat runner, store guest-sessions, daemon helpers	2026-04-18 18:03:37 -03:00
workspace	seams: move the last four package globals onto instance fields	2026-04-22 12:07:14 -03:00
ARCHITECTURE.md	daemon: split owner daemon from root helper	2026-04-26 12:43:17 -03:00
autopull_test.go	daemon: build a work-seed during image pull, refresh doctor check	2026-04-23 20:24:10 -03:00
capabilities.go	daemon: surface previously-swallowed errors at warn	2026-04-26 22:30:51 -03:00
capabilities_test.go	daemon: doctor passes vm dns when banger itself owns the port	2026-04-26 18:57:27 -03:00
concurrency_test.go	daemon: build a work-seed during image pull, refresh doctor check	2026-04-23 20:24:10 -03:00
daemon.go	daemon: thread per-RPC op_id end-to-end	2026-04-26 22:13:44 -03:00
daemon_test.go	daemon: split owner daemon from root helper	2026-04-26 12:43:17 -03:00
daemon_testing_test.go	test: add newTestDaemon harness + options	2026-04-22 17:45:43 -03:00
dispatch.go	daemon: extract StatsService sibling; shrink VMService's surface	2026-04-23 15:46:59 -03:00
dispatch_test.go	daemon: thread per-RPC op_id end-to-end	2026-04-26 22:13:44 -03:00
dns_routing.go	daemon: split owner daemon from root helper	2026-04-26 12:43:17 -03:00
dns_routing_test.go	seams: move the last four package globals onto instance fields	2026-04-22 12:07:14 -03:00
doc.go	daemon: split owner daemon from root helper	2026-04-26 12:43:17 -03:00
doctor.go	daemon: split owner daemon from root helper	2026-04-26 12:43:17 -03:00
doctor_test.go	cleanup: drop pre-v0.1 migration scaffolding + legacy-behavior refs	2026-04-23 13:56:32 -03:00
fake_firecracker_test.go	remove vm session feature	2026-04-20 12:47:58 -03:00
fastpath_test.go	daemon: build the work disk fresh instead of cloning the seed file	2026-04-26 20:42:10 -03:00
guest_ssh.go	remove vm session feature	2026-04-20 12:47:58 -03:00
host_network.go	daemon: split owner daemon from root helper	2026-04-26 12:43:17 -03:00
image_seed.go	daemon: serialise concurrent image/kernel pulls + atomic-rename seed refresh	2026-04-27 17:24:11 -03:00
image_service.go	daemon: serialise concurrent image/kernel pulls + atomic-rename seed refresh	2026-04-27 17:24:11 -03:00
images.go	daemon: serialise concurrent image/kernel pulls + atomic-rename seed refresh	2026-04-27 17:24:11 -03:00
images_helpers_test.go	coverage: medium batch — hostnat runner, store guest-sessions, daemon helpers	2026-04-18 18:03:37 -03:00
images_pull.go	daemon: build a work-seed during image pull, refresh doctor check	2026-04-23 20:24:10 -03:00
images_pull_bundle_test.go	daemon: build a work-seed during image pull, refresh doctor check	2026-04-23 20:24:10 -03:00
images_pull_test.go	daemon: build a work-seed during image pull, refresh doctor check	2026-04-23 20:24:10 -03:00
kernels.go	daemon: serialise concurrent image/kernel pulls + atomic-rename seed refresh	2026-04-27 17:24:11 -03:00
kernels_test.go	daemon split (6/n): extract wireServices + drop lazy service getters	2026-04-21 15:55:28 -03:00
lifecycle_flow_test.go	test: end-to-end VMService lifecycle flow harness	2026-04-22 17:55:04 -03:00
logger.go	daemon: thread per-RPC op_id end-to-end	2026-04-26 22:13:44 -03:00
logger_test.go	seams: move the last four package globals onto instance fields	2026-04-22 12:07:14 -03:00
nat.go	daemon: split owner daemon from root helper	2026-04-26 12:43:17 -03:00
nat_capability_test.go	daemon: persist tap device on VM.Runtime so NAT teardown survives handle-cache loss	2026-04-23 14:21:13 -03:00
nat_test.go	vm state: split transient kernel/process handles off the durable schema	2026-04-19 14:18:13 -03:00
open_close_test.go	daemon: split owner daemon from root helper	2026-04-26 12:43:17 -03:00
preflight.go	daemon: split owner daemon from root helper	2026-04-26 12:43:17 -03:00
privileged_ops.go	daemon: split owner daemon from root helper	2026-04-26 12:43:17 -03:00
runtime_assets.go	daemon split (4/5): extract *VMService service	2026-04-20 20:57:05 -03:00
snapshot.go	daemon: split owner daemon from root helper	2026-04-26 12:43:17 -03:00
snapshot_test.go	daemon split (6/n): extract wireServices + drop lazy service getters	2026-04-21 15:55:28 -03:00
ssh_client_config.go	daemon: split owner daemon from root helper	2026-04-26 12:43:17 -03:00
ssh_client_config_test.go	daemon: split owner daemon from root helper	2026-04-26 12:43:17 -03:00
sshd_config_test.go	daemon: delete flattenNestedWorkHome and normaliseHomeDirPerms	2026-04-23 18:33:06 -03:00
stats_service.go	daemon: thread per-RPC op_id end-to-end	2026-04-26 22:13:44 -03:00
stats_service_test.go	daemon: extract StatsService sibling; shrink VMService's surface	2026-04-23 15:46:59 -03:00
tap_pool.go	daemon: split owner daemon from root helper	2026-04-26 12:43:17 -03:00
vm.go	daemon: persist teardown fallbacks and reject unsafe import paths	2026-04-23 16:21:59 -03:00
vm_authsync.go	daemon: split owner daemon from root helper	2026-04-26 12:43:17 -03:00
vm_create.go	daemon: serialise concurrent image/kernel pulls + atomic-rename seed refresh	2026-04-27 17:24:11 -03:00
vm_create_ops.go	daemon: thread per-RPC op_id end-to-end	2026-04-26 22:13:44 -03:00
vm_create_test.go	model: validate VM names as DNS labels at CLI + daemon	2026-04-23 14:06:40 -03:00
vm_disk.go	system: mkfs work disks with lazy_itable_init + lazy_journal_init	2026-04-26 21:32:57 -03:00
vm_handles.go	daemon: split owner daemon from root helper	2026-04-26 12:43:17 -03:00
vm_handles_test.go	daemon: persist teardown fallbacks and reject unsafe import paths	2026-04-23 16:21:59 -03:00
vm_lifecycle.go	daemon: sync guest over ssh before stop to preserve workspace writes	2026-04-27 15:41:32 -03:00
vm_lifecycle_steps.go	daemon: skip fsck_snapshot on freshly-created system overlays	2026-04-26 21:37:14 -03:00
vm_lifecycle_steps_test.go	daemon: extract startVMLocked into step runner with per-step rollback	2026-04-23 15:34:34 -03:00
vm_locks.go	Move subsystem state/locks off Daemon into owning types	2026-04-15 15:58:33 -03:00
vm_service.go	daemon: thread per-RPC op_id end-to-end	2026-04-26 22:13:44 -03:00
vm_set.go	daemon: thread per-RPC op_id end-to-end	2026-04-26 22:13:44 -03:00
vm_test.go	daemon: split owner daemon from root helper	2026-04-26 12:43:17 -03:00
workspace.go	feat(vm): add vm exec command with workspace dirty detection	2026-04-26 23:53:45 -03:00
workspace_rejection_test.go	tests: targeted coverage for doctor, workspace rejections, and nat capability	2026-04-22 12:58:12 -03:00
workspace_service.go	daemon: thread per-RPC op_id end-to-end	2026-04-26 22:13:44 -03:00
workspace_test.go	feat(vm): add vm exec command with workspace dirty detection	2026-04-26 23:53:45 -03:00