daemon: serialise concurrent image/kernel pulls + atomic-rename seed refresh
Three concurrency bugs surfaced by `make smoke JOBS=4` that all stem from `vm.create` paths assuming single-caller semantics: 1. **Kernel auto-pull manifest race.** Parallel `vm.create` calls that each need to auto-pull the same kernel ref both run kernelcat.Fetch in parallel against the same /var/lib/banger/kernels/<name>/. Fetch writes manifest.json non-atomically (truncate + write); the peer reads it back mid-write and trips "parse manifest for X: unexpected end of JSON input". Fix: per-name `sync.Mutex` map on `ImageService` (kernelPullLock). `KernelPull` and `readOrAutoPullKernel` both acquire it and re-check `kernelcat.ReadLocal` after the lock so a peer who finished while we waited is treated as success — `readOrAutoPullKernel` does NOT call `s.KernelPull` because that path errors with "already pulled" on a peer-success, which would be wrong for auto-pull. Different kernels stay parallel. 2. **Image auto-pull race.** Same shape as the kernel race but on the image side: parallel `vm.create` calls both run pullFromBundle / pullFromOCI for the missing image (each ~minutes of OCI fetch + ext4 build). The publishImage atom under imageOpsMu only protects the rename + UpsertImage commit, so the loser does all the work only to fail at the recheck with "image already exists". Fix: per-name `sync.Mutex` map on `ImageService` (imagePullLock). `findOrAutoPullImage` acquires it, re-checks FindImage, and only then calls PullImage. Loser short-circuits with the freshly-published image instead of redoing minutes of work. PullImage's own publishImage recheck stays as defense-in-depth for callers that bypass the auto-pull path. 3. **Work-seed refresh race.** When the host's SSH key has rotated since an image was last refreshed, `ensureAuthorizedKeyOnWorkDisk` triggers `refreshManagedWorkSeedFingerprint`, which rewrote the shared work-seed.ext4 in place via e2rm + e2cp. Peer `vm.create` calls doing parallel `MaterializeWorkDisk` rdumps observed a torn ext4 image — "Superblock checksum does not match superblock". Fix: stage the rewrite on a sibling tmpfile (`<seed>.refresh.<pid>-<ns>.tmp`) and atomic-rename. Concurrent readers either have the file open (kernel keeps the pre-rename inode alive) or open after the rename (see the new inode) — never observe a partial state. Two parallel refreshes are idempotent (same daemon, same SSH key) so unique tmp names are enough; whichever rename lands last wins, with identical content. UpsertImage runs after the rename so the recorded fingerprint always matches what's on disk. Plus one smoke harness fix: reclassify `vm_prune` from `pure` to `global`. `vm prune -f` removes ALL stopped VMs system-wide, not just the ones the scenario created — so a parallel peer scenario that happens to have its VM in `created`/`stopped` momentarily gets wiped. Moving prune to the post-pool serial phase keeps it from racing with in-flight scenarios. After all four fixes, `make smoke JOBS=4` passes 21/21 in 174s (serial baseline 141s; the small overhead is the buffered-output and `wait -n` semaphore cost — well worth the parallelism for fast-iter work on a 32-core box). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
115eec8576
commit
72882e45d7
6 changed files with 162 additions and 13 deletions
|
|
@ -38,6 +38,29 @@ type ImageService struct {
|
|||
// internal/daemon/ARCHITECTURE.md.
|
||||
imageOpsMu sync.Mutex
|
||||
|
||||
// kernelPullLocksMu guards the kernelPullLocks map itself. Per-name
|
||||
// mutexes inside the map serialise concurrent pulls of the same
|
||||
// kernel ref. Without this, two parallel `vm run` callers that
|
||||
// auto-pull the same kernel race on
|
||||
// /var/lib/banger/kernels/<name>/manifest.json: one is mid-write
|
||||
// from kernelcat.Fetch's WriteLocal while the other is reading it
|
||||
// back, yielding "unexpected end of JSON input". The map keeps
|
||||
// pulls of *different* kernels parallel.
|
||||
kernelPullLocksMu sync.Mutex
|
||||
kernelPullLocks map[string]*sync.Mutex
|
||||
|
||||
// imagePullLocksMu / imagePullLocks: same per-name pattern for
|
||||
// image auto-pulls. Without this, parallel `vm.create` callers
|
||||
// resolving a missing image both run the full OCI fetch + ext4
|
||||
// build (each ~minutes), and the loser hits the "image already
|
||||
// exists" recheck inside publishImage and fails after doing all
|
||||
// the work for nothing. Locking around the FindImage-recheck +
|
||||
// PullImage section means only one caller does the heavy work
|
||||
// per image name; peers see the freshly-published image on the
|
||||
// post-lock recheck.
|
||||
imagePullLocksMu sync.Mutex
|
||||
imagePullLocks map[string]*sync.Mutex
|
||||
|
||||
// Test seams; nil → real implementation.
|
||||
pullAndFlatten func(ctx context.Context, ref, cacheDir, destDir string) (imagepull.Metadata, error)
|
||||
finalizePulledRootfs func(ctx context.Context, ext4File string, meta imagepull.Metadata) error
|
||||
|
|
@ -73,6 +96,41 @@ func newImageService(deps imageServiceDeps) *ImageService {
|
|||
}
|
||||
}
|
||||
|
||||
// kernelPullLock returns the per-name mutex used to serialise kernel
|
||||
// pulls of `name`. The map entry is created on first access and lives
|
||||
// for the daemon's lifetime — kernels rarely churn and keeping the
|
||||
// entry around saves the allocation and the second-acquire path stays
|
||||
// branchless. Callers Lock() / Unlock() the returned mutex directly.
|
||||
func (s *ImageService) kernelPullLock(name string) *sync.Mutex {
|
||||
s.kernelPullLocksMu.Lock()
|
||||
defer s.kernelPullLocksMu.Unlock()
|
||||
if s.kernelPullLocks == nil {
|
||||
s.kernelPullLocks = make(map[string]*sync.Mutex)
|
||||
}
|
||||
m, ok := s.kernelPullLocks[name]
|
||||
if !ok {
|
||||
m = &sync.Mutex{}
|
||||
s.kernelPullLocks[name] = m
|
||||
}
|
||||
return m
|
||||
}
|
||||
|
||||
// imagePullLock is the image-name peer of kernelPullLock; same lifetime
|
||||
// and zero-allocation properties on the second-acquire path.
|
||||
func (s *ImageService) imagePullLock(name string) *sync.Mutex {
|
||||
s.imagePullLocksMu.Lock()
|
||||
defer s.imagePullLocksMu.Unlock()
|
||||
if s.imagePullLocks == nil {
|
||||
s.imagePullLocks = make(map[string]*sync.Mutex)
|
||||
}
|
||||
m, ok := s.imagePullLocks[name]
|
||||
if !ok {
|
||||
m = &sync.Mutex{}
|
||||
s.imagePullLocks[name] = m
|
||||
}
|
||||
return m
|
||||
}
|
||||
|
||||
// FindImage is the service-owned lookup helper. It falls back from
|
||||
// exact-name → exact-id → prefix match, matching the historical
|
||||
// daemon.FindImage behaviour. Kept on ImageService because image
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue