Commit graph

296 commits

Author SHA1 Message Date
a0b5c7fa3c
CHANGELOG: v0.1.1 release notes
Captures the install.sh + BANGER_INSTALL_NONINTERACTIVE changes
that landed in 1004331 and 3c29af5. v0.1.1 is being cut now to
exercise the self-update path against a real released second
version — `banger update` has never run live before, only against
unit-test fixtures, so this release doubles as the smoke test of
the update flow.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 14:33:12 -03:00
1004331c14
install.sh: drop --user, add BANGER_INSTALL_NONINTERACTIVE env var
Surveyed the install scripts of comparable systemd-installing tools
(Docker, k3s, Tailscale, Ollama, Determinate Systems Nix, flyctl):
none of the daemon installers offer a --user staging mode, because
the resulting install isn't useful — banger inherits that. The
"--user just stages binaries you can't actually use yet" UX was a
trap; remove it before users hit it.

In its place, adopt the cross-tool convention for non-interactive
runs: the BANGER_INSTALL_NONINTERACTIVE=1 env var is friendlier
through a curl|bash pipe than `bash -s -- --yes` because the env
var can sit on the same line:

  curl -fsSL ...install.sh | env BANGER_INSTALL_NONINTERACTIVE=1 bash

The --yes flag still works for direct script invocation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 14:15:36 -03:00
3c29af55a2
Add curl|bash installer + wire upload into publish script
scripts/install.sh is the one-command installer end users run as

  curl -fsSL https://releases.thaloco.com/banger/install.sh | bash

Design choices:

* Runs as the invoking user. All network work + signature verification
  happens unprivileged; sudo is only re-execed for the actual install
  step that writes to /usr/local and creates systemd units.
* Right before the sudo prompt, the script prints a plain-language
  summary of exactly what's about to happen — the file paths it will
  create and a one-line "why sudo" — so the user authorises a known
  scope rather than the whole pipeline. Detail link in the docs.
* Uses openssl (universally available) for signature verification, not
  cosign. cosign is needed only by the *signer*, never the verifier.
* No jq dependency. The latest_stable field is extracted from the
  manifest with grep+sed, since the manifest shape is well-defined and
  we control it.
* /dev/tty fallback for the confirmation prompt so it works through
  the curl|bash pipe.
* --yes for non-interactive CI use, --user for installing into
  ~/.local/bin without touching system paths, --version vX.Y.Z to pin.

publish-banger-release.sh now uploads install.sh to the bucket root
on every publish, so the curl URL is stable but the script logic
matches the latest verified release. It also runs a key-drift check:
if scripts/install.sh's embedded cosign public key differs from the
one in internal/updater/verify_signature.go, publishing aborts. The
two copies must stay in sync or one of them ends up rejecting every
release.

README's Quick start now leads with the installer one-liner and
documents the audit-first variant alongside it; building from source
moves below.

Smoke-tested end to end against the live bucket with --user mode:
manifest fetch → tarball download → cosign signature verify → hash
verify → extract → install. The installed binary reports v0.1.0 at
commit 6fdebd9, matching the published artifact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 14:06:34 -03:00
d1c4619a01
Add CHANGELOG.md with v0.1.0 release notes
First-release changelog following the Keep a Changelog + SemVer
convention. The v0.1.0 section groups by capability area (sandbox
VMs, images, kernels, host networking, system install, self-update,
trust model, CLI surface) rather than by package, so it reads as
release notes for users deciding whether to install rather than as
a commit log. Includes a Compatibility section calling out the
informal vsock-protocol stability promise (stable across patches,
not minors) and the forward-only schema policy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 13:45:44 -03:00
6fdebd929e
publish-script: split RCLONE_BUCKET out of BUCKET_PATH
The previous form passed rclone paths like releases:banger/v0.1.0/,
which rclone parses as bucket=banger, key=v0.1.0/... — wrong, because
the actual R2 bucket is named "releases" (BUCKET_PATH was meant as
an in-bucket key prefix only). Uploads 403'd because the token has
no view of a bucket called "banger".

Introduce RCLONE_BUCKET as a separate env var (default: "releases")
and route every rclone copy through ${RCLONE_REMOTE}:${RCLONE_BUCKET}/${BUCKET_PATH}.
The public URLs in the manifest stay unchanged: BASE_URL is the
bucket's public custom domain, so the bucket name is implicit there.

The defaults now resolve to the live setup:
  rclone target:  releases:releases/banger/<version>/<file>
  public URL:     https://releases.thaloco.com/banger/<version>/<file>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 13:35:53 -03:00
12f7a92bb4
publish-script: don't clobber COSIGN_PASSWORD with empty default
The previous form did

  COSIGN_PASSWORD="${COSIGN_PASSWORD:-}" cosign sign-blob ...

which set COSIGN_PASSWORD to "" when the caller hadn't exported one.
cosign sees an explicit empty password and tries to decrypt with
it instead of prompting interactively, so any real password-protected
offline key fails with "decryption failed".

Drop the prefix entirely. If COSIGN_PASSWORD is already in env, it
gets inherited normally; if it isn't, cosign prompts on the terminal
— which is the right UX for a maintainer running the publish script
locally with the offline private key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 13:27:23 -03:00
3d748b87c8
publish-script: fix pubkey extraction and cosign v3 compatibility
Two bugs found while dry-running the publish flow end-to-end:

1. The awk pipeline that pulled BangerReleasePublicKey out of
   verify_signature.go didn't strip Go's raw-string-literal wrapping
   (`var ... = ` + backtick on the BEGIN line, trailing backtick on
   the END line). The "verify against embedded pub key" step thus
   compared sigs against a malformed PEM. Replaced with a sed pair
   that yields a clean PEM block byte-identical to cosign.pub.

2. cosign v3.x defaults sign-blob to a new bundle format and
   pushes signatures to Rekor; both are incompatible with banger's
   "embedded pub key, raw ASN.1 DER signature" trust model.
   Add --use-signing-config=false / --tlog-upload=false /
   --new-bundle-format=false to opt out, and --insecure-ignore-tlog
   on verify-blob. These flags also work on cosign v2.x, so the
   script is forward- and backward-compatible across the v2→v3
   boundary.

Validated by an end-to-end dry-run on this machine: built binaries,
tarred, sha256summed, cosign-signed, verified against the embedded
pub key, then re-verified through internal/updater's
crypto/ecdsa.VerifyASN1 path — all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 13:23:09 -03:00
b7c9661c99
updater: embed real cosign public key for v0.1.0 release signing
The placeholder in BangerReleasePublicKey is replaced with the
production cosign public key (P-256 ECDSA). The matching private
key is stored offline by the maintainer; this is the public half
that every banger CLI baked from this commit forward will use to
verify SHA256SUMS signatures.

cosign.pub is also committed at the repo root so external auditors
can re-verify a release without parsing the Go source.

The placeholder-refuses test now swaps the embedded key for a
synthetic placeholder for the duration of the test, since the
default value is no longer a placeholder.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 12:50:52 -03:00
fae28e3d8b
update: docs + publish script for the self-update feature
README gets a top-level Updating section; docs/privileges.md gains
a step-by-step trust-model writeup of `banger update`. The new
scripts/publish-banger-release.sh drives the manual release cut:
build, tar, sha256sum, cosign sign-blob, verify against the embedded
public key, jq-merge into manifest.json, rclone upload to the R2
bucket. Refuses outright if the embedded key is still the placeholder
so we can't accidentally publish an unverifiable release. Also folds
in gofmt drift accumulated across the updater package and a few
sibling files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 12:43:46 -03:00
8ed351ea47
updater: cosign-blob signature verification on SHA256SUMS
Closes the v0.1.0 cosign requirement. Every banger update download
now goes through ECDSA-P256 verification before any binary is
trusted: SHA256SUMS.sig is fetched, base64-decoded, and verified
against the embedded BangerReleasePublicKey.

  * BangerReleasePublicKey: PEM-encoded ECDSA public key embedded
    at compile time. The current value is a sentinel PLACEHOLDER —
    the maintainer must replace it with the output of
    `cosign generate-key-pair`'s cosign.pub before cutting v0.1.0,
    and re-cut. Until they do, every `banger update` refuses with
    ErrSignatureRequired ("the maintainer must replace it and
    re-cut a release before update can proceed"). Loud refusal
    beats silent acceptance.
  * VerifyBlobSignature: parses the embedded public key, base64-
    decodes the signature, computes SHA256(body), runs ecdsa
    .VerifyASN1. cosign sign-blob produces the format
    VerifyASN1 verifies natively (ASN.1-DER encoded ECDSA over
    a SHA256 digest), so no third-party crypto deps needed.
  * FetchAndVerifySignature: pulls the signature URL from the
    release manifest entry, fetches it (1 KiB cap), and verifies
    against sumsBody. Refuses outright when sha256sums_sig_url is
    empty — v0.1.0 contract requires every release to be signed,
    and an unsigned release is a manifest publishing bug we'd
    rather catch loudly than silently accept.
  * Wired into banger update: sumsBody captured from
    DownloadRelease, immediately fed into FetchAndVerifySignature.
    A failed verification removes the staged tarball before
    returning so it can't be reused.
  * BangerReleasePublicKey is var (not const) only to support tests
    that swap in a generated keypair; production sets it at compile
    time and never mutates it.

Tests: placeholder-key path returns ErrSignatureRequired; happy
path with a fresh in-test ECDSA keypair verifies a real
sign-then-verify; tampered body, wrong key, and three malformed
signature shapes (not-base64, empty, garbage-DER) all reject.

Maintainer-cut workflow documented in BangerReleasePublicKey's
comment: cosign generate-key-pair → paste cosign.pub into the
constant → at release time, cosign sign-blob --key cosign.key
SHA256SUMS > SHA256SUMS.sig and publish.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 12:37:53 -03:00
92ca1aa96f
cli: add banger update command
Wires updater + the existing system-install helpers into a single
operator-facing flow:

  1. FetchManifest, resolve target release (default: latest_stable;
     override with --to vX.Y.Z).
  2. --check exits with a one-line "up to date" / "update available".
     Same as `banger update --check` style for tools polling on a
     timer.
  3. requireRoot beyond this point — we're about to write
     /usr/local/bin and talk to systemctl.
  4. daemon.operations.list → refuse if any operation isn't Done.
     --force overrides; per the v0.1.0 plan there's no drain wait.
  5. PrepareCleanStaging + DownloadRelease + StageTarball into
     /var/cache/banger/updates/.
  6. Sanity-run the staged binaries: `banger --version` must mention
     the expected version; `bangerd --check-migrations --system`
     must exit 0 (compatible) or 1 (will auto-migrate). Exit 2
     (incompatible) aborts before the swap.
  7. --dry-run stops here with a one-line plan, leaves staging.
  8. Swap (vsock → bangerd → banger) → restart bangerd-root then
     bangerd → waitForDaemonReady on the system socket.
  9. Run `banger doctor` against the JUST-INSTALLED CLI binary
     (not d.doctor in-process — we want to exercise the new binary
     end-to-end). FAIL triggers auto-rollback: restore .previous
     backups, restart services, surface the original failure with
     "(rolled back to previous install)".
  10. UpdateBuildInfo on /etc/banger/install.toml. CleanupBackups.
     Wipe staging dir.

rollbackAndWrap / rollbackAndRestart split: the former is for
failures BEFORE the systemctl restart (old binaries are still on
disk under .previous; the OLD daemon is still running because the
restart never happened). The latter is for failures AFTER, where
rollback ALSO needs another systemctl restart so the OLD versions
take over again. If even rollback's restart fails, we surface
everything we know — the install is broken and the operator gets
the breadcrumbs to fix it manually.

Existing TestNewBangerCommandHasExpectedSubcommands updated to
include "update" in the expected ordering.

Live exercise against the empty bucket today errors as expected:
$ banger update --check
banger: discover: fetch manifest: HTTP 404 Not Found  # exit 1
once the user publishes the first manifest the same command will
report "up to date" or "update available".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 12:35:04 -03:00
91af367208
updater: download/stage/swap/rollback flow steps
The pure-logic core of `banger update`. No CLI yet; this commit
ships the steps the next commit's command will orchestrate.

  * download.go — DownloadRelease fetches SHA256SUMS, parses it,
    looks up the tarball's basename, then streams the tarball
    through download.FetchVerified so the hash is checked on the
    fly. Returns the SHA256SUMS bytes alongside so a future
    cosign-verification step can validate them against an embedded
    public key before trusting the hashes inside.
    Also: fetchBounded for small bounded GETs (manifest, sums file,
    future signature), DefaultStagingDir, EnsureStagingDir,
    PrepareCleanStaging.
  * stage.go — StageTarball reads gzip+tar, validates the entry
    set is exactly {banger, bangerd, banger-vsock-agent} (no
    extras, no missing, no path traversal, no non-regular files),
    extracts at mode 0755 regardless of what the tarball claims.
    StagedRelease records the resulting paths.
  * swap.go — InstallTargets pins the canonical install paths
    (/usr/local/bin/banger, /usr/local/bin/bangerd,
    /usr/local/lib/banger/banger-vsock-agent). Swap orders the
    three replacements vsock → bangerd → banger so the most
    impactful binary (the CLI) goes last; each step uses
    system.AtomicReplace and accumulates a SwapResult so partial
    failures can be rolled back cleanly. Rollback unwinds in
    reverse, joining errors so a half-rolled-back state surfaces
    enough info for an operator to fix manually. CleanupBackups
    removes the .previous trail after `banger doctor` confirms
    the new install is healthy.
  * installmeta.UpdateBuildInfo — small helper that refreshes
    Version/Commit/BuiltAt on /etc/banger/install.toml without
    re-running the full system install. Preserves OwnerUser/UID/
    GID/Home and the original InstalledAt timestamp.

Tests: stage rejects extra entries / missing entries / path
traversal / non-regular files; happy-path stages all three at 0755
with correct contents. Swap+Rollback covers the all-three-succeed
path (then verifies .previous backups exist + rollback restores
old contents) AND the partial-failure path (third swap blocked by
a non-dir parent → SwappedTargets = 2 → rollback unwinds those
two cleanly). DownloadRelease covers happy path, tarball-not-in-
SHA256SUMS, and propagated sha256 mismatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 12:30:22 -03:00
fb6d2b1dae
updater: manifest + SHA256SUMS parsing scaffolding
First slice of the `banger update` package. No CLI yet — this just
defines the wire shape and parsers the rest of the flow will plug
into.

  * internal/updater/manifest.go — Manifest / Release types,
    ManifestSchemaVersion = 1, the hardcoded URL
    https://releases.thaloco.com/banger/manifest.json (var instead
    of const so tests can point at httptest), and FetchManifest /
    ParseManifest / Manifest.LookupRelease / Manifest.Latest.
    The manifest only references URLs (tarball, SHA256SUMS, optional
    signature); actual binary hashes come from SHA256SUMS itself,
    so manifest tampering can't substitute a hash for a known-good
    tarball.
    SchemaVersion gates forward-compat: a CLI that doesn't know its
    server's schema_version refuses to update rather than guessing.
  * internal/updater/sha256sums.go — ParseSHA256Sums tolerates both
    GNU `<digest>  <file>` (with optional `*` binary prefix) and
    BSD `SHA256 (file) = <digest>` formats. Comments and blank
    lines are skipped; malformed lines that LOOK like entries are
    rejected (silent skipping is the wrong failure mode for a
    security-relevant input). Digests are lowercased so the caller
    can `==`-compare without worrying about case.

Caps: 1 MiB on the manifest body, 16 KiB on SHA256SUMS, 256 MiB on
release tarballs. Generous-but-bounded; bumping requires a code
change so a server-side mistake can't fill the disk.

Tests: ParseManifest happy path, schema-version-too-new rejection,
five malformed-input cases. ParseSHA256Sums covers GNU + BSD +
star-prefix + comments-and-blanks, six malformed-input rejections,
case-insensitive digest normalisation. FetchManifest end-to-end via
httptest.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 12:24:36 -03:00
abd5d6f5ab
download: shared FetchVerified helper for capped + hashed downloads
imagecat.Fetch and kernelcat.Fetch each implement the same pattern:
HTTP GET with a Content-Length pre-check, an io.LimitReader cap on
the body, on-the-fly sha256 hashing, and refusal on either the cap
trip or a hash mismatch. The about-to-arrive `banger update` flow
makes a third caller, which is the right number to factor.

  * internal/download.FetchVerified(ctx, client, url, expectedSHA256,
    maxBytes, dstPath): streams the body to dstPath through a
    sha256 hasher, capped at maxBytes+1 bytes so an oversize body
    is detected before the hash check fires. On any failure
    (HTTP error, ContentLength > cap, body exceeds cap, write
    error, hash mismatch) the partial file is removed before
    returning so callers don't have to disambiguate "did we leave
    bytes on disk?".

Imagecat and kernelcat are NOT migrated to this helper in this
commit — they each have their own destination-dir layout and
post-verify decompress/extract steps that don't fit a one-size
helper. Lift them later if it stays clean; for now the helper
is sized for the updater's "fetch tarball + sha256SUMS" need.

Tests cover happy path, hash mismatch, advertised Content-Length
over cap, lying server (chunked, no Content-Length, but oversize
body), HTTP non-2xx, and the two arg-validation rejections (empty
expected hash, non-positive maxBytes).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 18:44:27 -03:00
fa3a7a3e31
system: add AtomicReplace + Rollback for binary swap
Prerequisite for `banger update`'s swap step. The updater renames a
staged binary into place and needs (a) atomicity per file (no
half-written bytes for a process that's about to systemctl restart
into the new binary) and (b) a backup it can restore from when
post-restart doctor reports FAIL.

  * AtomicReplace(newSrc, dst, suffixPrevious): if dst exists,
    move it to dst+suffixPrevious. Then os.Rename newSrc → dst.
    Atomic on a single fs (the only case relevant to the updater —
    everything is staged under /var/cache/banger and then renamed
    into /usr/local/bin, but those should be on the same fs in a
    typical install). On rename failure, restore the backup so we
    don't leave the caller without their binary.
  * AtomicReplaceRollback(dst, suffixPrevious): symmetric inverse.
    Removes dst, renames dst+suffixPrevious back to dst. Tolerant
    of a missing backup (fresh-install case) so the updater can
    call it unconditionally on failure paths without tracking
    backup state itself.
  * Refuses an empty suffix at compile-time-style guard: an empty
    suffix would silently no-op the backup AND break rollback.

Six tests cover: happy path, fresh install (no prior dst), stale
.previous from a half-finished prior run, empty-suffix rejection,
rollback restores, rollback tolerant of no-backup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 18:43:04 -03:00
ec6fc9d185
store,bangerd: add --check-migrations flag for pre-swap schema check
Prerequisite for `banger update`. Before swapping a staged binary
into place, the updater needs to confirm the new bangerd recognises
the running install's DB schema. Without this, an operator could end
up with a service that won't open its store after the binary swap +
restart.

  * store.InspectSchemaState(path): opens the DB read-only (reusing
    OpenReadOnly's mode=ro DSN), reads the schema_migrations table,
    and classifies the relationship between applied and known IDs:
    SchemaCompatible (lockstep), SchemaMigrationsNeeded (binary
    newer, will auto-migrate on first Open), or SchemaIncompatible
    (DB has applied IDs the binary doesn't know about).
    Missing schema_migrations table is treated as "all migrations
    pending" rather than an error — matches the fresh-install case.
  * bangerd --check-migrations: opens the configured DB read-only,
    prints a one-line classification, and exits 0/1/2. The exit
    code is the contract:
        0 — compatible
        1 — migrations needed (binary newer; safe to swap)
        2 — incompatible (binary older than DB; abort the swap)
    Honours --system to pick between system StateDir and user mode.
  * bangerdExit indirection so future tests can capture the exit
    code without terminating the test process. Production points
    at os.Exit.

Tests cover the four classifications: compatible (fully migrated
DB), migrations-needed (only baseline applied), incompatible
(synthetic id=99 inserted), and missing-table (fresh DB). Live
exercise on this dev host returned `migrations needed: pending [3]
(binary will apply on first Open)` and exit 1, matching the
contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 18:41:31 -03:00
3c0af3a2de
opstate,daemon: list in-flight operations via daemon.operations.list
Prerequisite for `banger update`'s preflight, which refuses to swap
binaries while anything is in flight. Today's opstate.Registry
exposes Insert/Get/Prune but no iteration; without a snapshot
accessor the update flow can't tell whether a vm.create is
mid-prepare-work-disk.

  * opstate.Registry.List(): returns a freshly-allocated snapshot
    of every entry. Mutating the slice doesn't poison the
    registry. Pinned by tests covering the snapshot semantics
    and the empty case.
  * api.OperationSummary / OperationsListResult: a public-shape
    record per op. Today the Kind is always "vm.create" — the
    field exists so future async kinds (image.pull, kernel.pull)
    plug in without an API change.
  * Daemon.ListOperations + daemon.operations.list RPC:
    walks vmService.createOps and emits OperationSummary entries.
    Done ops are included in the snapshot; the update preflight
    filters by Done itself.
  * dispatch_test's documented-methods list updated.

No behaviour change for existing flows; this is a read-only
addition.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 18:14:57 -03:00
775525b592
cli,doctor: --version flag + CLI/install drift check
Two pre-release polish items on the version-display surface.

  * --version on both binaries: cobra's Version field on the banger
    and bangerd roots renders a one-line summary (banger v0.1.0
    (commit abcd1234, built 2026-04-28T20:45:50Z)). The
    SetVersionTemplate override drops cobra's "{{.Name}} version"
    prefix — our string is already a complete sentence. The
    multi-line `banger version` subcommand is unchanged for callers
    that want the full SHA / built_at on separate lines.
  * Doctor "banger version" row: prints the running CLI's version +
    short commit + built-at, plus what /etc/banger/install.toml
    recorded at install time. Disagreement is the most common
    version-skew pitfall (stale CLI against fresh daemon, or vice
    versa) and a one-line warn is friendlier than tracking that down
    from a launch failure.
    Drift detection is suppressed when either side is dev/unknown
    (untagged build) — comparing a dev CLI against a tagged install
    is the developer-machine case, not a real problem.

formatVersionLine is in internal/cli (banger.go) and reused by
bangerd.go via a strings.Replace because bangerd's version line
should say "bangerd" not "banger". Slightly tilt-feeling but cheaper
than parameterising the helper for one caller.

Tests: TestVersionsDriftToleratesDevAndUnknown pins the four
branches (match, version diff, commit diff, dev-suppression). The
existing version-format test already runs through formatVersionLine
indirectly.

Live exercise:
  $ banger --version
  banger dev (commit 1c1ca7d6, built 2026-04-28T20:52:33Z)
  $ bangerd --version
  bangerd dev (commit 1c1ca7d6, built 2026-04-28T20:52:33Z)
  $ banger doctor | head
  ...
  PASS	banger version
    - CLI dev (commit 1c1ca7d6, built 2026-04-28T20:52:33Z)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 17:53:32 -03:00
1c1ca7d6a4
doctor: pin firecracker version range, distro-aware install hint
Pre-release polish: be explicit about which firecracker versions
banger has been validated against, and give users a one-line install
suggestion when the binary is missing rather than the previous
generic "install firecracker or set firecracker_bin".

internal/firecracker/version.go (new):
  * MinSupportedVersion = "1.5.0" — the floor banger refuses to
    launch below. Bumping this is a deliberate decision, paired
    with whatever helper feature started requiring the newer
    firecracker.
  * KnownTestedVersion = "1.14.1" — what banger's smoke suite
    actually runs against today.
  * SemVer + Compare + ParseVersionOutput, table-tested. The parser
    tolerates the trailing "exiting successfully" log line that
    firecracker tacks onto --version; only the canonical
    "Firecracker vX.Y.Z" line matters.
  * QueryVersion shells `<bin> --version` through a CommandRunner-
    shaped interface; doesn't import internal/system to keep the
    firecracker package leaf-clean.

internal/daemon/doctor.go:
  * New addFirecrackerVersionCheck replaces the previous bare
    RequireExecutable preflight for firecracker. Three outcomes:
    PASS within [Min, Tested], WARN above Tested (newer firecracker
    usually works but is outside the tested window), FAIL below Min
    or when the binary is missing.
  * On missing binary, surfaces a distro-aware install command via
    parseOSReleaseIDs(/etc/os-release) → guessFirecrackerInstall
    Command. Pinned suggestions for debian (apt), arch/manjaro
    (paru), and nixos (nix-env). Other distros get only the upstream
    Releases URL — guessing wrong sends users on a wild goose chase.
  * runtimeChecks no longer includes the firecracker preflight; the
    new check subsumes it.

README.md:
  * Requirements line now spells out the tested-against version
    (v1.14.1) and the supported floor (≥ v1.5.0), and points at
    `banger doctor` for the version check + install hint.

Tests: ParseVersionOutput across canonical/prerelease/garbage inputs,
SemVer.Compare across major/minor/patch boundaries, MustParseSemVer
panics on malformed inputs. Doctor-side: PASS on tested version,
FAIL below Min, WARN above Tested, FAIL with upstream URL when
missing, install-hint dispatch table covering debian/ubuntu (via
ID_LIKE)/arch/manjaro/nixos/fedora-fallback/missing-os-release.
The renamed TestDoctorReport_MissingFirecrackerFails... now asserts
against the new check name. Live `banger doctor` reports
"v1.14.1 at /usr/bin/firecracker (within tested range; min v1.5.0,
tested v1.14.1)" against the smoke host.

Smoke bare_run still green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 17:47:42 -03:00
f7a6832ebf
Merge model,cli,docs polish for v0.1.0
# Conflicts:
#	internal/cli/commands_image.go
2026-04-28 17:36:47 -03:00
d0997fd3b5
model,cli,docs: medium-effort polish for v0.1.0
* model.ParseSize / FormatSizeBytes: pinned with table tests in
    internal/model/types_test.go (TestParseSize 22 cases,
    TestFormatSizeBytes 11 cases, TestParseSizeFormatRoundTrip 7
    boundaries). Fixed the long-suffix regression: "4GiB", "512MiB",
    "4KiB" now parse correctly (parser strips trailing IB before
    inspecting the unit byte). Pinned current behaviour for
    no-suffix input ("1024" treated as MiB) and FormatSizeBytes(0).
    commands_image.go --size flag-help updated to show 4GiB now
    that the parser accepts it.
  * vm ports --json: matches the JSON-vs-table inconsistency between
    vm stats (always JSON) and vm ports (always table). --json on
    vm ports flips to the same printJSON path as vm stats. Default
    table output unchanged. Other vm subcommands (show, stats,
    logs, health, ping) didn't fit the identical pattern; left
    alone.
  * docs/oci-import.md architecture section moved to a new
    docs/oci-import-internals.md (precedent: internal/daemon/
    ARCHITECTURE.md). User-facing oci-import.md keeps a one-line
    pointer for advanced reading.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 17:36:03 -03:00
26dbf0f221
Merge cli,docs polish for v0.1.0
Brings in commit 003b048 from the agent-2 worktree:
  - CLI help text + completers (image pull / kernel pull / vm stats /
    vm set / --from flag).
  - README golden-image definition + Requirements block above
    Quick Start.
  - Code hygiene: drop emptyDash; consolidate formatBytes into
    humanSize.
  - Logger downgrades for per-RPC INFO chatter.
2026-04-28 17:35:06 -03:00
003b0488ce
cli,docs: trivial polish for v0.1.0
A pre-release audit collected ~12 trivial-effort UX and code-hygiene
items. Rolling them up here so the v0.1.0 commit log isn't littered
with one-line tweaks.

CLI help / completion:
  * commands_image.go: drop dangling reference to a `banger image
    catalog` subcommand that doesn't exist; replace with a pointer
    to `banger image list`.
  * commands_image.go: --size flag example was "4GiB" but the parser
    rejects that suffix. Change example to "4G". (Parser-side fix
    is in a separate concern.)
  * commands_image.go + completion.go: image pull now wires a
    catalog completer (falls back to local image names since there's
    no image-catalog RPC yet); image show / delete / promote already
    completed local names.
  * commands_kernel.go + completion.go: kernel pull now wires a new
    completeKernelCatalogNameOnlyAtPos0 backed by the kernel.catalog
    RPC, so tab-complete suggests pullable kernels.
  * commands_vm.go: vm stats and vm set now have Long + Example
    blocks (peers all do); --from flag description updated to spell
    out the relationship to --branch.

README:
  * Define "golden image" inline at first use.
  * Add a one-line Requirements block above Quick Start so users
    hit the firecracker / KVM dependency before `make build`.

Code hygiene:
  * dashIfEmpty / emptyDash were the same function. Deleted
    emptyDash, retargeted three call sites.
  * formatBytes (introduced today in image cache prune) duplicated
    humanSize. Consolidated to humanSize, now with a space ("1.2
    GiB" not "1.2GiB"). formatters_test.go expectations updated.

Logging chattiness:
  * "operation started" (logger.go), "daemon request canceled"
    (daemon.go), and "helper rpc completed" (roothelper.go) all
    fired at INFO per RPC. Downgraded to DEBUG so routine shell
    completions don't spam syslog.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 17:31:54 -03:00
33639efe0c
docs: fix three security-sensitive doc/code mismatches
A pre-release audit caught three places where the docs misrepresent
the trust model. Each is a claim users would read while auditing
banger and reach the wrong conclusion.

  * docs/privileges.md:140, 194 — bridge default was documented as
    "banger0" but the code default (model.DefaultBridgeName) is
    "br-fc". A user following the manual-removal recipe would `ip
    link del banger0` against a non-existent interface.
  * docs/privileges.md:192 — uninstall recipe said "stop your VMs
    first via `banger vm stop --all`". That flag doesn't exist; vm
    stop is a per-name action. Replaced with the actual options:
    `banger vm prune` (bulk) or per-VM `banger vm stop <name>`.
  * docs/privileges.md:255 and README.md:78-79 — helper unit's
    CapabilityBoundingSet was listed as 5 caps; the actual set in
    commands_system.go:370 is 11 (we added FOWNER/KILL/MKNOD/SETGID/
    SETUID/SYS_CHROOT during Phase B and never updated the docs).
    Updated both lists; the "what's NOT included" rationale stays
    accurate against the new positive list.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 17:30:58 -03:00
4d8dca6b72
image: add banger image cache prune for OCI cache cleanup
OCI layer blobs accumulate forever — every pull writes layers to
~/.cache/banger/oci/blobs/sha256/<hex> via go-containerregistry's
filesystem cache, and nothing ever evicts them. The cache is purely
a re-pull-avoidance (every flattened image is independent of the
blobs that sourced it), so it's a perfect candidate for an opt-in
operator-driven prune.

New surface:
  * api: ImageCachePruneParams{DryRun}, ImageCachePruneResult
    {BytesFreed, BlobsFreed, DryRun, CacheDir}.
  * daemon: ImageService.PruneOCICache walks layout.OCICacheDir for
    a (bytes, blobs) tally, then — outside dry-run — atomically
    renames the cache aside, recreates it empty, and rm -rf's the
    aside dir. The rename-then-rm avoids leaving the cache in a
    half-removed state if a pull starts mid-prune (the in-flight
    pull's open files survive the rename via standard Linux
    semantics; it just sees a fresh empty cache afterwards). Missing
    cache dir is treated as zero — fresh installs that have never
    pulled an OCI image don't error.
  * dispatch: image.cache.prune RPC (paramHandler-wrapped, mirroring
    every other image RPC). Documented-methods test list updated.
  * cli: `banger image cache` group with a `prune` subcommand
    (--dry-run flag). Output is a single line: "freed 1.2 GiB
    across 47 blob(s) in /var/cache/banger/oci" or "would free …".
    formatBytes helper for the size pretty-print.

docs/oci-import.md: replaced the "Tech debt: cache eviction" bullet
with a "Cache lifecycle" section describing the new command and
the in-flight-pull caveat.

Tests: PruneOCICache covers the happy path (real prune empties the
cache, recreates an empty dir, doesn't leak the .pruning- aside),
the dry-run path (returns size, leaves blobs intact), and the
fresh-install path (cache dir absent → zero result, no error).
Smoke at JOBS=4 still green; live exercise against an empty cache
on a system install prints the expected zero summary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 16:32:57 -03:00
182bccf8af
roothelper: pin bridge name + IP + CIDR to a banger-managed shape
priv.ensure_bridge / priv.create_tap accepted the daemon's network
config triple (BridgeName, BridgeIP, CIDR) and forwarded it straight
to `ip link` / `ip addr` / `ip link set master`. Argv-style exec
ruled out shell injection, but the kernel happily honours those
commands against any iface a compromised owner-uid daemon names —
including eth0/docker0/lo. Concretely:

  * priv.ensure_bridge could `ip link set <iface> up` against any
    host interface and `ip addr add` arbitrary IP/CIDR to it.
  * priv.create_tap could `ip link set <new-tap> master <iface>`,
    bridging the per-VM tap into the host's primary LAN so the
    guest sees host-local broadcast traffic.
  * priv.sync_resolver_routing / priv.clear_resolver_routing only
    enforced "name shaped like a Linux iface" — no banger constraint.

New validators (single chokepoint via validateNetworkConfig):
  * validateBangerBridgeName: name must equal "br-fc" or start with
    "br-fc-". Stops a compromised daemon from naming any host iface
    in these RPCs. Users with a custom bridge keep the prefix.
  * validateCIDRPrefix: numeric in [8, 32]. Wider prefixes would
    silently widen the bridge subnet beyond what the daemon intends.
  * validateNetworkConfig bundles bridge-name + validateIPv4 +
    validateCIDRPrefix so every helper RPC that takes the triple
    stays in lockstep.

Wired into methodEnsureBridge, methodCreateTap, and the resolver-
routing pair (replacing the older validateLinuxIfaceName-only check
with the stricter banger-bridge check).

docs/privileges.md updated: the helper-RPC table rows now spell out
the banger-managed bridge constraint, and the trust list includes
the new validators.

Tests: TestValidateBangerBridgeName (default + suffixed accepted,
host ifaces / wrong prefix / oversized rejected), TestValidate
CIDRPrefix (boundary + non-numeric + IPv6-style 64 rejected),
TestValidateNetworkConfig (happy path + each-field-bad cases).
Smoke at JOBS=4 still green — banger's defaults sail through the
new gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 16:19:28 -03:00
4004ce2e7e
imagecat,kernelcat: bound staged download, hash before extract
Both Fetch flows previously streamed resp.Body straight into
zstd → tar → on-disk extractor with the SHA256 check tacked on at
the END. A bad mirror or an attacker that's compromised the catalog
host could ship a multi-gigabyte tarball, watch banger expand it to
disk, and only THEN see the helpful "sha256 mismatch" message —
having already filled the host filesystem.

Reorder the operations: stage the compressed tarball to a temp file
under the destination directory through an io.LimitReader (cap +1
bytes), hash on the way in, refuse to decompress if either the cap
trips or the SHA mismatches. Worst-case disk use is bounded by the
cap, not by the source.

Cap is exposed as a package var (MaxFetchedBundleBytes,
MaxFetchedKernelBytes) so callers can tune per-deployment and tests
can squeeze it down to provoke the rejection. Default 8 GiB —
generous enough for a 4 GiB rootfs (which compresses to ~1-2 GiB),
tight enough to make a "fill the host disk" attack expensive.

The temp file lives in the destination dir so extraction stays on
the same filesystem and we don't pay for cross-FS rename. defer
os.Remove cleans up; the existing per-package cleanup() handler
still removes any partial extraction on hash mismatch / extraction
failure.

Tests: each package gets a TestFetchRejectsOversizedTarballBefore
Extraction that sets the cap to 64 bytes, points Fetch at a multi-KB
tarball, and asserts (a) error mentions "cap", (b) destination dir
is left clean (no leaked rootfs / manifest / kernel tree). All
existing tests still pass — happy path, hash mismatch, missing
files, path traversal, HTTP error, etc.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 16:09:55 -03:00
3805b093b4
roothelper: tie kill/signal authorization to banger-launched firecracker
validateFirecrackerPID was a substring check on /proc/<pid>/cmdline:
"contains 'firecracker'". Good enough to refuse init/sshd/the test
binary, but on a shared host where multiple users run firecracker
the helper would happily SIGKILL someone else's VM. The owner-UID
daemon could weaponise the helper as an arbitrary "kill any
firecracker on this box" primitive.

Replace the substring gate with two stronger acceptance modes:

  * Cgroup match (the supported path): /proc/<pid>/cgroup contains
    bangerd-root.service. systemd assigns every direct child of the
    helper unit into that cgroup at fork; the kernel keeps it there
    for the process's lifetime, so no daemon-UID code can forge it.
    Other users' firecracker processes live in different cgroups
    (user@<uid>.service, foreign service slices) and fail this
    check. Also robust across helper restarts: KillMode=control-group
    on the unit kills children when the service goes down, so an
    "orphan banger firecracker in some other cgroup" is rare by
    construction.

  * --api-sock fallback: cmdline carries `--api-sock <path>` with
    the path under banger's RuntimeDir. Covers the legacy direct
    (no-jailer) launch path, and gives daemon reconcile a way to
    clean up the rare orphan that lands outside the service cgroup
    after a hard helper crash.

Tried /proc/<pid>/root first — pivot_root semantics make jailer'd
firecracker read its root as "/" from any namespace, so the symlink
is useless as a banger-managed fingerprint. Cgroup is the right
signal.

Also added a signal allowlist: priv.signal_process now rejects
anything outside {TERM, KILL, INT, HUP, QUIT, USR1, USR2, ABRT}
(case-insensitive, with or without SIG prefix). STOP/CONT, real-time
signals, and numeric forms are refused — the helper running as root
must not be a generic "send arbitrary signal to my pid" primitive.
priv.kill_process is unaffected (it always sends KILL).

Tests: validateSignalName covers allowlist + numeric/STOP/RTMIN
rejection; extractFirecrackerAPISock pins the three flag forms
(--api-sock VAL, --api-sock=VAL, -a VAL); pathIsUnder gets a small
table; existing TestValidateFirecrackerPID still rejects PID 0,
PID 1, and the test process itself. Doctor's non-system-mode test
gained a t.TempDir-backed install path so it stops being
environment-dependent on machines that happen to have
/etc/banger/install.toml.

Smoke at JOBS=4 still green — every banger-launched firecracker
sails through the cgroup match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 16:00:41 -03:00
4a56e6c7d6
roothelper: walk validateManagedPath components, reject symlinks
validateManagedPath was textual-only: filepath.Clean + dest-prefix
match. That stopped `..` escapes but not the symlink-bypass attack
that motivated this fix — a daemon-UID attacker can write into
StateDir/RuntimeDir (it's their UID), so they can plant
`<StateDir>/redirect -> /etc` and any helper RPC that then operates
on `<StateDir>/redirect/...` resolves through the symlink at the
kernel and lands at /etc/... on the host.

Concretely the leaks this closed:
  * priv.create_dm_snapshot: rootfs/cow paths fed to losetup —
    losetup follows the symlink and attaches a host block device.
  * priv.launch_firecracker: kernel/initrd paths hard-linked into
    the chroot via `ln -f` — link(2) on Linux follows source
    symlinks, hard-linking host files into the jail.
  * priv.read_ext4_file / priv.write_ext4_files: image paths fed
    to debugfs / e2cp as root.
  * validateLaunchDrivePath: drive paths mknod'd or hard-linked.
  * validateJailerOpts: chroot base.

Fix: after the existing prefix match, walk every component below
the matched root with Lstat. Any existing symlink — leaf or
intermediate — fails the validator. ENOENT is tolerated because
several callers pass paths firecracker/the helper materialise
later (sockets, log files, kernel hard-link targets); whoever
materialises them goes through the same validation when the
helper-side primitive runs.

Subsumes most of validateNotSymlink's coverage but the explicit
call sites (methodEnsureSocketAccess, methodCleanupJailerChroot)
keep their belt-and-braces check — those paths must EXIST and
not be symlinks, which validateNotSymlink enforces strictly while
the broadened validateManagedPath tolerates ENOENT.

Race-free in practice: helper RPCs are short and the validator
fires on the same kernel state the next syscall sees. The helper
loop processes RPCs serially per-connection, and the validator
plus the syscall both run as root within microseconds of each
other.

Four new tests cover symlink leaf, symlink intermediate, missing
leaf (must pass), and the plain happy path. Smoke at JOBS=4 still
green — every legitimate daemon-supplied path passes the walk.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 15:26:56 -03:00
0a079277ef
imagepull: reject symlink ancestors during OCI flatten
safeJoin previously did textual cleaning + dest-prefix check only.
That's enough to catch `../escape`, but not the symlink-ancestor
attack: a malicious OCI layer plants `etc -> /tmp/probe`, a later
layer writes/deletes/hardlinks against `etc/anything`, and the kernel
silently dereferences the symlink so the operation lands at
`/tmp/probe/anything` on the host.

The daemon runs flatten as the owner UID, so anywhere that UID can
write becomes a write target; anywhere it can delete (e.g. its own
home) becomes a delete target. Whiteouts and hardlinks make this
worse — a whiteout for `etc/.wh.victim` would `RemoveAll` the host
file `/tmp/probe/victim`, and a TypeLink would expose host files
inside the extracted rootfs.

safeJoin now Lstat-walks every intermediate component of the joined
path against the already-extracted tree, refusing if any ancestor is
a symlink. Walking is race-free against the extraction loop because
we process tar entries serially. Leaf components stay caller-owned
(TypeSymlink writes legitimately want a symlink leaf; TypeReg
RemoveAll's any prior leaf before opening; etc.).

Three new tests pin the protection: write through a symlinked
ancestor, whiteout through a symlinked ancestor, and hardlink target
through a symlinked ancestor — each must fail and leave the host
probe path untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 15:20:46 -03:00
8bfa525568
test: cover imagemgr + dmsnap helpers
Both packages had zero tests before this change. The helpers in them
are pure (imagemgr) or scripted-runner-friendly (dmsnap), so they're
cheap to pin and worth catching regressions on.

imagemgr/paths_test.go:
  * DebianBasePackages returns a defensive copy (mutating the result
    can't poison subsequent calls — important because hashPackages
    digests this list).
  * BuildMetadataPackages stays in lockstep with DebianBasePackages.
  * hashPackages is order-sensitive and includes a trailing newline
    in its canonical join (regression guard for any future "sort the
    list before hashing" temptation that would invalidate every
    on-disk hash).
  * StageOptionalArtifactPath returns "" for empty/whitespace input
    and joins by name otherwise.
  * WritePackagesMetadata writes <rootfs>.packages.sha256 with the
    expected hash, no-ops on empty rootfs path or empty package list.
  * DebianBasePackages contains the small critical-package floor
    (ca-certificates, curl, git) so a future apt-list trim can't
    silently drop them.

dmsnap/dmsnap_test.go:
  * Create runs losetup base, losetup cow, blockdev getsz, dmsetup
    create in that order, with a snapshot table referencing the loops
    in (base, cow) order — a swap would corrupt every VM.
  * Create's failure path unwinds with losetup -d on cow then base.
  * Cleanup tears down dmsetup before losetup (otherwise dmsetup sees
    EBUSY against vanished backing devices).
  * Cleanup falls back to DMDev when DMName is empty.
  * Cleanup tolerates "No such device" on losetup -d (idempotent
    re-run after a partial cleanup).
  * Cleanup surfaces non-missing losetup errors (the tolerance is
    narrow on purpose).
  * Remove returns nil on a missing target and surfaces non-retryable
    errors immediately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 15:13:49 -03:00
45826f0db0
docs: add config.md reference for the daemon TOML schema
README previously punted on the config schema with a "full key list in
internal/config/config.go" pointer. New docs/config.md walks every
TOML key the daemon reads — top-level, [vm_defaults], [[file_sync]] —
with type, default, and a one-sentence description per row, plus a
copy-pasteable example at the bottom.

Sourced 1:1 from internal/config/config.go's fileConfig (and the
defaults in load() + internal/model/types.go), so it stays accurate
as long as those structs are the schema source of truth.

README's existing config section now points at docs/config.md, and
the "Further reading" list gets it as the first bullet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 15:11:18 -03:00
7d7c15a370
docs: fix config-file path in privileges.md
The filesystem-mutations table referred to `~/.config/banger/banger.toml`,
but the daemon reads `~/.config/banger/config.toml` (per
internal/config/config.go and README.md). Bring privileges.md in line.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 15:11:06 -03:00
0c77b042ed
build: add pre-commit hook gating lint + test + build
`.githooks/pre-commit` runs `make lint test build` on every commit,
catching unformatted Go (`gofmt -l`), `go vet` regressions, shellcheck
errors on scripts/, broken unit tests, and broken builds before they
reach the index. Activate per-clone with `make install-hooks`, which
points `core.hooksPath` at `.githooks/`. Bypass for in-flight WIP
commits with `git commit --no-verify`.

The hook directory is tracked in git (unlike .git/hooks/) so a clone
+ `make install-hooks` is enough to opt in; no per-machine
hand-installation. .PHONY and the help line both list the new
target.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 15:08:41 -03:00
6b4e1922b0
model: gofmt VMRecord struct alignment
Stats and Workspace fields landed in 6b543cb with column alignment
that gofmt wants to pull tighter; rerun gofmt so the new pre-commit
hook's `gofmt -l` gate passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 15:08:12 -03:00
3e6d0cee89
doctor: surface security-posture drift in banger doctor
`docs/privileges.md` now documents what the install promises (helper +
daemon services active, sockets at 0600 ownerUID, units carrying the
hardening directives, firecracker root-owned + non-writable). Doctor
verifies the running install matches: drift between the doc and the
filesystem would silently weaken the trust model otherwise.

In system mode (install.toml present):
  * helper service / owner daemon service: `systemctl is-active`.
  * helper socket / daemon socket: stat-and-compare mode + uid against
    the registered owner.
  * helper unit hardening / daemon unit hardening: scan the rendered
    unit for NoNewPrivileges, ProtectSystem=strict, ProtectHome
    (=yes for the helper, =read-only for the daemon), RestrictSUIDSGID,
    LockPersonality, and the helper's CapabilityBoundingSet line. The
    daemon unit also pins User=<registered owner>.
  * firecracker binary ownership: regular file, not a symlink, mode
    not group/world writable, executable, owned by uid 0 — same
    constraints validateRootExecutable enforces at launch, surfaced
    once at doctor time so a misconfigured binary fails fast with a
    clearer error than the helper's open-time rejection.

In non-system mode (no /etc/banger/install.toml) doctor emits a single
WARN row pointing at docs/privileges.md > 'Running outside the system
install'. A PASS would imply guarantees the install isn't actually
providing.

Tests cover both branches: the non-system warn pins its message
substrings; system-mode pins that every check name shows up; and the
helpers (socket-perms, unit-hardening, executable-ownership) have
direct table-style negative tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 14:58:34 -03:00
853249dec2
roothelper: tighten input validation across privileged RPCs
Defence-in-depth pass over every helper method that touches the host
as root. Each fix narrows what a compromised owner-uid daemon could
ask the helper to do; many close concrete file-ownership and DoS
primitives that the previous validators didn't reach.

Path / identifier validation:
  * priv.fsck_snapshot now requires /dev/mapper/fc-rootfs-* (was
    "is the string non-empty"). e2fsck -fy on /dev/sda1 was the
    motivating exploit.
  * priv.kill_process and priv.signal_process now read
    /proc/<pid>/cmdline and require a "firecracker" substring before
    sending the signal. Killing arbitrary host PIDs (sshd, init, …)
    is no longer a one-RPC primitive.
  * priv.read_ext4_file and priv.write_ext4_files now require the
    image path to live under StateDir or be /dev/mapper/fc-rootfs-*.
  * priv.cleanup_dm_snapshot validates every non-empty Handles field:
    DM name fc-rootfs-*, DM device /dev/mapper/fc-rootfs-*, loops
    /dev/loopN.
  * priv.remove_dm_snapshot accepts only fc-rootfs-* names or
    /dev/mapper/fc-rootfs-* paths.
  * priv.ensure_nat now requires a parsable IPv4 address and a
    banger-prefixed tap.
  * priv.sync_resolver_routing and priv.clear_resolver_routing now
    require a Linux iface-name-shaped bridge name (1–15 chars, no
    whitespace/'/'/':') and, for sync, a parsable resolver address.

Symlink defence:
  * priv.ensure_socket_access now validates the socket path is under
    RuntimeDir and not a symlink. The fcproc layer's chown/chmod
    moves to unix.Open(O_PATH|O_NOFOLLOW) + Fchownat(AT_EMPTY_PATH)
    + Fchmodat via /proc/self/fd, so even a swap of the leaf into a
    symlink between validation and the syscall is refused. The
    local-priv (non-root) fallback uses `chown -h`.
  * priv.cleanup_jailer_chroot rejects symlinks at both the leaf
    (os.Lstat) and intermediate path components (filepath.EvalSymlinks
    + clean-equality). The umount sweep was rewritten from shell
    `umount --recursive --lazy` to direct unix.Unmount(MNT_DETACH |
    UMOUNT_NOFOLLOW) per child mount, deepest-first; the findmnt
    guard remains as the rm-rf safety net. Local-priv mode falls
    back to `sudo umount --lazy`.

Binary validation:
  * validateRootExecutable now opens with O_PATH|O_NOFOLLOW and
    Fstats through the resulting fd. Rejects path-level symlinks and
    narrows the TOCTOU window between validation and the SDK's exec
    to fork+exec time on a healthy host.

Daemon socket:
  * The owner daemon now reads SO_PEERCRED on every accepted
    connection and refuses any UID that isn't 0 or the registered
    owner. Filesystem perms (0600 + ownerUID) already enforced this;
    the check is belt-and-braces in case the socket FD is ever
    leaked to a non-owner process.

Docs:
  * docs/privileges.md walked end-to-end. Each helper RPC's
    Validation gate row reflects what the code actually enforces.
    New section "Running outside the system install" calls out the
    looser dev-mode trust model (NOPASSWD sudoers, helper hardening
    bypassed) so users don't deploy that path on shared hosts.
    Trust list updated to include every new validator.

Tests added: validators (DM-loop, DM-remove-target, DM-handles,
ext4-image-path, iface-name, IPv4, resolver-addr, not-symlink,
firecracker-PID, root-executable variants), the daemon's authorize
path (non-unix conn rejection + unix conn happy path), the umount2
ordering contract (deepest-first + --lazy on the sudo branch), and
positive/negative cases for the chown-no-follow fallback.

Verified end-to-end via `make smoke JOBS=4` on a KVM host.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 14:39:41 -03:00
6b543cb17f
firecracker: adopt firecracker-jailer for VM launch (Phase B)
Each VM's firecracker now runs inside a per-VM chroot dropped to the
registered owner UID via firecracker-jailer. Closes the broad ambient-
sudo escalation surface that survived Phase A: the helper still needs
caps for tap/bridge/dm/loop/iptables, but the VMM itself no longer
runs as root in the host root filesystem.

The host helper stages each chroot up front: hard-links the kernel
and (optional) initrd, mknods block-device drives + /dev/vhost-vsock,
copies in the firecracker binary (jailer opens it O_RDWR so a ro bind
fails with EROFS), and bind-mounts /usr/lib + /lib trees read-only so
the dynamic linker can resolve. Self-binds the chroot first so the
findmnt-guarded cleanup can recurse safely.

AF_UNIX sun_path is 108 bytes; the chroot path easily blows past that.
Daemon-side launch pre-symlinks the short request socket path to the
long chroot socket before Machine.Start so the SDK's poll/connect
sees the short path while the kernel resolves to the chroot socket.
--new-pid-ns is intentionally disabled — jailer's PID-namespace fork
makes the SDK see the parent exit and tear the API socket down too
early.

CapabilityBoundingSet for the helper expands to add CAP_FOWNER,
CAP_KILL, CAP_MKNOD, CAP_SETGID, CAP_SETUID, CAP_SYS_CHROOT alongside
the existing CAP_CHOWN/CAP_DAC_OVERRIDE/CAP_NET_ADMIN/CAP_NET_RAW/
CAP_SYS_ADMIN.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-28 14:38:07 -03:00
d73efe6fbc
firecracker: drop sudo sh -c, race chown against SDK probe in Go
Replace the shell-string launcher in buildProcessRunner with a direct
exec.Command. The previous sh -c wrapper relied on shellQuote escaping
for every MachineConfig field that flowed into the launch script; any
future field that ever carried an attacker-controlled value would have
become RCE-as-root. The new path passes binary path and flags as
separate argv entries, so there is no shell to interpret anything.

The wrapper also did two things the shell can no longer do for us:

  1. umask 077 — moved to syscall.Umask in cmd/bangerd/main.go so every
     firecracker child (and any other file the daemon creates) inherits
     0600 by default. Single-user dev sandbox state should be private.

  2. chown_watcher — the SDK's HTTP probe inside Machine.Start connects
     to the API socket the moment it appears. Under sudo the socket is
     created root-owned and the daemon's connect(2) gets EACCES, so the
     post-Start EnsureSocketAccess never runs. The shell papered over
     this with a backgrounded chown loop. Replaced by
     fcproc.EnsureSocketAccessForAsync: same race-window guarantee, in
     pure Go, kicked off in LaunchFirecracker right before Start and
     awaited right after.

Tests updated: shell-substring assertions replaced with cmd-arg
assertions, plus a new fcproc test pinning the async chown sequence.
Smoke (full systemd two-service install + KVM scenarios) passes.
2026-04-27 20:14:01 -03:00
c4e1cb5953
daemon: tighten concurrency around pulls, cleanup, and handle persistence
Four targeted fixes from a race-condition audit of the daemon package.
None change behaviour on the happy path; each closes a window where a
concurrent or interrupted RPC could strand state on the host.

  - KernelDelete now holds the same per-name lock as KernelPull /
    readOrAutoPullKernel. Without it, a delete racing a concurrent
    pull could remove files mid-write or land between the pull's
    manifest write and its first use.

  - cleanupRuntime no longer early-returns on an inner waitForExit
    failure; DM snapshot, capability, and tap teardown always run and
    every error is folded into the returned errors.Join. EBUSY against
    a still-alive firecracker is benign and surfaces in the joined
    error rather than stranding kernel state across daemon restarts.

  - Per-name image / kernel pull locks switch from *sync.Mutex to a
    1-buffered chan struct{}. Acquire is a select on ctx.Done(), so a
    peer waiting behind a pull whose RPC was cancelled can bail out
    instead of blocking forever on a pull nobody is consuming.

  - setVMHandles writes the per-VM scratch file before updating the
    in-memory cache. A daemon crash between the two now leaves disk
    ahead of memory (recoverable: reconcile re-seeds the cache from
    the file on next start) rather than memory ahead of disk (lost
    handles → stranded DM/loops/tap).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 19:32:43 -03:00
777b597a1e
smoke: smol VMs by default + JOBS auto-detects nproc
Three quality-of-life improvements now that the daemon-side races
that gated parallel mode are fixed:

1. **Smol VMs by default.** Smoke installs a tuned config.toml at
   /etc/banger/config.toml between `system install` and `system
   restart` so the respawned daemon picks up:
       vcpu = 2
       memory_mib = 1024
       disk_size = "2G"
       system_overlay_size = "2G"
   Smoke scenarios assert behavior, not capacity — they don't need
   4 vCPU / 8 GiB / 8 GiB / 8 GiB. Per-VM RAM cost drops from 8 GiB
   to 1 GiB; nominal disk drops from 16 GiB to 4 GiB (sparse, so
   actual use is small either way, but the new ceiling is gentler
   on hosts that can't overcommit). Scenarios that test
   reconfiguration (vm_set's --vcpu 2 → 4) still pass --vcpu
   explicitly, so this default doesn't perturb their assertions.

2. **JOBS defaults to nproc.** The Makefile resolves JOBS to
   `$(shell nproc)` if unset; the smoke script's existing cap of 8
   keeps the parallel pool sane on bigger hosts. The script always
   passes --jobs N now, so behavior is consistent. Override with
   `make smoke JOBS=1` for a fully serial run.

3. **Help text catches up.** --help no longer flags parallelism as
   experimental (the underlying daemon races are fixed) and now
   describes the small-VM default. `make help` mentions the new
   default and how to override.

Verified: `make smoke` (no JOBS) on a 32-core box auto-runs with
JOBS=8, smol VMs, 21/21 PASS in 172s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 17:36:17 -03:00
72882e45d7
daemon: serialise concurrent image/kernel pulls + atomic-rename seed refresh
Three concurrency bugs surfaced by `make smoke JOBS=4` that all stem
from `vm.create` paths assuming single-caller semantics:

1. **Kernel auto-pull manifest race.** Parallel `vm.create` calls that
   each need to auto-pull the same kernel ref both run kernelcat.Fetch
   in parallel against the same /var/lib/banger/kernels/<name>/. Fetch
   writes manifest.json non-atomically (truncate + write); the peer
   reads it back mid-write and trips
   "parse manifest for X: unexpected end of JSON input".

   Fix: per-name `sync.Mutex` map on `ImageService` (kernelPullLock).
   `KernelPull` and `readOrAutoPullKernel` both acquire it and re-check
   `kernelcat.ReadLocal` after the lock so a peer who finished while we
   waited is treated as success — `readOrAutoPullKernel` does NOT call
   `s.KernelPull` because that path errors with "already pulled" on a
   peer-success, which would be wrong for auto-pull. Different kernels
   stay parallel.

2. **Image auto-pull race.** Same shape as the kernel race but on the
   image side: parallel `vm.create` calls both run pullFromBundle /
   pullFromOCI for the missing image (each ~minutes of OCI fetch +
   ext4 build). The publishImage atom under imageOpsMu only protects
   the rename + UpsertImage commit, so the loser does all the work
   only to fail at the recheck with "image already exists".

   Fix: per-name `sync.Mutex` map on `ImageService` (imagePullLock).
   `findOrAutoPullImage` acquires it, re-checks FindImage, and only
   then calls PullImage. Loser short-circuits with the
   freshly-published image instead of redoing minutes of work.
   PullImage's own publishImage recheck stays as defense-in-depth
   for callers that bypass the auto-pull path.

3. **Work-seed refresh race.** When the host's SSH key has rotated
   since an image was last refreshed, `ensureAuthorizedKeyOnWorkDisk`
   triggers `refreshManagedWorkSeedFingerprint`, which rewrote the
   shared work-seed.ext4 in place via e2rm + e2cp. Peer `vm.create`
   calls doing parallel `MaterializeWorkDisk` rdumps observed a torn
   ext4 image — "Superblock checksum does not match superblock".

   Fix: stage the rewrite on a sibling tmpfile (`<seed>.refresh.<pid>-<ns>.tmp`)
   and atomic-rename. Concurrent readers either have the file open
   (kernel keeps the pre-rename inode alive) or open after the rename
   (see the new inode) — never observe a partial state. Two parallel
   refreshes are idempotent (same daemon, same SSH key) so unique tmp
   names are enough; whichever rename lands last wins, with identical
   content. UpsertImage runs after the rename so the recorded
   fingerprint always matches what's on disk.

Plus one smoke harness fix: reclassify `vm_prune` from `pure` to
`global`. `vm prune -f` removes ALL stopped VMs system-wide, not just
the ones the scenario created — so a parallel peer scenario that
happens to have its VM in `created`/`stopped` momentarily gets wiped.
Moving prune to the post-pool serial phase keeps it from racing with
in-flight scenarios.

After all four fixes, `make smoke JOBS=4` passes 21/21 in 174s
(serial baseline 141s; the small overhead is the buffered-output and
`wait -n` semaphore cost — well worth the parallelism for fast-iter
work on a 32-core box).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 17:24:11 -03:00
115eec8576
smoke: discoverable scenarios + selectable runs + parallel dispatch
`scripts/smoke.sh` was a 600-line linear script: no way to see what it
covers without reading the whole thing, and no way to run a single
scenario when iterating. Every iteration paid the full ~5-10 min suite,
which made fast feedback loops painful enough to avoid the suite.

Refactor into a registry + per-scenario functions:

- Top-of-file SMOKE_SCENARIOS (ordered) + SMOKE_DESCS (one-line desc per
  scenario) + SMOKE_CLASS (pure / repodir / global) drive both listing
  and dispatch. The 21 existing scenario blocks become scenario_<name>
  functions. Bodies are the inline blocks verbatim, modulo the workspace
  fixture move described below.
- New CLI: --list (cheap discovery, no install / no env-vars),
  --scenario NAME (or NAME,NAME,...), --jobs N (parallel dispatch),
  -h / --help.
- New setup_fixtures runs once after the install/doctor/restart preamble
  and produces the throwaway git repo at $repodir that 'repodir'-class
  scenarios consume. Lifted out of scenario_workspace_run so single-
  scenario invocations (e.g. --scenario workspace_dryrun) get the
  fixture even when the scenario that historically built it isn't
  selected.
- Wipe ~/.local/state/banger/ssh/known_hosts in the install preamble.
  `system uninstall --purge` clears /var/lib/banger but the user-side
  known_hosts persists by design — and smoke creates VMs that reuse
  guest IPs (172.16.0.2 etc.) with fresh host keys every run, so a
  leftover entry trips StrictHostKeyChecking and the daemon's wait-
  for-ssh sees only timeouts. This was the real cause of the "guest
  ssh did not come up" flakes that surface across smoke iterations.

Parallel dispatch:

- --jobs N opts into a slot-limited pool: 'pure' scenarios fan out as
  individual jobs; 'repodir' scenarios fuse into a single serial chain
  (since they mutate $repodir in registry order); 'global' scenarios
  run serially after the pool, one at a time.
- Cap is min(N, 8) — each parallel slot runs an 8 GiB VM, so RAM is
  the binding constraint.
- Parallel-mode stdout/stderr per scenario buffer to per-scenario
  logs and emit one PASS/FAIL line on completion; on FAIL the buffer
  is dumped. Serial mode (--jobs 1, the default) keeps stdout
  unbuffered exactly as before.
- Parallelism is documented as experimental in --help: it surfaces
  real daemon-side concurrency bugs (image auto-pull manifest race,
  work-seed-refresh race on the shared work-seed.ext4) that don't
  appear in serial mode and that need their own fix in the daemon.
  Serial (--jobs 1) is the reliable path; --jobs N is for fast-
  iteration dev work where occasional re-runs are acceptable.

Exit codes: 0 ok, 1 assertion failed, 2 usage error (unknown
scenario, missing SCENARIO=), 77 explicit selection skipped (NAT
when sudo iptables is unavailable AND nat is the only selected
scenario; soft-skip otherwise).

Makefile additions:

- `make smoke-list` — cheap discovery, no smoke-build dep, no env vars.
- `make smoke-one SCENARIO=name` — single-scenario run, full preamble.
  MAKECMDGOALS guard catches missing SCENARIO= before any rebuild.
- `make smoke JOBS=N` — passes through to the script's --jobs N.
- Help text covers all three.

Verified: serial full suite passes 21/21 in ~140s on this host;
make smoke-one SCENARIO=workspace_restart runs the recently-added
regression test alone in ~50s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 16:56:57 -03:00
c9358ab390
daemon: sync guest over ssh before stop to preserve workspace writes
VM stop has been quietly losing data freshly written via
`vm workspace prepare`: stop+start of a workspace-prepared VM would
come back with /root/repo wiped on the work disk.

Root cause is firecracker + Debian's systemd defaults. FC's
SendCtrlAltDel (the only "graceful shutdown" action FC exposes) just
delivers the keystroke; what the guest does with it is its choice.
Debian routes ctrl-alt-del.target -> reboot.target, so the guest
reboots, FC stays alive, the daemon's 10s wait_for_exit window
expires, and the SIGKILL fallback drops anything still in FC's
userspace I/O path. For an idle VM that's invisible. For one that
just took 100s of small writes through a workspace prepare, it's
data loss.

Fix is to dial the guest over SSH inside StopVM and run
`sync; systemctl --no-block poweroff || /sbin/poweroff -f &` before
the existing SendCtrlAltDel path. The synchronous `sync` is the
load-bearing piece — it blocks until every dirty page hits virtio-blk
and lands in the on-host root.ext4. Whether poweroff completes
before SIGKILL fires is incidental; sync has already run. SSH
unreachable falls back to the old SendCtrlAltDel behaviour so a
broken-network guest can't make stop hang.

Bounded by a 5s SSH-dial timeout so a half-broken guest can't extend
the overall stop window past gracefulShutdownWait.

Also adds two smoke scenarios:
- `workspace + stop/start`: prepare -> stop -> start -> assert
  marker survives. This is the regression that caught the bug.
- `vm exec`: end-to-end coverage for d59425a — auto-cd into the
  prepared workspace, exit-code propagation, dirty-host warning,
  --auto-prepare resync, refusal on stopped VM.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 15:41:32 -03:00
d59425adb9
feat(vm): add vm exec command with workspace dirty detection
Introduces three interconnected features for persistent VM workflows:

1. `banger vm exec <vm> -- <cmd>`: runs a command in the prepared
   workspace, automatically cd-ing into the guest path and wrapping
   via `mise exec --` so mise-managed tools are on PATH. Falls back
   to a plain exec when mise isn't available. Exit code propagates
   verbatim.

2. Workspace persistence: workspace.prepare now stores the guest path,
   host source path, and HEAD commit into a new `workspace_json` column
   on the vms table (migration 3). This state survives daemon restarts
   and informs both dirty-checking and auto-prepare.

3. Dirty detection: `vm exec` compares the stored HEAD commit against
   the current host repo HEAD. When stale it warns and, with
   --auto-prepare, re-syncs the workspace before running.

Also:
- WORKSPACE column added to `banger ps` / `vm list`
- `banger vm` quick reference updated with `vm exec` entry
2026-04-26 23:53:45 -03:00
c8637b0fe4
daemon: auto-trust mise configs on workspace prepare
vm run ./repo (and the explicit vm workspace prepare) imports the
host user's own checkout. Any .mise.toml that lands in the guest
would otherwise prompt on the first guest command — 'mise trust:
hash mismatch, run "mise trust"' — and stall what should be a
zero-friction sandbox launch. The repo just came from the host,
the guest is single-tenant root@<vm>.vm, the user already trusts
this checkout: auto-trust is the right default here.

After workspaceImportHook succeeds, run
  if command -v mise >/dev/null 2>&1; then
    mise trust --quiet --all <guest_path> || true
  fi
inside the guest. Best effort: a missing mise binary, a non-zero
exit, or a no-op trust all log at debug only and never fail
prepare. The path is shell-quoted via ws.ShellQuote so guest
paths with spaces or quotes don't break the argument.

Tests pin the script shape (command -v guard + --quiet --all flag
+ trailing `|| true`) and assert the script actually fires after
a successful import. A path with an apostrophe round-trips via
ws.ShellQuote without truncation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 23:08:41 -03:00
fa4292756d
daemon: surface previously-swallowed errors at warn
Three recovery-path errors were silently dropped:

- vm_lifecycle.go startVMLocked persisted the VMStateError record
  with `_ = s.store.UpsertVM(...)`. If the persist failed the user
  saw the original start error but operators had no way to find
  out the store had also drifted out of sync.
- vm_lifecycle.go deleteVMLocked killed the firecracker process
  with `_ = s.net.killVMProcess(...)`. cleanupRuntime tears it
  down regardless, so the explicit kill is best-effort, but a
  permission-denied / EPERM was still worth logging.
- capabilities.go cleanupPreparedCapabilities collected per-cap
  errors with errors.Join. Callers get the aggregated value but
  couldn't tell which capability failed when more than one did.

All three now log Warn before the original behaviour continues.
The aggregate return value, control flow, and user-visible error
strings are unchanged — this is purely a "less silence in the
journal" pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 22:30:51 -03:00
71a332a6a1
cli: maturity polish — color, error translation, tabwriter consistency
Adds three small but high-leverage presentation tweaks for v0.1:

1. internal/cli/style is a new ~70 LOC package with Pass/Fail/Warn/
   Dim/Bold helpers. Each is TTY-gated and obeys NO_COLOR. No
   external dep. Wired into the doctor PASS/FAIL/WARN status, the
   "banger:" error prefix on stderr, and the dim 'ready in <elapsed>'
   line.
2. internal/cli/errors translates rpc.ErrorResponse into user-facing
   text. operation_failed becomes invisible (the message wins);
   not_found, already_exists, bad_request, bad_version, unauthorized,
   unknown_method get short labels; unknown codes pass through. The
   daemon-attached op_id lands in dim parens — paste into
   journalctl --grep to find the daemon log line that produced the
   failure.
3. Tabwriter config converges on (0, 8, 2, ' ', 0) across every
   list/table command. The vm prune confirmation table picked up the
   right config; system install + system status switched from bare
   "key: value\n" lines to tabular form. printVMSpecLine drops its
   Unicode middle dot for an ASCII '|' so terminals without UTF-8
   render cleanly.

Tests cover translateRPCError for every code, style helpers no-op
on non-TTY and under NO_COLOR. Smoke status greps switch from
"key: value" to "key   value" to match the new format.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 22:27:07 -03:00
e47b8146dc
daemon: thread per-RPC op_id end-to-end
Today there's no way to correlate a CLI failure with a daemon log
line. operationLog records relative timing but no id, two concurrent
vm.start calls log indistinguishably, and the async
vmCreateOperationState.ID is user-facing yet never reaches the
journal. The root helper logs plain text to stderr while bangerd
logs JSON, so a merged journalctl is hard to grep across the
trust-boundary split.

Mint a per-RPC op id at dispatch entry, store it on context, and
include it as an "op_id" attr on every operationLog record. The
id is stamped onto every error response (including the early
short-circuit paths bad_version and unknown_method). rpc.Call
forwards the context op id on requests so a daemon RPC and the
helper RPCs it triggers all share one id. The helper now logs
JSON to match bangerd, adopts the inbound id, and emits a single
"helper rpc completed" / "helper rpc failed" line per call so
operators can see at a glance how long each privileged op took.

vmCreateOperationState.ID is now the same id dispatch generated
for vm.create.begin — one identifier between client status polls,
daemon logs, and helper logs.

The wire format gains two optional fields: rpc.Request.OpID and
rpc.ErrorResponse.OpID, both omitempty so older peers (and the
opposite direction) ignore them. ErrorResponse.Error() now appends
"(op-XXXXXX)" to its string form when set; existing callers that
just print err.Error() get the id for free.

Tests cover: dispatch stamps op_id on unknown_method, bad_version,
and handler-returned errors; rpc.Call exposes the typed
*ErrorResponse via errors.As so the CLI can read code/op_id; ctx
op_id is forwarded to the server in the request envelope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 22:13:44 -03:00
b8c48765fb
daemon: skip fsck_snapshot on freshly-created system overlays
The fsck_snapshot lifecycle step exists to repair stale bitmaps in
a COW file reused from a prior aborted start — without it, the
later e2cp/e2rm calls in patch_root_overlay refuse to touch the
snapshot. On a freshly-created COW there are no stale bitmaps to
repair, so e2fsck -fy is pure overhead.

system_overlay already tracks whether it created the file this run
(sc.systemOverlayCreated, used to drive the rollback path). Reuse
that flag to skip e2fsck entirely on the create-fresh path. The
reused-COW path keeps the fsck for safety. Saves a few hundred ms
per VM create — small absolute win on top of the lazy-mkfs change,
but free.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 21:37:14 -03:00