Commit graph

44 commits

Author SHA1 Message Date
f1b17f6f8e
install: surface ssh-config --install in next-steps blurb
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 17:57:26 -03:00
9ed44bfd75
port smoke to go 2026-05-01 19:34:44 -03:00
b9b3505e34
smoke: cover -d/--detach and bootstrap NAT precondition
Two new pure scenarios:

* detach_run: -d --rm and -d -- <cmd> combos rejected before VM
  creation; bare -d leaves the VM running and ssh-able afterward.

* bootstrap_precondition: workspace with a .mise.toml is refused
  without --nat; --no-bootstrap bypasses the precondition and the
  run completes normally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 15:05:27 -03:00
2606bfbabb
update: VMs survive banger update and rollback
Three load-bearing fixes that together let `banger update` (and its
auto-rollback path) restart the helper + daemon without killing
every running VM. New smoke scenarios prove the property end-to-end.

Bug fixes:

1. Disable the firecracker SDK's signal-forwarding goroutine. The
   default ForwardSignals = [SIGINT, SIGQUIT, SIGTERM, SIGHUP,
   SIGABRT] installs a handler in the helper that propagates the
   helper's SIGTERM (sent by systemd on `systemctl stop bangerd-
   root.service`) to every running firecracker child. Set
   ForwardSignals to an empty (non-nil) slice so setupSignals
   short-circuits at len()==0.

2. Add SendSIGKILL=no to bangerd-root.service. KillMode=process
   limits the initial SIGTERM to the helper main, but systemd
   still SIGKILLs leftover cgroup processes during the
   FinalKillSignal stage unless SendSIGKILL=no.

3. Route restart-helper / restart-daemon / wait-daemon-ready
   failures through rollbackAndRestart instead of rollbackAndWrap.
   rollbackAndWrap restored .previous binaries but didn't re-
   restart the failed unit, leaving the helper dead with the
   rolled-back binary on disk after a failed update.

Testing infrastructure (production binaries unaffected):

- Hidden --manifest-url and --pubkey-file flags on `banger update`
  let the smoke harness redirect the updater at locally-built
  release artefacts. Marked Hidden in cobra; not advertised in
  --help.
- FetchManifestFrom / VerifyBlobSignatureWithKey /
  FetchAndVerifySignatureWithKey export the existing logic against
  caller-supplied URL / pubkey. The default entry points still
  call them with the embedded canonical values.

Smoke scenarios:

- update_check: --check against fake manifest reports update
  available
- update_to_unknown: --to v9.9.9 fails before any host mutation
- update_no_root: refuses without sudo, install untouched
- update_dry_run: stages + verifies, no swap, version unchanged
- update_keeps_vm_alive: real swap to v0.smoke.0; same VM (same
  boot_id) answers SSH after the daemon restart
- update_rollback_keeps_vm_alive: v0.smoke.broken-bangerd ships a
  bangerd that passes --check-migrations but exits 1 as the
  daemon. The post-swap `systemctl restart bangerd` fails,
  rollbackAndRestart fires, the .previous binaries are restored
  and re-restarted; the same VM still answers SSH afterwards
- daemon_admin (separate prep): covers `banger daemon socket`,
  `bangerd --check-migrations --system`, `sudo banger daemon
  stop`

The smoke release builder generates a fresh ECDSA P-256 keypair
with openssl, signs SHA256SUMS cosign-compatibly, and serves
artefacts from a backgrounded python http.server.
verify_smoke_check_test.go pins the openssl/cosign signature
equivalence so the smoke release builder can't silently drift.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 12:08:08 -03:00
93ba233a12
simplify post install instructions 2026-04-30 10:41:06 -03:00
596dc67556
install.sh: expand the pre-sudo summary beyond just networking
The previous one-liner ("banger needs permission to manage network
access for the VMs you launch") was honest but understated; banger
also needs sudo for storage (rootfs snapshots, loop devices, image
files), launching/stopping firecracker under jailer isolation, and
installing binaries + systemd units. Spell those out as a short
bulleted list at the moment of decision so the user is authorising
a known scope rather than a euphemism.

Wording stays plain-language — no capability names, no jargon —
since the target audience may not know networking or container
terminology.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 17:25:16 -03:00
1004331c14
install.sh: drop --user, add BANGER_INSTALL_NONINTERACTIVE env var
Surveyed the install scripts of comparable systemd-installing tools
(Docker, k3s, Tailscale, Ollama, Determinate Systems Nix, flyctl):
none of the daemon installers offer a --user staging mode, because
the resulting install isn't useful — banger inherits that. The
"--user just stages binaries you can't actually use yet" UX was a
trap; remove it before users hit it.

In its place, adopt the cross-tool convention for non-interactive
runs: the BANGER_INSTALL_NONINTERACTIVE=1 env var is friendlier
through a curl|bash pipe than `bash -s -- --yes` because the env
var can sit on the same line:

  curl -fsSL ...install.sh | env BANGER_INSTALL_NONINTERACTIVE=1 bash

The --yes flag still works for direct script invocation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 14:15:36 -03:00
3c29af55a2
Add curl|bash installer + wire upload into publish script
scripts/install.sh is the one-command installer end users run as

  curl -fsSL https://releases.thaloco.com/banger/install.sh | bash

Design choices:

* Runs as the invoking user. All network work + signature verification
  happens unprivileged; sudo is only re-execed for the actual install
  step that writes to /usr/local and creates systemd units.
* Right before the sudo prompt, the script prints a plain-language
  summary of exactly what's about to happen — the file paths it will
  create and a one-line "why sudo" — so the user authorises a known
  scope rather than the whole pipeline. Detail link in the docs.
* Uses openssl (universally available) for signature verification, not
  cosign. cosign is needed only by the *signer*, never the verifier.
* No jq dependency. The latest_stable field is extracted from the
  manifest with grep+sed, since the manifest shape is well-defined and
  we control it.
* /dev/tty fallback for the confirmation prompt so it works through
  the curl|bash pipe.
* --yes for non-interactive CI use, --user for installing into
  ~/.local/bin without touching system paths, --version vX.Y.Z to pin.

publish-banger-release.sh now uploads install.sh to the bucket root
on every publish, so the curl URL is stable but the script logic
matches the latest verified release. It also runs a key-drift check:
if scripts/install.sh's embedded cosign public key differs from the
one in internal/updater/verify_signature.go, publishing aborts. The
two copies must stay in sync or one of them ends up rejecting every
release.

README's Quick start now leads with the installer one-liner and
documents the audit-first variant alongside it; building from source
moves below.

Smoke-tested end to end against the live bucket with --user mode:
manifest fetch → tarball download → cosign signature verify → hash
verify → extract → install. The installed binary reports v0.1.0 at
commit 6fdebd9, matching the published artifact.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 14:06:34 -03:00
6fdebd929e
publish-script: split RCLONE_BUCKET out of BUCKET_PATH
The previous form passed rclone paths like releases:banger/v0.1.0/,
which rclone parses as bucket=banger, key=v0.1.0/... — wrong, because
the actual R2 bucket is named "releases" (BUCKET_PATH was meant as
an in-bucket key prefix only). Uploads 403'd because the token has
no view of a bucket called "banger".

Introduce RCLONE_BUCKET as a separate env var (default: "releases")
and route every rclone copy through ${RCLONE_REMOTE}:${RCLONE_BUCKET}/${BUCKET_PATH}.
The public URLs in the manifest stay unchanged: BASE_URL is the
bucket's public custom domain, so the bucket name is implicit there.

The defaults now resolve to the live setup:
  rclone target:  releases:releases/banger/<version>/<file>
  public URL:     https://releases.thaloco.com/banger/<version>/<file>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 13:35:53 -03:00
12f7a92bb4
publish-script: don't clobber COSIGN_PASSWORD with empty default
The previous form did

  COSIGN_PASSWORD="${COSIGN_PASSWORD:-}" cosign sign-blob ...

which set COSIGN_PASSWORD to "" when the caller hadn't exported one.
cosign sees an explicit empty password and tries to decrypt with
it instead of prompting interactively, so any real password-protected
offline key fails with "decryption failed".

Drop the prefix entirely. If COSIGN_PASSWORD is already in env, it
gets inherited normally; if it isn't, cosign prompts on the terminal
— which is the right UX for a maintainer running the publish script
locally with the offline private key.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 13:27:23 -03:00
3d748b87c8
publish-script: fix pubkey extraction and cosign v3 compatibility
Two bugs found while dry-running the publish flow end-to-end:

1. The awk pipeline that pulled BangerReleasePublicKey out of
   verify_signature.go didn't strip Go's raw-string-literal wrapping
   (`var ... = ` + backtick on the BEGIN line, trailing backtick on
   the END line). The "verify against embedded pub key" step thus
   compared sigs against a malformed PEM. Replaced with a sed pair
   that yields a clean PEM block byte-identical to cosign.pub.

2. cosign v3.x defaults sign-blob to a new bundle format and
   pushes signatures to Rekor; both are incompatible with banger's
   "embedded pub key, raw ASN.1 DER signature" trust model.
   Add --use-signing-config=false / --tlog-upload=false /
   --new-bundle-format=false to opt out, and --insecure-ignore-tlog
   on verify-blob. These flags also work on cosign v2.x, so the
   script is forward- and backward-compatible across the v2→v3
   boundary.

Validated by an end-to-end dry-run on this machine: built binaries,
tarred, sha256summed, cosign-signed, verified against the embedded
pub key, then re-verified through internal/updater's
crypto/ecdsa.VerifyASN1 path — all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 13:23:09 -03:00
fae28e3d8b
update: docs + publish script for the self-update feature
README gets a top-level Updating section; docs/privileges.md gains
a step-by-step trust-model writeup of `banger update`. The new
scripts/publish-banger-release.sh drives the manual release cut:
build, tar, sha256sum, cosign sign-blob, verify against the embedded
public key, jq-merge into manifest.json, rclone upload to the R2
bucket. Refuses outright if the embedded key is still the placeholder
so we can't accidentally publish an unverifiable release. Also folds
in gofmt drift accumulated across the updater package and a few
sibling files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 12:43:46 -03:00
777b597a1e
smoke: smol VMs by default + JOBS auto-detects nproc
Three quality-of-life improvements now that the daemon-side races
that gated parallel mode are fixed:

1. **Smol VMs by default.** Smoke installs a tuned config.toml at
   /etc/banger/config.toml between `system install` and `system
   restart` so the respawned daemon picks up:
       vcpu = 2
       memory_mib = 1024
       disk_size = "2G"
       system_overlay_size = "2G"
   Smoke scenarios assert behavior, not capacity — they don't need
   4 vCPU / 8 GiB / 8 GiB / 8 GiB. Per-VM RAM cost drops from 8 GiB
   to 1 GiB; nominal disk drops from 16 GiB to 4 GiB (sparse, so
   actual use is small either way, but the new ceiling is gentler
   on hosts that can't overcommit). Scenarios that test
   reconfiguration (vm_set's --vcpu 2 → 4) still pass --vcpu
   explicitly, so this default doesn't perturb their assertions.

2. **JOBS defaults to nproc.** The Makefile resolves JOBS to
   `$(shell nproc)` if unset; the smoke script's existing cap of 8
   keeps the parallel pool sane on bigger hosts. The script always
   passes --jobs N now, so behavior is consistent. Override with
   `make smoke JOBS=1` for a fully serial run.

3. **Help text catches up.** --help no longer flags parallelism as
   experimental (the underlying daemon races are fixed) and now
   describes the small-VM default. `make help` mentions the new
   default and how to override.

Verified: `make smoke` (no JOBS) on a 32-core box auto-runs with
JOBS=8, smol VMs, 21/21 PASS in 172s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 17:36:17 -03:00
72882e45d7
daemon: serialise concurrent image/kernel pulls + atomic-rename seed refresh
Three concurrency bugs surfaced by `make smoke JOBS=4` that all stem
from `vm.create` paths assuming single-caller semantics:

1. **Kernel auto-pull manifest race.** Parallel `vm.create` calls that
   each need to auto-pull the same kernel ref both run kernelcat.Fetch
   in parallel against the same /var/lib/banger/kernels/<name>/. Fetch
   writes manifest.json non-atomically (truncate + write); the peer
   reads it back mid-write and trips
   "parse manifest for X: unexpected end of JSON input".

   Fix: per-name `sync.Mutex` map on `ImageService` (kernelPullLock).
   `KernelPull` and `readOrAutoPullKernel` both acquire it and re-check
   `kernelcat.ReadLocal` after the lock so a peer who finished while we
   waited is treated as success — `readOrAutoPullKernel` does NOT call
   `s.KernelPull` because that path errors with "already pulled" on a
   peer-success, which would be wrong for auto-pull. Different kernels
   stay parallel.

2. **Image auto-pull race.** Same shape as the kernel race but on the
   image side: parallel `vm.create` calls both run pullFromBundle /
   pullFromOCI for the missing image (each ~minutes of OCI fetch +
   ext4 build). The publishImage atom under imageOpsMu only protects
   the rename + UpsertImage commit, so the loser does all the work
   only to fail at the recheck with "image already exists".

   Fix: per-name `sync.Mutex` map on `ImageService` (imagePullLock).
   `findOrAutoPullImage` acquires it, re-checks FindImage, and only
   then calls PullImage. Loser short-circuits with the
   freshly-published image instead of redoing minutes of work.
   PullImage's own publishImage recheck stays as defense-in-depth
   for callers that bypass the auto-pull path.

3. **Work-seed refresh race.** When the host's SSH key has rotated
   since an image was last refreshed, `ensureAuthorizedKeyOnWorkDisk`
   triggers `refreshManagedWorkSeedFingerprint`, which rewrote the
   shared work-seed.ext4 in place via e2rm + e2cp. Peer `vm.create`
   calls doing parallel `MaterializeWorkDisk` rdumps observed a torn
   ext4 image — "Superblock checksum does not match superblock".

   Fix: stage the rewrite on a sibling tmpfile (`<seed>.refresh.<pid>-<ns>.tmp`)
   and atomic-rename. Concurrent readers either have the file open
   (kernel keeps the pre-rename inode alive) or open after the rename
   (see the new inode) — never observe a partial state. Two parallel
   refreshes are idempotent (same daemon, same SSH key) so unique tmp
   names are enough; whichever rename lands last wins, with identical
   content. UpsertImage runs after the rename so the recorded
   fingerprint always matches what's on disk.

Plus one smoke harness fix: reclassify `vm_prune` from `pure` to
`global`. `vm prune -f` removes ALL stopped VMs system-wide, not just
the ones the scenario created — so a parallel peer scenario that
happens to have its VM in `created`/`stopped` momentarily gets wiped.
Moving prune to the post-pool serial phase keeps it from racing with
in-flight scenarios.

After all four fixes, `make smoke JOBS=4` passes 21/21 in 174s
(serial baseline 141s; the small overhead is the buffered-output and
`wait -n` semaphore cost — well worth the parallelism for fast-iter
work on a 32-core box).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 17:24:11 -03:00
115eec8576
smoke: discoverable scenarios + selectable runs + parallel dispatch
`scripts/smoke.sh` was a 600-line linear script: no way to see what it
covers without reading the whole thing, and no way to run a single
scenario when iterating. Every iteration paid the full ~5-10 min suite,
which made fast feedback loops painful enough to avoid the suite.

Refactor into a registry + per-scenario functions:

- Top-of-file SMOKE_SCENARIOS (ordered) + SMOKE_DESCS (one-line desc per
  scenario) + SMOKE_CLASS (pure / repodir / global) drive both listing
  and dispatch. The 21 existing scenario blocks become scenario_<name>
  functions. Bodies are the inline blocks verbatim, modulo the workspace
  fixture move described below.
- New CLI: --list (cheap discovery, no install / no env-vars),
  --scenario NAME (or NAME,NAME,...), --jobs N (parallel dispatch),
  -h / --help.
- New setup_fixtures runs once after the install/doctor/restart preamble
  and produces the throwaway git repo at $repodir that 'repodir'-class
  scenarios consume. Lifted out of scenario_workspace_run so single-
  scenario invocations (e.g. --scenario workspace_dryrun) get the
  fixture even when the scenario that historically built it isn't
  selected.
- Wipe ~/.local/state/banger/ssh/known_hosts in the install preamble.
  `system uninstall --purge` clears /var/lib/banger but the user-side
  known_hosts persists by design — and smoke creates VMs that reuse
  guest IPs (172.16.0.2 etc.) with fresh host keys every run, so a
  leftover entry trips StrictHostKeyChecking and the daemon's wait-
  for-ssh sees only timeouts. This was the real cause of the "guest
  ssh did not come up" flakes that surface across smoke iterations.

Parallel dispatch:

- --jobs N opts into a slot-limited pool: 'pure' scenarios fan out as
  individual jobs; 'repodir' scenarios fuse into a single serial chain
  (since they mutate $repodir in registry order); 'global' scenarios
  run serially after the pool, one at a time.
- Cap is min(N, 8) — each parallel slot runs an 8 GiB VM, so RAM is
  the binding constraint.
- Parallel-mode stdout/stderr per scenario buffer to per-scenario
  logs and emit one PASS/FAIL line on completion; on FAIL the buffer
  is dumped. Serial mode (--jobs 1, the default) keeps stdout
  unbuffered exactly as before.
- Parallelism is documented as experimental in --help: it surfaces
  real daemon-side concurrency bugs (image auto-pull manifest race,
  work-seed-refresh race on the shared work-seed.ext4) that don't
  appear in serial mode and that need their own fix in the daemon.
  Serial (--jobs 1) is the reliable path; --jobs N is for fast-
  iteration dev work where occasional re-runs are acceptable.

Exit codes: 0 ok, 1 assertion failed, 2 usage error (unknown
scenario, missing SCENARIO=), 77 explicit selection skipped (NAT
when sudo iptables is unavailable AND nat is the only selected
scenario; soft-skip otherwise).

Makefile additions:

- `make smoke-list` — cheap discovery, no smoke-build dep, no env vars.
- `make smoke-one SCENARIO=name` — single-scenario run, full preamble.
  MAKECMDGOALS guard catches missing SCENARIO= before any rebuild.
- `make smoke JOBS=N` — passes through to the script's --jobs N.
- Help text covers all three.

Verified: serial full suite passes 21/21 in ~140s on this host;
make smoke-one SCENARIO=workspace_restart runs the recently-added
regression test alone in ~50s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 16:56:57 -03:00
c9358ab390
daemon: sync guest over ssh before stop to preserve workspace writes
VM stop has been quietly losing data freshly written via
`vm workspace prepare`: stop+start of a workspace-prepared VM would
come back with /root/repo wiped on the work disk.

Root cause is firecracker + Debian's systemd defaults. FC's
SendCtrlAltDel (the only "graceful shutdown" action FC exposes) just
delivers the keystroke; what the guest does with it is its choice.
Debian routes ctrl-alt-del.target -> reboot.target, so the guest
reboots, FC stays alive, the daemon's 10s wait_for_exit window
expires, and the SIGKILL fallback drops anything still in FC's
userspace I/O path. For an idle VM that's invisible. For one that
just took 100s of small writes through a workspace prepare, it's
data loss.

Fix is to dial the guest over SSH inside StopVM and run
`sync; systemctl --no-block poweroff || /sbin/poweroff -f &` before
the existing SendCtrlAltDel path. The synchronous `sync` is the
load-bearing piece — it blocks until every dirty page hits virtio-blk
and lands in the on-host root.ext4. Whether poweroff completes
before SIGKILL fires is incidental; sync has already run. SSH
unreachable falls back to the old SendCtrlAltDel behaviour so a
broken-network guest can't make stop hang.

Bounded by a 5s SSH-dial timeout so a half-broken guest can't extend
the overall stop window past gracefulShutdownWait.

Also adds two smoke scenarios:
- `workspace + stop/start`: prepare -> stop -> start -> assert
  marker survives. This is the regression that caught the bug.
- `vm exec`: end-to-end coverage for d59425a — auto-cd into the
  prepared workspace, exit-code propagation, dirty-host warning,
  --auto-prepare resync, refusal on stopped VM.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 15:41:32 -03:00
71a332a6a1
cli: maturity polish — color, error translation, tabwriter consistency
Adds three small but high-leverage presentation tweaks for v0.1:

1. internal/cli/style is a new ~70 LOC package with Pass/Fail/Warn/
   Dim/Bold helpers. Each is TTY-gated and obeys NO_COLOR. No
   external dep. Wired into the doctor PASS/FAIL/WARN status, the
   "banger:" error prefix on stderr, and the dim 'ready in <elapsed>'
   line.
2. internal/cli/errors translates rpc.ErrorResponse into user-facing
   text. operation_failed becomes invisible (the message wins);
   not_found, already_exists, bad_request, bad_version, unauthorized,
   unknown_method get short labels; unknown codes pass through. The
   daemon-attached op_id lands in dim parens — paste into
   journalctl --grep to find the daemon log line that produced the
   failure.
3. Tabwriter config converges on (0, 8, 2, ' ', 0) across every
   list/table command. The vm prune confirmation table picked up the
   right config; system install + system status switched from bare
   "key: value\n" lines to tabular form. printVMSpecLine drops its
   Unicode middle dot for an ASCII '|' so terminals without UTF-8
   render cleanly.

Tests cover translateRPCError for every code, style helpers no-op
on non-TTY and under NO_COLOR. Smoke status greps switch from
"key: value" to "key   value" to match the new format.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 22:27:07 -03:00
59e48e830b
daemon: split owner daemon from root helper
Move the supported systemd path to two services: an owner-user bangerd for
orchestration and a narrow root helper for bridge/tap, NAT/resolver, dm/loop,
and Firecracker ownership. This removes repeated sudo from daily vm and image
flows without leaving the general daemon running as root.

Add install metadata, system install/status/restart/uninstall commands, and a
system-owned runtime layout. Keep user SSH/config material in the owner home,
lock file_sync to the owner home, and move daemon known_hosts handling out of
the old root-owned control path.

Route privileged lifecycle steps through typed privilegedOps calls, harden the
two systemd units, and rewrite smoke plus docs around the supported service
model.

Verified with make build, make test, make lint, and make smoke on the
supported systemd host path.
2026-04-26 12:43:17 -03:00
caa6a2b996
model: validate VM names as DNS labels at CLI + daemon
A VM name flows into five places that all have narrower grammars
than "arbitrary string":

  - the guest's /etc/hostname  (vm_disk.patchRootOverlay)
  - the guest's /etc/hosts      (same)
  - the <name>.vm DNS record    (vmdns.RecordName)
  - the kernel command line     (system.BuildBootArgs*)
  - VM-dir file-path fragments  (layout.VMsDir/<id>, etc.)

Nothing in the chain was validating the input. A name with
whitespace, newline, dot, slash, colon, or = would produce broken
hostnames, weird DNS labels, smuggled kernel cmdline tokens, or
(in the worst case) surprising traversal through the on-disk
layout. Not host shell injection — we already avoid shelling out
with the raw name — but a real correctness and supportability bug.

New: model.ValidateVMName. Rules:

  - 1..63 chars (DNS label max per RFC 1123; also a comfortable
    /etc/hostname cap)
  - lowercase ASCII letters, digits, '-' only
  - no leading or trailing '-'
  - no normalization — the name is the user-visible identifier
    (store key, `ssh <name>.vm`, `vm show`); silently rewriting
    "MyVM" → "myvm" would hand the user back something different
    than they typed

Called from two places:

  - internal/cli/commands_vm.go vmCreateParamsFromFlags — rejects
    bad `--name` values before any RPC. Empty name still passes
    through so the daemon can generate one.
  - internal/daemon/vm_create.go reserveVM — defense in depth for
    any non-CLI RPC caller (SDK, direct JSON over the socket).

Tests:

  - internal/model/vm_name_test.go — exhaustive character-class
    matrix (space, newline, tab, dot, slash, colon, equals, quote,
    control chars, unicode letters, uppercase, leading/trailing
    hyphen, over-length, max-length-exact, digits-only).
  - internal/cli TestVMCreateParamsFromFlagsRejectsInvalidName —
    CLI wire-through + empty-name passthrough.
  - internal/daemon TestReserveVMRejectsInvalidName — daemon
    defense-in-depth (including `box/../evil` path-traversal).
  - scripts/smoke.sh — end-to-end rejection + no-leaked-row
    assertion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 14:06:40 -03:00
235758e5b2
workspace: drop --readonly flag — advisory only against root guests
--readonly ran `chmod -R a-w` over the workspace after copying, but
every banger guest boots as root, and root bypasses DAC mode checks.
So a user running `vm workspace prepare ... --readonly` got the
mode bits set to 0444 but `echo x >> file` in the guest still
succeeded. The flag promised enforcement it couldn't deliver.

The feature also doesn't match the product model: workspaces are
prepared precisely so the guest CAN edit them, and `workspace
export` exists to pull those edits back as a patch. A
"read-only workspace" contradicts that loop.

Removed:
  - CLI flag `--readonly` on `vm workspace prepare`
  - api.VMWorkspacePrepareParams.ReadOnly field
  - model.WorkspacePrepareResult.ReadOnly field
  - daemon chmod dispatch in prepareVMWorkspaceGuestIO
  - smoke scenario pinning the (advisory) mode-bit behavior
  - misleading "exportbox-readonly" VM name in an unrelated export
    test (the test is about not mutating the real git index;
    renamed to exportbox-noindex-mutation)

If real enforcement becomes a user need later, the right primitive
is `chattr +i` (immutable bit — root CAN'T write) or a ro bind-mount.
Reintroducing a new flag is cheaper than debugging what the current
one actually guarantees.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 13:04:33 -03:00
bafe816fc7
smoke: cover the gaps — NAT, vm ports/restart/kill/prune, workspace variants, ssh-config
Audit of banger's advertised CLI surface vs. what smoke was exercising
turned up several gaps where a regression would have shipped silently.
New scenarios:

- NAT: asserts the per-VM POSTROUTING MASQUERADE rule is installed
  with --nat (scoped to the guest /32), idempotent across stop/start,
  and torn down on delete. End-to-end curl tests don't work here
  because the bridge IP and uplink IP both belong to the host — a
  guest reaching the uplink lands on host-local loopback whether
  MASQUERADE is set up or not — so the test pins the iptables rule
  itself. Skipped if passwordless `sudo iptables` isn't available.

- vm ports: sshd :22 must be visible with the <name>.vm endpoint
  (not localhost, not the raw guest IP — the daemon prefers the
  DNS record when one exists).

- vm restart: dedicated verb, not a stop+start alias. Asserts a
  fresh boot_id to prove the kernel actually recycled.

- vm kill --signal KILL: forceful termination path (distinct from
  `vm stop`'s graceful Ctrl-Alt-Del). Post-kill state must be
  'stopped' (not 'error') and the dm-snapshot must be cleaned up.

- vm prune -f: batch delete of non-running VMs while preserving any
  that are still running. Regression guard for the case where prune
  could wipe a live session.

- workspace prepare --readonly: mode bits on /root/repo/<file>
  must drop all write bits. Enforcement is advisory against a root
  guest, so the test asserts the bits, not EACCES.

- workspace prepare --mode full_copy: alternate transfer path
  (tarred into rootfs, no overlay) still lands the repo contents
  at /root/repo.

- workspace export --base-commit: guest-side commits captured in
  the patch when the pre-commit SHA is pinned. The feature's whole
  reason for existing; it had zero coverage. Includes a control
  assertion that the plain (no --base-commit) export does NOT see
  the committed file.

- ssh-config --install / --uninstall: HOME-isolated to a smoke
  tempdir so we don't touch the invoking user's ~/.ssh/config.
  Seeds a pre-existing config to catch any regression where
  install clobbers instead of appending. Asserts idempotency
  (second install doesn't duplicate the Include line) and clean
  round-trip (uninstall leaves the user's own content intact).

Coverage deltas from smoke (vs the last run):
  internal/hostnat          14.1% → 64.1%  (+50pp — NAT rule dance)
  internal/daemon/opstate   56.2% → 87.5%  (+31pp)
  internal/daemon           43.4% → 49.4%  (+6pp)
  internal/cli              36.1% → 40.4%  (+4pp)
  internal/daemon/workspace 64.1% → 67.5%  (+3pp)

Scenario count: 12 → 21.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 12:50:39 -03:00
b4afe13b2a
daemon: fix vm start (on a stopped VM) + regression coverage
Two defects compounded to make `vm create X` → `vm stop X` → `vm start X`
→ `vm ssh X` fail with `not_running: vm X is not running` even though
`vm show` reports `state=running`.

1. firecracker-go-sdk's startVMM spawns a goroutine that SIGTERMs
   firecracker when the ctx passed to Machine.Start cancels — and
   retains that ctx for the lifetime of the VMM, not just the boot
   phase. Our Machine.Start wrapper was plumbing the caller's ctx
   through, which on `vm.start` is the RPC request ctx. daemon.go's
   handleConn cancels reqCtx via `defer cancel()` right after
   writing the response. Net effect: firecracker is killed ~150ms
   after the `vm start` RPC "completes", invisibly, and the next
   `vm ssh` sees a dead PID. `vm.create` side-stepped the bug
   because BeginVMCreate detaches to context.Background() before
   calling startVMLocked; `vm.start` used the RPC ctx directly.
   Fix: Machine.Start now passes context.Background() to the SDK.
   We own firecracker lifecycle explicitly (StopVM / KillVM /
   cleanupRuntime), so ctx-driven cancellation here was never
   actually wired into anything useful.

2. With (1) fixed, the same scenario exposed a second defect:
   patchRootOverlay's e2cp/e2rm refuses to touch the dm-snapshot
   with "Inode bitmap checksum does not match bitmap" on a restart,
   because the COW holds stale free-block/free-inode counters from
   the previous guest boot. Kernel ext4 is fine with this; e2fsprogs
   is not. Fix: run `e2fsck -fy` on the snapshot between the
   dm_snapshot and patch_root_overlay stages. Idempotent on a fresh
   snapshot, reconciles the bitmaps on a reused COW.

Regression coverage:
  - scripts/repro-restart-bug.sh — minimal create→stop→start→ssh
    reproducer with rich on-failure diagnostics (daemon log trace,
    firecracker.log tail, handles.json, pgrep-by-apiSock, apiSock
    stat). Exits non-zero if the bug returns.
  - scripts/smoke.sh — lifecycle scenario (create/ssh/stop/start/
    ssh/delete) and vm-set scenario (--vcpu 2 → stop → set --vcpu 4
    → start → assert nproc=4). Both were pulled when the bug was
    first found; now restored.

Supporting:
  - internal/system/system.ExitCode — extracts exec.ExitError's
    code without forcing callers to import os/exec. Needed by the
    e2fsck caller (policy test pins os/exec to the shell-out
    packages).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 12:01:46 -03:00
e94e7c4dcc
smoke: workspace export scenario + smoke-fresh target + fix the export bug it caught
The export round-trip (`vm create` → `workspace prepare` → guest edit →
`workspace export`) exposed a reproducible failure on Debian bookworm
guests: `git read-tree HEAD --index-output=/tmp/...` returns exit 128
"unable to write new index file" when the target lives on tmpfs while
`.git` is on the workspace overlay. Move the temp index into
`$(git rev-parse --git-dir)` so it shares a filesystem with `.git/index`
and the lockfile + rename + hardlink dance git does internally works.

Alongside:
- new workspace-export smoke scenario that would have caught this at
  the boundary between daemon and guest git
- `make smoke-fresh` = `smoke-clean && smoke` for release-time runs
  that want first-install paths (migrations, image pull) stamped into
  the coverage report

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 11:34:55 -03:00
672d7151e9
smoke: five more scenarios + fix exit-code propagation bug the new ones caught
Five new smoke scenarios layered on top of the existing bare + workspace
vm-run tests:

  - exit-code propagation: `sh -c 'exit 42'` must rc=42
  - workspace dry-run: --dry-run lists tracked files without a VM
  - workspace --include-untracked: opt-in ships files outside the git
    index (regression guard on the security-default flip from review 4)
  - concurrent vm runs: two --rm invocations in parallel both succeed
    (stresses per-VM locks, createVMMu reservation window, tap pool)
  - invalid spec rejection: --vcpu 0 must fail with no VM row left
    behind (the "cleanup on partial failure" path the review flagged)

The exit-code scenario caught a real bug on first run:

  `banger vm run --rm -- sh -c 'exit 42'` returned rc=0, not 42.

Root cause in internal/cli/ssh.go's sshCommandArgs: extra args were
appended to the ssh argv verbatim, relying on ssh(1)'s implicit
space-join to deliver the remote command. That works for single
tokens (echo hello) but re-tokenises multi-word commands on the
remote side: `ssh host sh -c 'exit 42'` becomes remote
`sh -c exit 42`, where `42` is $0 for the already-completed `exit`,
and the exit code the user asked for is lost.

Fix: shell-quote every element of extra (`'sh'` `'-c'` `'exit 42'`)
and join them into a single trailing argv entry. ssh's space-join
then produces exactly the command the user typed on the remote
shell. TestSSHCommandArgs was updated to pin the quoting; the
existing TestRunVMRunCommandModePropagatesExitCode test needed a
one-word quote tweak (`false` → `'false'`).

Smoke run after fix passes all seven scenarios in ~2 min on warm
state. cmd/banger coverage jumped to 100% (the invalid-spec
scenario hits the error-reporting path that wasn't covered
before).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 19:37:07 -03:00
5f81332b0a
make smoke: end-to-end boot suite with coverage from real VM runs
The unit + integration tests can't cross machine.Start — the SDK
boundary would need a fake firecracker that reimplements the
control-plane HTTP API, and the ongoing maintenance cost of keeping
that fake honest with upstream kills the value. Instead, add a
pre-release smoke target that drives REAL Firecracker + real KVM,
captures coverage from the -cover-instrumented binaries, and
surfaces per-package deltas so regressions in the boot path don't
ship silently.

scripts/smoke.sh:
  - Isolated XDG_{CONFIG,STATE,CACHE,RUNTIME} so the smoke run can't
    touch real user state (state/cache persist under build/smoke/xdg
    for fast reruns; runtime is mktemp'd fresh per-run because
    sockets can't be reused)
  - Preflight: `banger doctor` must pass; UDP :42069 must be free
    (otherwise the user's real daemon is up and the smoke daemon
    can't bind its DNS listener — fail with an actionable message)
  - Scenario 1 — bare: `banger vm run --rm -- echo smoke-bare-ok`
    exercises create → start → socket ownership chown → machine.Start
    → SDK waitForSocket race → vsock agent readiness → guest SSH
    wait → exec → cleanup → delete
  - Scenario 2 — workspace: creates a throwaway git repo, runs
    `banger vm run --rm <repo> -- cat /root/repo/smoke-file.txt`,
    verifies the tracked file reached the guest (exercises
    workDisk capability PrepareHost + workspace.prepare)
  - `banger daemon stop` at the end so instrumented binaries flush
    GOCOVERDIR pods before the script exits

Makefile additions:
  - smoke-build: builds banger/bangerd under build/smoke/bin/ with
    `go build -cover`
  - smoke: runs the script with GOCOVERDIR set, reports per-package
    coverage via `go tool covdata percent`
  - smoke-coverage-html: textfmt + go tool cover for a browsable
    report
  - smoke-clean: nukes build/smoke/ including the persisted XDG
    state

Bonus fix uncovered during the first smoke run: doctor treated a
missing state.db as a FAIL ("out of memory" from SQLite
SQLITE_CANTOPEN), which red-flagged every fresh install. Split
the store check: DB file absent → PASS with "will be created on
first daemon start" detail; DB present but unreadable → FAIL as
before. New TestDoctorReport_StoreMissingSurfacesAsPassForFreshInstall
pins the behaviour.

Concrete coverage delta from the first successful smoke run
(compared to `make coverage-total`'s unit-test-only 37.8%):

  internal/firecracker        43.6% → 75.0%
  internal/daemon/workspace   33.8% → 60.8%
  internal/store              40.1% → 56.3%
  internal/guest              63.7% → 57.4%  (different mix: smoke
                                              exercises real SSH;
                                              unit tests cover more
                                              error branches)

The packages the review flagged are the ones that moved most —
which is the point.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-22 18:59:57 -03:00
25a1466947
supply chain: verify signatures and pins across image + kernel builds
Three independent hardenings, addressing a review finding that the
kernel and image build pipelines were relying on HTTPS alone for
artifact integrity.

scripts/make-generic-kernel.sh
- Fetch the detached PGP signature (linux-<ver>.tar.sign) alongside
  the tarball and verify it with gpg before extraction. An isolated
  $GNUPGHOME under the tempdir keeps the kernel signers out of the
  invoking user's keyring.
- Import the three kernel.org release signing keys (Greg KH / Linus /
  Sasha Levin) from keyserver.ubuntu.com, falling back to
  keys.openpgp.org. Ubuntu comes first because keys.openpgp.org strips
  unverified UIDs on upload, leaving gpg with UID-less keys it
  refuses to trust.
- Require VALIDSIG (cryptographic proof) rather than GOODSIG
  (printed even for expired keys) before proceeding. Verified
  end-to-end against a clean tarball (accepts) and a byte-flipped
  tampered copy (rejects with BADSIG).
- gpg + gpgv + xz added to the required-tools check.

images/golden/Dockerfile
- Pin Docker's apt signing key by fingerprint. After downloading
  /etc/apt/keyrings/docker.asc we gpg --show-keys --with-colons it,
  extract the fpr, and compare against the expected
  9DC858229FC7DD38854AE2D88D81803C0EBFCD88. A tampered or swapped key
  aborts the build before any apt repo metadata is fetched.
- Replace `curl https://mise.run | sh` with a pinned GitHub release
  binary (mise v2026.4.18, linux-x64) verified against its published
  sha256. Refuses to build on unknown architectures rather than
  silently installing a binary we have no hash for.
- Add gnupg to the ESSENTIAL apt-get install so the fingerprint check
  has gpg available.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 19:38:13 -03:00
afe91e805a
drop unused bench-create script + Makefile target
The script carried a python3 dep for one json.dumps on a VM name
that's always alphanumeric-plus-dashes anyway, it was never wired
into CI or docs, and `time banger vm create` covers the same need
ad hoc when anyone wants to measure create latency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 13:33:09 -03:00
6083e2dde5
Prune legacy void/alpine + customize.sh flows
The golden-image Dockerfile + catalog pipeline replaces the entire
manual rootfs-build stack. With that shipped, the per-distro shell
flows are dead code.

Removed:
- scripts/customize.sh, scripts/interactive.sh, scripts/verify.sh
- scripts/make-rootfs{,-void,-alpine}.sh
- scripts/register-{void,alpine}-image.sh
- scripts/make-{void,alpine}-kernel.sh
- internal/imagepreset/ (only consumer was `banger internal packages`,
  which fed customize.sh)
- examples/{void,alpine}.config.toml
- Makefile targets: rootfs, rootfs-void, rootfs-alpine, void-kernel,
  alpine-kernel, void-register, alpine-register, void-vm, alpine-vm,
  verify-void, verify-alpine, plus the ALPINE_RELEASE / *_IMAGE_NAME
  / *_VM_NAME variables

The void-6.12 kernel catalog entry is also gone — golden image pairs
with generic-6.12 and nothing else in the catalog depended on it.

Consolidated: imagemgr now holds the small DebianBasePackages list +
package-hash helper inline, so the `image build --from-image` flow
(still supported) no longer pulls from a separate imagepreset package.

Net: 3,815 lines deleted, 59 added. No runtime functionality removed
beyond the `banger internal packages` subcommand (hidden, used only
by the deleted customize.sh).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 15:39:53 -03:00
75baf2e415
publish-golden-image: content-addressed tarball names
Embed the sha256 prefix in the uploaded filename so every rebuild
lives at a unique URL. Cloudflare's edge cache (and any similar CDN
in front of R2) can never serve stale bytes for the URL the catalog
points at. The R2 console offers no per-URL purge for this bucket
layout, so making the URL itself content-addressed is the only
durable fix.

Also republishes the debian-bookworm catalog entry with the new
filename.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 15:26:57 -03:00
ab5627aec2
imagecat: publish debian-bookworm golden image
First entry in the image catalog. Verified end-to-end:
  - https://images.thaloco.com/debian-bookworm-x86_64.tar.zst reachable
  - sha256 071495e6... matches
  - bundle unpacks to rootfs.ext4 (4 GiB) + manifest.json with the
    expected name/distro/arch/kernel_ref.

publish-golden-image.sh tweaks:
  - default RCLONE_REMOTE from 'r2' to 'banger-images' (matches the
    rclone config actually in use here).
  - rclone copyto now passes --s3-no-check-bucket and --no-check-dest
    so scoped R2 tokens without HeadBucket/HeadObject permission
    still upload cleanly.

To use: restart bangerd so it picks up the new embedded catalog,
then `banger image pull debian-bookworm`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 13:25:42 -03:00
d22d05555c
scripts: bundle-based golden image pipeline
Replaces the OCI-push flow with a bundle-based one that mirrors the
kernel catalog (publish-kernel.sh / kernelcat).

- scripts/make-golden-bundle.sh: docker build → docker create → docker
  export | banger internal make-bundle → .tar.zst. Defaults target
  debian-bookworm / generic-6.12 / x86_64; pinned --size 4G to leave
  headroom for first-boot installs and in-VM apt use.
- scripts/publish-golden-image.sh: rewritten to call make-golden-bundle,
  rclone upload to R2 (banger-images bucket, images.thaloco.com), and
  jq-patch internal/imagecat/catalog.json with URL / sha256 / size.
  --skip-upload stops after bundle build and copies to dist/.

make-bundle default ext4 sizing also bumped from +25% to +50% headroom
(mkfs.ext4 needs room for inode tables, block-group metadata, journal,
and the default 5% reserved-blocks margin). The old 25% was too tight
for the ~950 MB golden rootfs and aborted with "Could not allocate
block".

End-to-end smoke (local): golden Dockerfile → 286 MB tar.zst bundle
with correct manifest, valid ext4, and all banger units + vsock agent
present.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 15:38:04 -03:00
da471b0640
Golden image Dockerfile + local build script
Debian bookworm with two clearly-labeled sections:
- ESSENTIAL: systemd, openssh-server, ca-certificates, curl, iproute2.
- OPINION: git, jq, ripgrep, fd, build-essential, shellcheck, mise,
  Docker CE (+ Compose v2 + buildx), tmux, htop, and friends.

Per-VM identity stripped at build time: /etc/machine-id cleared,
SSH host keys removed with a ssh.service drop-in that runs
`ssh-keygen -A` on first start so each VM gets a unique set.

The script is a parameterized wrapper around `docker build`; it also
supports `--push` to an OCI registry, which will be removed once the
bundle pipeline is in place.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 15:11:40 -03:00
8f4be112c2
Generic kernel + init= boot path for OCI-pulled images
Closes the full arc: banger kernel pull + image pull + vm create + vm ssh
now works end-to-end against docker.io/library/debian:bookworm with zero
manual image building.

Generic kernel:
 - New scripts/make-generic-kernel.sh builds vmlinux from upstream
   kernel.org sources using Firecracker's official minimal config
   (configs/firecracker-x86_64-6.1.config). All critical drivers
   (virtio_blk, virtio_net, ext4, vsock) compiled in — no modules,
   no initramfs needed.
 - Published as generic-6.12 in the catalog (kernels.thaloco.com).
 - catalog.json updated with the new entry.

Direct-boot init= override (vm_lifecycle.go):
 - For images without an initrd (direct-boot / OCI-pulled), banger now
   passes init=/usr/local/libexec/banger-first-boot on the kernel
   cmdline. The script runs as PID 1, mounts /proc /sys /dev /run,
   checks for systemd — if present execs it immediately; if not
   (container images), installs systemd-sysv + openssh-server via the
   guest's package manager, then execs systemd.
 - Also passes kernel-level ip= parameter via BuildBootArgsWithKernelIP
   so the kernel configures the network interface before init runs
   (container images don't ship iproute2, so the userspace bootstrap
   script can't call ip(8)).
 - Masks dev-ttyS0.device and dev-vdb.device systemd units that
   otherwise wait 90s for udev events that never fire in Firecracker
   guests started from container rootfses.

first-boot.sh rewritten as universal init wrapper:
 - Works as PID 1 (mounts essential filesystems) OR as a systemd
   oneshot (existing behavior).
 - Installs both systemd-sysv AND openssh-server (container images
   have neither).
 - Dispatch updated: debian, alpine, fedora, arch, opensuse families
   + ID_LIKE fallback. All tests updated.

Opencode capability skip for direct-boot images:
 - The opencode readiness check (WaitReady on vsock port 4096) now
   returns nil for images without an initrd, since pulled container
   images don't ship the opencode service. Without this, the VM
   would be marked as error for lacking an opinionated add-on.

Docs: README and kernel-catalog.md updated to recommend generic-6.12
as the default kernel for OCI-pulled images. AGENTS.md notes the new
build script.

Verified live:
 - banger kernel pull generic-6.12
 - banger image pull docker.io/library/debian:bookworm --kernel-ref generic-6.12
 - banger vm create --image debian-bookworm --name testbox --nat
 - banger vm ssh testbox -- "id; uname -r; systemctl is-active banger-vsock-agent"
 → uid=0(root), kernel 6.12.8, Debian bookworm, vsock-agent active,
   sshd running, SSH working.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 20:12:56 -03:00
fa95849f5a
Phase 5: kernel catalog publish flow + docs
Manual publish flow for the kernel catalog, designed for the current
no-CI, private-repo state of banger.

scripts/publish-kernel.sh <name>:
 - Reads $BANGER_KERNELS_DIR/<name>/ (the canonical layout produced by
   `banger kernel import`).
 - Pulls distro / arch / kernel_version from the local manifest.
 - Packages vmlinux + optional initrd.img + optional modules/ as
   <name>-<arch>.tar.zst with zstd -19.
 - Computes sha256 + size.
 - rclone copyto -> r2:banger-kernels/<file>.
 - HEAD-checks https://kernels.thaloco.com/<file> to catch
   public-access misconfig before declaring success.
 - jq-patches internal/kernelcat/catalog.json: replaces any prior
   entry with the same name, then sorts entries by name.
 - Prints next-step git+make commands; does not commit or rebuild
   automatically.

Environment overrides RCLONE_REMOTE / RCLONE_BUCKET / BASE_URL /
BANGER_KERNELS_DIR for non-default setups.

docs/kernel-catalog.md covers the architecture (embedded JSON +
external tarballs), end-user flow, the add/update/remove playbook,
naming and tarball-layout conventions, the trust model (sha256 in
embedded catalog catches transport/swap; no signing yet), and where
the bucket lives.

README.md gains a kernel-catalog example next to the existing image
register example. AGENTS.md points at publish-kernel.sh and the docs.

.gitignore now excludes .env so accidental drops of R2 credentials
don't follow into commits.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 15:56:56 -03:00
7192ba24ae
Phase 3: banger kernel import bridges make-*-kernel.sh output
`banger kernel import <name> --from <dir>` copies a staged kernel
bundle into the local catalog. <dir> is the output of
`make void-kernel` or `make alpine-kernel` (build/manual/void-kernel/
or build/manual/alpine-kernel/).

kernelcat.DiscoverPaths locates artifacts under <dir>:
 1. Prefers metadata.json (written by make-void-kernel.sh).
 2. Falls back to globbing: boot/vmlinux-* or vmlinuz-* (Alpine
    fallback), boot/initramfs-*, lib/modules/<latest>.

The daemon's KernelImport copies kernel + optional initrd via
system.CopyFilePreferClone and modules via system.CopyDirContents
(no-sudo mode — catalog lives under ~/.local/state), computes SHA256
over the kernel, and writes the manifest via kernelcat.WriteLocal.

While wiring this up, fixed a latent bug in system.CopyDirContents:
filepath.Join(sourceDir, ".") silently drops the trailing dot, so
`cp -a source source/contents target/` was copying the whole source
directory (including its basename) instead of just its contents.
Replaced the join with a manual "/." suffix. imagemgr.StageBootArtifacts
(the only existing caller) silently benefits.

scripts/register-void-image.sh and scripts/register-alpine-image.sh
are rewritten to use `banger kernel import … && banger image register
--kernel-ref …` instead of the find-and-pass-paths dance. Preserves
the same user-facing commands and env vars.

Tests cover: metadata.json preference, glob fallback, Alpine vmlinuz
fallback, kernel-missing error, round-trip copy into the catalog, and
the --from required flag.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-16 14:53:49 -03:00
797a9de1ce
Install claude and pi through mise
Provisioning was still installing `claude` and `pi` through a separate
npm-global prefix even after the guest images had switched to `mise` for
Node and opencode. That left two competing install paths and made the
runtime layout harder to reason about.

Switch the Debian and Void image setup flows to install `claude` and `pi`
as `mise` npm tools, assert their shims exist after `mise reshim`, and
symlink `node`, `npm`, `opencode`, `claude`, and `pi` directly from the
mise shim directory into `/usr/local/bin`.

Update the imagebuild test expectations and bump the Void rootfs default
size to 4G so the larger default toolset still fits reliably.
2026-04-13 18:29:02 -03:00
37c4c091ec
Add guest sessions and agent VM defaults
Add daemon-backed workspace and guest-session primitives so host
orchestrators can prepare /root/repo, launch long-lived guest commands,
and attach to pipe-mode sessions over the local stdio mux bridge.

Persist richer session metadata and launch diagnostics, preflight guest
cwd/command requirements, make pipe-mode attach rehydratable from guest
state after daemon restart, and allow submodules when workspace prepare
runs in full_copy mode.

At the same time, stop vm run from auto-attaching opencode, make it
print next-step commands instead, and make glibc guest images more
agent-ready by installing node, opencode, claude, and pi while syncing
opencode/claude/pi auth files into work disks on VM start.

Validation:
- GOCACHE=/tmp/banger-gocache go test ./...
- make build
- banger vm workspace prepare --help
- banger vm session --help
- banger vm session start --help
- banger vm session attach --help
2026-04-12 23:48:42 -03:00
497e6dca3d
Rename experimental Void image to void
Replace the old `void-exp` repository defaults with `void` so the Make targets,
registration helper, example config, verification messaging, and sample test
fixtures all line up with the new managed image name.

Keep the scope to repo-facing naming only: config overrides, helper output, and
test fixtures now expect `void`, while runtime compatibility for existing local
`void-exp` VMs remains an operational concern outside this commit.

Validation: go test ./..., make build, and a local `banger vm create --image void`
smoke boot with ssh and opencode ports up.
2026-04-01 20:15:28 -03:00
70bc6d07d0
Fix void-kernel output directory setup
Replace the stale `RUNTIME_DIR` mkdir in the experimental Void kernel helper with
creation of the parent directory for `OUT_DIR`, which is the current
BANGER_MANUAL_DIR/custom --out-dir flow used by the Make target.

This restores `make void-kernel` without requiring an extra environment override.
Validation: make void-kernel ARGS='--out-dir /tmp/banger-void-kernel-verify-$$'.
2026-04-01 19:42:30 -03:00
092d848620
Wait for real guest vsock health before opencode
Make vm create wait for the guest-side vsock /healthz endpoint instead of only waiting for the host socket path, so the wait_vsock_agent stage reflects actual guest readiness.

Start banger-vsock-agent earlier in the Alpine OpenRC graph and report later /ports failures as guest-service waits rather than vsock-agent waits, which makes the progress output match what the guest is really doing.

Validate with go test ./..., a rebuilt managed alpine image, and a fresh vm create --image alpine --name alp --nat that now progresses through wait_vsock_agent -> wait_guest_ready -> wait_opencode -> ready.
2026-03-21 21:14:22 -03:00
a166068fab
Add an experimental Alpine image flow
Stage a complete Alpine x86_64 image stack so \	--image alpineworks like the existing manual Void path instead of relying on Debian-oriented image builds.\n\nAdd make targets plus kernel/rootfs/register helpers that download pinned Alpine artifacts, extract a Firecracker-compatible vmlinux, build a matching mkinitfs initramfs, seed OpenRC services, and register/promote a managed image named alpine.\n\nFold in the bring-up fixes discovered during boot validation: use rootfstype=ext4 in shared boot args, install libgcc/libstdc++ for the opencode binary, and give opencode more time to become ready on cold boots.\n\nValidate with go test ./..., the Alpine helper builds, image promotion, and banger vm create --image alpine --name alp --nat plus guest service and port checks.
2026-03-21 20:25:55 -03:00
572bf32424
Remove runtime-bundle image dependencies
Hard-cut banger away from source-checkout runtime bundles as an implicit source of\nimage and host defaults. Managed images now own their full boot set,\nimage build starts from an existing registered image, and daemon startup\nno longer synthesizes a default image from host paths.\n\nResolve Firecracker from PATH or firecracker_bin, make SSH keys config-owned\nwith an auto-managed XDG default, replace the external name generator and\npackage manifests with Go code, and keep the vsock helper as a companion\nbinary instead of a user-managed runtime asset.\n\nUpdate the manual scripts, web/CLI forms, config surface, and docs around\nthe new build/manual flow and explicit image registration semantics.\n\nValidation: GOCACHE=/tmp/banger-gocache go test ./..., bash -n scripts/*.sh,\nand make build.
2026-03-21 18:34:53 -03:00
01c7cb5e65
Reorganize the source checkout layout
Separate tracked source from generated artifacts so the repo root stops accumulating helper scripts, manifests, and local runtime outputs.

Move manual shell entrypoints under scripts/, manifests under config/, and the Firecracker API reference under docs/reference/. Make build and runtimebundle now target build/bin, build/runtime, and build/dist as the canonical source-checkout paths.

Update runtime discovery, helper scripts, tests, and docs to follow the new layout while keeping legacy source-checkout runtime fallbacks for existing local bundles during migration.

Validated with bash -n on the moved scripts, make build, and GOCACHE=/tmp/banger-gocache go test ./....
2026-03-21 17:22:57 -03:00
c8d9a122f9
Speed up VM create with work seeds
Beat VM create wall time without changing VM semantics.

Generate a work-seed ext4 sidecar during image builds and rootfs rebuilds, then clone and resize that seed for each new VM instead of rebuilding /root from scratch. Plumb the new seed artifact through config, runtime metadata, store state, runtime-bundle defaults, doctor checks, and default-image reconciliation so older images still fall back cleanly.

Add a daemon TAP pool to keep idle bridge-attached devices warm, expose stage timing in lifecycle logs, add a create/SSH benchmark script plus Make target, and teach verify.sh that tap-pool-* devices are reusable capacity rather than cleanup leaks.

Validated with go test ./..., make build, ./verify.sh, and make bench-create ARGS="--runs 2".
2026-03-18 21:22:12 -03:00