daemon: thread per-RPC op_id end-to-end

Today there's no way to correlate a CLI failure with a daemon log
line. operationLog records relative timing but no id, two concurrent
vm.start calls log indistinguishably, and the async
vmCreateOperationState.ID is user-facing yet never reaches the
journal. The root helper logs plain text to stderr while bangerd
logs JSON, so a merged journalctl is hard to grep across the
trust-boundary split.

Mint a per-RPC op id at dispatch entry, store it on context, and
include it as an "op_id" attr on every operationLog record. The
id is stamped onto every error response (including the early
short-circuit paths bad_version and unknown_method). rpc.Call
forwards the context op id on requests so a daemon RPC and the
helper RPCs it triggers all share one id. The helper now logs
JSON to match bangerd, adopts the inbound id, and emits a single
"helper rpc completed" / "helper rpc failed" line per call so
operators can see at a glance how long each privileged op took.

vmCreateOperationState.ID is now the same id dispatch generated
for vm.create.begin — one identifier between client status polls,
daemon logs, and helper logs.

The wire format gains two optional fields: rpc.Request.OpID and
rpc.ErrorResponse.OpID, both omitempty so older peers (and the
opposite direction) ignore them. ErrorResponse.Error() now appends
"(op-XXXXXX)" to its string form when set; existing callers that
just print err.Error() get the id for free.

Tests cover: dispatch stamps op_id on unknown_method, bad_version,
and handler-returned errors; rpc.Call exposes the typed
*ErrorResponse via errors.As so the CLI can read code/op_id; ctx
op_id is forwarded to the server in the request envelope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Thales Maciel 2026-04-26 22:13:44 -03:00
parent b8c48765fb
commit e47b8146dc
No known key found for this signature in database
GPG key ID: 33112E6833C34679
16 changed files with 333 additions and 44 deletions

View file

@ -24,10 +24,21 @@ type vmCreateOperationState struct {
op api.VMCreateOperation
}
func newVMCreateOperationState() (*vmCreateOperationState, error) {
id, err := model.NewID()
if err != nil {
return nil, err
// newVMCreateOperationState constructs the async-progress record for
// a vm.create.begin RPC. When the caller's context already carries a
// dispatch-assigned op id (the normal path), we reuse it so the
// operator-visible status id and the daemon-log op_id are the same
// string. Otherwise we mint a fresh op id — keeps the same shape on
// internal call sites that don't go through dispatch (tests, future
// background creators).
func newVMCreateOperationState(ctx context.Context) (*vmCreateOperationState, error) {
id := OpIDFromContext(ctx)
if id == "" {
var err error
id, err = model.NewOpID()
if err != nil {
return nil, err
}
}
now := model.Now()
return &vmCreateOperationState{
@ -146,12 +157,16 @@ func (op *vmCreateOperationState) cancelOperation() {
}
}
func (s *VMService) BeginVMCreate(_ context.Context, params api.VMCreateParams) (api.VMCreateOperation, error) {
op, err := newVMCreateOperationState()
func (s *VMService) BeginVMCreate(ctx context.Context, params api.VMCreateParams) (api.VMCreateOperation, error) {
op, err := newVMCreateOperationState(ctx)
if err != nil {
return api.VMCreateOperation{}, err
}
createCtx, cancel := context.WithCancel(context.Background())
// Detach from the caller's deadline (the begin RPC returns
// immediately) but preserve the op id so every log line emitted
// by the goroutine carries the same identifier the client just
// got back.
createCtx, cancel := context.WithCancel(WithOpID(context.Background(), op.op.ID))
op.setCancel(cancel)
s.createOps.Insert(op)
go s.runVMCreateOperation(withVMCreateProgress(createCtx, op), op, params)