daemon: thread per-RPC op_id end-to-end

Today there's no way to correlate a CLI failure with a daemon log line. operationLog records relative timing but no id, two concurrent vm.start calls log indistinguishably, and the async vmCreateOperationState.ID is user-facing yet never reaches the journal. The root helper logs plain text to stderr while bangerd logs JSON, so a merged journalctl is hard to grep across the trust-boundary split. Mint a per-RPC op id at dispatch entry, store it on context, and include it as an "op_id" attr on every operationLog record. The id is stamped onto every error response (including the early short-circuit paths bad_version and unknown_method). rpc.Call forwards the context op id on requests so a daemon RPC and the helper RPCs it triggers all share one id. The helper now logs JSON to match bangerd, adopts the inbound id, and emits a single "helper rpc completed" / "helper rpc failed" line per call so operators can see at a glance how long each privileged op took. vmCreateOperationState.ID is now the same id dispatch generated for vm.create.begin — one identifier between client status polls, daemon logs, and helper logs. The wire format gains two optional fields: rpc.Request.OpID and rpc.ErrorResponse.OpID, both omitempty so older peers (and the opposite direction) ignore them. ErrorResponse.Error() now appends "(op-XXXXXX)" to its string form when set; existing callers that just print err.Error() get the id for free. Tests cover: dispatch stamps op_id on unknown_method, bad_version, and handler-returned errors; rpc.Call exposes the typed *ErrorResponse via errors.As so the CLI can read code/op_id; ctx op_id is forwarded to the server in the request envelope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-26 22:13:44 -03:00 · 2026-04-26 22:13:44 -03:00 · e47b8146dc
commit e47b8146dc
parent b8c48765fb
16 changed files with 333 additions and 44 deletions
--- a/internal/roothelper/roothelper.go
+++ b/internal/roothelper/roothelper.go
@ -285,7 +285,11 @@ func Open() (*Server, error) {
 	return &Server{
 		meta:   meta,
 		runner: system.NewRunner(),
-		logger: slog.New(slog.NewTextHandler(os.Stderr, &slog.HandlerOptions{Level: slog.LevelInfo})),
+		// JSON to match bangerd. Mixed text/JSON streams in the
+		// merged journalctl made the daemon side painful to grep;
+		// this aligns the helper so a single greppable shape spans
+		// both units.
+		logger: slog.New(slog.NewJSONHandler(os.Stderr, &slog.HandlerOptions{Level: slog.LevelInfo})),
 	}, nil
 }

@ -352,7 +356,29 @@ func (s *Server) handleConn(conn net.Conn) {
 		_ = json.NewEncoder(conn).Encode(rpc.NewError("bad_request", err.Error()))
 		return
 	}
-	resp := s.dispatch(context.Background(), req)
+	// Adopt the daemon's op id so a single greppable id covers the
+	// whole call chain (CLI → daemon → helper). Entry log at debug
+	// level keeps production quiet; the completion log fires at
+	// info-on-success / error-on-failure with duration so an
+	// operator can see at a glance how long each privileged op
+	// took.
+	ctx := rpc.WithOpID(context.Background(), req.OpID)
+	start := time.Now()
+	if s.logger != nil {
+		s.logger.Debug("helper rpc", "method", req.Method, "op_id", req.OpID)
+	}
+	resp := s.dispatch(ctx, req)
+	if !resp.OK && resp.Error != nil && resp.Error.OpID == "" && req.OpID != "" {
+		resp.Error.OpID = req.OpID
+	}
+	if s.logger != nil {
+		duration := time.Since(start).Milliseconds()
+		if !resp.OK && resp.Error != nil {
+			s.logger.Error("helper rpc failed", "method", req.Method, "op_id", req.OpID, "duration_ms", duration, "code", resp.Error.Code, "message", resp.Error.Message)
+		} else {
+			s.logger.Info("helper rpc completed", "method", req.Method, "op_id", req.OpID, "duration_ms", duration)
+		}
+	}
 	_ = json.NewEncoder(conn).Encode(resp)
 }