Add tool error detection to telemetry middleware #2092

Deepam02 · 2025-10-04T05:00:24Z

it is a simple solution
we added mcp.tool.error field to traces and tool_error status to metrics when MCP tools fail.

Closes #2084

Closes stacklok#2084 Signed-off-by: Deepam02 <[email protected]>

codecov · 2025-10-04T05:05:24Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 48.37%. Comparing base (19a9f7b) to head (4fcc067).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2092      +/-   ##
==========================================
+ Coverage   48.31%   48.37%   +0.05%     
==========================================
  Files         238      238              
  Lines       30034    30067      +33     
==========================================
+ Hits        14512    14545      +33     
  Misses      14423    14423              
  Partials     1099     1099

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Deepam02 <[email protected]>

JAORMX · 2025-10-04T17:43:37Z

@claude please review this

claude · 2025-10-04T17:43:51Z

Claude encountered an error —— View job

Failed with exit code 128

I'll analyze this and get back to you.

Copilot

Pull Request Overview

This PR adds tool error detection to telemetry middleware to track MCP (Model Context Protocol) tool execution errors through observability signals. When MCP tools fail, the system now captures this information in both traces and metrics.

Adds mcp.tool.error attribute to trace spans when tool execution errors are detected
Introduces tool_error status category in metrics for failed tool calls
Implements lightweight error detection by scanning response payloads for "isError":true patterns

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
pkg/telemetry/middleware.go	Core implementation of tool error detection logic and telemetry integration
pkg/telemetry/middleware_test.go	Unit tests for error detection function and response writer behavior
pkg/telemetry/integration_test.go	Integration test verifying end-to-end tool error detection in telemetry pipeline

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

pkg/telemetry/middleware.go

Signed-off-by: Deepam02 <[email protected]>

Copilot

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-04T19:43:09Z

pkg/telemetry/middleware.go

+	// Buffer response data for tool calls to enable proper error detection
+	if rw.isToolCall && !rw.hasToolError {
+		rw.responseBuffer = append(rw.responseBuffer, data...)
+	}


Unbounded memory usage: response buffer grows without limits for tool calls. Consider adding a maximum buffer size to prevent potential memory exhaustion on large responses.

Copilot · 2025-10-04T19:43:09Z

pkg/telemetry/middleware.go

+	if rw.isToolCall && !rw.hasToolError && len(rw.responseBuffer) > 0 {
+		rw.hasToolError = detectMCPToolError(rw.responseBuffer)
+		// Clear buffer to free memory
+		rw.responseBuffer = nil
+	}


Inefficient error detection: detectMCPToolError is called on the complete buffered response, which requires parsing the entire JSON even for large responses. Consider streaming JSON parsing or early detection when isError field is encountered.

ChrisJBurns · 2025-10-04T20:07:29Z

@claude try again and review this

claude · 2025-10-04T20:07:41Z

Claude encountered an error —— View job

Failed with exit code 128

I'll analyze this and get back to you.

ChrisJBurns · 2025-10-06T18:31:46Z

@claude review this one more time

claude · 2025-10-06T18:32:04Z

Claude encountered an error —— View job

Failed with exit code 128

I'll analyze this and get back to you.

ChrisJBurns · 2025-10-06T18:32:40Z

@JAORMX I wonder if Claude can't review forked branches. The following error is in the action log

fatal: couldn't find remote ref feat/telemetry-tool-error-capture
Error: Prepare step failed with error: Failed with exit code 128
Error: Process completed with exit code 1.

ChrisJBurns · 2025-10-06T18:33:58Z

@Deepam02 Thanks for the PR!! 🚀 Would you be able to add some screenshots to the issue so we can verify that the correct outcome has happened? We've got some example deployment yamls in the otel folder that you can use with kind

Deepam02 · 2025-10-07T07:14:55Z

Hey @ChrisJBurns!

I tried following the OTEL deployment examples in examples/otel with Kind, but ran into issues and coudn't figure it out

However, is this screenshot of the successful passing unit test enough to demonstrate that the tool error detection middleware is working correctly?

The TestTelemetryIntegration_ToolErrorDetection test passes, proving that MCP responses with "isError": true are properly detected and captured in telemetry.

blkt · 2025-10-07T12:37:20Z

pkg/telemetry/middleware.go

 			ResponseWriter: w,
 			statusCode:     http.StatusOK,
 			bytesWritten:   0,
+			isToolCall:     mcpparser.GetMCPMethod(ctx) == string(mcp.MethodToolsCall),


issue: this assumes that request payload and response payload both flow through the same socket, which is not the case for (legacy) SSE transport. This bug is also present around line 100 where it checks the endpoint path, but the spec does not mandate it.

You might want to a look at pkg/testkit to ease the implementation of deeper tests for both SSE and Streamable HTTP transports.

Thanks @blkt! Good point about the transport assumptions, that makes sense.

Since SSE is marked as legacy, I was thinking of just disabling tool error detection for SSE and keeping it working for streamable-HTTP. It keeps this PR simple.

If you’d prefer full SSE support though, I can refactor it to move the detection to the MCP layer so it works across transports. Happy to go with whatever you think is best.

I don't have strong opinions about this and I definitely do not want to block this, just wanted to raise awareness.
@ChrisJBurns @dmjb I'll let you decide the best way forward, feel free to resolve the conversation.

Add tool error detection to telemetry middleware

4b7ac63

Closes stacklok#2084 Signed-off-by: Deepam02 <[email protected]>

telemetry: fix test parallelism and formatting

51ecc34

Signed-off-by: Deepam02 <[email protected]>

JAORMX requested a review from Copilot October 4, 2025 17:43

Copilot AI reviewed Oct 4, 2025

View reviewed changes

pkg/telemetry/middleware.go Outdated Show resolved Hide resolved

pkg/telemetry/middleware.go Outdated Show resolved Hide resolved

Improve error detection with JSON parsing and response buffering

4fcc067

Signed-off-by: Deepam02 <[email protected]>

Deepam02 requested a review from Copilot October 4, 2025 19:42

Copilot AI reviewed Oct 4, 2025

View reviewed changes

blkt requested changes Oct 7, 2025

View reviewed changes

Add tool error detection to telemetry middleware #2092

Are you sure you want to change the base?

Add tool error detection to telemetry middleware #2092

Conversation

Deepam02 commented Oct 4, 2025

Uh oh!

codecov bot commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

JAORMX commented Oct 4, 2025

Uh oh!

claude bot commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

ChrisJBurns commented Oct 4, 2025

Uh oh!

claude bot commented Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChrisJBurns commented Oct 6, 2025

Uh oh!

claude bot commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChrisJBurns commented Oct 6, 2025

Uh oh!

ChrisJBurns commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Deepam02 commented Oct 7, 2025

Uh oh!

blkt Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

Deepam02 Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

blkt Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Oct 4, 2025 •

edited

Loading

claude bot commented Oct 4, 2025 •

edited

Loading

claude bot commented Oct 4, 2025 •

edited

Loading

claude bot commented Oct 6, 2025 •

edited

Loading

ChrisJBurns commented Oct 6, 2025 •

edited

Loading

blkt Oct 8, 2025 •

edited

Loading