-
Notifications
You must be signed in to change notification settings - Fork 123
Add tool error detection to telemetry middleware #2092
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add tool error detection to telemetry middleware #2092
Conversation
Closes stacklok#2084 Signed-off-by: Deepam02 <[email protected]>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2092 +/- ##
==========================================
+ Coverage 48.31% 48.37% +0.05%
==========================================
Files 238 238
Lines 30034 30067 +33
==========================================
+ Hits 14512 14545 +33
Misses 14423 14423
Partials 1099 1099 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Deepam02 <[email protected]>
@claude please review this |
Claude encountered an error —— View job
I'll analyze this and get back to you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds tool error detection to telemetry middleware to track MCP (Model Context Protocol) tool execution errors through observability signals. When MCP tools fail, the system now captures this information in both traces and metrics.
- Adds
mcp.tool.error
attribute to trace spans when tool execution errors are detected - Introduces
tool_error
status category in metrics for failed tool calls - Implements lightweight error detection by scanning response payloads for
"isError":true
patterns
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
File | Description |
---|---|
pkg/telemetry/middleware.go | Core implementation of tool error detection logic and telemetry integration |
pkg/telemetry/middleware_test.go | Unit tests for error detection function and response writer behavior |
pkg/telemetry/integration_test.go | Integration test verifying end-to-end tool error detection in telemetry pipeline |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Signed-off-by: Deepam02 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
// Buffer response data for tool calls to enable proper error detection | ||
if rw.isToolCall && !rw.hasToolError { | ||
rw.responseBuffer = append(rw.responseBuffer, data...) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unbounded memory usage: response buffer grows without limits for tool calls. Consider adding a maximum buffer size to prevent potential memory exhaustion on large responses.
Copilot uses AI. Check for mistakes.
if rw.isToolCall && !rw.hasToolError && len(rw.responseBuffer) > 0 { | ||
rw.hasToolError = detectMCPToolError(rw.responseBuffer) | ||
// Clear buffer to free memory | ||
rw.responseBuffer = nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inefficient error detection: detectMCPToolError
is called on the complete buffered response, which requires parsing the entire JSON even for large responses. Consider streaming JSON parsing or early detection when isError
field is encountered.
Copilot uses AI. Check for mistakes.
@claude try again and review this |
Claude encountered an error —— View job
I'll analyze this and get back to you. |
@claude review this one more time |
Claude encountered an error —— View job
I'll analyze this and get back to you. |
@JAORMX I wonder if Claude can't review forked branches. The following error is in the action log
|
@Deepam02 Thanks for the PR!! 🚀 Would you be able to add some screenshots to the issue so we can verify that the correct outcome has happened? We've got some example deployment yamls in the otel folder that you can use with kind |
Hey @ChrisJBurns! I tried following the OTEL deployment examples in However, is this screenshot of the successful passing unit test enough to demonstrate that the tool error detection middleware is working correctly? ![]() The |
ResponseWriter: w, | ||
statusCode: http.StatusOK, | ||
bytesWritten: 0, | ||
isToolCall: mcpparser.GetMCPMethod(ctx) == string(mcp.MethodToolsCall), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: this assumes that request payload and response payload both flow through the same socket, which is not the case for (legacy) SSE transport. This bug is also present around line 100 where it checks the endpoint path, but the spec does not mandate it.
You might want to a look at pkg/testkit
to ease the implementation of deeper tests for both SSE and Streamable HTTP transports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @blkt! Good point about the transport assumptions, that makes sense.
Since SSE is marked as legacy, I was thinking of just disabling tool error detection for SSE and keeping it working for streamable-HTTP. It keeps this PR simple.
If you’d prefer full SSE support though, I can refactor it to move the detection to the MCP layer so it works across transports. Happy to go with whatever you think is best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have strong opinions about this and I definitely do not want to block this, just wanted to raise awareness.
@ChrisJBurns @dmjb I'll let you decide the best way forward, feel free to resolve the conversation.
it is a simple solution
we added
mcp.tool.error
field to traces andtool_error
status to metrics when MCP tools fail.Closes #2084