Skip to content

Conversation

Deepam02
Copy link
Contributor

@Deepam02 Deepam02 commented Oct 4, 2025

it is a simple solution
we added mcp.tool.error field to traces and tool_error status to metrics when MCP tools fail.

Closes #2084

Copy link

codecov bot commented Oct 4, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 48.37%. Comparing base (19a9f7b) to head (4fcc067).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2092      +/-   ##
==========================================
+ Coverage   48.31%   48.37%   +0.05%     
==========================================
  Files         238      238              
  Lines       30034    30067      +33     
==========================================
+ Hits        14512    14545      +33     
  Misses      14423    14423              
  Partials     1099     1099              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JAORMX JAORMX requested a review from Copilot October 4, 2025 17:43
@JAORMX
Copy link
Collaborator

JAORMX commented Oct 4, 2025

@claude please review this

Copy link
Contributor

claude bot commented Oct 4, 2025

Claude encountered an error —— View job

Failed with exit code 128

I'll analyze this and get back to you.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds tool error detection to telemetry middleware to track MCP (Model Context Protocol) tool execution errors through observability signals. When MCP tools fail, the system now captures this information in both traces and metrics.

  • Adds mcp.tool.error attribute to trace spans when tool execution errors are detected
  • Introduces tool_error status category in metrics for failed tool calls
  • Implements lightweight error detection by scanning response payloads for "isError":true patterns

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
pkg/telemetry/middleware.go Core implementation of tool error detection logic and telemetry integration
pkg/telemetry/middleware_test.go Unit tests for error detection function and response writer behavior
pkg/telemetry/integration_test.go Integration test verifying end-to-end tool error detection in telemetry pipeline

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@Deepam02 Deepam02 requested a review from Copilot October 4, 2025 19:42
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +448 to +451
// Buffer response data for tool calls to enable proper error detection
if rw.isToolCall && !rw.hasToolError {
rw.responseBuffer = append(rw.responseBuffer, data...)
}
Copy link
Preview

Copilot AI Oct 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unbounded memory usage: response buffer grows without limits for tool calls. Consider adding a maximum buffer size to prevent potential memory exhaustion on large responses.

Copilot uses AI. Check for mistakes.

Comment on lines +459 to +463
if rw.isToolCall && !rw.hasToolError && len(rw.responseBuffer) > 0 {
rw.hasToolError = detectMCPToolError(rw.responseBuffer)
// Clear buffer to free memory
rw.responseBuffer = nil
}
Copy link
Preview

Copilot AI Oct 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inefficient error detection: detectMCPToolError is called on the complete buffered response, which requires parsing the entire JSON even for large responses. Consider streaming JSON parsing or early detection when isError field is encountered.

Copilot uses AI. Check for mistakes.

@ChrisJBurns
Copy link
Collaborator

@claude try again and review this

Copy link
Contributor

claude bot commented Oct 4, 2025

Claude encountered an error —— View job

Failed with exit code 128

I'll analyze this and get back to you.

@ChrisJBurns
Copy link
Collaborator

@claude review this one more time

Copy link
Contributor

claude bot commented Oct 6, 2025

Claude encountered an error —— View job

Failed with exit code 128

I'll analyze this and get back to you.

@ChrisJBurns
Copy link
Collaborator

@JAORMX I wonder if Claude can't review forked branches. The following error is in the action log

fatal: couldn't find remote ref feat/telemetry-tool-error-capture
Error: Prepare step failed with error: Failed with exit code 128
Error: Process completed with exit code 1.

@ChrisJBurns
Copy link
Collaborator

ChrisJBurns commented Oct 6, 2025

@Deepam02 Thanks for the PR!! 🚀 Would you be able to add some screenshots to the issue so we can verify that the correct outcome has happened? We've got some example deployment yamls in the otel folder that you can use with kind

@Deepam02
Copy link
Contributor Author

Deepam02 commented Oct 7, 2025

Hey @ChrisJBurns!

I tried following the OTEL deployment examples in examples/otel with Kind, but ran into issues and coudn't figure it out

However, is this screenshot of the successful passing unit test enough to demonstrate that the tool error detection middleware is working correctly?

image

The TestTelemetryIntegration_ToolErrorDetection test passes, proving that MCP responses with "isError": true are properly detected and captured in telemetry.

ResponseWriter: w,
statusCode: http.StatusOK,
bytesWritten: 0,
isToolCall: mcpparser.GetMCPMethod(ctx) == string(mcp.MethodToolsCall),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: this assumes that request payload and response payload both flow through the same socket, which is not the case for (legacy) SSE transport. This bug is also present around line 100 where it checks the endpoint path, but the spec does not mandate it.

You might want to a look at pkg/testkit to ease the implementation of deeper tests for both SSE and Streamable HTTP transports.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @blkt! Good point about the transport assumptions, that makes sense.

Since SSE is marked as legacy, I was thinking of just disabling tool error detection for SSE and keeping it working for streamable-HTTP. It keeps this PR simple.

If you’d prefer full SSE support though, I can refactor it to move the detection to the MCP layer so it works across transports. Happy to go with whatever you think is best.

Copy link
Contributor

@blkt blkt Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong opinions about this and I definitely do not want to block this, just wanted to raise awareness.
@ChrisJBurns @dmjb I'll let you decide the best way forward, feel free to resolve the conversation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FR: Capture tool execution errors in OTel traces and metrics
4 participants