Skip to content

Conversation

iamemilio
Copy link
Contributor

@iamemilio iamemilio commented Oct 7, 2025

What does this PR do?

Removes the broken tracing middleware from llama stack core. This middleware duplicates what otel already does for fast api by default, but breaks tracing by incorrectly handling w3 trace headers.

Test Plan

Telemetry is currently not working. An attempt to run this by hand was made, but this is not the only thing that needs to change to make the telemetry work in this project, so traces did not show up. More changes to come to address this.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 7, 2025
@iamemilio iamemilio force-pushed the remove_broken_middleware branch 2 times, most recently from 674f63e to 88c1687 Compare October 7, 2025 22:59
@iamemilio iamemilio force-pushed the remove_broken_middleware branch from 441a0bf to 556bdd5 Compare October 8, 2025 13:47
@iamemilio iamemilio force-pushed the remove_broken_middleware branch from 556bdd5 to b3b9c93 Compare October 8, 2025 13:51
Copy link
Collaborator

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we want to remove the broken tracing middleware. Can you clarify with what we should replace it? Can you explain how do you intend to split your work and PRs that will follow?

Thanks!

@iamemilio
Copy link
Contributor Author

iamemilio commented Oct 8, 2025

Ok, we want to remove the broken tracing middleware. Can you clarify with what we should replace it? Can you explain how do you intend to split your work and PRs that will follow?

Thanks!
@leseb Thanks for the review!

Yeah, I am trying to find a way to make this change that makes sense, but its kinda a headache. The middleware we have right now interferes with other tracing. Would you prefer that I just replace all the tracing all at once?

I did a lot of testing, and discovered that we can use the auto instrumentation, but we need to do it programmatically due to a known quirk of using otel with uvicorn. This would mean that we would need telemetry installed and enabled by default, but we can disable it with environment variables. I am beginning to stage those WIP changes here: #3733

How do we feel about this design pattern? I made this comment in community the discord as well, I am happy to link you.

My goal with this PR is to make the telemetry we have work well enough. Then we can migrate services to the new pattern we want one service at a time. Once that is done, we can deprecate the telemetry API.

Once we merge this, and I finish implementing what is in the next PR, I can file tickets upstream for each place we capture custom instrumentation, and let you all help me with the migration. Its also an opportunity to go over what we capture with scrutiny to make sure what custom info we capture makes sense and isn't duplicated elsewhere.

# 2. If it has no parent (implicit root span from FastAPI instrumentation)
is_root_span = span.attributes.get(LOCAL_ROOT_SPAN_MARKER) or parent_span_id is None
root_span_id_value = span_id if is_root_span else None

Copy link
Contributor Author

@iamemilio iamemilio Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ehhuang take a look at this. I was able to get the integration test to work by doing this, but I am not 100% sure its right. I'd appreciate if you took a look and confirmed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't either. Can we just kill this sqlite span processor alltogether and add tests analogous to those in test_*_telemetry but against OTEL?

@iamemilio iamemilio force-pushed the remove_broken_middleware branch from 48e23c7 to 6d92d69 Compare October 8, 2025 14:47
@iamemilio iamemilio force-pushed the remove_broken_middleware branch from 6d92d69 to f051458 Compare October 8, 2025 14:51
Copy link
Contributor

@cdoern cdoern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is looking good, but +1 on leaving out unrelated changes.

@iamemilio
Copy link
Contributor Author

iamemilio commented Oct 8, 2025

Screenshot 2025-10-08 at 12 54 00 PM

Here is an example distributed trace with the changes in this PR from a client that was also instrumented sending a chat completion request to llama stack.

telemetry config:

  telemetry:
  - provider_id: meta-reference
    provider_type: inline::meta-reference
    config:
      service_name: llama-stack-server
      otel_exporter_otlp_endpoint: http://localhost:4318
      sinks:
        - console
        - otel_metric
        - otel_trace

@iamemilio iamemilio requested a review from leseb October 8, 2025 17:41
@iamemilio iamemilio requested a review from cdoern October 8, 2025 17:41
@ehhuang
Copy link
Contributor

ehhuang commented Oct 8, 2025

image

did we lose some spans with these changes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants