Skip to content

[Core,Frontend,Doc] Trace v1 cuda start up with opentelemetry (vllm-project#19318) #20229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ibl-g
Copy link

@ibl-g ibl-g commented Jun 29, 2025

Purpose

The PR is in response to #19318 and

  • Adds MVP level OpenTelemetry tracing of a starting set of spans covering GPU/CUDA start up for openai.api_server entrypoint.

  • Tracing is on by default if opentelemetry is installed and a trace endpoint is configured via env var. This is default opentelemetry behaviour but differs from v0 request tracing that requires a CLI arg --otlp-traces-endpoint

  • Keeps opentelemetry an optional dependency. It uses no-op trace provider/spans if opentelemetry packages are not available. This is similar to how opentelemetry behaves if no trace provider is configured or tracing is disabled.

  • Forwards trace context between the API Server/AsyncLLM process and Engine Core process such that all spans are grouped together into a single trace view.

  • Adds a new pattern of per module trace "scopes" (opentelemetry terminology), similar to logging loggers. This is a common opentelemetry pattern but is a bit different from the v0 request tracing that exports a single span at the end of a request based on data collected by vLLM over time.

This PR is intended to be a starting point for iteration. We'll want to add coverage for other hardware and entrypoints and iterate on the set of spans and their attributes.

Test Plan

Unit tests testing that API server exports trace spans via gRPC, similar to v0 request tracing.

We may want to expand the test to cover also

  • custom no-op tracing behaviour (would require otel libraries to not be installed)
  • otel no-op tracing behaviour (by setting OTEL_SDK_DISABLED)
  • other entrypoints, notably llm.py

And perhaps share more of the test utilities with the v0 request tracing. Currently there's some duplication.

Happy to do this and more testing. I mostly wanted to get the PR in motion for early feedback.

Test Result

The new test passes. I've not yet been able to run the full suite of tests locally. If this is not done on the PR automatically I'll continue investigating my environment set up to resolve the missing imports resulting in test failures.

(Optional) Documentation Update

I've added an example documentation under "other", let me know if you prefer it under "online serving".

Example screenshots from Jaeger

Screenshot 2025-06-27 21 26 48
Screenshot 2025-06-27 21 26 34

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added documentation Improvements or additions to documentation frontend v1 labels Jun 29, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @ibl-g, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces OpenTelemetry tracing for the vLLM startup process, specifically focusing on GPU/CUDA initialization within the openai.api_server entrypoint. It establishes a new, more granular tracing pattern using per-module scopes and ensures trace context propagation across different processes, allowing for a unified view of the startup sequence. The implementation maintains OpenTelemetry as an optional dependency, enabling tracing by default if the necessary environment variables are configured.

Highlights

  • OpenTelemetry Integration: Introduced OpenTelemetry tracing capabilities for vLLM's startup process, specifically focusing on GPU/CUDA initialization within the openai.api_server entrypoint. This allows for detailed visibility into cold start phases.
  • Trace Context Propagation: Implemented mechanisms to propagate trace context between the API server process and the engine core process. This ensures that all related startup spans are grouped into a single, unified trace view, simplifying debugging and performance analysis.
  • Flexible Tracing Activation: Tracing is now enabled by default if OpenTelemetry packages are installed and a trace endpoint is configured via environment variables. OpenTelemetry remains an optional dependency, gracefully falling back to no-op tracing if not available.
  • Granular Tracing Scopes: Adopted a new pattern of per-module trace 'scopes' (similar to logging loggers) to provide more granular and organized tracing, allowing for detailed instrumentation of specific components like tokenizer initialization, model loading, and KV cache setup.
  • New Startup Spans: Added a comprehensive set of startup-related spans, including vllm.startup, vllm.python_imports, vllm.asyncllm, vllm.asyncllm.tokenizer, vllm.model_registry.inspect_model, vllm.engine_core, vllm.engine_core_client, vllm.engine_core.kv_cache, vllm.engine_core.model_executor, vllm.engine_core.model_runner.load_model, vllm.engine_core.model_runner.profile_run, vllm.api_server.init_app_state, vllm.engine_core.torch_compile, and vllm.engine_core.model_runner.model_capture, each capturing relevant attributes.
  • Documentation and Testing: Added new documentation (examples/others/tracing_vllm_startup.md) detailing how to use the new tracing features, along with a new unit test (tests/tracing/test_startup_tracing.py) to validate that the API server correctly exports trace spans via gRPC.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces OpenTelemetry tracing for vLLM startup, which is a valuable addition for observability and performance analysis. The implementation is well-structured, making tracing an optional dependency and providing a clear pattern for adding more spans. The context propagation between processes is also handled correctly.

I've identified a couple of medium-severity issues. One is in the documentation and could lead to misconfiguration by users. The other is in the new test, which could make it brittle to future changes. After addressing these points, the PR will be in great shape.

3. Configure [OpenTelemetry environment variables](https://opentelemetry.io/docs/specs/otel/configuration/sdk-environment-variables/) for vLLM

```
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=grpc://localhost:4317
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The grpc:// scheme is not standard for OTEL_EXPORTER_OTLP_TRACES_ENDPOINT when using gRPC and might not be handled correctly by the OpenTelemetry library, leading to connection errors. The protocol is determined by OTEL_EXPORTER_OTLP_TRACES_PROTOCOL (which defaults to grpc), and security by OTEL_EXPORTER_OTLP_TRACES_INSECURE. It's better to provide just the host and port as the test does.

Suggested change
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=grpc://localhost:4317
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=localhost:4317

for span in scope.spans:
spans[span.name] = span

assert len(spans) == 12, (f"Expected 12 spans but got {len(spans)}.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This assertion for an exact number of spans is brittle and likely to break as new spans are added or existing ones are refactored. The following assertion assert expected_spans <= found_spans is more robust as it checks for a minimum set of required spans while allowing for future additions. Removing this line will make the test more maintainable.

@ibl-g
Copy link
Author

ibl-g commented Jun 29, 2025

My apologies, running pre-commit locally did not reveal the issues noted by your neat automation. I'll get those addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation frontend v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant