-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
[Core,Frontend,Doc] Trace v1 cuda start up with opentelemetry (vllm-project#19318) #20229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: ibl <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @ibl-g, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces OpenTelemetry tracing for the vLLM startup process, specifically focusing on GPU/CUDA initialization within the openai.api_server
entrypoint. It establishes a new, more granular tracing pattern using per-module scopes and ensures trace context propagation across different processes, allowing for a unified view of the startup sequence. The implementation maintains OpenTelemetry as an optional dependency, enabling tracing by default if the necessary environment variables are configured.
Highlights
- OpenTelemetry Integration: Introduced OpenTelemetry tracing capabilities for vLLM's startup process, specifically focusing on GPU/CUDA initialization within the
openai.api_server
entrypoint. This allows for detailed visibility into cold start phases. - Trace Context Propagation: Implemented mechanisms to propagate trace context between the API server process and the engine core process. This ensures that all related startup spans are grouped into a single, unified trace view, simplifying debugging and performance analysis.
- Flexible Tracing Activation: Tracing is now enabled by default if OpenTelemetry packages are installed and a trace endpoint is configured via environment variables. OpenTelemetry remains an optional dependency, gracefully falling back to no-op tracing if not available.
- Granular Tracing Scopes: Adopted a new pattern of per-module trace 'scopes' (similar to logging loggers) to provide more granular and organized tracing, allowing for detailed instrumentation of specific components like tokenizer initialization, model loading, and KV cache setup.
- New Startup Spans: Added a comprehensive set of startup-related spans, including
vllm.startup
,vllm.python_imports
,vllm.asyncllm
,vllm.asyncllm.tokenizer
,vllm.model_registry.inspect_model
,vllm.engine_core
,vllm.engine_core_client
,vllm.engine_core.kv_cache
,vllm.engine_core.model_executor
,vllm.engine_core.model_runner.load_model
,vllm.engine_core.model_runner.profile_run
,vllm.api_server.init_app_state
,vllm.engine_core.torch_compile
, andvllm.engine_core.model_runner.model_capture
, each capturing relevant attributes. - Documentation and Testing: Added new documentation (
examples/others/tracing_vllm_startup.md
) detailing how to use the new tracing features, along with a new unit test (tests/tracing/test_startup_tracing.py
) to validate that the API server correctly exports trace spans via gRPC.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces OpenTelemetry tracing for vLLM startup, which is a valuable addition for observability and performance analysis. The implementation is well-structured, making tracing an optional dependency and providing a clear pattern for adding more spans. The context propagation between processes is also handled correctly.
I've identified a couple of medium-severity issues. One is in the documentation and could lead to misconfiguration by users. The other is in the new test, which could make it brittle to future changes. After addressing these points, the PR will be in great shape.
3. Configure [OpenTelemetry environment variables](https://opentelemetry.io/docs/specs/otel/configuration/sdk-environment-variables/) for vLLM | ||
|
||
``` | ||
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=grpc://localhost:4317 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The grpc://
scheme is not standard for OTEL_EXPORTER_OTLP_TRACES_ENDPOINT
when using gRPC and might not be handled correctly by the OpenTelemetry library, leading to connection errors. The protocol is determined by OTEL_EXPORTER_OTLP_TRACES_PROTOCOL
(which defaults to grpc
), and security by OTEL_EXPORTER_OTLP_TRACES_INSECURE
. It's better to provide just the host and port as the test does.
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=grpc://localhost:4317 | |
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=localhost:4317 |
for span in scope.spans: | ||
spans[span.name] = span | ||
|
||
assert len(spans) == 12, (f"Expected 12 spans but got {len(spans)}.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion for an exact number of spans is brittle and likely to break as new spans are added or existing ones are refactored. The following assertion assert expected_spans <= found_spans
is more robust as it checks for a minimum set of required spans while allowing for future additions. Removing this line will make the test more maintainable.
My apologies, running pre-commit locally did not reveal the issues noted by your neat automation. I'll get those addressed. |
Purpose
The PR is in response to #19318 and
Adds MVP level OpenTelemetry tracing of a starting set of spans covering GPU/CUDA start up for openai.api_server entrypoint.
Tracing is on by default if opentelemetry is installed and a trace endpoint is configured via env var. This is default opentelemetry behaviour but differs from v0 request tracing that requires a CLI arg
--otlp-traces-endpoint
Keeps opentelemetry an optional dependency. It uses no-op trace provider/spans if opentelemetry packages are not available. This is similar to how opentelemetry behaves if no trace provider is configured or tracing is disabled.
Forwards trace context between the API Server/AsyncLLM process and Engine Core process such that all spans are grouped together into a single trace view.
Adds a new pattern of per module trace "scopes" (opentelemetry terminology), similar to logging loggers. This is a common opentelemetry pattern but is a bit different from the v0 request tracing that exports a single span at the end of a request based on data collected by vLLM over time.
This PR is intended to be a starting point for iteration. We'll want to add coverage for other hardware and entrypoints and iterate on the set of spans and their attributes.
Test Plan
Unit tests testing that API server exports trace spans via gRPC, similar to v0 request tracing.
We may want to expand the test to cover also
llm.py
And perhaps share more of the test utilities with the v0 request tracing. Currently there's some duplication.
Happy to do this and more testing. I mostly wanted to get the PR in motion for early feedback.
Test Result
The new test passes. I've not yet been able to run the full suite of tests locally. If this is not done on the PR automatically I'll continue investigating my environment set up to resolve the missing imports resulting in test failures.
(Optional) Documentation Update
I've added an example documentation under "other", let me know if you prefer it under "online serving".
Example screenshots from Jaeger