-
Notifications
You must be signed in to change notification settings - Fork 579
fix: container/Dockerfile.trtllm - use pytorch 2.8.0a0+5228986c39.nv25.5 #2579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughSystem-wide Python installation replaces virtualenv across build and runtime stages in container/Dockerfile.trtllm. CUDA toolkit assets copied into the runtime image are expanded. Triton, cuda-python, TensorRT-LLM wheels, internal wheels, tests, and benchmarks are installed via uv pip with --system and --break-system-packages. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
container/Dockerfile.trtllm (2)
418-426
: Potential CUDA toolchain/library version skew between build and runtime images.You’re copying cudafe++/ptxas/fatbinary, headers, nvvm, and libcudart.so* from the build stage into the runtime. The build base (${BASE_IMAGE_TAG}=25.05) likely carries a different CUDA than the runtime (${RUNTIME_IMAGE_TAG}=12.9.0). Mixing toolchain and runtime libs can produce subtle compilation/runtime faults (e.g., PTX incompatibilities, mismatched libcudart SONAMEs).
Recommendations (pick one):
- Align CUDA across stages: use a build image that matches 12.9 (or switch runtime to match build).
- Don’t copy libcudart.so*; rely on the runtime image’s CUDA runtime:
# Remove this copy: # COPY --from=build /usr/local/cuda/lib64/libcudart.so* /usr/local/cuda/lib64/
- Install cuda-toolkit-12-9 directly in runtime (devel layer or apt) instead of copying binaries piecemeal.
351-354
: Drop VIRTUAL_ENV/venv PATH manipulation in runtime or actually create/use it.These envs suggest a venv, but you’ve moved to system-wide installs and commented out venv creation. Keep things unambiguous.
Minimal fix (outside selected range):
# Remove: ENV VIRTUAL_ENV=/opt/dynamo/venv ENV PATH="${VIRTUAL_ENV}/bin:${PATH}" # Keep PATH explicit if needed: ENV PATH=/opt/hpcx/ompi/bin:/usr/local/bin/etcd/:/usr/local/cuda/nvvm/bin:$PATH
🧹 Nitpick comments (3)
container/Dockerfile.trtllm (3)
410-417
: System-level installs: add uv cache and verify interpreter resolution.Switching to uv pip --system is fine in containers, but two nits:
- Add a build cache mount to speed up repeated image builds.
- Given VIRTUAL_ENV is still set (Lines 351–354), please verify uv picks system Python, not some unexpected interpreter.
Suggested tweaks:
RUN --mount=type=bind,source=./container/deps/requirements.txt,target=/tmp/requirements.txt \ - uv pip install --system --break-system-packages --requirement /tmp/requirements.txt + --mount=type=cache,target=/root/.cache/uv \ + uv pip install --system --break-system-packages --requirement /tmp/requirements.txt RUN --mount=type=bind,source=./container/deps/requirements.test.txt,target=/tmp/requirements.txt \ - uv pip install --system --break-system-packages --requirement /tmp/requirements.txt + --mount=type=cache,target=/root/.cache/uv \ + uv pip install --system --break-system-packages --requirement /tmp/requirements.txtIf helpful, I can also switch these to uv pip sync with a constraints file to keep resolution stable across builds.
497-497
: Avoid installing benchmarks into the system site in the runtime image.This bloats the production image and risks dependency drift. Gate it behind a build-arg so CI images can opt-in.
- RUN uv pip install --system --break-system-packages /workspace/benchmarks + ARG INSTALL_BENCHMARKS=0 + RUN if [ "$INSTALL_BENCHMARKS" = "1" ]; then \ + uv pip install --system --break-system-packages /workspace/benchmarks; \ + fi
481-489
: Note on index selection and dependency pinning.Thanks for documenting uv’s --extra-index-url precedence. For reproducibility, consider pinning TENSORRT-LLM to an exact version and adding a constraints file for transitive deps (e.g., cuda-python, packaging, jinja2) to ensure stable rebuilds.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
container/Dockerfile.trtllm
(2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: ptarasiewiczNV
PR: ai-dynamo/dynamo#2027
File: container/deps/vllm/install_vllm.sh:0-0
Timestamp: 2025-07-22T10:22:28.972Z
Learning: The `--torch-backend=auto` flag works with vLLM installations via uv pip install, even though it's not a standard pip option. This flag is processed by vLLM's build system during installation to automatically match PyTorch distribution with container CUDA versions.
Learnt from: nnshah1
PR: ai-dynamo/dynamo#2489
File: container/deps/vllm/install_vllm.sh:151-152
Timestamp: 2025-08-18T16:52:15.659Z
Learning: The VLLM_PRECOMPILED_WHEEL_LOCATION environment variable, when exported, automatically triggers vLLM's build system to use the precompiled wheel instead of building from source, even when using standard `uv pip install .` commands in container/deps/vllm/install_vllm.sh.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (1)
container/Dockerfile.trtllm (1)
405-406
: I've initiated inspection ofcontainer/Dockerfile.trtllm
around the virtualenv and PATH settings. Once the outputs are available, I'll verify whether the review comment's concerns hold and then update the comment accordingly.
/ok to test 4e51c49 |
…alling requirements.txt
/ok to test 77ace97 |
/ok to test d3ae1d4 |
/ok to test 625a490 |
@dmitry-tokarev-nv Why we can not use uv for running on B200? |
Co-authored-by: Misha Chornyi <[email protected]> Signed-off-by: Dmitry Tokarev <[email protected]>
Co-authored-by: Misha Chornyi <[email protected]> Signed-off-by: Dmitry Tokarev <[email protected]>
Co-authored-by: Misha Chornyi <[email protected]> Signed-off-by: Dmitry Tokarev <[email protected]>
Co-authored-by: Misha Chornyi <[email protected]> Signed-off-by: Dmitry Tokarev <[email protected]>
@tanmayv25 it didn't work with torch 2.8.0a0+5228986c39.nv25.5 which is coming from pytorch container and uv didn't recognize it. 2.8.0a0+5228986c39.nv25.5 is needed to work on B200. |
/ok to test 4f2dfef |
Comments are non-critical.
…5.5 (#2579) Signed-off-by: Dmitry Tokarev <[email protected]> Co-authored-by: Misha Chornyi <[email protected]>
…5.5 (#2579) Signed-off-by: Dmitry Tokarev <[email protected]> Co-authored-by: Misha Chornyi <[email protected]>
…5.5 (#2579) Signed-off-by: Dmitry Tokarev <[email protected]> Co-authored-by: Misha Chornyi <[email protected]> Signed-off-by: Jason Zhou <[email protected]>
…5.5 (#2579) Signed-off-by: Dmitry Tokarev <[email protected]> Co-authored-by: Misha Chornyi <[email protected]> Signed-off-by: Krishnan Prashanth <[email protected]>
Overview:
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
New Features
Chores