Skip to content

Conversation

dmitry-tokarev-nv
Copy link
Contributor

@dmitry-tokarev-nv dmitry-tokarev-nv commented Aug 20, 2025

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features

    • Expanded CUDA toolkit available in the runtime image (ptxas, cudafe++, fatbinary, headers, nvvm, libcudart), enabling broader GPU tooling and on-container compilation workflows.
  • Chores

    • Consolidated all Python dependency installs to system-level across build and runtime images.
    • Removed virtual environment setup to simplify the runtime environment.
    • Standardized installation of CUDA/TRT components and benchmarks using system-wide installs for consistency and reliability.
    • Aligned build and runtime images to ensure consistent dependency management and improved CUDA compatibility.

Copy link

copy-pr-bot bot commented Aug 20, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Contributor

coderabbitai bot commented Aug 20, 2025

Walkthrough

System-wide Python installation replaces virtualenv across build and runtime stages in container/Dockerfile.trtllm. CUDA toolkit assets copied into the runtime image are expanded. Triton, cuda-python, TensorRT-LLM wheels, internal wheels, tests, and benchmarks are installed via uv pip with --system and --break-system-packages.

Changes

Cohort / File(s) Summary of changes
Docker build/runtime install model
container/Dockerfile.trtllm
Commented out uv venv creation/activation; switched all uv pip installs (requirements, tests, internal wheels, benchmarks, Triton, cuda-python, TRT-LLM) to system-level with --system --break-system-packages; added runtime COPYs for additional CUDA tools/headers/libs (cudafe++, ptxas, fatbinary, include, libcudart, nvvm).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

A rabbit hops through Docker’s field,
No venv burrows—system’s sealed.
CUDA carrots, bright and new,
ptxas, cudafe++, in view.
Triton streams along the way,
Wheels roll smooth—uv leads the day.
Ship the crate, then bound away! 🥕🐇

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
container/Dockerfile.trtllm (2)

418-426: Potential CUDA toolchain/library version skew between build and runtime images.

You’re copying cudafe++/ptxas/fatbinary, headers, nvvm, and libcudart.so* from the build stage into the runtime. The build base (${BASE_IMAGE_TAG}=25.05) likely carries a different CUDA than the runtime (${RUNTIME_IMAGE_TAG}=12.9.0). Mixing toolchain and runtime libs can produce subtle compilation/runtime faults (e.g., PTX incompatibilities, mismatched libcudart SONAMEs).

Recommendations (pick one):

  • Align CUDA across stages: use a build image that matches 12.9 (or switch runtime to match build).
  • Don’t copy libcudart.so*; rely on the runtime image’s CUDA runtime:
# Remove this copy:
# COPY --from=build /usr/local/cuda/lib64/libcudart.so* /usr/local/cuda/lib64/
  • Install cuda-toolkit-12-9 directly in runtime (devel layer or apt) instead of copying binaries piecemeal.

351-354: Drop VIRTUAL_ENV/venv PATH manipulation in runtime or actually create/use it.

These envs suggest a venv, but you’ve moved to system-wide installs and commented out venv creation. Keep things unambiguous.

Minimal fix (outside selected range):

# Remove:
ENV VIRTUAL_ENV=/opt/dynamo/venv
ENV PATH="${VIRTUAL_ENV}/bin:${PATH}"

# Keep PATH explicit if needed:
ENV PATH=/opt/hpcx/ompi/bin:/usr/local/bin/etcd/:/usr/local/cuda/nvvm/bin:$PATH
🧹 Nitpick comments (3)
container/Dockerfile.trtllm (3)

410-417: System-level installs: add uv cache and verify interpreter resolution.

Switching to uv pip --system is fine in containers, but two nits:

  • Add a build cache mount to speed up repeated image builds.
  • Given VIRTUAL_ENV is still set (Lines 351–354), please verify uv picks system Python, not some unexpected interpreter.

Suggested tweaks:

 RUN --mount=type=bind,source=./container/deps/requirements.txt,target=/tmp/requirements.txt \
-    uv pip install --system --break-system-packages --requirement /tmp/requirements.txt
+    --mount=type=cache,target=/root/.cache/uv \
+    uv pip install --system --break-system-packages --requirement /tmp/requirements.txt

 RUN --mount=type=bind,source=./container/deps/requirements.test.txt,target=/tmp/requirements.txt \
-    uv pip install --system --break-system-packages --requirement /tmp/requirements.txt
+    --mount=type=cache,target=/root/.cache/uv \
+    uv pip install --system --break-system-packages --requirement /tmp/requirements.txt

If helpful, I can also switch these to uv pip sync with a constraints file to keep resolution stable across builds.


497-497: Avoid installing benchmarks into the system site in the runtime image.

This bloats the production image and risks dependency drift. Gate it behind a build-arg so CI images can opt-in.

- RUN uv pip install --system --break-system-packages /workspace/benchmarks
+ ARG INSTALL_BENCHMARKS=0
+ RUN if [ "$INSTALL_BENCHMARKS" = "1" ]; then \
+       uv pip install --system --break-system-packages /workspace/benchmarks; \
+     fi

481-489: Note on index selection and dependency pinning.

Thanks for documenting uv’s --extra-index-url precedence. For reproducibility, consider pinning TENSORRT-LLM to an exact version and adding a constraints file for transitive deps (e.g., cuda-python, packaging, jinja2) to ensure stable rebuilds.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between bc290e7 and 4e51c49.

📒 Files selected for processing (1)
  • container/Dockerfile.trtllm (2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: ptarasiewiczNV
PR: ai-dynamo/dynamo#2027
File: container/deps/vllm/install_vllm.sh:0-0
Timestamp: 2025-07-22T10:22:28.972Z
Learning: The `--torch-backend=auto` flag works with vLLM installations via uv pip install, even though it's not a standard pip option. This flag is processed by vLLM's build system during installation to automatically match PyTorch distribution with container CUDA versions.
Learnt from: nnshah1
PR: ai-dynamo/dynamo#2489
File: container/deps/vllm/install_vllm.sh:151-152
Timestamp: 2025-08-18T16:52:15.659Z
Learning: The VLLM_PRECOMPILED_WHEEL_LOCATION environment variable, when exported, automatically triggers vLLM's build system to use the precompiled wheel instead of building from source, even when using standard `uv pip install .` commands in container/deps/vllm/install_vllm.sh.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (1)
container/Dockerfile.trtllm (1)

405-406: I've initiated inspection of container/Dockerfile.trtllm around the virtualenv and PATH settings. Once the outputs are available, I'll verify whether the review comment's concerns hold and then update the comment accordingly.

@dmitry-tokarev-nv
Copy link
Contributor Author

/ok to test 4e51c49

@dmitry-tokarev-nv
Copy link
Contributor Author

/ok to test 77ace97

@pull-request-size pull-request-size bot added size/M and removed size/S labels Aug 21, 2025
@dmitry-tokarev-nv
Copy link
Contributor Author

/ok to test d3ae1d4

@dmitry-tokarev-nv dmitry-tokarev-nv changed the title fix: container/Dockerfile.trtllm - install pip libs on system level fix: container/Dockerfile.trtllm - use pytorch 2.8.0a0+5228986c39.nv25.5 Aug 21, 2025
@dmitry-tokarev-nv
Copy link
Contributor Author

/ok to test 625a490

@tanmayv25
Copy link
Contributor

@dmitry-tokarev-nv Why we can not use uv for running on B200?

dmitry-tokarev-nv and others added 5 commits August 26, 2025 21:50
Co-authored-by: Misha Chornyi <[email protected]>
Signed-off-by: Dmitry Tokarev <[email protected]>
Co-authored-by: Misha Chornyi <[email protected]>
Signed-off-by: Dmitry Tokarev <[email protected]>
Co-authored-by: Misha Chornyi <[email protected]>
Signed-off-by: Dmitry Tokarev <[email protected]>
Co-authored-by: Misha Chornyi <[email protected]>
Signed-off-by: Dmitry Tokarev <[email protected]>
@dmitry-tokarev-nv
Copy link
Contributor Author

@dmitry-tokarev-nv Why we can not use uv for running on B200?

@tanmayv25 it didn't work with torch 2.8.0a0+5228986c39.nv25.5 which is coming from pytorch container and uv didn't recognize it. 2.8.0a0+5228986c39.nv25.5 is needed to work on B200.

@dmitry-tokarev-nv
Copy link
Contributor Author

/ok to test 4f2dfef

@dmitry-tokarev-nv dmitry-tokarev-nv dismissed nv-kmcgill53’s stale review August 27, 2025 02:37

Comments are non-critical.

@dmitry-tokarev-nv dmitry-tokarev-nv merged commit 73f60fe into main Aug 27, 2025
10 of 11 checks passed
@dmitry-tokarev-nv dmitry-tokarev-nv deleted the dtokarev-fix-trtllm-runtime-libs branch August 27, 2025 02:37
dmitry-tokarev-nv added a commit that referenced this pull request Aug 27, 2025
krishung5 pushed a commit that referenced this pull request Aug 27, 2025
jasonqinzhou pushed a commit that referenced this pull request Aug 30, 2025
…5.5 (#2579)

Signed-off-by: Dmitry Tokarev <[email protected]>
Co-authored-by: Misha Chornyi <[email protected]>
Signed-off-by: Jason Zhou <[email protected]>
KrishnanPrash pushed a commit that referenced this pull request Sep 2, 2025
…5.5 (#2579)

Signed-off-by: Dmitry Tokarev <[email protected]>
Co-authored-by: Misha Chornyi <[email protected]>
Signed-off-by: Krishnan Prashanth <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants