Skip to content

Conversation

venkywonka
Copy link
Collaborator

@venkywonka venkywonka commented Aug 18, 2025

Summary by CodeRabbit

  • New Features
    • Automatically derives LoRA count from model labels and sets max_loras and max_cpu_loras accordingly.
  • Bug Fixes
    • Prevents runtime failures when LoRA receives unsupported input types by auto-casting to FP16/BF16, with a debug log for visibility.
  • Tests
    • Perf configuration updated to reflect dynamic LoRA count and new presets, improving flexibility for PyTorch model runs.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@venkywonka venkywonka requested a review from a team as a code owner August 18, 2025 23:19
Copy link
Contributor

coderabbitai bot commented Aug 18, 2025

📝 Walkthrough

Walkthrough

Adds input casting to FP16/BF16 in LoRA forward path to ensure grouped GEMM receives supported types, with debug logging. Updates PyTorch perf test config to derive LoRA counts from model labels, propagate max_loras/max_cpu_loras, and apply special settings (targets, mapping, rank) for phi_4_multimodal_instruct.

Changes

Cohort / File(s) Summary
LoRA runtime casting
tensorrt_llm/_torch/peft/lora/layer.py
Import logger. In LoraLayer.forward, when LoRA is active, cast non-FP16/BF16 inputs to BF16 if CUDA supports it else FP16; log debug message; pass cast tensor to lora_grouped_gemm. Prevents FP8 inputs from reaching the op.
Perf test LoRA config derivation
tests/integration/defs/perf/pytorch_model_config.py
Parse lora_count from model_label (fallback 1). Set lora_config.max_loras and max_cpu_loras to lora_count. For labels including phi_4_multimodal_instruct, add lora_target_modules and trtllm_modules_to_hf_modules, and set max_lora_rank=320; otherwise keep rank 64. Merge via base_config.update.

Sequence Diagram(s)

sequenceDiagram
  participant UserCode as Caller
  participant LoraLayer as LoraLayer.forward
  participant GEMM as lora_grouped_gemm

  UserCode->>LoraLayer: forward(input, ...)
  alt LoRA active
    alt input dtype not FP16/BF16
      LoraLayer->>LoraLayer: check CUDA bf16 support
      LoraLayer->>LoraLayer: cast to bf16 or fp16
      LoraLayer->>GEMM: grouped_gemm(cast_input, ...)
    else input is FP16/BF16
      LoraLayer->>GEMM: grouped_gemm(input, ...)
    end
  else LoRA inactive
    LoraLayer->>LoraLayer: bypass LoRA path
  end
  LoraLayer-->>UserCode: output
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • shaharmor98
  • Wanli-Jiang
  • tijyojwad
  • 2ez4bz

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@coderabbitai coderabbitai bot changed the title [5464088] @coderabbitai title [5464088] fix Cast LoRA forward inputs to FP16 and BF16; update perf test config Aug 18, 2025
@venkywonka venkywonka changed the title [5464088] fix Cast LoRA forward inputs to FP16 and BF16; update perf test config [5464088] [fix] Cast LoRA forward inputs to FP16 and BF16; update perf test config Aug 18, 2025
@venkywonka venkywonka changed the title [5464088] [fix] Cast LoRA forward inputs to FP16 and BF16; update perf test config [5464088] [fix] dequantize fp8 activation input to lora forward; update perf test config Aug 18, 2025
@venkywonka venkywonka changed the title [5464088] [fix] dequantize fp8 activation input to lora forward; update perf test config [https://nvbugs/5464088] [fix] dequantize fp8 activation input to lora forward; update perf test config Aug 18, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
tensorrt_llm/_torch/peft/lora/layer.py (2)

124-132: Good FP8 guard; consider preserving original dtype by casting outputs back

The guard avoids runtime failures by ensuring the custom op sees FP16/BF16. However, if upstream produced FP8 (or any non-FP16/BF16), returning LoRA outputs in FP16/BF16 may introduce a dtype mismatch for downstream fusion/addition logic expecting the original dtype.

Recommend preserving the original dtype for output by:

  • Capturing the original input dtype and whether a recast is needed.
  • Casting the outputs back to the original dtype just before returning.

This keeps the LoRA op constraints and preserves end-to-end dtype invariants.

Apply this diff in the guard to track original dtype and recast intent:

-                if x.dtype not in (torch.float16, torch.bfloat16):
-                    target_dtype = torch.bfloat16 if torch.cuda.is_bf16_supported(
-                    ) else torch.float16
-                    logger.debug(
-                        f"lora_grouped_gemm supports only FP16/BF16. Casting input from {x.dtype} to {target_dtype}."
-                    )
-                    x = x.to(target_dtype).contiguous()
+                orig_dtype = x.dtype
+                need_recast = orig_dtype not in (torch.float16, torch.bfloat16)
+                if need_recast:
+                    target_dtype = (
+                        torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
+                    )
+                    logger.debug(
+                        f"lora_grouped_gemm supports only FP16/BF16. Casting input from {orig_dtype} to {target_dtype}."
+                    )
+                    x = x.to(target_dtype).contiguous()

And update the return sites to cast back when needed (outside the selected range, shown as a Python snippet for clarity):

# Around lines 146-166:
if isinstance(lora_outputs, torch.Tensor):
    return lora_outputs.to(orig_dtype) if need_recast else lora_outputs
else:
    lora_output = []
    for module_idx in self.lora_module_types:
        if int(module_idx) in active_lora_module_ids:
            lora_output.append(lora_outputs.pop(0))
        else:
            lora_output.append(
                torch.zeros(
                    list(x.shape[:-1]) + [
                        self.output_hidden_sizes[self.lora_module_types.index(module_idx)]
                    ],
                    dtype=x.dtype,
                    device=x.device,
                )
            )
    out = torch.cat(lora_output, dim=-1)
    return out.to(orig_dtype) if need_recast else out

Verification suggestion: Please confirm that the downstream addition path expects the LoRA contribution in the original activation dtype (e.g., FP8). If it instead expects BF16/FP16, skip the “cast back” step to avoid redundant conversions.


1-1: Missing NVIDIA copyright header

This file is missing the standard NVIDIA copyright/SPDX header mandated by the repo guidelines for source files. Please add it.

Example header (adjust year if needed):

# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
tests/integration/defs/perf/pytorch_model_config.py (1)

184-190: Make loras:X parsing case-insensitive and resilient to bad inputs

Great idea to derive lora_count from the model label. To harden it:

  • Parse case-insensitively (some labels vary in case).
  • Guard against non-integer values to avoid ValueError on int().

Apply this diff:

-        lora_count = 1
-        for part in model_label.split('-'):
-            if part.startswith('loras:'):
-                lora_count = max(1, int(part.split(':', 1)[1]))
-                break
+        lora_count = 1
+        for part in model_label.lower().split('-'):
+            if part.startswith('loras:'):
+                try:
+                    lora_count = max(1, int(part.split(':', 1)[1]))
+                except ValueError:
+                    lora_count = 1
+                break
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6fda8dd and a93c35f.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/peft/lora/layer.py (2 hunks)
  • tests/integration/defs/perf/pytorch_model_config.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/peft/lora/layer.py
  • tests/integration/defs/perf/pytorch_model_config.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/peft/lora/layer.py
  • tests/integration/defs/perf/pytorch_model_config.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tensorrt_llm/_torch/peft/lora/layer.py (1)

6-7: LGTM on logger import

Importing the package logger is appropriate for the new debug message. No issues spotted.

tests/integration/defs/perf/pytorch_model_config.py (1)

194-197: max_loras/max_cpu_loras are fully consumed in the pipeline

  • tensorrt_llm/serve/scripts/benchmark_dataset.py: defines CLI options max_loras and max_cpu_loras, so config keys map directly to function parameters.
  • tensorrt_llm/llmapi/llm.py (lines 864–869): reads both fields to set peft_cache_config_model.num_device_module_layer and num_host_module_layer.
  • tensorrt_llm/_torch/pyexecutor/_util.py (lines 488–492): applies them identically when building the PyTorch executor’s PEFT cache.

No silent no-ops remain—both settings drive cache sizing as intended.

@litaotju litaotju enabled auto-merge (squash) August 19, 2025 02:26
@venkywonka
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15700 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15700 [ run ] completed with state FAILURE
/LLM/release-1.0/L0_MergeRequest_PR pipeline #200 completed with status: 'FAILURE'

@venkywonka
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15719 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15719 [ run ] completed with state FAILURE
/LLM/release-1.0/L0_MergeRequest_PR pipeline #206 completed with status: 'FAILURE'

@venkywonka venkywonka force-pushed the lora-bugfixes branch 2 times, most recently from 5818853 to cbb4eb5 Compare August 19, 2025 19:10
@venkywonka venkywonka changed the title [https://nvbugs/5464088] [fix] dequantize fp8 activation input to lora forward; update perf test config [https://nvbugs/5464088] [fix] Guard against fp8 activations in lora forward; update perf test config Aug 19, 2025
@venkywonka
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15819 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15876 [ run ] completed with state SUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #229 completed with status: 'SUCCESS'

- Added logging for dtype casting in LoraLayer to ensure compatibility with FP16/BF16.
- Updated model configuration to derive the number of LoRA adapters from the model label, improving flexibility in adapter management.

Signed-off-by: Venky Ganesh <[email protected]>
- Modified _apply_activation method to accept a for_lora flag, allowing for specific handling of activation during LoRA operations.
- Updated the call to _apply_activation in GatedMLP to pass the for_lora argument, ensuring correct behavior in LoRA scenarios.
- Removed unnecessary dtype casting checks in LoraLayer, simplifying the code.

Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
@venkywonka
Copy link
Collaborator Author

/bot run --post-merge

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15948 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15948 [ run ] completed with state SUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #241 completed with status: 'FAILURE'

@venkywonka
Copy link
Collaborator Author

/bot run --extra-stage "A100X-PyTorch-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15959 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15959 [ run ] completed with state SUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #243 completed with status: 'FAILURE'

@venkywonka
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15979 [ run ] triggered by Bot

Copy link
Collaborator

@shaharmor98 shaharmor98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15979 [ run ] completed with state SUCCESS
/LLM/release-1.0/L0_MergeRequest_PR pipeline #245 completed with status: 'SUCCESS'

@litaotju litaotju merged commit 9eac744 into NVIDIA:release/1.0 Aug 21, 2025
4 checks passed
yuanjingx87 pushed a commit that referenced this pull request Aug 28, 2025
…a forward; update perf test config (#7014)

Signed-off-by: Venky Ganesh <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 5, 2025
…a forward; update perf test config (NVIDIA#7014)

Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 5, 2025
…a forward; update perf test config (NVIDIA#7014)

Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 6, 2025
…a forward; update perf test config (NVIDIA#7014)

Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 6, 2025
…a forward; update perf test config (NVIDIA#7014)

Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
dominicshanshan pushed a commit to dominicshanshan/TensorRT-LLM that referenced this pull request Sep 7, 2025
…a forward; update perf test config (NVIDIA#7014)

Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Wangshanshan <[email protected]>
pamelap-nvidia pushed a commit to pamelap-nvidia/TensorRT-LLM that referenced this pull request Sep 10, 2025
…a forward; update perf test config (NVIDIA#7014)

Signed-off-by: Venky Ganesh <[email protected]>
farazkh80 pushed a commit to farazkh80/TensorRT-LLM that referenced this pull request Sep 10, 2025
…a forward; update perf test config (NVIDIA#7014)

Signed-off-by: Venky Ganesh <[email protected]>
farazkh80 pushed a commit to farazkh80/TensorRT-LLM that referenced this pull request Sep 14, 2025
…a forward; update perf test config (NVIDIA#7014)

Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Faraz Khoubsirat <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants