-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[https://nvbugs/5464088] [fix] Guard against fp8 activations in lora forward; update perf test config #7014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughAdds input casting to FP16/BF16 in LoRA forward path to ensure grouped GEMM receives supported types, with debug logging. Updates PyTorch perf test config to derive LoRA counts from model labels, propagate max_loras/max_cpu_loras, and apply special settings (targets, mapping, rank) for phi_4_multimodal_instruct. Changes
Sequence Diagram(s)sequenceDiagram
participant UserCode as Caller
participant LoraLayer as LoraLayer.forward
participant GEMM as lora_grouped_gemm
UserCode->>LoraLayer: forward(input, ...)
alt LoRA active
alt input dtype not FP16/BF16
LoraLayer->>LoraLayer: check CUDA bf16 support
LoraLayer->>LoraLayer: cast to bf16 or fp16
LoraLayer->>GEMM: grouped_gemm(cast_input, ...)
else input is FP16/BF16
LoraLayer->>GEMM: grouped_gemm(input, ...)
end
else LoRA inactive
LoraLayer->>LoraLayer: bypass LoRA path
end
LoraLayer-->>UserCode: output
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
tensorrt_llm/_torch/peft/lora/layer.py (2)
124-132
: Good FP8 guard; consider preserving original dtype by casting outputs backThe guard avoids runtime failures by ensuring the custom op sees FP16/BF16. However, if upstream produced FP8 (or any non-FP16/BF16), returning LoRA outputs in FP16/BF16 may introduce a dtype mismatch for downstream fusion/addition logic expecting the original dtype.
Recommend preserving the original dtype for output by:
- Capturing the original input dtype and whether a recast is needed.
- Casting the outputs back to the original dtype just before returning.
This keeps the LoRA op constraints and preserves end-to-end dtype invariants.
Apply this diff in the guard to track original dtype and recast intent:
- if x.dtype not in (torch.float16, torch.bfloat16): - target_dtype = torch.bfloat16 if torch.cuda.is_bf16_supported( - ) else torch.float16 - logger.debug( - f"lora_grouped_gemm supports only FP16/BF16. Casting input from {x.dtype} to {target_dtype}." - ) - x = x.to(target_dtype).contiguous() + orig_dtype = x.dtype + need_recast = orig_dtype not in (torch.float16, torch.bfloat16) + if need_recast: + target_dtype = ( + torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16 + ) + logger.debug( + f"lora_grouped_gemm supports only FP16/BF16. Casting input from {orig_dtype} to {target_dtype}." + ) + x = x.to(target_dtype).contiguous()And update the return sites to cast back when needed (outside the selected range, shown as a Python snippet for clarity):
# Around lines 146-166: if isinstance(lora_outputs, torch.Tensor): return lora_outputs.to(orig_dtype) if need_recast else lora_outputs else: lora_output = [] for module_idx in self.lora_module_types: if int(module_idx) in active_lora_module_ids: lora_output.append(lora_outputs.pop(0)) else: lora_output.append( torch.zeros( list(x.shape[:-1]) + [ self.output_hidden_sizes[self.lora_module_types.index(module_idx)] ], dtype=x.dtype, device=x.device, ) ) out = torch.cat(lora_output, dim=-1) return out.to(orig_dtype) if need_recast else outVerification suggestion: Please confirm that the downstream addition path expects the LoRA contribution in the original activation dtype (e.g., FP8). If it instead expects BF16/FP16, skip the “cast back” step to avoid redundant conversions.
1-1
: Missing NVIDIA copyright headerThis file is missing the standard NVIDIA copyright/SPDX header mandated by the repo guidelines for source files. Please add it.
Example header (adjust year if needed):
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0tests/integration/defs/perf/pytorch_model_config.py (1)
184-190
: Make loras:X parsing case-insensitive and resilient to bad inputsGreat idea to derive lora_count from the model label. To harden it:
- Parse case-insensitively (some labels vary in case).
- Guard against non-integer values to avoid ValueError on int().
Apply this diff:
- lora_count = 1 - for part in model_label.split('-'): - if part.startswith('loras:'): - lora_count = max(1, int(part.split(':', 1)[1])) - break + lora_count = 1 + for part in model_label.lower().split('-'): + if part.startswith('loras:'): + try: + lora_count = max(1, int(part.split(':', 1)[1])) + except ValueError: + lora_count = 1 + break
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
tensorrt_llm/_torch/peft/lora/layer.py
(2 hunks)tests/integration/defs/perf/pytorch_model_config.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/peft/lora/layer.py
tests/integration/defs/perf/pytorch_model_config.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/peft/lora/layer.py
tests/integration/defs/perf/pytorch_model_config.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tensorrt_llm/_torch/peft/lora/layer.py (1)
6-7
: LGTM on logger importImporting the package logger is appropriate for the new debug message. No issues spotted.
tests/integration/defs/perf/pytorch_model_config.py (1)
194-197
: max_loras/max_cpu_loras are fully consumed in the pipeline
- tensorrt_llm/serve/scripts/benchmark_dataset.py: defines CLI options
max_loras
andmax_cpu_loras
, so config keys map directly to function parameters.- tensorrt_llm/llmapi/llm.py (lines 864–869): reads both fields to set
peft_cache_config_model.num_device_module_layer
andnum_host_module_layer
.- tensorrt_llm/_torch/pyexecutor/_util.py (lines 488–492): applies them identically when building the PyTorch executor’s PEFT cache.
No silent no-ops remain—both settings drive cache sizing as intended.
/bot run |
PR_Github #15700 [ run ] triggered by Bot |
PR_Github #15700 [ run ] completed with state |
/bot run |
PR_Github #15719 [ run ] triggered by Bot |
PR_Github #15719 [ run ] completed with state |
5818853
to
cbb4eb5
Compare
/bot run |
PR_Github #15819 [ run ] triggered by Bot |
PR_Github #15876 [ run ] completed with state |
- Added logging for dtype casting in LoraLayer to ensure compatibility with FP16/BF16. - Updated model configuration to derive the number of LoRA adapters from the model label, improving flexibility in adapter management. Signed-off-by: Venky Ganesh <[email protected]>
- Modified _apply_activation method to accept a for_lora flag, allowing for specific handling of activation during LoRA operations. - Updated the call to _apply_activation in GatedMLP to pass the for_lora argument, ensuring correct behavior in LoRA scenarios. - Removed unnecessary dtype casting checks in LoraLayer, simplifying the code. Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Venky Ganesh <[email protected]>
72ed15b
to
ab30002
Compare
/bot run --post-merge |
PR_Github #15948 [ run ] triggered by Bot |
PR_Github #15948 [ run ] completed with state |
/bot run --extra-stage "A100X-PyTorch-1" |
PR_Github #15959 [ run ] triggered by Bot |
PR_Github #15959 [ run ] completed with state |
/bot run |
PR_Github #15979 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR_Github #15979 [ run ] completed with state |
…a forward; update perf test config (#7014) Signed-off-by: Venky Ganesh <[email protected]>
…a forward; update perf test config (NVIDIA#7014) Signed-off-by: Venky Ganesh <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…a forward; update perf test config (NVIDIA#7014) Signed-off-by: Venky Ganesh <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…a forward; update perf test config (NVIDIA#7014) Signed-off-by: Venky Ganesh <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…a forward; update perf test config (NVIDIA#7014) Signed-off-by: Venky Ganesh <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…a forward; update perf test config (NVIDIA#7014) Signed-off-by: Venky Ganesh <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…a forward; update perf test config (NVIDIA#7014) Signed-off-by: Venky Ganesh <[email protected]>
…a forward; update perf test config (NVIDIA#7014) Signed-off-by: Venky Ganesh <[email protected]>
…a forward; update perf test config (NVIDIA#7014) Signed-off-by: Venky Ganesh <[email protected]> Signed-off-by: Faraz Khoubsirat <[email protected]>
Summary by CodeRabbit
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.