-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Support W4A8 method of AngleSlim tool #6857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughAdds Angelslim quant config loader and wires it into pretrained config flow; extends HF quant config parsing (new quant methods, kv_cache and activation fields, ignored_quantization handling); adds ActivationScheme and exclude_quantization to QuantConfig; adjusts exclude-module overrides and DeepseekV3 dequant gating; overhauls fused MoE weight-loading to accept weight_name and compute per-expert FP8/FP4 scales with SM100 resmoothing. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant ModelConfig
participant FS as Filesystem
participant Loader
Caller->>ModelConfig: from_pretrained(model_dir)
ModelConfig->>FS: stat angelslim_hf_quant_config.json
alt angelslim exists
ModelConfig->>Loader: load_angelslim_quant_config(file)
Loader-->>ModelConfig: quant_config, layer_quant_config=None
else
ModelConfig->>FS: stat hf_quant_config.json
ModelConfig->>Loader: load_hf_quant_config(file)
Loader-->>ModelConfig: quant_config (includes activation_scheme/exclude_quant_config)
end
ModelConfig-->>Caller: model config with QuantConfig
sequenceDiagram
participant Caller
participant FusedMoE
participant Weights as WeightsDict
participant Backend
Caller->>FusedMoE: load_weights(module, weights, mode, weight_name)
FusedMoE->>Weights: read expert.w1.{weight_name}, expert.w3.{weight_name}, expert.w2.{weight_name}
alt FP8 QDQ path
FusedMoE->>FusedMoE: compute per-expert w1/w3/w2 scales & alphas
FusedMoE->>FusedMoE: normalize & interleave scales into target tensors
opt SM100 resmoothing
FusedMoE->>FusedMoE: resmooth weight_scale_inv → weight keys
end
end
FusedMoE->>Backend: set tensors/scales
FusedMoE-->>Caller: done
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested reviewers
Pre-merge checks (3 warnings)❌ Failed checks (3 warnings)
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal). Please share your feedback with us on this Discord post. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🔭 Outside diff range comments (2)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)
1361-1368
: Use fully-qualified module name for exclusion check; current code can yield false negativesis_module_excluded_from_quantization expects the full module path for pattern matching. Passing only names[-1] ("kv_b_proj") can miss exclusions like "model.layers.12.self_attn.kv_b_proj". Also, your new gating to skip dequant when exclude_quant_config is provided makes sense; keep it, but fix the name passed.
Apply this diff to pass the fully-qualified module name:
- dequant_kv_b_proj = self.model_config.quant_config.is_module_excluded_from_quantization( - names[-1]) and self.model_config.quant_config.exclude_quant_config is None + dequant_kv_b_proj = ( + self.model_config.quant_config.is_module_excluded_from_quantization(name) + and self.model_config.quant_config.exclude_quant_config is None + )tensorrt_llm/_torch/models/modeling_utils.py (1)
476-488
: Preserve base QuantConfig when overriding for excluded modules; add None-safetyThe new per-excluded override only sets quant_algo and activation_scheme and loses other QuantConfig fields (e.g., group_size, clamp_val, has_zero_point). Build the override from the existing QuantConfig to preserve defaults and runtime semantics. Also, guard against a None quant_config.
Apply this diff:
- quant_algo = None - activation_scheme = None - exclude_quant_config = quant_config.exclude_quant_config - if exclude_quant_config: - quant_algo = exclude_quant_config.get("quant_algo", None) - activation_scheme = exclude_quant_config.get("activation_scheme", None) - new_config = QuantConfig( - quant_algo=quant_algo, kv_cache_quant_algo=kv_cache_quant_algo, activation_scheme=activation_scheme) + quant_algo = None + activation_scheme = None + exclude_quant_cfg = getattr(quant_config, "exclude_quant_config", None) + if exclude_quant_cfg: + quant_algo = exclude_quant_cfg.get("quant_algo") + activation_scheme = exclude_quant_cfg.get("activation_scheme") + # Preserve all other QuantConfig fields while overriding specific attributes + base_config = quant_config or QuantConfig() + new_config = dataclass_replace( + base_config, + quant_algo=quant_algo, + activation_scheme=activation_scheme, + kv_cache_quant_algo=kv_cache_quant_algo, + )And add this import at the top of the file (outside this hunk):
from dataclasses import replace as dataclass_replace
🧹 Nitpick comments (3)
tensorrt_llm/quantization/mode.py (1)
463-466
: Enum addition looks good; consider adding a brief docstring for clarityActivationScheme is correctly defined as a StrEnum with BaseEnumMeta and is ready for serialization and validation. A short docstring would improve readability.
Apply this diff to add a concise docstring:
class ActivationScheme(StrEnum, metaclass=BaseEnumMeta): - STATIC = auto() - DYNAMIC = auto() + """Activation quantization scheme.""" + STATIC = auto() + DYNAMIC = auto()tensorrt_llm/models/modeling_utils.py (1)
143-145
: Tighten docstring wording and wrap long lines (fixes E501 and improves clarity)The current docstrings have grammar issues and exceed 120 chars. Reword and wrap to meet style and static-analysis guidance.
Apply this diff:
- exclude_quant_config (Dict, optional): The model of exclude_modules will use exclude_quant_config. - activation_scheme (tensorrt_llm.quantization.mode.ActivationScheme, optional): The input of activation quantize scheme. + exclude_quant_config (Dict, optional): Per‑module quantization overrides applied to modules + matched by exclude_modules. Only the provided fields are overridden (e.g., quant_algo, + kv_cache_quant_algo, activation_scheme). + activation_scheme (tensorrt_llm.quantization.mode.ActivationScheme, optional): Activation + quantization scheme (e.g., STATIC or DYNAMIC).tensorrt_llm/_torch/model_config.py (1)
267-279
: Simplify nested dictionary initialization for exclude_quantization.The nested ternary operations make the code hard to read and maintain. Consider extracting this into a helper function for better clarity.
Extract the logic into a helper function:
+def _parse_exclude_quantization(json_exclude_quant_configs): + if not json_exclude_quant_configs: + return None + + result = {} + if json_exclude_quant_configs.get('quant_algo'): + result['quant_algo'] = QuantAlgo(json_exclude_quant_configs['quant_algo'].upper()) + else: + result['quant_algo'] = None + + if json_exclude_quant_configs.get('kv_cache_quant_algo'): + result['kv_cache_quant_algo'] = QuantAlgo(json_exclude_quant_configs['kv_cache_quant_algo'].upper()) + else: + result['kv_cache_quant_algo'] = None + + if json_exclude_quant_configs.get('activation_scheme'): + result['activation_scheme'] = ActivationScheme(json_exclude_quant_configs['activation_scheme'].upper()) + else: + result['activation_scheme'] = None + + return result json_exclude_quant_configs = json_quant_configs.get('exclude_quantization', None) -if json_exclude_quant_configs: - quant_config.exclude_quant_config = { - "quant_algo": QuantAlgo( - json_exclude_quant_configs.get('quant_algo', None).upper() - ) if json_exclude_quant_configs.get("quant_algo") else None, - "kv_cache_quant_algo": QuantAlgo( - json_exclude_quant_configs.get("kv_cache_quant_algo").upper() - ) if json_exclude_quant_configs.get("kv_cache_quant_algo") else None, - "activation_scheme": ActivationScheme( - json_exclude_quant_configs.get('activation_scheme', None).upper() - ) if json_exclude_quant_configs.get("activation_scheme") else None, - } +quant_config.exclude_quant_config = _parse_exclude_quantization(json_exclude_quant_configs)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
tensorrt_llm/_torch/model_config.py
(4 hunks)tensorrt_llm/_torch/models/modeling_deepseekv3.py
(1 hunks)tensorrt_llm/_torch/models/modeling_utils.py
(1 hunks)tensorrt_llm/_torch/modules/fused_moe/quantization.py
(6 hunks)tensorrt_llm/llmapi/llm_utils.py
(2 hunks)tensorrt_llm/models/modeling_utils.py
(3 hunks)tensorrt_llm/quantization/mode.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/quantization/mode.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/models/modeling_utils.py
tensorrt_llm/models/modeling_utils.py
tensorrt_llm/_torch/model_config.py
tensorrt_llm/llmapi/llm_utils.py
tensorrt_llm/_torch/modules/fused_moe/quantization.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/quantization/mode.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/models/modeling_utils.py
tensorrt_llm/models/modeling_utils.py
tensorrt_llm/_torch/model_config.py
tensorrt_llm/llmapi/llm_utils.py
tensorrt_llm/_torch/modules/fused_moe/quantization.py
🧬 Code Graph Analysis (7)
tensorrt_llm/quantization/mode.py (1)
tensorrt_llm/_utils.py (1)
BaseEnumMeta
(773-780)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)
tensorrt_llm/llmapi/llm_args.py (2)
quant_config
(2145-2148)quant_config
(2151-2152)
tensorrt_llm/_torch/models/modeling_utils.py (2)
tensorrt_llm/models/modeling_utils.py (2)
quant_algo
(550-551)QuantConfig
(128-271)tensorrt_llm/llmapi/llm_args.py (2)
quant_config
(2145-2148)quant_config
(2151-2152)
tensorrt_llm/models/modeling_utils.py (1)
tensorrt_llm/quantization/mode.py (2)
QuantAlgo
(23-46)ActivationScheme
(463-465)
tensorrt_llm/_torch/model_config.py (2)
tensorrt_llm/quantization/mode.py (2)
QuantAlgo
(23-46)ActivationScheme
(463-465)tensorrt_llm/models/modeling_utils.py (2)
QuantConfig
(128-271)quant_algo
(550-551)
tensorrt_llm/llmapi/llm_utils.py (3)
tensorrt_llm/models/modeling_utils.py (3)
PretrainedConfig
(369-570)QuantConfig
(128-271)quant_algo
(550-551)tensorrt_llm/quantization/mode.py (2)
QuantAlgo
(23-46)ActivationScheme
(463-465)tensorrt_llm/llmapi/llm_args.py (2)
quant_config
(2145-2148)quant_config
(2151-2152)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (3)
tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
MoEWeightLoadingMode
(13-15)tests/unittest/_torch/modules/test_fused_moe.py (1)
load_weights
(1660-1725)tensorrt_llm/_torch/utils.py (1)
shape
(103-104)
🪛 Ruff (0.12.2)
tensorrt_llm/models/modeling_utils.py
143-143: Line too long (127 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (7)
tensorrt_llm/llmapi/llm_utils.py (1)
438-474
: Manual verification needed for kv_cache_quant_method & activation_scheme enum valuesIt looks like no
quantization_config.kv_cache_quant_method
or.activation_scheme
entries were found in the repo’s JSON files—this may simply mean that those fields live in user‐supplied Hugging Face configs or are generated at runtime. Before merging the enum-validation refactor, please:
- Confirm which enum members actually appear in your external HF configs or examples (e.g. via your model hubs or deployment manifests).
- Ensure the hardcoded “Expected one of” lists (INT8, FP8, NVFP4 and STATIC, DYNAMIC) cover every real-world value your users might supply.
- Update the error messages or docstrings accordingly if there are additional valid enum values.
tensorrt_llm/_torch/model_config.py (2)
18-18
: LGTM! Import addition for ActivationScheme is appropriate.The addition of
ActivationScheme
to the import statement is necessary for the new functionality.
326-330
: Good addition of W4A8_AWQ support with proper error handling.The new case for
"w4a8_awq"
correctly maps toQuantAlgo.W4A8_AWQ
, and the addition ofNotImplementedError
for unsupported quant methods improves error handling.tensorrt_llm/_torch/modules/fused_moe/quantization.py (4)
209-210
: Good addition of weight_name parameter for flexible weight loading.The addition of the
weight_name
parameter with a default value of"weight"
provides backward compatibility while enabling support for different weight types.
218-220
: LGTM! Proper use of weight_name in expert weight keys.The implementation correctly uses the
weight_name
parameter to construct the expert weight keys, allowing for flexible weight type selection.
959-963
: Good override maintaining backward compatibility.The
WInt4AFP8FusedMoEMethod.load_weights
override withweight_name: str = "qweight"
properly handles quantized weights while maintaining the interface contract.
985-991
: Per-expert FP8 scale handling is correct
All of the expected"{expert_id}.w1.weight_scale"
and"{expert_id}.w3.weight_scale"
keys are present in the tests, and the code’s element-wise max → stack → reshape logic aligns with the shape ofmodule.fc31_alpha
. No missing keys or shape mismatches were found.
tensorrt_llm/_torch/model_config.py
Outdated
@staticmethod | ||
def load_angelslim_quant_config(quant_config_file, model_dir, moe_backend): | ||
quant_config = QuantConfig() | ||
layer_quant_config = None | ||
|
||
with open(quant_config_file) as f: | ||
quant_config_dict = json.load(f) | ||
|
||
json_quant_configs = quant_config_dict['quantization'] | ||
|
||
quant_config.quant_algo = QuantAlgo( | ||
json_quant_configs.get('quant_algo', None).upper()) if json_quant_configs.get("quant_algo") else None | ||
# fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES | ||
if quant_config.quant_algo == "fp8_pb_wo": | ||
quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES') | ||
|
||
quant_config.kv_cache_quant_algo = QuantAlgo( | ||
json_quant_configs.get("kv_cache_quant_algo").upper() | ||
) if json_quant_configs.get("kv_cache_quant_algo") else None | ||
quant_config.group_size = json_quant_configs.get('group_size', None) | ||
quant_config.exclude_modules = json_quant_configs.get( | ||
'exclude_modules', None) | ||
quant_config.activation_scheme = ActivationScheme( | ||
json_quant_configs.get('activation_scheme', None).upper() | ||
) if json_quant_configs.get("activation_scheme") else None | ||
|
||
json_exclude_quant_configs = json_quant_configs.get('exclude_quantization', None) | ||
if json_exclude_quant_configs: | ||
quant_config.exclude_quant_config = { | ||
"quant_algo": QuantAlgo( | ||
json_exclude_quant_configs.get('quant_algo', None).upper() | ||
) if json_exclude_quant_configs.get("quant_algo") else None, | ||
"kv_cache_quant_algo": QuantAlgo( | ||
json_exclude_quant_configs.get("kv_cache_quant_algo").upper() | ||
) if json_exclude_quant_configs.get("kv_cache_quant_algo") else None, | ||
"activation_scheme": ActivationScheme( | ||
json_exclude_quant_configs.get('activation_scheme', None).upper() | ||
) if json_exclude_quant_configs.get("activation_scheme") else None, | ||
} | ||
return quant_config, layer_quant_config | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling for malformed JSON config files.
The new load_angelslim_quant_config
method assumes the JSON file is well-formed and contains the expected structure. Add error handling for cases where required keys might be missing or the JSON is malformed.
Consider adding validation like this:
def load_angelslim_quant_config(quant_config_file, model_dir, moe_backend):
quant_config = QuantConfig()
layer_quant_config = None
- with open(quant_config_file) as f:
- quant_config_dict = json.load(f)
+ try:
+ with open(quant_config_file) as f:
+ quant_config_dict = json.load(f)
+ except (json.JSONDecodeError, IOError) as e:
+ raise ValueError(f"Failed to load angelslim config from {quant_config_file}: {e}")
- json_quant_configs = quant_config_dict['quantization']
+ json_quant_configs = quant_config_dict.get('quantization', {})
+ if not json_quant_configs:
+ raise ValueError(f"Missing 'quantization' section in {quant_config_file}")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
@staticmethod | |
def load_angelslim_quant_config(quant_config_file, model_dir, moe_backend): | |
quant_config = QuantConfig() | |
layer_quant_config = None | |
with open(quant_config_file) as f: | |
quant_config_dict = json.load(f) | |
json_quant_configs = quant_config_dict['quantization'] | |
quant_config.quant_algo = QuantAlgo( | |
json_quant_configs.get('quant_algo', None).upper()) if json_quant_configs.get("quant_algo") else None | |
# fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES | |
if quant_config.quant_algo == "fp8_pb_wo": | |
quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES') | |
quant_config.kv_cache_quant_algo = QuantAlgo( | |
json_quant_configs.get("kv_cache_quant_algo").upper() | |
) if json_quant_configs.get("kv_cache_quant_algo") else None | |
quant_config.group_size = json_quant_configs.get('group_size', None) | |
quant_config.exclude_modules = json_quant_configs.get( | |
'exclude_modules', None) | |
quant_config.activation_scheme = ActivationScheme( | |
json_quant_configs.get('activation_scheme', None).upper() | |
) if json_quant_configs.get("activation_scheme") else None | |
json_exclude_quant_configs = json_quant_configs.get('exclude_quantization', None) | |
if json_exclude_quant_configs: | |
quant_config.exclude_quant_config = { | |
"quant_algo": QuantAlgo( | |
json_exclude_quant_configs.get('quant_algo', None).upper() | |
) if json_exclude_quant_configs.get("quant_algo") else None, | |
"kv_cache_quant_algo": QuantAlgo( | |
json_exclude_quant_configs.get("kv_cache_quant_algo").upper() | |
) if json_exclude_quant_configs.get("kv_cache_quant_algo") else None, | |
"activation_scheme": ActivationScheme( | |
json_exclude_quant_configs.get('activation_scheme', None).upper() | |
) if json_exclude_quant_configs.get("activation_scheme") else None, | |
} | |
return quant_config, layer_quant_config | |
@staticmethod | |
def load_angelslim_quant_config(quant_config_file, model_dir, moe_backend): | |
quant_config = QuantConfig() | |
layer_quant_config = None | |
try: | |
with open(quant_config_file) as f: | |
quant_config_dict = json.load(f) | |
except (json.JSONDecodeError, IOError) as e: | |
raise ValueError(f"Failed to load angelslim config from {quant_config_file}: {e}") | |
json_quant_configs = quant_config_dict.get('quantization', {}) | |
if not json_quant_configs: | |
raise ValueError(f"Missing 'quantization' section in {quant_config_file}") | |
quant_config.quant_algo = QuantAlgo( | |
json_quant_configs.get('quant_algo', None).upper() | |
) if json_quant_configs.get("quant_algo") else None | |
# fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES | |
if quant_config.quant_algo == "fp8_pb_wo": | |
quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES') | |
quant_config.kv_cache_quant_algo = QuantAlgo( | |
json_quant_configs.get("kv_cache_quant_algo").upper() | |
) if json_quant_configs.get("kv_cache_quant_algo") else None | |
quant_config.group_size = json_quant_configs.get('group_size', None) | |
quant_config.exclude_modules = json_quant_configs.get( | |
'exclude_modules', None) | |
quant_config.activation_scheme = ActivationScheme( | |
json_quant_configs.get('activation_scheme', None).upper() | |
) if json_quant_configs.get("activation_scheme") else None | |
json_exclude_quant_configs = json_quant_configs.get('exclude_quantization', None) | |
if json_exclude_quant_configs: | |
quant_config.exclude_quant_config = { | |
"quant_algo": QuantAlgo( | |
json_exclude_quant_configs.get('quant_algo', None).upper() | |
) if json_exclude_quant_configs.get("quant_algo") else None, | |
"kv_cache_quant_algo": QuantAlgo( | |
json_exclude_quant_configs.get("kv_cache_quant_algo").upper() | |
) if json_exclude_quant_configs.get("kv_cache_quant_algo") else None, | |
"activation_scheme": ActivationScheme( | |
json_exclude_quant_configs.get('activation_scheme', None).upper() | |
) if json_exclude_quant_configs.get("activation_scheme") else None, | |
} | |
return quant_config, layer_quant_config |
quant_config.quant_algo = QuantAlgo( | ||
json_quant_configs.get('quant_algo', None).upper()) if json_quant_configs.get("quant_algo") else None | ||
# fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES | ||
if quant_config.quant_algo == "fp8_pb_wo": | ||
quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logical issue: string comparison against enum value.
Line 254 compares quant_config.quant_algo
(which is now a QuantAlgo
enum) against the string "fp8_pb_wo"
. This will always fail because you're comparing an enum to a string.
Fix the comparison:
- quant_config.quant_algo = QuantAlgo(
- json_quant_configs.get('quant_algo', None).upper()) if json_quant_configs.get("quant_algo") else None
- # fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES
- if quant_config.quant_algo == "fp8_pb_wo":
- quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES')
+ algo_str = json_quant_configs.get('quant_algo')
+ if algo_str:
+ algo_str = algo_str.upper()
+ # fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES
+ if algo_str == "FP8_PB_WO":
+ quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES')
+ else:
+ quant_config.quant_algo = QuantAlgo(algo_str)
+ else:
+ quant_config.quant_algo = None
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
quant_config.quant_algo = QuantAlgo( | |
json_quant_configs.get('quant_algo', None).upper()) if json_quant_configs.get("quant_algo") else None | |
# fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES | |
if quant_config.quant_algo == "fp8_pb_wo": | |
quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES') | |
algo_str = json_quant_configs.get('quant_algo') | |
if algo_str: | |
algo_str = algo_str.upper() | |
# fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES | |
if algo_str == "FP8_PB_WO": | |
quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES') | |
else: | |
quant_config.quant_algo = QuantAlgo(algo_str) | |
else: | |
quant_config.quant_algo = None |
🤖 Prompt for AI Agents
tensorrt_llm/_torch/model_config.py around lines 251 to 255: the code currently
compares the QuantAlgo enum instance to the string "fp8_pb_wo", which always
fails; change the check to compare the enum's name or value to that string (for
example: if quant_config.quant_algo.name.lower() == "fp8_pb_wo") and then set
quant_config.quant_algo = QuantAlgo("FP8_BLOCK_SCALES") as before.
elif (quant_config_file := model_dir / 'angelslim_hf_quant_config.json').exists(): | ||
quant_config, layer_quant_config = cls.load_angelslim_quant_config( | ||
quant_config_file, model_dir, moe_backend) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify the consistency of the naming convention for config files.
The code now supports both angelslim_hf_quant_config.json
and hf_quant_config.json
. Ensure this naming is consistent with the tool's documentation and expected usage.
🏁 Script executed:
#!/bin/bash
# Check for any existing references to these config file names in the codebase
echo "Searching for references to angelslim_hf_quant_config.json:"
rg "angelslim_hf_quant_config" --type py -A 2 -B 2
echo -e "\nSearching for references to hf_quant_config.json:"
rg "hf_quant_config\.json" --type py -A 2 -B 2
echo -e "\nChecking for any documentation about these config files:"
fd -e md -e rst | xargs rg -l "angelslim_hf_quant_config|hf_quant_config"
Length of output: 8264
Update documentation to include the new AngelsLim config filename
The code now recognizes both hf_quant_config.json
and angelslim_hf_quant_config.json
, but our docs and examples only mention the former. Please update all references so users know both are supported:
• docs/source/performance/perf-benchmarking.md
• examples/models/core/deepseek_v3/README.md
– Add a note explaining that if an angelslim_hf_quant_config.json
is present in the model directory, it will be loaded via load_angelslim_quant_config()
– Ensure any code snippets or CLI examples show both filenames where applicable
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/model_config.py around lines 438 to 440, the code now
accepts an additional AngelsLim config filename
(angelslim_hf_quant_config.json); update the documentation and example README to
mention both hf_quant_config.json and angelslim_hf_quant_config.json. Edit
docs/source/performance/perf-benchmarking.md and
examples/models/core/deepseek_v3/README.md to add a short note that if
angelslim_hf_quant_config.json exists in the model directory it will be loaded
via load_angelslim_quant_config(), and update any code snippets or CLI examples
to show both filenames where applicable (e.g., list both filenames in examples
and usage text).
all_w3_scales = torch.stack(all_w3_scales) / all_w3_w1_scales_fp8_max.unsqueeze(2) | ||
all_w1_scales = [ | ||
load_weight_shard(weights[f"{expert_id}.w1.weight_scale_inv"], | ||
load_weight_shard(weights[f"{expert_id}.w1.weight_scale.int4"], | ||
module.tp_size, | ||
module.tp_rank, | ||
TensorParallelMode.COLUMN, | ||
device=self.device) | ||
for expert_id in module.initial_local_expert_ids | ||
] | ||
all_w1_scales = torch.stack(all_w1_scales) / all_w3_w1_scales_fp8_max.unsqueeze(2) | ||
all_w3_w1_scales = torch.cat( | ||
[torch.stack(all_w3_scales), | ||
torch.stack(all_w1_scales)], dim=-2) | ||
[all_w3_scales, | ||
all_w1_scales], dim=-2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential division by zero when scales are zero.
The code divides by all_w3_w1_scales_fp8_max
and all_w2_scales_fp8
without checking if these values are zero. This could lead to runtime errors or inf/nan values.
Add validation for zero scales:
all_w3_scales = torch.stack(all_w3_scales) / all_w3_w1_scales_fp8_max.unsqueeze(2)
+if torch.any(all_w3_w1_scales_fp8_max == 0):
+ raise ValueError("Found zero FP8 weight scales, which would cause division by zero")
all_w1_scales = torch.stack(all_w1_scales) / all_w3_w1_scales_fp8_max.unsqueeze(2)
all_w2_scales = torch.stack(all_w2_scales) / all_w2_scales_fp8.unsqueeze(2)
+if torch.any(all_w2_scales_fp8 == 0):
+ raise ValueError("Found zero FP8 weight scales for w2, which would cause division by zero")
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/fused_moe/quantization.py around lines 1002 to
1014, the code divides tensors by all_w3_w1_scales_fp8_max and all_w2_scales_fp8
without guarding against zeros; add a defensive check to avoid divide-by-zero by
replacing zero (or near-zero) elements with a small safe epsilon (e.g., 1e-6) or
use torch.clamp_min / torch.where to ensure the denominators are >= epsilon
before performing the divisions, and optionally log or assert if any
replacements occurred to aid debugging.
all_w1_scales = [ | ||
load_weight_shard(weights[f"{expert_id}.w1.weight_scale_inv"], | ||
load_weight_shard(weights[f"{expert_id}.w1.weight_scale.int4"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent weight scale key naming: weight_scale.int4 vs weight_scale_inv.
Line 1004 uses f"{expert_id}.w1.weight_scale.int4"
while line 995 uses f"{expert_id}.w3.weight_scale_inv"
. This inconsistency could indicate different weight formats or be a bug.
Please verify if this naming difference is intentional:
#!/bin/bash
# Check for both naming patterns in the codebase
echo "Searching for weight_scale.int4 pattern:"
rg "weight_scale\.int4" --type py -B 2 -A 2
echo -e "\nSearching for weight_scale_inv pattern:"
rg "weight_scale_inv" --type py -B 2 -A 2
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/fused_moe/quantization.py around line 1004 there
is an inconsistent key name: line 1004 uses f"{expert_id}.w1.weight_scale.int4"
while earlier (line ~995) uses f"{expert_id}.w3.weight_scale_inv"; confirm
whether the correct stored key is weight_scale_inv or weight_scale.int4 by
searching the repo for both patterns, then make the keys consistent (prefer
using the canonical weight_scale_inv if other shards use that naming), update
the load_weight_shard call to use the canonical key across all shards, and add a
brief inline comment explaining the chosen convention so future readers know
which format is expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/models/modeling_utils.py (1)
468-507
: Rename all remainingexclude_quant_config
references toexclude_quantization
.
- tensorrt_llm/llmapi/llm_utils.py (around line 469): change
quant_config.exclude_quant_config
→quant_config.exclude_quantization
.- tensorrt_llm/_torch/model_config.py (around lines 269 & 359): update both the assignment and subsequent checks of
quant_config.exclude_quant_config
→quant_config.exclude_quantization
.- Re-run
rg '\bexclude_quant_config\b'
to confirm no leftovers.
♻️ Duplicate comments (2)
tensorrt_llm/_torch/model_config.py (2)
251-256
: Fix enum/string mismatch for quant_algo ("fp8_pb_wo" mapping never triggers).You construct an enum on Line 251, then compare it to a string on Line 254. This always fails. Normalize the source string first, then map to QuantAlgo.
- quant_config.quant_algo = QuantAlgo( - json_quant_configs.get('quant_algo', None).upper()) if json_quant_configs.get("quant_algo") else None - # fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES - if quant_config.quant_algo == "fp8_pb_wo": - quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES') + algo_str = json_quant_configs.get('quant_algo') + if algo_str: + algo_str = algo_str.upper() + # fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES + if algo_str == "FP8_PB_WO": + quant_config.quant_algo = QuantAlgo.FP8_BLOCK_SCALES + else: + quant_config.quant_algo = QuantAlgo(algo_str) + else: + quant_config.quant_algo = None
246-249
: Add robust error handling for malformed/absent quantization JSON.Currently assumes file is valid JSON and contains 'quantization'. Add try/except and schema checks.
- with open(quant_config_file) as f: - quant_config_dict = json.load(f) + try: + with open(quant_config_file) as f: + quant_config_dict = json.load(f) + except (json.JSONDecodeError, OSError) as e: + raise ValueError(f"Failed to load angelslim config from {quant_config_file}: {e}") from e + json_quant_configs = quant_config_dict.get('quantization') + if not isinstance(json_quant_configs, dict): + raise ValueError(f"Missing or invalid 'quantization' in {quant_config_file}") - - json_quant_configs = quant_config_dict['quantization']
🧹 Nitpick comments (5)
tensorrt_llm/_torch/model_config.py (2)
242-242
: Silence ARG004 or use the params.model_dir and moe_backend were unused (now moe_backend is used above). If model_dir remains unused, prefix to _model_dir or document intent.
- def load_angelslim_quant_config(quant_config_file, model_dir, moe_backend): + def load_angelslim_quant_config(quant_config_file, _model_dir, moe_backend):
1-1
: Missing 2025 NVIDIA Apache-2.0 header.Please prepend the standard header per repo guidelines.
tensorrt_llm/_torch/models/modeling_utils.py (1)
480-493
: Good construction of per-exclusion QuantConfig.
- Honors kv_cache precedence.
- Allows per-excluded-module quant_algo/activation_scheme/group_size.
Minor: if exclude_quantization is None, consider using the base quant_config.group_size to keep consistency.
- group_size = 128 + group_size = quant_config.group_size or 128tensorrt_llm/llmapi/llm_utils.py (2)
1-1
: Missing 2025 NVIDIA Apache-2.0 header.Please prepend the standard header per repo guidelines.
354-484
: Unify configuration key for ignored modules
Replaceignored_layers
in tensorrt_llm/_torch/model_config.py:354 withignored_modules
to match the usage in llmapi/llm_utils.py. No string-literal comparisons againstQuantAlgo
were found.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
tensorrt_llm/_torch/model_config.py
(4 hunks)tensorrt_llm/_torch/models/modeling_deepseekv3.py
(1 hunks)tensorrt_llm/_torch/models/modeling_utils.py
(1 hunks)tensorrt_llm/llmapi/llm_utils.py
(3 hunks)tensorrt_llm/models/modeling_utils.py
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- tensorrt_llm/_torch/models/modeling_deepseekv3.py
- tensorrt_llm/models/modeling_utils.py
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
tensorrt_llm/_torch/model_config.py
tensorrt_llm/llmapi/llm_utils.py
tensorrt_llm/_torch/models/modeling_utils.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
tensorrt_llm/_torch/model_config.py
tensorrt_llm/llmapi/llm_utils.py
tensorrt_llm/_torch/models/modeling_utils.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
tensorrt_llm/_torch/model_config.py
tensorrt_llm/llmapi/llm_utils.py
tensorrt_llm/_torch/models/modeling_utils.py
🧬 Code graph analysis (3)
tensorrt_llm/_torch/model_config.py (2)
tensorrt_llm/quantization/mode.py (2)
QuantAlgo
(23-46)ActivationScheme
(463-465)tensorrt_llm/models/modeling_utils.py (2)
QuantConfig
(128-272)quant_algo
(551-552)
tensorrt_llm/llmapi/llm_utils.py (4)
tensorrt_llm/models/modeling_utils.py (3)
PretrainedConfig
(370-571)QuantConfig
(128-272)quant_algo
(551-552)tensorrt_llm/quantization/mode.py (2)
QuantAlgo
(23-46)ActivationScheme
(463-465)tensorrt_llm/llmapi/llm_args.py (2)
quant_config
(2145-2148)quant_config
(2151-2152)cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h (1)
group_size
(63-63)
tensorrt_llm/_torch/models/modeling_utils.py (2)
tensorrt_llm/models/modeling_utils.py (2)
quant_algo
(551-552)QuantConfig
(128-272)tensorrt_llm/llmapi/llm_args.py (2)
quant_config
(2145-2148)quant_config
(2151-2152)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/model_config.py
242-242: Unused static method argument: model_dir
(ARG004)
242-242: Unused static method argument: moe_backend
(ARG004)
🔇 Additional comments (5)
tensorrt_llm/_torch/models/modeling_utils.py (1)
28-28
: Import looks correct.ActivationScheme import aligns with the new QuantConfig fields.
tensorrt_llm/llmapi/llm_utils.py (4)
429-433
: FP8 block-scales parsing LGTM.Asserts (128,128) and sets group_size accordingly.
444-447
: Support for 'fp8' and 'w4a8_awq' quant_method LGTM.Sets group_size for AWQ.
452-459
: KV cache and activation scheme mapping LGTM.Uppercasing to enum names is consistent.
480-482
: Log line LGTM.Message is informative and low-noise.
json_exclude_quantization= json_quant_configs.get('exclude_quantization', None) | ||
if json_exclude_quantization: | ||
quant_config.exclude_quant_config = { | ||
"quant_algo": QuantAlgo( | ||
json_exclude_quantization.get('quant_algo', None).upper() | ||
) if json_exclude_quantization.get("quant_algo") else None, | ||
"kv_cache_quant_algo": QuantAlgo( | ||
json_exclude_quantization.get("kv_cache_quant_algo").upper() | ||
) if json_exclude_quantization.get("kv_cache_quant_algo") else None, | ||
"activation_scheme": ActivationScheme( | ||
json_exclude_quantization.get('activation_scheme', None).upper() | ||
) if json_exclude_quantization.get("activation_scheme") else None, | ||
"group_size": json_exclude_quantization.get('group_size', None), | ||
} | ||
if quant_config.exclude_quantization["quant_algo"] in [QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ]: | ||
if quant_config.exclude_quantization["group_size"] is None: | ||
quant_config.exclude_quantization["group_size"] = 128 | ||
|
||
if quant_config.quant_algo in [QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ]: | ||
if quant_config.group_size is None: | ||
quant_config.group_size = 128 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo breaks exclude overrides: use QuantConfig.exclude_quantization consistently and initialize it.
Lines 269 and 356 use exclude_quant_config but QuantConfig defines exclude_quantization. Also, subsequent code (Line 281) expects exclude_quantization. This prevents per-module overrides from being applied and can raise exceptions.
- json_exclude_quantization= json_quant_configs.get('exclude_quantization', None)
+ json_exclude_quantization = json_quant_configs.get('exclude_quantization', None)
if json_exclude_quantization:
- quant_config.exclude_quant_config = {
+ quant_config.exclude_quantization = {
"quant_algo": QuantAlgo(
json_exclude_quantization.get('quant_algo', None).upper()
) if json_exclude_quantization.get("quant_algo") else None,
"kv_cache_quant_algo": QuantAlgo(
json_exclude_quantization.get("kv_cache_quant_algo").upper()
) if json_exclude_quantization.get("kv_cache_quant_algo") else None,
"activation_scheme": ActivationScheme(
json_exclude_quantization.get('activation_scheme', None).upper()
) if json_exclude_quantization.get("activation_scheme") else None,
"group_size": json_exclude_quantization.get('group_size', None),
}
- if quant_config.exclude_quantization["quant_algo"] in [QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ]:
+ if quant_config.exclude_quantization["quant_algo"] in {QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ}:
if quant_config.exclude_quantization["group_size"] is None:
quant_config.exclude_quantization["group_size"] = 128
-
- if quant_config.quant_algo in [QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ]:
+ if quant_config.quant_algo in {QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ}:
if quant_config.group_size is None:
quant_config.group_size = 128
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
json_exclude_quantization= json_quant_configs.get('exclude_quantization', None) | |
if json_exclude_quantization: | |
quant_config.exclude_quant_config = { | |
"quant_algo": QuantAlgo( | |
json_exclude_quantization.get('quant_algo', None).upper() | |
) if json_exclude_quantization.get("quant_algo") else None, | |
"kv_cache_quant_algo": QuantAlgo( | |
json_exclude_quantization.get("kv_cache_quant_algo").upper() | |
) if json_exclude_quantization.get("kv_cache_quant_algo") else None, | |
"activation_scheme": ActivationScheme( | |
json_exclude_quantization.get('activation_scheme', None).upper() | |
) if json_exclude_quantization.get("activation_scheme") else None, | |
"group_size": json_exclude_quantization.get('group_size', None), | |
} | |
if quant_config.exclude_quantization["quant_algo"] in [QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ]: | |
if quant_config.exclude_quantization["group_size"] is None: | |
quant_config.exclude_quantization["group_size"] = 128 | |
if quant_config.quant_algo in [QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ]: | |
if quant_config.group_size is None: | |
quant_config.group_size = 128 | |
json_exclude_quantization = json_quant_configs.get('exclude_quantization', None) | |
if json_exclude_quantization: | |
quant_config.exclude_quantization = { | |
"quant_algo": QuantAlgo( | |
json_exclude_quantization.get('quant_algo', None).upper() | |
) if json_exclude_quantization.get("quant_algo") else None, | |
"kv_cache_quant_algo": QuantAlgo( | |
json_exclude_quantization.get("kv_cache_quant_algo").upper() | |
) if json_exclude_quantization.get("kv_cache_quant_algo") else None, | |
"activation_scheme": ActivationScheme( | |
json_exclude_quantization.get('activation_scheme', None).upper() | |
) if json_exclude_quantization.get("activation_scheme") else None, | |
"group_size": json_exclude_quantization.get('group_size', None), | |
} | |
if quant_config.exclude_quantization["quant_algo"] in {QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ}: | |
if quant_config.exclude_quantization["group_size"] is None: | |
quant_config.exclude_quantization["group_size"] = 128 | |
if quant_config.quant_algo in {QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ}: | |
if quant_config.group_size is None: | |
quant_config.group_size = 128 |
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/model_config.py around lines 267-287, replace the
misspelled/incorrect property exclude_quant_config with the correct
QuantConfig.exclude_quantization and ensure quant_config.exclude_quantization is
initialized (e.g., to an empty dict or appropriate default) before assigning its
keys; update any other occurrences (e.g., around line 356 mentioned in the
comment) to use exclude_quantization so per-module overrides are applied
consistently and avoid exceptions.
if quant_config.quant_algo in [QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ]: | ||
if quant_config.group_size is None: | ||
quant_config.group_size = 128 | ||
|
||
return quant_config, layer_quant_config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Replicate TRTLLM default excludes for FP8_BLOCK_SCALES (parity with modelopt path).
For AngelsLim FP8 block scales, set default exclude_modules when moe_backend == 'TRTLLM', as done in load_modelopt_quant_config.
- return quant_config, layer_quant_config
+ if (moe_backend == 'TRTLLM'
+ and quant_config.quant_algo == QuantAlgo.FP8_BLOCK_SCALES
+ and quant_config.exclude_modules is None):
+ quant_config.exclude_modules = ["*kv_b_proj*", "*k_b_proj*", "*eh_proj"]
+ return quant_config, layer_quant_config
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if quant_config.quant_algo in [QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ]: | |
if quant_config.group_size is None: | |
quant_config.group_size = 128 | |
return quant_config, layer_quant_config | |
if quant_config.quant_algo in [QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ]: | |
if quant_config.group_size is None: | |
quant_config.group_size = 128 | |
if (moe_backend == 'TRTLLM' | |
and quant_config.quant_algo == QuantAlgo.FP8_BLOCK_SCALES | |
and quant_config.exclude_modules is None): | |
quant_config.exclude_modules = ["*kv_b_proj*", "*k_b_proj*", "*eh_proj"] | |
return quant_config, layer_quant_config |
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/model_config.py around lines 285 to 289, when
quant_config.quant_algo == QuantAlgo.FP8_BLOCK_SCALES and moe_backend ==
'TRTLLM' you must set the default quant_config.exclude_modules to the same
TRTLLM default excludes used by load_modelopt_quant_config; update this branch
to assign quant_config.exclude_modules (if None) to the common
TRTLLM_DEFAULT_EXCLUDES (or the exact list used in load_modelopt_quant_config)
so FP8_BLOCK_SCALES follows the same exclude defaults as the modelopt path.
if quant_config.exclude_modules: | ||
if hf_quant_config.get("ignored_layers"): | ||
quant_config.exclude_modules += hf_quant_config.get("ignored_layers") | ||
else: | ||
quant_config.exclude_modules = hf_quant_config.get("ignored_layers") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Accept both 'ignored_modules' and 'ignored_layers' from HF config.
Two places in the codebase use different keys. Normalize here to avoid silently missing excludes.
- if quant_config.exclude_modules:
- if hf_quant_config.get("ignored_layers"):
- quant_config.exclude_modules += hf_quant_config.get("ignored_layers")
- else:
- quant_config.exclude_modules = hf_quant_config.get("ignored_layers")
+ ignored = hf_quant_config.get("ignored_modules") or hf_quant_config.get("ignored_layers")
+ if ignored:
+ if quant_config.exclude_modules:
+ quant_config.exclude_modules += ignored
+ else:
+ quant_config.exclude_modules = ignored
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if quant_config.exclude_modules: | |
if hf_quant_config.get("ignored_layers"): | |
quant_config.exclude_modules += hf_quant_config.get("ignored_layers") | |
else: | |
quant_config.exclude_modules = hf_quant_config.get("ignored_layers") | |
ignored = hf_quant_config.get("ignored_modules") or hf_quant_config.get("ignored_layers") | |
if ignored: | |
if quant_config.exclude_modules: | |
quant_config.exclude_modules += ignored | |
else: | |
quant_config.exclude_modules = ignored |
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/model_config.py around lines 350 to 355, the code only
reads hf_quant_config["ignored_layers"] which misses configs using
"ignored_modules"; update the logic to accept and normalize both keys by
checking for "ignored_modules" first then "ignored_layers" (or merge both if
present), treat missing values as empty lists, ensure
quant_config.exclude_modules is a list before extending/appending, merge the HF
excludes into quant_config.exclude_modules (avoiding None) and optionally
deduplicate the final list.
# set exclude_quant_config | ||
hf_ignored_quantization_config = hf_quant_config.get("ignored_quantization_config") | ||
if hf_ignored_quantization_config: | ||
quant_config.exclude_quant_config = { | ||
"kv_cache_quant_algo": QuantAlgo( | ||
hf_ignored_quantization_config.get("kv_cache_quant_method").upper() | ||
) if hf_ignored_quantization_config.get("kv_cache_quant_method") else None, | ||
"activation_scheme": ActivationScheme( | ||
hf_ignored_quantization_config.get("activation_scheme").upper() | ||
) if hf_ignored_quantization_config.get("activation_scheme") else None, | ||
"group_size": 128, | ||
} | ||
if hf_ignored_quantization_config.get( | ||
"quant_method") == "fp8" and hf_ignored_quantization_config.get("weight_block_size", []): | ||
quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8_BLOCK_SCALES | ||
block_size = hf_ignored_quantization_config.get("weight_block_size", []) | ||
assert tuple(block_size) == ( | ||
128, | ||
128), "FP8_BLOCK_SCALES only supports block_size=(128,128)" | ||
quant_config.exclude_quantization["group_size"] = block_size[0] | ||
elif hf_ignored_quantization_config.get("quant_method") == "fp8": | ||
quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8 | ||
elif hf_ignored_quantization_config.get("quant_method") == "w4a8_awq": | ||
quant_config.exclude_quantization["quant_algo"] = QuantAlgo.W4A8_AWQ | ||
quant_config.exclude_quantization["group_size"] = hf_ignored_quantization_config.get( | ||
"weight_group_size", 128) | ||
else: | ||
raise NotImplementedError(f"Unsupported quantization_config.ignored_quantization_config: " | ||
f"{hf_ignored_quantization_config}.") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix exclude overrides for HF: wrong attribute name + missing FP8 block handling.
Same naming typo as above; also add FP8_BLOCK_SCALES handling when ignored_quantization_config carries weight_block_size.
- hf_ignored_quantization_config = hf_quant_config.get("ignored_quantization_config")
- if hf_ignored_quantization_config:
- quant_config.exclude_quant_config = {
- "kv_cache_quant_algo": QuantAlgo(
- hf_ignored_quantization_config.get("kv_cache_quant_method").upper()
- ) if hf_ignored_quantization_config.get("kv_cache_quant_method") else None,
- "activation_scheme": ActivationScheme(
- hf_ignored_quantization_config.get("activation_scheme").upper()
- ) if hf_ignored_quantization_config.get("activation_scheme") else None,
- "group_size": 128,
- }
- if hf_ignored_quantization_config.get(
- "quant_method") == "fp8" and hf_ignored_quantization_config.get("weight_block_size", []):
- quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8_BLOCK_SCALES
- block_size = hf_ignored_quantization_config.get("weight_block_size", [])
- assert tuple(block_size) == (
- 128,
- 128), "FP8_BLOCK_SCALES only supports block_size=(128,128)"
- quant_config.exclude_quantization["group_size"] = block_size[0]
- elif hf_ignored_quantization_config.get("quant_method") == "fp8":
- quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8
- elif hf_ignored_quantization_config.get("quant_method") == "w4a8_awq":
- quant_config.exclude_quantization["quant_algo"] = QuantAlgo.W4A8_AWQ
- quant_config.exclude_quantization["group_size"] = hf_ignored_quantization_config.get(
- "weight_group_size", 128)
- else:
- raise NotImplementedError(f"Unsupported quantization_config.ignored_quantization_config: "
- f"{hf_ignored_quantization_config}.")
+ hf_ignored_quant = hf_quant_config.get("ignored_quantization_config")
+ if hf_ignored_quant:
+ quant_config.exclude_quantization = {
+ "kv_cache_quant_algo": QuantAlgo(hf_ignored_quant["kv_cache_quant_method"].upper())
+ if hf_ignored_quant.get("kv_cache_quant_method") else None,
+ "activation_scheme": ActivationScheme(hf_ignored_quant["activation_scheme"].upper())
+ if hf_ignored_quant.get("activation_scheme") else None,
+ "group_size": 128,
+ }
+ if hf_ignored_quant.get("quant_method") == "fp8" and hf_ignored_quant.get("weight_block_size"):
+ block_size = hf_ignored_quant["weight_block_size"]
+ assert tuple(block_size) == (128, 128), "FP8_BLOCK_SCALES only supports block_size=(128,128)"
+ quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8_BLOCK_SCALES
+ quant_config.exclude_quantization["group_size"] = block_size[0]
+ elif hf_ignored_quant.get("quant_method") == "fp8":
+ quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8
+ elif hf_ignored_quant.get("quant_method") == "w4a8_awq":
+ quant_config.exclude_quantization["quant_algo"] = QuantAlgo.W4A8_AWQ
+ quant_config.exclude_quantization["group_size"] = hf_ignored_quant.get("weight_group_size", 128)
+ else:
+ raise NotImplementedError(f"Unsupported quantization_config.ignored_quantization_config: {hf_ignored_quant}.")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# set exclude_quant_config | |
hf_ignored_quantization_config = hf_quant_config.get("ignored_quantization_config") | |
if hf_ignored_quantization_config: | |
quant_config.exclude_quant_config = { | |
"kv_cache_quant_algo": QuantAlgo( | |
hf_ignored_quantization_config.get("kv_cache_quant_method").upper() | |
) if hf_ignored_quantization_config.get("kv_cache_quant_method") else None, | |
"activation_scheme": ActivationScheme( | |
hf_ignored_quantization_config.get("activation_scheme").upper() | |
) if hf_ignored_quantization_config.get("activation_scheme") else None, | |
"group_size": 128, | |
} | |
if hf_ignored_quantization_config.get( | |
"quant_method") == "fp8" and hf_ignored_quantization_config.get("weight_block_size", []): | |
quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8_BLOCK_SCALES | |
block_size = hf_ignored_quantization_config.get("weight_block_size", []) | |
assert tuple(block_size) == ( | |
128, | |
128), "FP8_BLOCK_SCALES only supports block_size=(128,128)" | |
quant_config.exclude_quantization["group_size"] = block_size[0] | |
elif hf_ignored_quantization_config.get("quant_method") == "fp8": | |
quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8 | |
elif hf_ignored_quantization_config.get("quant_method") == "w4a8_awq": | |
quant_config.exclude_quantization["quant_algo"] = QuantAlgo.W4A8_AWQ | |
quant_config.exclude_quantization["group_size"] = hf_ignored_quantization_config.get( | |
"weight_group_size", 128) | |
else: | |
raise NotImplementedError(f"Unsupported quantization_config.ignored_quantization_config: " | |
f"{hf_ignored_quantization_config}.") | |
# set exclude_quant_config | |
hf_ignored_quant = hf_quant_config.get("ignored_quantization_config") | |
if hf_ignored_quant: | |
quant_config.exclude_quantization = { | |
"kv_cache_quant_algo": QuantAlgo(hf_ignored_quant["kv_cache_quant_method"].upper()) | |
if hf_ignored_quant.get("kv_cache_quant_method") else None, | |
"activation_scheme": ActivationScheme(hf_ignored_quant["activation_scheme"].upper()) | |
if hf_ignored_quant.get("activation_scheme") else None, | |
"group_size": 128, | |
} | |
if hf_ignored_quant.get("quant_method") == "fp8" and hf_ignored_quant.get("weight_block_size"): | |
block_size = hf_ignored_quant["weight_block_size"] | |
assert tuple(block_size) == (128, 128), "FP8_BLOCK_SCALES only supports block_size=(128,128)" | |
quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8_BLOCK_SCALES | |
quant_config.exclude_quantization["group_size"] = block_size[0] | |
elif hf_ignored_quant.get("quant_method") == "fp8": | |
quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8 | |
elif hf_ignored_quant.get("quant_method") == "w4a8_awq": | |
quant_config.exclude_quantization["quant_algo"] = QuantAlgo.W4A8_AWQ | |
quant_config.exclude_quantization["group_size"] = hf_ignored_quant.get("weight_group_size", 128) | |
else: | |
raise NotImplementedError( | |
f"Unsupported quantization_config.ignored_quantization_config: {hf_ignored_quant}." | |
) |
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/model_config.py around lines 356 to 385, the code sets
quant_config.exclude_quant_config but later references
quant_config.exclude_quantization (typo/inconsistent attribute name) and it also
doesn't correctly handle the FP8 block-size case; change the initial assignment
to use the correct attribute name exclude_quantization (not
exclude_quant_config) and build that dict with consistent keys
("kv_cache_quant_algo", "activation_scheme", "group_size"); then update the FP8
branch so that when hf_ignored_quantization_config["quant_method"] == "fp8" and
weight_block_size is provided you set quant_algo = QuantAlgo.FP8_BLOCK_SCALES,
assert the block_size equals (128,128) and set group_size = block_size[0],
otherwise set quant_algo = QuantAlgo.FP8 for the non-block case; keep the
w4a8_awq branch setting QuantAlgo.W4A8_AWQ and group_size from weight_group_size
as before, and raise NotImplementedError for unsupported configs.
if quant_config.exclude_modules: | ||
if hf_quant_config.get("ignored_modules"): | ||
quant_config.exclude_modules += hf_quant_config.get("ignored_modules") | ||
else: | ||
quant_config.exclude_modules = hf_quant_config.get("ignored_modules") | ||
# set exclude_quant_config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Accept both 'ignored_modules' and 'ignored_layers'.
Align with the other loader to avoid missing excludes.
- if quant_config.exclude_modules:
- if hf_quant_config.get("ignored_modules"):
- quant_config.exclude_modules += hf_quant_config.get("ignored_modules")
- else:
- quant_config.exclude_modules = hf_quant_config.get("ignored_modules")
+ ignored = hf_quant_config.get("ignored_modules") or hf_quant_config.get("ignored_layers")
+ if ignored:
+ if quant_config.exclude_modules:
+ quant_config.exclude_modules += ignored
+ else:
+ quant_config.exclude_modules = ignored
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if quant_config.exclude_modules: | |
if hf_quant_config.get("ignored_modules"): | |
quant_config.exclude_modules += hf_quant_config.get("ignored_modules") | |
else: | |
quant_config.exclude_modules = hf_quant_config.get("ignored_modules") | |
# set exclude_quant_config | |
ignored = hf_quant_config.get("ignored_modules") or hf_quant_config.get("ignored_layers") | |
if ignored: | |
if quant_config.exclude_modules: | |
quant_config.exclude_modules += ignored | |
else: | |
quant_config.exclude_modules = ignored | |
# set exclude_quant_config |
🤖 Prompt for AI Agents
In tensorrt_llm/llmapi/llm_utils.py around lines 461 to 466, update the logic
that reads hf_quant_config ignored entries so it accepts both "ignored_modules"
and "ignored_layers" (treating them equivalently), and correctly accumulates
them into quant_config.exclude_modules: read both keys (preferring one if
needed), coerce the result to a list if None, and then either extend the
existing quant_config.exclude_modules or assign a new list (avoiding None
concatenation). Ensure you merge lists rather than overwrite unexpectedly so the
behavior matches the other loader.
hf_ignored_quantization_config = hf_quant_config.get("ignored_quantization_config") | ||
if hf_ignored_quantization_config: | ||
quant_config.exclude_quant_config = { | ||
"quant_algo": QuantAlgo( | ||
hf_ignored_quantization_config.get("quant_method").upper() | ||
) if hf_ignored_quantization_config.get("quant_method") else None, | ||
"kv_cache_quant_algo": QuantAlgo( | ||
hf_ignored_quantization_config.get("kv_cache_quant_method").upper() | ||
) if hf_ignored_quantization_config.get("kv_cache_quant_method") else None, | ||
"activation_scheme": ActivationScheme( | ||
hf_ignored_quantization_config.get("activation_scheme").upper() | ||
) if hf_ignored_quantization_config.get("activation_scheme") else None, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix exclude overrides: wrong attribute, add FP8 block handling, and group_size.
Use exclude_quantization (not exclude_quant_config) and mirror FP8_BLOCK_SCALES logic.
- hf_ignored_quantization_config = hf_quant_config.get("ignored_quantization_config")
- if hf_ignored_quantization_config:
- quant_config.exclude_quant_config = {
- "quant_algo": QuantAlgo(
- hf_ignored_quantization_config.get("quant_method").upper()
- ) if hf_ignored_quantization_config.get("quant_method") else None,
- "kv_cache_quant_algo": QuantAlgo(
- hf_ignored_quantization_config.get("kv_cache_quant_method").upper()
- ) if hf_ignored_quantization_config.get("kv_cache_quant_method") else None,
- "activation_scheme": ActivationScheme(
- hf_ignored_quantization_config.get("activation_scheme").upper()
- ) if hf_ignored_quantization_config.get("activation_scheme") else None,
- }
+ ignored_q = hf_quant_config.get("ignored_quantization_config")
+ if ignored_q:
+ quant_config.exclude_quantization = {
+ "kv_cache_quant_algo": QuantAlgo(ignored_q["kv_cache_quant_method"].upper())
+ if ignored_q.get("kv_cache_quant_method") else None,
+ "activation_scheme": ActivationScheme(ignored_q["activation_scheme"].upper())
+ if ignored_q.get("activation_scheme") else None,
+ "group_size": 128,
+ }
+ if ignored_q.get("quant_method") == "fp8" and ignored_q.get("weight_block_size"):
+ block_size = ignored_q["weight_block_size"]
+ assert tuple(block_size) == (128, 128), "FP8_BLOCK_SCALES only supports block_size=(128,128)"
+ quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8_BLOCK_SCALES
+ quant_config.exclude_quantization["group_size"] = block_size[0]
+ elif ignored_q.get("quant_method") == "fp8":
+ quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8
+ elif ignored_q.get("quant_method") == "w4a8_awq":
+ quant_config.exclude_quantization["quant_algo"] = QuantAlgo.W4A8_AWQ
+ quant_config.exclude_quantization["group_size"] = ignored_q.get("weight_group_size", 128)
+ else:
+ raise NotImplementedError(f"Unsupported quantization_config.ignored_quantization_config: {ignored_q}.")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
hf_ignored_quantization_config = hf_quant_config.get("ignored_quantization_config") | |
if hf_ignored_quantization_config: | |
quant_config.exclude_quant_config = { | |
"quant_algo": QuantAlgo( | |
hf_ignored_quantization_config.get("quant_method").upper() | |
) if hf_ignored_quantization_config.get("quant_method") else None, | |
"kv_cache_quant_algo": QuantAlgo( | |
hf_ignored_quantization_config.get("kv_cache_quant_method").upper() | |
) if hf_ignored_quantization_config.get("kv_cache_quant_method") else None, | |
"activation_scheme": ActivationScheme( | |
hf_ignored_quantization_config.get("activation_scheme").upper() | |
) if hf_ignored_quantization_config.get("activation_scheme") else None, | |
} | |
ignored_q = hf_quant_config.get("ignored_quantization_config") | |
if ignored_q: | |
quant_config.exclude_quantization = { | |
"kv_cache_quant_algo": QuantAlgo(ignored_q["kv_cache_quant_method"].upper()) | |
if ignored_q.get("kv_cache_quant_method") else None, | |
"activation_scheme": ActivationScheme(ignored_q["activation_scheme"].upper()) | |
if ignored_q.get("activation_scheme") else None, | |
"group_size": 128, | |
} | |
if ignored_q.get("quant_method") == "fp8" and ignored_q.get("weight_block_size"): | |
block_size = ignored_q["weight_block_size"] | |
assert tuple(block_size) == (128, 128), "FP8_BLOCK_SCALES only supports block_size=(128,128)" | |
quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8_BLOCK_SCALES | |
quant_config.exclude_quantization["group_size"] = block_size[0] | |
elif ignored_q.get("quant_method") == "fp8": | |
quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8 | |
elif ignored_q.get("quant_method") == "w4a8_awq": | |
quant_config.exclude_quantization["quant_algo"] = QuantAlgo.W4A8_AWQ | |
quant_config.exclude_quantization["group_size"] = ignored_q.get("weight_group_size", 128) | |
else: | |
raise NotImplementedError( | |
f"Unsupported quantization_config.ignored_quantization_config: {ignored_q}." | |
) |
🤖 Prompt for AI Agents
In tensorrt_llm/llmapi/llm_utils.py around lines 467 to 479, the code mistakenly
sets quant_config.exclude_quant_config instead of
quant_config.exclude_quantization and omits FP8 block handling and group_size;
change the assignment to quant_config.exclude_quantization, map the keys to the
correct QuantAlgo/ActivationScheme enums as before, and add handling for FP8
block scales by mirroring the existing FP8_BLOCK_SCALES logic (include
fp8_block_scales or equivalent field and group_size parsing from
hf_ignored_quantization_config) so excluded FP8 settings and group_size are
applied consistently with the other FP8 handling code.
1dbc226
to
62774f9
Compare
Summary by CodeRabbit
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.