Skip to content

Conversation

bppan
Copy link

@bppan bppan commented Aug 13, 2025

Summary by CodeRabbit

  • New Features
    • Added support for Angelslim quantization configs when loading pretrained models.
    • Expanded HF quantization config support: w4a8_awq and fp8 methods, kv-cache quantization, activation schemes (STATIC/DYNAMIC), module exclusions, and per-module override configs.
    • Configurable weight name in MoE backends (supports qweight/weight paths) and enriched per-expert FP8/FP4 scale handling, plus SM100-specific resmoothing.
    • Added logging of detected quantization settings.
  • Bug Fixes
    • Corrected dequantization handling for certain DeepSeek V3 weights to respect exclusion settings.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@bppan bppan requested review from a team as code owners August 13, 2025 08:18
Copy link
Contributor

coderabbitai bot commented Aug 13, 2025

📝 Walkthrough

Walkthrough

Adds Angelslim quant config loader and wires it into pretrained config flow; extends HF quant config parsing (new quant methods, kv_cache and activation fields, ignored_quantization handling); adds ActivationScheme and exclude_quantization to QuantConfig; adjusts exclude-module overrides and DeepseekV3 dequant gating; overhauls fused MoE weight-loading to accept weight_name and compute per-expert FP8/FP4 scales with SM100 resmoothing.

Changes

Cohort / File(s) Summary of Changes
Quant schema & enums
tensorrt_llm/models/modeling_utils.py, tensorrt_llm/quantization/mode.py
Added ActivationScheme enum and extended QuantConfig with activation_scheme and exclude_quantization fields (typing/docs).
HF / Angelslim quant config loading
tensorrt_llm/_torch/model_config.py, tensorrt_llm/llmapi/llm_utils.py
Added load_angelslim_quant_config static loader; from_pretrained now dispatches to angelslim_hf_quant_config.json when present; HF loader extended to support quant_method "w4a8_awq"/"fp8", kv_cache_quant_method, activation_scheme, merging ignored_modules→exclude_modules, ignored_quantization_config→exclude_quant_config, logging, and NotImplemented guards for unsupported methods.
Exclude-module quant overrides
tensorrt_llm/_torch/models/modeling_utils.py
apply_quant_config_exclude_modules now builds per-excluded-module QuantConfig using fields from exclude_quantization (quant_algo, activation_scheme, group_size) while preserving kv_cache_quant_algo precedence.
DeepseekV3 kv_b_proj gating
tensorrt_llm/_torch/models/modeling_deepseekv3.py
Tightened condition for dequantizing kv_b_proj: requires module in exclude_modules AND exclude_quant_config is None.
Fused MoE weight loading & FP8/FP4 scales
tensorrt_llm/_torch/modules/fused_moe/quantization.py
Introduced weight_name parameter across loaders (defaults preserved); load expert weights by expert.w{1,3,2}.{weight_name}; reworked FP8 QDQ scale/alpha computation to per-expert statistics, interleaving/normalization, and per-SM resmoothing (SM100) of weight_scale_inv; updated backend overrides to accept/forward weight_name.

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant ModelConfig
  participant FS as Filesystem
  participant Loader

  Caller->>ModelConfig: from_pretrained(model_dir)
  ModelConfig->>FS: stat angelslim_hf_quant_config.json
  alt angelslim exists
    ModelConfig->>Loader: load_angelslim_quant_config(file)
    Loader-->>ModelConfig: quant_config, layer_quant_config=None
  else
    ModelConfig->>FS: stat hf_quant_config.json
    ModelConfig->>Loader: load_hf_quant_config(file)
    Loader-->>ModelConfig: quant_config (includes activation_scheme/exclude_quant_config)
  end
  ModelConfig-->>Caller: model config with QuantConfig
Loading
sequenceDiagram
  participant Caller
  participant FusedMoE
  participant Weights as WeightsDict
  participant Backend

  Caller->>FusedMoE: load_weights(module, weights, mode, weight_name)
  FusedMoE->>Weights: read expert.w1.{weight_name}, expert.w3.{weight_name}, expert.w2.{weight_name}
  alt FP8 QDQ path
    FusedMoE->>FusedMoE: compute per-expert w1/w3/w2 scales & alphas
    FusedMoE->>FusedMoE: normalize & interleave scales into target tensors
    opt SM100 resmoothing
      FusedMoE->>FusedMoE: resmooth weight_scale_inv → weight keys
    end
  end
  FusedMoE->>Backend: set tensors/scales
  FusedMoE-->>Caller: done
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested reviewers

  • achartier
  • hlu1
  • Tracin
  • Superjomn
  • litaotju
  • yuxianq

Pre-merge checks (3 warnings)

❌ Failed checks (3 warnings)
Check name Status Explanation Resolution
Title Check ⚠️ Warning The current title “Support W4A8 method of AngleSlim tool” only highlights one quant method and misspells “Angelslim,” while the PR introduces extensive Angelslim HF quant config support, a new loader, FP8 mappings, HF loader enhancements, and broader quantization changes across multiple modules. It is overly narrow and contains a typo, so it does not accurately reflect the main scope of the changes. Please update the title to follow the repository template (e.g., “[TRTLLM-1234][feat] Add Angelslim HF quant config loader and extend quantization support”) so that it accurately and concisely summarizes the primary change and includes a valid ticket reference or “[None]” and type tag.
Description Check ⚠️ Warning The PR description is entirely placeholder text and template comments without any actual summary, description of the implementation, or test coverage details. It fails to explain what was changed, why the changes were made, or which tests verify the new logic. Please fill out the PR template by adding the @coderabbitai summary or a manual summary, a clear “## Description” section describing the issue and your solution, and a “## Test Coverage” section listing relevant tests or test plans to ensure adequate coverage of the new code paths.
Docstring Coverage ⚠️ Warning Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🔭 Outside diff range comments (2)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)

1361-1368: Use fully-qualified module name for exclusion check; current code can yield false negatives

is_module_excluded_from_quantization expects the full module path for pattern matching. Passing only names[-1] ("kv_b_proj") can miss exclusions like "model.layers.12.self_attn.kv_b_proj". Also, your new gating to skip dequant when exclude_quant_config is provided makes sense; keep it, but fix the name passed.

Apply this diff to pass the fully-qualified module name:

-                    dequant_kv_b_proj = self.model_config.quant_config.is_module_excluded_from_quantization(
-                        names[-1]) and self.model_config.quant_config.exclude_quant_config is None
+                    dequant_kv_b_proj = (
+                        self.model_config.quant_config.is_module_excluded_from_quantization(name)
+                        and self.model_config.quant_config.exclude_quant_config is None
+                    )
tensorrt_llm/_torch/models/modeling_utils.py (1)

476-488: Preserve base QuantConfig when overriding for excluded modules; add None-safety

The new per-excluded override only sets quant_algo and activation_scheme and loses other QuantConfig fields (e.g., group_size, clamp_val, has_zero_point). Build the override from the existing QuantConfig to preserve defaults and runtime semantics. Also, guard against a None quant_config.

Apply this diff:

-        quant_algo = None
-        activation_scheme = None
-        exclude_quant_config = quant_config.exclude_quant_config
-        if exclude_quant_config:
-            quant_algo = exclude_quant_config.get("quant_algo", None)
-            activation_scheme = exclude_quant_config.get("activation_scheme", None)
-        new_config = QuantConfig(
-            quant_algo=quant_algo, kv_cache_quant_algo=kv_cache_quant_algo, activation_scheme=activation_scheme)
+        quant_algo = None
+        activation_scheme = None
+        exclude_quant_cfg = getattr(quant_config, "exclude_quant_config", None)
+        if exclude_quant_cfg:
+            quant_algo = exclude_quant_cfg.get("quant_algo")
+            activation_scheme = exclude_quant_cfg.get("activation_scheme")
+        # Preserve all other QuantConfig fields while overriding specific attributes
+        base_config = quant_config or QuantConfig()
+        new_config = dataclass_replace(
+            base_config,
+            quant_algo=quant_algo,
+            activation_scheme=activation_scheme,
+            kv_cache_quant_algo=kv_cache_quant_algo,
+        )

And add this import at the top of the file (outside this hunk):

from dataclasses import replace as dataclass_replace
🧹 Nitpick comments (3)
tensorrt_llm/quantization/mode.py (1)

463-466: Enum addition looks good; consider adding a brief docstring for clarity

ActivationScheme is correctly defined as a StrEnum with BaseEnumMeta and is ready for serialization and validation. A short docstring would improve readability.

Apply this diff to add a concise docstring:

 class ActivationScheme(StrEnum, metaclass=BaseEnumMeta):
-    STATIC = auto()
-    DYNAMIC = auto()
+    """Activation quantization scheme."""
+    STATIC = auto()
+    DYNAMIC = auto()
tensorrt_llm/models/modeling_utils.py (1)

143-145: Tighten docstring wording and wrap long lines (fixes E501 and improves clarity)

The current docstrings have grammar issues and exceed 120 chars. Reword and wrap to meet style and static-analysis guidance.

Apply this diff:

-        exclude_quant_config  (Dict, optional): The model of exclude_modules will use exclude_quant_config.
-        activation_scheme (tensorrt_llm.quantization.mode.ActivationScheme, optional): The input of activation quantize scheme.
+        exclude_quant_config (Dict, optional): Per‑module quantization overrides applied to modules
+            matched by exclude_modules. Only the provided fields are overridden (e.g., quant_algo,
+            kv_cache_quant_algo, activation_scheme).
+        activation_scheme (tensorrt_llm.quantization.mode.ActivationScheme, optional): Activation
+            quantization scheme (e.g., STATIC or DYNAMIC).
tensorrt_llm/_torch/model_config.py (1)

267-279: Simplify nested dictionary initialization for exclude_quantization.

The nested ternary operations make the code hard to read and maintain. Consider extracting this into a helper function for better clarity.

Extract the logic into a helper function:

+def _parse_exclude_quantization(json_exclude_quant_configs):
+    if not json_exclude_quant_configs:
+        return None
+    
+    result = {}
+    if json_exclude_quant_configs.get('quant_algo'):
+        result['quant_algo'] = QuantAlgo(json_exclude_quant_configs['quant_algo'].upper())
+    else:
+        result['quant_algo'] = None
+    
+    if json_exclude_quant_configs.get('kv_cache_quant_algo'):
+        result['kv_cache_quant_algo'] = QuantAlgo(json_exclude_quant_configs['kv_cache_quant_algo'].upper())
+    else:
+        result['kv_cache_quant_algo'] = None
+    
+    if json_exclude_quant_configs.get('activation_scheme'):
+        result['activation_scheme'] = ActivationScheme(json_exclude_quant_configs['activation_scheme'].upper())
+    else:
+        result['activation_scheme'] = None
+    
+    return result

 json_exclude_quant_configs = json_quant_configs.get('exclude_quantization', None)
-if json_exclude_quant_configs:
-    quant_config.exclude_quant_config = {
-        "quant_algo": QuantAlgo(
-            json_exclude_quant_configs.get('quant_algo', None).upper()
-        ) if json_exclude_quant_configs.get("quant_algo") else None,
-        "kv_cache_quant_algo": QuantAlgo(
-            json_exclude_quant_configs.get("kv_cache_quant_algo").upper()
-        ) if json_exclude_quant_configs.get("kv_cache_quant_algo") else None,
-        "activation_scheme": ActivationScheme(
-            json_exclude_quant_configs.get('activation_scheme', None).upper()
-        ) if json_exclude_quant_configs.get("activation_scheme") else None,
-    }
+quant_config.exclude_quant_config = _parse_exclude_quantization(json_exclude_quant_configs)
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 50e5e72 and 15fbe20.

📒 Files selected for processing (7)
  • tensorrt_llm/_torch/model_config.py (4 hunks)
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_utils.py (1 hunks)
  • tensorrt_llm/_torch/modules/fused_moe/quantization.py (6 hunks)
  • tensorrt_llm/llmapi/llm_utils.py (2 hunks)
  • tensorrt_llm/models/modeling_utils.py (3 hunks)
  • tensorrt_llm/quantization/mode.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/quantization/mode.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tensorrt_llm/_torch/models/modeling_utils.py
  • tensorrt_llm/models/modeling_utils.py
  • tensorrt_llm/_torch/model_config.py
  • tensorrt_llm/llmapi/llm_utils.py
  • tensorrt_llm/_torch/modules/fused_moe/quantization.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/quantization/mode.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tensorrt_llm/_torch/models/modeling_utils.py
  • tensorrt_llm/models/modeling_utils.py
  • tensorrt_llm/_torch/model_config.py
  • tensorrt_llm/llmapi/llm_utils.py
  • tensorrt_llm/_torch/modules/fused_moe/quantization.py
🧬 Code Graph Analysis (7)
tensorrt_llm/quantization/mode.py (1)
tensorrt_llm/_utils.py (1)
  • BaseEnumMeta (773-780)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)
tensorrt_llm/llmapi/llm_args.py (2)
  • quant_config (2145-2148)
  • quant_config (2151-2152)
tensorrt_llm/_torch/models/modeling_utils.py (2)
tensorrt_llm/models/modeling_utils.py (2)
  • quant_algo (550-551)
  • QuantConfig (128-271)
tensorrt_llm/llmapi/llm_args.py (2)
  • quant_config (2145-2148)
  • quant_config (2151-2152)
tensorrt_llm/models/modeling_utils.py (1)
tensorrt_llm/quantization/mode.py (2)
  • QuantAlgo (23-46)
  • ActivationScheme (463-465)
tensorrt_llm/_torch/model_config.py (2)
tensorrt_llm/quantization/mode.py (2)
  • QuantAlgo (23-46)
  • ActivationScheme (463-465)
tensorrt_llm/models/modeling_utils.py (2)
  • QuantConfig (128-271)
  • quant_algo (550-551)
tensorrt_llm/llmapi/llm_utils.py (3)
tensorrt_llm/models/modeling_utils.py (3)
  • PretrainedConfig (369-570)
  • QuantConfig (128-271)
  • quant_algo (550-551)
tensorrt_llm/quantization/mode.py (2)
  • QuantAlgo (23-46)
  • ActivationScheme (463-465)
tensorrt_llm/llmapi/llm_args.py (2)
  • quant_config (2145-2148)
  • quant_config (2151-2152)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (3)
tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
  • MoEWeightLoadingMode (13-15)
tests/unittest/_torch/modules/test_fused_moe.py (1)
  • load_weights (1660-1725)
tensorrt_llm/_torch/utils.py (1)
  • shape (103-104)
🪛 Ruff (0.12.2)
tensorrt_llm/models/modeling_utils.py

143-143: Line too long (127 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (7)
tensorrt_llm/llmapi/llm_utils.py (1)

438-474: Manual verification needed for kv_cache_quant_method & activation_scheme enum values

It looks like no quantization_config.kv_cache_quant_method or .activation_scheme entries were found in the repo’s JSON files—this may simply mean that those fields live in user‐supplied Hugging Face configs or are generated at runtime. Before merging the enum-validation refactor, please:

  • Confirm which enum members actually appear in your external HF configs or examples (e.g. via your model hubs or deployment manifests).
  • Ensure the hardcoded “Expected one of” lists (INT8, FP8, NVFP4 and STATIC, DYNAMIC) cover every real-world value your users might supply.
  • Update the error messages or docstrings accordingly if there are additional valid enum values.
tensorrt_llm/_torch/model_config.py (2)

18-18: LGTM! Import addition for ActivationScheme is appropriate.

The addition of ActivationScheme to the import statement is necessary for the new functionality.


326-330: Good addition of W4A8_AWQ support with proper error handling.

The new case for "w4a8_awq" correctly maps to QuantAlgo.W4A8_AWQ, and the addition of NotImplementedError for unsupported quant methods improves error handling.

tensorrt_llm/_torch/modules/fused_moe/quantization.py (4)

209-210: Good addition of weight_name parameter for flexible weight loading.

The addition of the weight_name parameter with a default value of "weight" provides backward compatibility while enabling support for different weight types.


218-220: LGTM! Proper use of weight_name in expert weight keys.

The implementation correctly uses the weight_name parameter to construct the expert weight keys, allowing for flexible weight type selection.


959-963: Good override maintaining backward compatibility.

The WInt4AFP8FusedMoEMethod.load_weights override with weight_name: str = "qweight" properly handles quantized weights while maintaining the interface contract.


985-991: Per-expert FP8 scale handling is correct
All of the expected "{expert_id}.w1.weight_scale" and "{expert_id}.w3.weight_scale" keys are present in the tests, and the code’s element-wise max → stack → reshape logic aligns with the shape of module.fc31_alpha. No missing keys or shape mismatches were found.

Comment on lines 241 to 338
@staticmethod
def load_angelslim_quant_config(quant_config_file, model_dir, moe_backend):
quant_config = QuantConfig()
layer_quant_config = None

with open(quant_config_file) as f:
quant_config_dict = json.load(f)

json_quant_configs = quant_config_dict['quantization']

quant_config.quant_algo = QuantAlgo(
json_quant_configs.get('quant_algo', None).upper()) if json_quant_configs.get("quant_algo") else None
# fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES
if quant_config.quant_algo == "fp8_pb_wo":
quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES')

quant_config.kv_cache_quant_algo = QuantAlgo(
json_quant_configs.get("kv_cache_quant_algo").upper()
) if json_quant_configs.get("kv_cache_quant_algo") else None
quant_config.group_size = json_quant_configs.get('group_size', None)
quant_config.exclude_modules = json_quant_configs.get(
'exclude_modules', None)
quant_config.activation_scheme = ActivationScheme(
json_quant_configs.get('activation_scheme', None).upper()
) if json_quant_configs.get("activation_scheme") else None

json_exclude_quant_configs = json_quant_configs.get('exclude_quantization', None)
if json_exclude_quant_configs:
quant_config.exclude_quant_config = {
"quant_algo": QuantAlgo(
json_exclude_quant_configs.get('quant_algo', None).upper()
) if json_exclude_quant_configs.get("quant_algo") else None,
"kv_cache_quant_algo": QuantAlgo(
json_exclude_quant_configs.get("kv_cache_quant_algo").upper()
) if json_exclude_quant_configs.get("kv_cache_quant_algo") else None,
"activation_scheme": ActivationScheme(
json_exclude_quant_configs.get('activation_scheme', None).upper()
) if json_exclude_quant_configs.get("activation_scheme") else None,
}
return quant_config, layer_quant_config

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add error handling for malformed JSON config files.

The new load_angelslim_quant_config method assumes the JSON file is well-formed and contains the expected structure. Add error handling for cases where required keys might be missing or the JSON is malformed.

Consider adding validation like this:

 def load_angelslim_quant_config(quant_config_file, model_dir, moe_backend):
     quant_config = QuantConfig()
     layer_quant_config = None
 
-    with open(quant_config_file) as f:
-        quant_config_dict = json.load(f)
+    try:
+        with open(quant_config_file) as f:
+            quant_config_dict = json.load(f)
+    except (json.JSONDecodeError, IOError) as e:
+        raise ValueError(f"Failed to load angelslim config from {quant_config_file}: {e}")
 
-    json_quant_configs = quant_config_dict['quantization']
+    json_quant_configs = quant_config_dict.get('quantization', {})
+    if not json_quant_configs:
+        raise ValueError(f"Missing 'quantization' section in {quant_config_file}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@staticmethod
def load_angelslim_quant_config(quant_config_file, model_dir, moe_backend):
quant_config = QuantConfig()
layer_quant_config = None
with open(quant_config_file) as f:
quant_config_dict = json.load(f)
json_quant_configs = quant_config_dict['quantization']
quant_config.quant_algo = QuantAlgo(
json_quant_configs.get('quant_algo', None).upper()) if json_quant_configs.get("quant_algo") else None
# fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES
if quant_config.quant_algo == "fp8_pb_wo":
quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES')
quant_config.kv_cache_quant_algo = QuantAlgo(
json_quant_configs.get("kv_cache_quant_algo").upper()
) if json_quant_configs.get("kv_cache_quant_algo") else None
quant_config.group_size = json_quant_configs.get('group_size', None)
quant_config.exclude_modules = json_quant_configs.get(
'exclude_modules', None)
quant_config.activation_scheme = ActivationScheme(
json_quant_configs.get('activation_scheme', None).upper()
) if json_quant_configs.get("activation_scheme") else None
json_exclude_quant_configs = json_quant_configs.get('exclude_quantization', None)
if json_exclude_quant_configs:
quant_config.exclude_quant_config = {
"quant_algo": QuantAlgo(
json_exclude_quant_configs.get('quant_algo', None).upper()
) if json_exclude_quant_configs.get("quant_algo") else None,
"kv_cache_quant_algo": QuantAlgo(
json_exclude_quant_configs.get("kv_cache_quant_algo").upper()
) if json_exclude_quant_configs.get("kv_cache_quant_algo") else None,
"activation_scheme": ActivationScheme(
json_exclude_quant_configs.get('activation_scheme', None).upper()
) if json_exclude_quant_configs.get("activation_scheme") else None,
}
return quant_config, layer_quant_config
@staticmethod
def load_angelslim_quant_config(quant_config_file, model_dir, moe_backend):
quant_config = QuantConfig()
layer_quant_config = None
try:
with open(quant_config_file) as f:
quant_config_dict = json.load(f)
except (json.JSONDecodeError, IOError) as e:
raise ValueError(f"Failed to load angelslim config from {quant_config_file}: {e}")
json_quant_configs = quant_config_dict.get('quantization', {})
if not json_quant_configs:
raise ValueError(f"Missing 'quantization' section in {quant_config_file}")
quant_config.quant_algo = QuantAlgo(
json_quant_configs.get('quant_algo', None).upper()
) if json_quant_configs.get("quant_algo") else None
# fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES
if quant_config.quant_algo == "fp8_pb_wo":
quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES')
quant_config.kv_cache_quant_algo = QuantAlgo(
json_quant_configs.get("kv_cache_quant_algo").upper()
) if json_quant_configs.get("kv_cache_quant_algo") else None
quant_config.group_size = json_quant_configs.get('group_size', None)
quant_config.exclude_modules = json_quant_configs.get(
'exclude_modules', None)
quant_config.activation_scheme = ActivationScheme(
json_quant_configs.get('activation_scheme', None).upper()
) if json_quant_configs.get("activation_scheme") else None
json_exclude_quant_configs = json_quant_configs.get('exclude_quantization', None)
if json_exclude_quant_configs:
quant_config.exclude_quant_config = {
"quant_algo": QuantAlgo(
json_exclude_quant_configs.get('quant_algo', None).upper()
) if json_exclude_quant_configs.get("quant_algo") else None,
"kv_cache_quant_algo": QuantAlgo(
json_exclude_quant_configs.get("kv_cache_quant_algo").upper()
) if json_exclude_quant_configs.get("kv_cache_quant_algo") else None,
"activation_scheme": ActivationScheme(
json_exclude_quant_configs.get('activation_scheme', None).upper()
) if json_exclude_quant_configs.get("activation_scheme") else None,
}
return quant_config, layer_quant_config

Comment on lines +251 to +303
quant_config.quant_algo = QuantAlgo(
json_quant_configs.get('quant_algo', None).upper()) if json_quant_configs.get("quant_algo") else None
# fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES
if quant_config.quant_algo == "fp8_pb_wo":
quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Logical issue: string comparison against enum value.

Line 254 compares quant_config.quant_algo (which is now a QuantAlgo enum) against the string "fp8_pb_wo". This will always fail because you're comparing an enum to a string.

Fix the comparison:

-    quant_config.quant_algo = QuantAlgo(
-        json_quant_configs.get('quant_algo', None).upper()) if json_quant_configs.get("quant_algo") else None
-    # fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES
-    if quant_config.quant_algo == "fp8_pb_wo":
-        quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES')
+    algo_str = json_quant_configs.get('quant_algo')
+    if algo_str:
+        algo_str = algo_str.upper()
+        # fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES
+        if algo_str == "FP8_PB_WO":
+            quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES')
+        else:
+            quant_config.quant_algo = QuantAlgo(algo_str)
+    else:
+        quant_config.quant_algo = None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
quant_config.quant_algo = QuantAlgo(
json_quant_configs.get('quant_algo', None).upper()) if json_quant_configs.get("quant_algo") else None
# fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES
if quant_config.quant_algo == "fp8_pb_wo":
quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES')
algo_str = json_quant_configs.get('quant_algo')
if algo_str:
algo_str = algo_str.upper()
# fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES
if algo_str == "FP8_PB_WO":
quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES')
else:
quant_config.quant_algo = QuantAlgo(algo_str)
else:
quant_config.quant_algo = None
🤖 Prompt for AI Agents
tensorrt_llm/_torch/model_config.py around lines 251 to 255: the code currently
compares the QuantAlgo enum instance to the string "fp8_pb_wo", which always
fails; change the check to compare the enum's name or value to that string (for
example: if quant_config.quant_algo.name.lower() == "fp8_pb_wo") and then set
quant_config.quant_algo = QuantAlgo("FP8_BLOCK_SCALES") as before.

Comment on lines +438 to +519
elif (quant_config_file := model_dir / 'angelslim_hf_quant_config.json').exists():
quant_config, layer_quant_config = cls.load_angelslim_quant_config(
quant_config_file, model_dir, moe_backend)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify the consistency of the naming convention for config files.

The code now supports both angelslim_hf_quant_config.json and hf_quant_config.json. Ensure this naming is consistent with the tool's documentation and expected usage.


🏁 Script executed:

#!/bin/bash
# Check for any existing references to these config file names in the codebase
echo "Searching for references to angelslim_hf_quant_config.json:"
rg "angelslim_hf_quant_config" --type py -A 2 -B 2

echo -e "\nSearching for references to hf_quant_config.json:"
rg "hf_quant_config\.json" --type py -A 2 -B 2

echo -e "\nChecking for any documentation about these config files:"
fd -e md -e rst | xargs rg -l "angelslim_hf_quant_config|hf_quant_config"

Length of output: 8264


Update documentation to include the new AngelsLim config filename

The code now recognizes both hf_quant_config.json and angelslim_hf_quant_config.json, but our docs and examples only mention the former. Please update all references so users know both are supported:

• docs/source/performance/perf-benchmarking.md
• examples/models/core/deepseek_v3/README.md

– Add a note explaining that if an angelslim_hf_quant_config.json is present in the model directory, it will be loaded via load_angelslim_quant_config()
– Ensure any code snippets or CLI examples show both filenames where applicable

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/model_config.py around lines 438 to 440, the code now
accepts an additional AngelsLim config filename
(angelslim_hf_quant_config.json); update the documentation and example README to
mention both hf_quant_config.json and angelslim_hf_quant_config.json. Edit
docs/source/performance/perf-benchmarking.md and
examples/models/core/deepseek_v3/README.md to add a short note that if
angelslim_hf_quant_config.json exists in the model directory it will be loaded
via load_angelslim_quant_config(), and update any code snippets or CLI examples
to show both filenames where applicable (e.g., list both filenames in examples
and usage text).

Comment on lines 1002 to 1014
all_w3_scales = torch.stack(all_w3_scales) / all_w3_w1_scales_fp8_max.unsqueeze(2)
all_w1_scales = [
load_weight_shard(weights[f"{expert_id}.w1.weight_scale_inv"],
load_weight_shard(weights[f"{expert_id}.w1.weight_scale.int4"],
module.tp_size,
module.tp_rank,
TensorParallelMode.COLUMN,
device=self.device)
for expert_id in module.initial_local_expert_ids
]
all_w1_scales = torch.stack(all_w1_scales) / all_w3_w1_scales_fp8_max.unsqueeze(2)
all_w3_w1_scales = torch.cat(
[torch.stack(all_w3_scales),
torch.stack(all_w1_scales)], dim=-2)
[all_w3_scales,
all_w1_scales], dim=-2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Potential division by zero when scales are zero.

The code divides by all_w3_w1_scales_fp8_max and all_w2_scales_fp8 without checking if these values are zero. This could lead to runtime errors or inf/nan values.

Add validation for zero scales:

 all_w3_scales = torch.stack(all_w3_scales) / all_w3_w1_scales_fp8_max.unsqueeze(2)
+if torch.any(all_w3_w1_scales_fp8_max == 0):
+    raise ValueError("Found zero FP8 weight scales, which would cause division by zero")
 
 all_w1_scales = torch.stack(all_w1_scales) / all_w3_w1_scales_fp8_max.unsqueeze(2)
 
 all_w2_scales = torch.stack(all_w2_scales) / all_w2_scales_fp8.unsqueeze(2)
+if torch.any(all_w2_scales_fp8 == 0):
+    raise ValueError("Found zero FP8 weight scales for w2, which would cause division by zero")

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/fused_moe/quantization.py around lines 1002 to
1014, the code divides tensors by all_w3_w1_scales_fp8_max and all_w2_scales_fp8
without guarding against zeros; add a defensive check to avoid divide-by-zero by
replacing zero (or near-zero) elements with a small safe epsilon (e.g., 1e-6) or
use torch.clamp_min / torch.where to ensure the denominators are >= epsilon
before performing the divisions, and optionally log or assert if any
replacements occurred to aid debugging.

all_w1_scales = [
load_weight_shard(weights[f"{expert_id}.w1.weight_scale_inv"],
load_weight_shard(weights[f"{expert_id}.w1.weight_scale.int4"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Inconsistent weight scale key naming: weight_scale.int4 vs weight_scale_inv.

Line 1004 uses f"{expert_id}.w1.weight_scale.int4" while line 995 uses f"{expert_id}.w3.weight_scale_inv". This inconsistency could indicate different weight formats or be a bug.

Please verify if this naming difference is intentional:

#!/bin/bash
# Check for both naming patterns in the codebase
echo "Searching for weight_scale.int4 pattern:"
rg "weight_scale\.int4" --type py -B 2 -A 2

echo -e "\nSearching for weight_scale_inv pattern:"
rg "weight_scale_inv" --type py -B 2 -A 2
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/modules/fused_moe/quantization.py around line 1004 there
is an inconsistent key name: line 1004 uses f"{expert_id}.w1.weight_scale.int4"
while earlier (line ~995) uses f"{expert_id}.w3.weight_scale_inv"; confirm
whether the correct stored key is weight_scale_inv or weight_scale.int4 by
searching the repo for both patterns, then make the keys consistent (prefer
using the canonical weight_scale_inv if other shards use that naming), update
the load_weight_shard call to use the canonical key across all shards, and add a
brief inline comment explaining the chosen convention so future readers know
which format is expected.

@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Aug 13, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/models/modeling_utils.py (1)

468-507: Rename all remaining exclude_quant_config references to exclude_quantization.

  • tensorrt_llm/llmapi/llm_utils.py (around line 469): change quant_config.exclude_quant_configquant_config.exclude_quantization.
  • tensorrt_llm/_torch/model_config.py (around lines 269 & 359): update both the assignment and subsequent checks of quant_config.exclude_quant_configquant_config.exclude_quantization.
  • Re-run rg '\bexclude_quant_config\b' to confirm no leftovers.
♻️ Duplicate comments (2)
tensorrt_llm/_torch/model_config.py (2)

251-256: Fix enum/string mismatch for quant_algo ("fp8_pb_wo" mapping never triggers).

You construct an enum on Line 251, then compare it to a string on Line 254. This always fails. Normalize the source string first, then map to QuantAlgo.

-        quant_config.quant_algo = QuantAlgo(
-            json_quant_configs.get('quant_algo', None).upper()) if json_quant_configs.get("quant_algo") else None
-        # fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES
-        if quant_config.quant_algo == "fp8_pb_wo":
-            quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES')
+        algo_str = json_quant_configs.get('quant_algo')
+        if algo_str:
+            algo_str = algo_str.upper()
+            # fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES
+            if algo_str == "FP8_PB_WO":
+                quant_config.quant_algo = QuantAlgo.FP8_BLOCK_SCALES
+            else:
+                quant_config.quant_algo = QuantAlgo(algo_str)
+        else:
+            quant_config.quant_algo = None

246-249: Add robust error handling for malformed/absent quantization JSON.

Currently assumes file is valid JSON and contains 'quantization'. Add try/except and schema checks.

-        with open(quant_config_file) as f:
-            quant_config_dict = json.load(f)
+        try:
+            with open(quant_config_file) as f:
+                quant_config_dict = json.load(f)
+        except (json.JSONDecodeError, OSError) as e:
+            raise ValueError(f"Failed to load angelslim config from {quant_config_file}: {e}") from e
+        json_quant_configs = quant_config_dict.get('quantization')
+        if not isinstance(json_quant_configs, dict):
+            raise ValueError(f"Missing or invalid 'quantization' in {quant_config_file}")
-
-        json_quant_configs = quant_config_dict['quantization']
🧹 Nitpick comments (5)
tensorrt_llm/_torch/model_config.py (2)

242-242: Silence ARG004 or use the params.

model_dir and moe_backend were unused (now moe_backend is used above). If model_dir remains unused, prefix to _model_dir or document intent.

-    def load_angelslim_quant_config(quant_config_file, model_dir, moe_backend):
+    def load_angelslim_quant_config(quant_config_file, _model_dir, moe_backend):

1-1: Missing 2025 NVIDIA Apache-2.0 header.

Please prepend the standard header per repo guidelines.

tensorrt_llm/_torch/models/modeling_utils.py (1)

480-493: Good construction of per-exclusion QuantConfig.

  • Honors kv_cache precedence.
  • Allows per-excluded-module quant_algo/activation_scheme/group_size.

Minor: if exclude_quantization is None, consider using the base quant_config.group_size to keep consistency.

-        group_size = 128
+        group_size = quant_config.group_size or 128
tensorrt_llm/llmapi/llm_utils.py (2)

1-1: Missing 2025 NVIDIA Apache-2.0 header.

Please prepend the standard header per repo guidelines.


354-484: Unify configuration key for ignored modules
Replace ignored_layers in tensorrt_llm/_torch/model_config.py:354 with ignored_modules to match the usage in llmapi/llm_utils.py. No string-literal comparisons against QuantAlgo were found.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 15fbe20 and 1dbc226.

📒 Files selected for processing (5)
  • tensorrt_llm/_torch/model_config.py (4 hunks)
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_utils.py (1 hunks)
  • tensorrt_llm/llmapi/llm_utils.py (3 hunks)
  • tensorrt_llm/models/modeling_utils.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tensorrt_llm/models/modeling_utils.py
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tensorrt_llm/_torch/model_config.py
  • tensorrt_llm/llmapi/llm_utils.py
  • tensorrt_llm/_torch/models/modeling_utils.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tensorrt_llm/_torch/model_config.py
  • tensorrt_llm/llmapi/llm_utils.py
  • tensorrt_llm/_torch/models/modeling_utils.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tensorrt_llm/_torch/model_config.py
  • tensorrt_llm/llmapi/llm_utils.py
  • tensorrt_llm/_torch/models/modeling_utils.py
🧬 Code graph analysis (3)
tensorrt_llm/_torch/model_config.py (2)
tensorrt_llm/quantization/mode.py (2)
  • QuantAlgo (23-46)
  • ActivationScheme (463-465)
tensorrt_llm/models/modeling_utils.py (2)
  • QuantConfig (128-272)
  • quant_algo (551-552)
tensorrt_llm/llmapi/llm_utils.py (4)
tensorrt_llm/models/modeling_utils.py (3)
  • PretrainedConfig (370-571)
  • QuantConfig (128-272)
  • quant_algo (551-552)
tensorrt_llm/quantization/mode.py (2)
  • QuantAlgo (23-46)
  • ActivationScheme (463-465)
tensorrt_llm/llmapi/llm_args.py (2)
  • quant_config (2145-2148)
  • quant_config (2151-2152)
cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h (1)
  • group_size (63-63)
tensorrt_llm/_torch/models/modeling_utils.py (2)
tensorrt_llm/models/modeling_utils.py (2)
  • quant_algo (551-552)
  • QuantConfig (128-272)
tensorrt_llm/llmapi/llm_args.py (2)
  • quant_config (2145-2148)
  • quant_config (2151-2152)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/model_config.py

242-242: Unused static method argument: model_dir

(ARG004)


242-242: Unused static method argument: moe_backend

(ARG004)

🔇 Additional comments (5)
tensorrt_llm/_torch/models/modeling_utils.py (1)

28-28: Import looks correct.

ActivationScheme import aligns with the new QuantConfig fields.

tensorrt_llm/llmapi/llm_utils.py (4)

429-433: FP8 block-scales parsing LGTM.

Asserts (128,128) and sets group_size accordingly.


444-447: Support for 'fp8' and 'w4a8_awq' quant_method LGTM.

Sets group_size for AWQ.


452-459: KV cache and activation scheme mapping LGTM.

Uppercasing to enum names is consistent.


480-482: Log line LGTM.

Message is informative and low-noise.

Comment on lines +267 to +335
json_exclude_quantization= json_quant_configs.get('exclude_quantization', None)
if json_exclude_quantization:
quant_config.exclude_quant_config = {
"quant_algo": QuantAlgo(
json_exclude_quantization.get('quant_algo', None).upper()
) if json_exclude_quantization.get("quant_algo") else None,
"kv_cache_quant_algo": QuantAlgo(
json_exclude_quantization.get("kv_cache_quant_algo").upper()
) if json_exclude_quantization.get("kv_cache_quant_algo") else None,
"activation_scheme": ActivationScheme(
json_exclude_quantization.get('activation_scheme', None).upper()
) if json_exclude_quantization.get("activation_scheme") else None,
"group_size": json_exclude_quantization.get('group_size', None),
}
if quant_config.exclude_quantization["quant_algo"] in [QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ]:
if quant_config.exclude_quantization["group_size"] is None:
quant_config.exclude_quantization["group_size"] = 128

if quant_config.quant_algo in [QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ]:
if quant_config.group_size is None:
quant_config.group_size = 128
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Typo breaks exclude overrides: use QuantConfig.exclude_quantization consistently and initialize it.

Lines 269 and 356 use exclude_quant_config but QuantConfig defines exclude_quantization. Also, subsequent code (Line 281) expects exclude_quantization. This prevents per-module overrides from being applied and can raise exceptions.

-        json_exclude_quantization= json_quant_configs.get('exclude_quantization', None)
+        json_exclude_quantization = json_quant_configs.get('exclude_quantization', None)
         if json_exclude_quantization:
-            quant_config.exclude_quant_config = {
+            quant_config.exclude_quantization = {
                 "quant_algo": QuantAlgo(
                     json_exclude_quantization.get('quant_algo', None).upper()
                 ) if json_exclude_quantization.get("quant_algo") else None,
                 "kv_cache_quant_algo": QuantAlgo(
                     json_exclude_quantization.get("kv_cache_quant_algo").upper()
                 ) if json_exclude_quantization.get("kv_cache_quant_algo") else None,
                 "activation_scheme": ActivationScheme(
                     json_exclude_quantization.get('activation_scheme', None).upper()
                 ) if json_exclude_quantization.get("activation_scheme") else None,
                 "group_size": json_exclude_quantization.get('group_size', None),
             }
-            if quant_config.exclude_quantization["quant_algo"] in [QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ]:
+            if quant_config.exclude_quantization["quant_algo"] in {QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ}:
                 if quant_config.exclude_quantization["group_size"] is None:
                     quant_config.exclude_quantization["group_size"] = 128
-
-        if quant_config.quant_algo in [QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ]:
+        if quant_config.quant_algo in {QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ}:
             if quant_config.group_size is None:
                 quant_config.group_size = 128
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
json_exclude_quantization= json_quant_configs.get('exclude_quantization', None)
if json_exclude_quantization:
quant_config.exclude_quant_config = {
"quant_algo": QuantAlgo(
json_exclude_quantization.get('quant_algo', None).upper()
) if json_exclude_quantization.get("quant_algo") else None,
"kv_cache_quant_algo": QuantAlgo(
json_exclude_quantization.get("kv_cache_quant_algo").upper()
) if json_exclude_quantization.get("kv_cache_quant_algo") else None,
"activation_scheme": ActivationScheme(
json_exclude_quantization.get('activation_scheme', None).upper()
) if json_exclude_quantization.get("activation_scheme") else None,
"group_size": json_exclude_quantization.get('group_size', None),
}
if quant_config.exclude_quantization["quant_algo"] in [QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ]:
if quant_config.exclude_quantization["group_size"] is None:
quant_config.exclude_quantization["group_size"] = 128
if quant_config.quant_algo in [QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ]:
if quant_config.group_size is None:
quant_config.group_size = 128
json_exclude_quantization = json_quant_configs.get('exclude_quantization', None)
if json_exclude_quantization:
quant_config.exclude_quantization = {
"quant_algo": QuantAlgo(
json_exclude_quantization.get('quant_algo', None).upper()
) if json_exclude_quantization.get("quant_algo") else None,
"kv_cache_quant_algo": QuantAlgo(
json_exclude_quantization.get("kv_cache_quant_algo").upper()
) if json_exclude_quantization.get("kv_cache_quant_algo") else None,
"activation_scheme": ActivationScheme(
json_exclude_quantization.get('activation_scheme', None).upper()
) if json_exclude_quantization.get("activation_scheme") else None,
"group_size": json_exclude_quantization.get('group_size', None),
}
if quant_config.exclude_quantization["quant_algo"] in {QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ}:
if quant_config.exclude_quantization["group_size"] is None:
quant_config.exclude_quantization["group_size"] = 128
if quant_config.quant_algo in {QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ}:
if quant_config.group_size is None:
quant_config.group_size = 128
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/model_config.py around lines 267-287, replace the
misspelled/incorrect property exclude_quant_config with the correct
QuantConfig.exclude_quantization and ensure quant_config.exclude_quantization is
initialized (e.g., to an empty dict or appropriate default) before assigning its
keys; update any other occurrences (e.g., around line 356 mentioned in the
comment) to use exclude_quantization so per-module overrides are applied
consistently and avoid exceptions.

Comment on lines +285 to +337
if quant_config.quant_algo in [QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ]:
if quant_config.group_size is None:
quant_config.group_size = 128

return quant_config, layer_quant_config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Replicate TRTLLM default excludes for FP8_BLOCK_SCALES (parity with modelopt path).

For AngelsLim FP8 block scales, set default exclude_modules when moe_backend == 'TRTLLM', as done in load_modelopt_quant_config.

-        return quant_config, layer_quant_config
+        if (moe_backend == 'TRTLLM'
+                and quant_config.quant_algo == QuantAlgo.FP8_BLOCK_SCALES
+                and quant_config.exclude_modules is None):
+            quant_config.exclude_modules = ["*kv_b_proj*", "*k_b_proj*", "*eh_proj"]
+        return quant_config, layer_quant_config
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if quant_config.quant_algo in [QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ]:
if quant_config.group_size is None:
quant_config.group_size = 128
return quant_config, layer_quant_config
if quant_config.quant_algo in [QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ]:
if quant_config.group_size is None:
quant_config.group_size = 128
if (moe_backend == 'TRTLLM'
and quant_config.quant_algo == QuantAlgo.FP8_BLOCK_SCALES
and quant_config.exclude_modules is None):
quant_config.exclude_modules = ["*kv_b_proj*", "*k_b_proj*", "*eh_proj"]
return quant_config, layer_quant_config
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/model_config.py around lines 285 to 289, when
quant_config.quant_algo == QuantAlgo.FP8_BLOCK_SCALES and moe_backend ==
'TRTLLM' you must set the default quant_config.exclude_modules to the same
TRTLLM default excludes used by load_modelopt_quant_config; update this branch
to assign quant_config.exclude_modules (if None) to the common
TRTLLM_DEFAULT_EXCLUDES (or the exact list used in load_modelopt_quant_config)
so FP8_BLOCK_SCALES follows the same exclude defaults as the modelopt path.

Comment on lines +350 to +403
if quant_config.exclude_modules:
if hf_quant_config.get("ignored_layers"):
quant_config.exclude_modules += hf_quant_config.get("ignored_layers")
else:
quant_config.exclude_modules = hf_quant_config.get("ignored_layers")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Accept both 'ignored_modules' and 'ignored_layers' from HF config.

Two places in the codebase use different keys. Normalize here to avoid silently missing excludes.

-        if quant_config.exclude_modules:
-            if hf_quant_config.get("ignored_layers"):
-                quant_config.exclude_modules += hf_quant_config.get("ignored_layers")
-        else:
-            quant_config.exclude_modules = hf_quant_config.get("ignored_layers")
+        ignored = hf_quant_config.get("ignored_modules") or hf_quant_config.get("ignored_layers")
+        if ignored:
+            if quant_config.exclude_modules:
+                quant_config.exclude_modules += ignored
+            else:
+                quant_config.exclude_modules = ignored
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if quant_config.exclude_modules:
if hf_quant_config.get("ignored_layers"):
quant_config.exclude_modules += hf_quant_config.get("ignored_layers")
else:
quant_config.exclude_modules = hf_quant_config.get("ignored_layers")
ignored = hf_quant_config.get("ignored_modules") or hf_quant_config.get("ignored_layers")
if ignored:
if quant_config.exclude_modules:
quant_config.exclude_modules += ignored
else:
quant_config.exclude_modules = ignored
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/model_config.py around lines 350 to 355, the code only
reads hf_quant_config["ignored_layers"] which misses configs using
"ignored_modules"; update the logic to accept and normalize both keys by
checking for "ignored_modules" first then "ignored_layers" (or merge both if
present), treat missing values as empty lists, ensure
quant_config.exclude_modules is a list before extending/appending, merge the HF
excludes into quant_config.exclude_modules (avoiding None) and optionally
deduplicate the final list.

Comment on lines +356 to +433
# set exclude_quant_config
hf_ignored_quantization_config = hf_quant_config.get("ignored_quantization_config")
if hf_ignored_quantization_config:
quant_config.exclude_quant_config = {
"kv_cache_quant_algo": QuantAlgo(
hf_ignored_quantization_config.get("kv_cache_quant_method").upper()
) if hf_ignored_quantization_config.get("kv_cache_quant_method") else None,
"activation_scheme": ActivationScheme(
hf_ignored_quantization_config.get("activation_scheme").upper()
) if hf_ignored_quantization_config.get("activation_scheme") else None,
"group_size": 128,
}
if hf_ignored_quantization_config.get(
"quant_method") == "fp8" and hf_ignored_quantization_config.get("weight_block_size", []):
quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8_BLOCK_SCALES
block_size = hf_ignored_quantization_config.get("weight_block_size", [])
assert tuple(block_size) == (
128,
128), "FP8_BLOCK_SCALES only supports block_size=(128,128)"
quant_config.exclude_quantization["group_size"] = block_size[0]
elif hf_ignored_quantization_config.get("quant_method") == "fp8":
quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8
elif hf_ignored_quantization_config.get("quant_method") == "w4a8_awq":
quant_config.exclude_quantization["quant_algo"] = QuantAlgo.W4A8_AWQ
quant_config.exclude_quantization["group_size"] = hf_ignored_quantization_config.get(
"weight_group_size", 128)
else:
raise NotImplementedError(f"Unsupported quantization_config.ignored_quantization_config: "
f"{hf_ignored_quantization_config}.")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix exclude overrides for HF: wrong attribute name + missing FP8 block handling.

Same naming typo as above; also add FP8_BLOCK_SCALES handling when ignored_quantization_config carries weight_block_size.

-        hf_ignored_quantization_config = hf_quant_config.get("ignored_quantization_config")
-        if hf_ignored_quantization_config:
-            quant_config.exclude_quant_config = {
-                "kv_cache_quant_algo": QuantAlgo(
-                    hf_ignored_quantization_config.get("kv_cache_quant_method").upper()
-                ) if hf_ignored_quantization_config.get("kv_cache_quant_method") else None,
-                "activation_scheme": ActivationScheme(
-                    hf_ignored_quantization_config.get("activation_scheme").upper()
-                ) if hf_ignored_quantization_config.get("activation_scheme") else None,
-                "group_size": 128,
-            }
-            if hf_ignored_quantization_config.get(
-                    "quant_method") == "fp8" and hf_ignored_quantization_config.get("weight_block_size", []):
-                quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8_BLOCK_SCALES
-                block_size = hf_ignored_quantization_config.get("weight_block_size", [])
-                assert tuple(block_size) == (
-                    128,
-                    128), "FP8_BLOCK_SCALES only supports block_size=(128,128)"
-                quant_config.exclude_quantization["group_size"] = block_size[0]
-            elif hf_ignored_quantization_config.get("quant_method") == "fp8":
-                quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8
-            elif hf_ignored_quantization_config.get("quant_method") == "w4a8_awq":
-                quant_config.exclude_quantization["quant_algo"] = QuantAlgo.W4A8_AWQ
-                quant_config.exclude_quantization["group_size"] = hf_ignored_quantization_config.get(
-                    "weight_group_size", 128)
-            else:
-                raise NotImplementedError(f"Unsupported quantization_config.ignored_quantization_config: "
-                                          f"{hf_ignored_quantization_config}.")
+        hf_ignored_quant = hf_quant_config.get("ignored_quantization_config")
+        if hf_ignored_quant:
+            quant_config.exclude_quantization = {
+                "kv_cache_quant_algo": QuantAlgo(hf_ignored_quant["kv_cache_quant_method"].upper())
+                if hf_ignored_quant.get("kv_cache_quant_method") else None,
+                "activation_scheme": ActivationScheme(hf_ignored_quant["activation_scheme"].upper())
+                if hf_ignored_quant.get("activation_scheme") else None,
+                "group_size": 128,
+            }
+            if hf_ignored_quant.get("quant_method") == "fp8" and hf_ignored_quant.get("weight_block_size"):
+                block_size = hf_ignored_quant["weight_block_size"]
+                assert tuple(block_size) == (128, 128), "FP8_BLOCK_SCALES only supports block_size=(128,128)"
+                quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8_BLOCK_SCALES
+                quant_config.exclude_quantization["group_size"] = block_size[0]
+            elif hf_ignored_quant.get("quant_method") == "fp8":
+                quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8
+            elif hf_ignored_quant.get("quant_method") == "w4a8_awq":
+                quant_config.exclude_quantization["quant_algo"] = QuantAlgo.W4A8_AWQ
+                quant_config.exclude_quantization["group_size"] = hf_ignored_quant.get("weight_group_size", 128)
+            else:
+                raise NotImplementedError(f"Unsupported quantization_config.ignored_quantization_config: {hf_ignored_quant}.")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# set exclude_quant_config
hf_ignored_quantization_config = hf_quant_config.get("ignored_quantization_config")
if hf_ignored_quantization_config:
quant_config.exclude_quant_config = {
"kv_cache_quant_algo": QuantAlgo(
hf_ignored_quantization_config.get("kv_cache_quant_method").upper()
) if hf_ignored_quantization_config.get("kv_cache_quant_method") else None,
"activation_scheme": ActivationScheme(
hf_ignored_quantization_config.get("activation_scheme").upper()
) if hf_ignored_quantization_config.get("activation_scheme") else None,
"group_size": 128,
}
if hf_ignored_quantization_config.get(
"quant_method") == "fp8" and hf_ignored_quantization_config.get("weight_block_size", []):
quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8_BLOCK_SCALES
block_size = hf_ignored_quantization_config.get("weight_block_size", [])
assert tuple(block_size) == (
128,
128), "FP8_BLOCK_SCALES only supports block_size=(128,128)"
quant_config.exclude_quantization["group_size"] = block_size[0]
elif hf_ignored_quantization_config.get("quant_method") == "fp8":
quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8
elif hf_ignored_quantization_config.get("quant_method") == "w4a8_awq":
quant_config.exclude_quantization["quant_algo"] = QuantAlgo.W4A8_AWQ
quant_config.exclude_quantization["group_size"] = hf_ignored_quantization_config.get(
"weight_group_size", 128)
else:
raise NotImplementedError(f"Unsupported quantization_config.ignored_quantization_config: "
f"{hf_ignored_quantization_config}.")
# set exclude_quant_config
hf_ignored_quant = hf_quant_config.get("ignored_quantization_config")
if hf_ignored_quant:
quant_config.exclude_quantization = {
"kv_cache_quant_algo": QuantAlgo(hf_ignored_quant["kv_cache_quant_method"].upper())
if hf_ignored_quant.get("kv_cache_quant_method") else None,
"activation_scheme": ActivationScheme(hf_ignored_quant["activation_scheme"].upper())
if hf_ignored_quant.get("activation_scheme") else None,
"group_size": 128,
}
if hf_ignored_quant.get("quant_method") == "fp8" and hf_ignored_quant.get("weight_block_size"):
block_size = hf_ignored_quant["weight_block_size"]
assert tuple(block_size) == (128, 128), "FP8_BLOCK_SCALES only supports block_size=(128,128)"
quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8_BLOCK_SCALES
quant_config.exclude_quantization["group_size"] = block_size[0]
elif hf_ignored_quant.get("quant_method") == "fp8":
quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8
elif hf_ignored_quant.get("quant_method") == "w4a8_awq":
quant_config.exclude_quantization["quant_algo"] = QuantAlgo.W4A8_AWQ
quant_config.exclude_quantization["group_size"] = hf_ignored_quant.get("weight_group_size", 128)
else:
raise NotImplementedError(
f"Unsupported quantization_config.ignored_quantization_config: {hf_ignored_quant}."
)
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/model_config.py around lines 356 to 385, the code sets
quant_config.exclude_quant_config but later references
quant_config.exclude_quantization (typo/inconsistent attribute name) and it also
doesn't correctly handle the FP8 block-size case; change the initial assignment
to use the correct attribute name exclude_quantization (not
exclude_quant_config) and build that dict with consistent keys
("kv_cache_quant_algo", "activation_scheme", "group_size"); then update the FP8
branch so that when hf_ignored_quantization_config["quant_method"] == "fp8" and
weight_block_size is provided you set quant_algo = QuantAlgo.FP8_BLOCK_SCALES,
assert the block_size equals (128,128) and set group_size = block_size[0],
otherwise set quant_algo = QuantAlgo.FP8 for the non-block case; keep the
w4a8_awq branch setting QuantAlgo.W4A8_AWQ and group_size from weight_group_size
as before, and raise NotImplementedError for unsupported configs.

Comment on lines +461 to +466
if quant_config.exclude_modules:
if hf_quant_config.get("ignored_modules"):
quant_config.exclude_modules += hf_quant_config.get("ignored_modules")
else:
quant_config.exclude_modules = hf_quant_config.get("ignored_modules")
# set exclude_quant_config
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Accept both 'ignored_modules' and 'ignored_layers'.

Align with the other loader to avoid missing excludes.

-                if quant_config.exclude_modules:
-                    if hf_quant_config.get("ignored_modules"):
-                        quant_config.exclude_modules += hf_quant_config.get("ignored_modules")
-                else:
-                    quant_config.exclude_modules = hf_quant_config.get("ignored_modules")
+                ignored = hf_quant_config.get("ignored_modules") or hf_quant_config.get("ignored_layers")
+                if ignored:
+                    if quant_config.exclude_modules:
+                        quant_config.exclude_modules += ignored
+                    else:
+                        quant_config.exclude_modules = ignored
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if quant_config.exclude_modules:
if hf_quant_config.get("ignored_modules"):
quant_config.exclude_modules += hf_quant_config.get("ignored_modules")
else:
quant_config.exclude_modules = hf_quant_config.get("ignored_modules")
# set exclude_quant_config
ignored = hf_quant_config.get("ignored_modules") or hf_quant_config.get("ignored_layers")
if ignored:
if quant_config.exclude_modules:
quant_config.exclude_modules += ignored
else:
quant_config.exclude_modules = ignored
# set exclude_quant_config
🤖 Prompt for AI Agents
In tensorrt_llm/llmapi/llm_utils.py around lines 461 to 466, update the logic
that reads hf_quant_config ignored entries so it accepts both "ignored_modules"
and "ignored_layers" (treating them equivalently), and correctly accumulates
them into quant_config.exclude_modules: read both keys (preferring one if
needed), coerce the result to a list if None, and then either extend the
existing quant_config.exclude_modules or assign a new list (avoiding None
concatenation). Ensure you merge lists rather than overwrite unexpectedly so the
behavior matches the other loader.

Comment on lines +467 to +479
hf_ignored_quantization_config = hf_quant_config.get("ignored_quantization_config")
if hf_ignored_quantization_config:
quant_config.exclude_quant_config = {
"quant_algo": QuantAlgo(
hf_ignored_quantization_config.get("quant_method").upper()
) if hf_ignored_quantization_config.get("quant_method") else None,
"kv_cache_quant_algo": QuantAlgo(
hf_ignored_quantization_config.get("kv_cache_quant_method").upper()
) if hf_ignored_quantization_config.get("kv_cache_quant_method") else None,
"activation_scheme": ActivationScheme(
hf_ignored_quantization_config.get("activation_scheme").upper()
) if hf_ignored_quantization_config.get("activation_scheme") else None,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix exclude overrides: wrong attribute, add FP8 block handling, and group_size.

Use exclude_quantization (not exclude_quant_config) and mirror FP8_BLOCK_SCALES logic.

-                hf_ignored_quantization_config = hf_quant_config.get("ignored_quantization_config")
-                if hf_ignored_quantization_config:
-                    quant_config.exclude_quant_config = {
-                        "quant_algo": QuantAlgo(
-                            hf_ignored_quantization_config.get("quant_method").upper()
-                        ) if hf_ignored_quantization_config.get("quant_method") else None,
-                        "kv_cache_quant_algo": QuantAlgo(
-                            hf_ignored_quantization_config.get("kv_cache_quant_method").upper()
-                        ) if hf_ignored_quantization_config.get("kv_cache_quant_method") else None,
-                        "activation_scheme": ActivationScheme(
-                            hf_ignored_quantization_config.get("activation_scheme").upper()
-                        ) if hf_ignored_quantization_config.get("activation_scheme") else None,
-                    }
+                ignored_q = hf_quant_config.get("ignored_quantization_config")
+                if ignored_q:
+                    quant_config.exclude_quantization = {
+                        "kv_cache_quant_algo": QuantAlgo(ignored_q["kv_cache_quant_method"].upper())
+                        if ignored_q.get("kv_cache_quant_method") else None,
+                        "activation_scheme": ActivationScheme(ignored_q["activation_scheme"].upper())
+                        if ignored_q.get("activation_scheme") else None,
+                        "group_size": 128,
+                    }
+                    if ignored_q.get("quant_method") == "fp8" and ignored_q.get("weight_block_size"):
+                        block_size = ignored_q["weight_block_size"]
+                        assert tuple(block_size) == (128, 128), "FP8_BLOCK_SCALES only supports block_size=(128,128)"
+                        quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8_BLOCK_SCALES
+                        quant_config.exclude_quantization["group_size"] = block_size[0]
+                    elif ignored_q.get("quant_method") == "fp8":
+                        quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8
+                    elif ignored_q.get("quant_method") == "w4a8_awq":
+                        quant_config.exclude_quantization["quant_algo"] = QuantAlgo.W4A8_AWQ
+                        quant_config.exclude_quantization["group_size"] = ignored_q.get("weight_group_size", 128)
+                    else:
+                        raise NotImplementedError(f"Unsupported quantization_config.ignored_quantization_config: {ignored_q}.")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
hf_ignored_quantization_config = hf_quant_config.get("ignored_quantization_config")
if hf_ignored_quantization_config:
quant_config.exclude_quant_config = {
"quant_algo": QuantAlgo(
hf_ignored_quantization_config.get("quant_method").upper()
) if hf_ignored_quantization_config.get("quant_method") else None,
"kv_cache_quant_algo": QuantAlgo(
hf_ignored_quantization_config.get("kv_cache_quant_method").upper()
) if hf_ignored_quantization_config.get("kv_cache_quant_method") else None,
"activation_scheme": ActivationScheme(
hf_ignored_quantization_config.get("activation_scheme").upper()
) if hf_ignored_quantization_config.get("activation_scheme") else None,
}
ignored_q = hf_quant_config.get("ignored_quantization_config")
if ignored_q:
quant_config.exclude_quantization = {
"kv_cache_quant_algo": QuantAlgo(ignored_q["kv_cache_quant_method"].upper())
if ignored_q.get("kv_cache_quant_method") else None,
"activation_scheme": ActivationScheme(ignored_q["activation_scheme"].upper())
if ignored_q.get("activation_scheme") else None,
"group_size": 128,
}
if ignored_q.get("quant_method") == "fp8" and ignored_q.get("weight_block_size"):
block_size = ignored_q["weight_block_size"]
assert tuple(block_size) == (128, 128), "FP8_BLOCK_SCALES only supports block_size=(128,128)"
quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8_BLOCK_SCALES
quant_config.exclude_quantization["group_size"] = block_size[0]
elif ignored_q.get("quant_method") == "fp8":
quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8
elif ignored_q.get("quant_method") == "w4a8_awq":
quant_config.exclude_quantization["quant_algo"] = QuantAlgo.W4A8_AWQ
quant_config.exclude_quantization["group_size"] = ignored_q.get("weight_group_size", 128)
else:
raise NotImplementedError(
f"Unsupported quantization_config.ignored_quantization_config: {ignored_q}."
)
🤖 Prompt for AI Agents
In tensorrt_llm/llmapi/llm_utils.py around lines 467 to 479, the code mistakenly
sets quant_config.exclude_quant_config instead of
quant_config.exclude_quantization and omits FP8 block handling and group_size;
change the assignment to quant_config.exclude_quantization, map the keys to the
correct QuantAlgo/ActivationScheme enums as before, and add handling for FP8
block scales by mirroring the existing FP8_BLOCK_SCALES logic (include
fp8_block_scales or equivalent field and group_size parsing from
hf_ignored_quantization_config) so excluded FP8 settings and group_size are
applied consistently with the other FP8 handling code.

@bppan bppan force-pushed the support_angelslim_w4 branch from 1dbc226 to 62774f9 Compare September 10, 2025 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community want to contribute PRs initiated from Community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants