Support W4A8 method of AngleSlim tool #6857

bppan · 2025-08-13T08:18:25Z

Summary by CodeRabbit

New Features
- Added support for Angelslim quantization configs when loading pretrained models.
- Expanded HF quantization config support: w4a8_awq and fp8 methods, kv-cache quantization, activation schemes (STATIC/DYNAMIC), module exclusions, and per-module override configs.
- Configurable weight name in MoE backends (supports qweight/weight paths) and enriched per-expert FP8/FP4 scale handling, plus SM100-specific resmoothing.
- Added logging of detected quantization settings.
Bug Fixes
- Corrected dequantization handling for certain DeepSeek V3 weights to respect exclusion settings.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-08-13T08:18:33Z

📝 Walkthrough

Walkthrough

Adds Angelslim quant config loader and wires it into pretrained config flow; extends HF quant config parsing (new quant methods, kv_cache and activation fields, ignored_quantization handling); adds ActivationScheme and exclude_quantization to QuantConfig; adjusts exclude-module overrides and DeepseekV3 dequant gating; overhauls fused MoE weight-loading to accept weight_name and compute per-expert FP8/FP4 scales with SM100 resmoothing.

Changes

Cohort / File(s)	Summary of Changes
Quant schema & enums `tensorrt_llm/models/modeling_utils.py`, `tensorrt_llm/quantization/mode.py`	Added `ActivationScheme` enum and extended `QuantConfig` with `activation_scheme` and `exclude_quantization` fields (typing/docs).
HF / Angelslim quant config loading `tensorrt_llm/_torch/model_config.py`, `tensorrt_llm/llmapi/llm_utils.py`	Added `load_angelslim_quant_config` static loader; from_pretrained now dispatches to angelslim_hf_quant_config.json when present; HF loader extended to support quant_method "w4a8_awq"/"fp8", kv_cache_quant_method, activation_scheme, merging ignored_modules→exclude_modules, ignored_quantization_config→exclude_quant_config, logging, and NotImplemented guards for unsupported methods.
Exclude-module quant overrides `tensorrt_llm/_torch/models/modeling_utils.py`	apply_quant_config_exclude_modules now builds per-excluded-module QuantConfig using fields from `exclude_quantization` (quant_algo, activation_scheme, group_size) while preserving kv_cache_quant_algo precedence.
DeepseekV3 kv_b_proj gating `tensorrt_llm/_torch/models/modeling_deepseekv3.py`	Tightened condition for dequantizing `kv_b_proj`: requires module in exclude_modules AND `exclude_quant_config` is None.
Fused MoE weight loading & FP8/FP4 scales `tensorrt_llm/_torch/modules/fused_moe/quantization.py`	Introduced `weight_name` parameter across loaders (defaults preserved); load expert weights by `expert.w{1,3,2}.{weight_name}`; reworked FP8 QDQ scale/alpha computation to per-expert statistics, interleaving/normalization, and per-SM resmoothing (SM100) of weight_scale_inv; updated backend overrides to accept/forward `weight_name`.

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant ModelConfig
  participant FS as Filesystem
  participant Loader

  Caller->>ModelConfig: from_pretrained(model_dir)
  ModelConfig->>FS: stat angelslim_hf_quant_config.json
  alt angelslim exists
    ModelConfig->>Loader: load_angelslim_quant_config(file)
    Loader-->>ModelConfig: quant_config, layer_quant_config=None
  else
    ModelConfig->>FS: stat hf_quant_config.json
    ModelConfig->>Loader: load_hf_quant_config(file)
    Loader-->>ModelConfig: quant_config (includes activation_scheme/exclude_quant_config)
  end
  ModelConfig-->>Caller: model config with QuantConfig

sequenceDiagram
  participant Caller
  participant FusedMoE
  participant Weights as WeightsDict
  participant Backend

  Caller->>FusedMoE: load_weights(module, weights, mode, weight_name)
  FusedMoE->>Weights: read expert.w1.{weight_name}, expert.w3.{weight_name}, expert.w2.{weight_name}
  alt FP8 QDQ path
    FusedMoE->>FusedMoE: compute per-expert w1/w3/w2 scales & alphas
    FusedMoE->>FusedMoE: normalize & interleave scales into target tensors
    opt SM100 resmoothing
      FusedMoE->>FusedMoE: resmooth weight_scale_inv → weight keys
    end
  end
  FusedMoE->>Backend: set tensors/scales
  FusedMoE-->>Caller: done

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

[None][feat] Add support for fused gate_up_proj scales for FP8 blockwise #6496: Overlaps fused MoE FP8 blockwise scale loading and weight-loading mode branching in the same file.
[TRTLLM-6898][feat] make fused_moe_cute_dsl work on blackwell #6616: Also changes fused MoE weight-loading and SM100/DeepSeek FP8 block scales handling.
[TRTLLM-6174][feat] Enable FP32 mamba ssm cache #6574: Touches QuantConfig in modeling_utils.py with additional fields; intersects with this PR’s QuantConfig changes.

Suggested reviewers

achartier
hlu1
Tracin
Superjomn
litaotju
yuxianq

Pre-merge checks (3 warnings)

❌ Failed checks (3 warnings)

Check name	Status	Explanation	Resolution
Title Check	⚠️ Warning	The current title “Support W4A8 method of AngleSlim tool” only highlights one quant method and misspells “Angelslim,” while the PR introduces extensive Angelslim HF quant config support, a new loader, FP8 mappings, HF loader enhancements, and broader quantization changes across multiple modules. It is overly narrow and contains a typo, so it does not accurately reflect the main scope of the changes.	Please update the title to follow the repository template (e.g., “[TRTLLM-1234][feat] Add Angelslim HF quant config loader and extend quantization support”) so that it accurately and concisely summarizes the primary change and includes a valid ticket reference or “[None]” and type tag.
Description Check	⚠️ Warning	The PR description is entirely placeholder text and template comments without any actual summary, description of the implementation, or test coverage details. It fails to explain what was changed, why the changes were made, or which tests verify the new logic.	Please fill out the PR template by adding the @coderabbitai summary or a manual summary, a clear “## Description” section describing the issue and your solution, and a “## Test Coverage” section listing relevant tests or test plans to ensure adequate coverage of the new code paths.
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🔭 Outside diff range comments (2)

tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)

1361-1368: Use fully-qualified module name for exclusion check; current code can yield false negatives

is_module_excluded_from_quantization expects the full module path for pattern matching. Passing only names[-1] ("kv_b_proj") can miss exclusions like "model.layers.12.self_attn.kv_b_proj". Also, your new gating to skip dequant when exclude_quant_config is provided makes sense; keep it, but fix the name passed.

Apply this diff to pass the fully-qualified module name:
-                    dequant_kv_b_proj = self.model_config.quant_config.is_module_excluded_from_quantization(
-                        names[-1]) and self.model_config.quant_config.exclude_quant_config is None
+                    dequant_kv_b_proj = (
+                        self.model_config.quant_config.is_module_excluded_from_quantization(name)
+                        and self.model_config.quant_config.exclude_quant_config is None
+                    )

tensorrt_llm/_torch/models/modeling_utils.py (1)

476-488: Preserve base QuantConfig when overriding for excluded modules; add None-safety

The new per-excluded override only sets quant_algo and activation_scheme and loses other QuantConfig fields (e.g., group_size, clamp_val, has_zero_point). Build the override from the existing QuantConfig to preserve defaults and runtime semantics. Also, guard against a None quant_config.

Apply this diff:

-        quant_algo = None
-        activation_scheme = None
-        exclude_quant_config = quant_config.exclude_quant_config
-        if exclude_quant_config:
-            quant_algo = exclude_quant_config.get("quant_algo", None)
-            activation_scheme = exclude_quant_config.get("activation_scheme", None)
-        new_config = QuantConfig(
-            quant_algo=quant_algo, kv_cache_quant_algo=kv_cache_quant_algo, activation_scheme=activation_scheme)
+        quant_algo = None
+        activation_scheme = None
+        exclude_quant_cfg = getattr(quant_config, "exclude_quant_config", None)
+        if exclude_quant_cfg:
+            quant_algo = exclude_quant_cfg.get("quant_algo")
+            activation_scheme = exclude_quant_cfg.get("activation_scheme")
+        # Preserve all other QuantConfig fields while overriding specific attributes
+        base_config = quant_config or QuantConfig()
+        new_config = dataclass_replace(
+            base_config,
+            quant_algo=quant_algo,
+            activation_scheme=activation_scheme,
+            kv_cache_quant_algo=kv_cache_quant_algo,
+        )

And add this import at the top of the file (outside this hunk):

from dataclasses import replace as dataclass_replace

🧹 Nitpick comments (3)

tensorrt_llm/quantization/mode.py (1)

463-466: Enum addition looks good; consider adding a brief docstring for clarity

ActivationScheme is correctly defined as a StrEnum with BaseEnumMeta and is ready for serialization and validation. A short docstring would improve readability.

Apply this diff to add a concise docstring:
 class ActivationScheme(StrEnum, metaclass=BaseEnumMeta):
-    STATIC = auto()
-    DYNAMIC = auto()
+    """Activation quantization scheme."""
+    STATIC = auto()
+    DYNAMIC = auto()

tensorrt_llm/models/modeling_utils.py (1)

143-145: Tighten docstring wording and wrap long lines (fixes E501 and improves clarity)

The current docstrings have grammar issues and exceed 120 chars. Reword and wrap to meet style and static-analysis guidance.

Apply this diff:

-        exclude_quant_config  (Dict, optional): The model of exclude_modules will use exclude_quant_config.
-        activation_scheme (tensorrt_llm.quantization.mode.ActivationScheme, optional): The input of activation quantize scheme.
+        exclude_quant_config (Dict, optional): Per‑module quantization overrides applied to modules
+            matched by exclude_modules. Only the provided fields are overridden (e.g., quant_algo,
+            kv_cache_quant_algo, activation_scheme).
+        activation_scheme (tensorrt_llm.quantization.mode.ActivationScheme, optional): Activation
+            quantization scheme (e.g., STATIC or DYNAMIC).

tensorrt_llm/_torch/model_config.py (1)

267-279: Simplify nested dictionary initialization for exclude_quantization.

The nested ternary operations make the code hard to read and maintain. Consider extracting this into a helper function for better clarity.

Extract the logic into a helper function:

+def _parse_exclude_quantization(json_exclude_quant_configs):
+    if not json_exclude_quant_configs:
+        return None
+    
+    result = {}
+    if json_exclude_quant_configs.get('quant_algo'):
+        result['quant_algo'] = QuantAlgo(json_exclude_quant_configs['quant_algo'].upper())
+    else:
+        result['quant_algo'] = None
+    
+    if json_exclude_quant_configs.get('kv_cache_quant_algo'):
+        result['kv_cache_quant_algo'] = QuantAlgo(json_exclude_quant_configs['kv_cache_quant_algo'].upper())
+    else:
+        result['kv_cache_quant_algo'] = None
+    
+    if json_exclude_quant_configs.get('activation_scheme'):
+        result['activation_scheme'] = ActivationScheme(json_exclude_quant_configs['activation_scheme'].upper())
+    else:
+        result['activation_scheme'] = None
+    
+    return result

 json_exclude_quant_configs = json_quant_configs.get('exclude_quantization', None)
-if json_exclude_quant_configs:
-    quant_config.exclude_quant_config = {
-        "quant_algo": QuantAlgo(
-            json_exclude_quant_configs.get('quant_algo', None).upper()
-        ) if json_exclude_quant_configs.get("quant_algo") else None,
-        "kv_cache_quant_algo": QuantAlgo(
-            json_exclude_quant_configs.get("kv_cache_quant_algo").upper()
-        ) if json_exclude_quant_configs.get("kv_cache_quant_algo") else None,
-        "activation_scheme": ActivationScheme(
-            json_exclude_quant_configs.get('activation_scheme', None).upper()
-        ) if json_exclude_quant_configs.get("activation_scheme") else None,
-    }
+quant_config.exclude_quant_config = _parse_exclude_quantization(json_exclude_quant_configs)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 50e5e72 and 15fbe20.

📒 Files selected for processing (7)

tensorrt_llm/_torch/model_config.py (4 hunks)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1 hunks)
tensorrt_llm/_torch/models/modeling_utils.py (1 hunks)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (6 hunks)
tensorrt_llm/llmapi/llm_utils.py (2 hunks)
tensorrt_llm/models/modeling_utils.py (3 hunks)
tensorrt_llm/quantization/mode.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/quantization/mode.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/models/modeling_utils.py
tensorrt_llm/models/modeling_utils.py
tensorrt_llm/_torch/model_config.py
tensorrt_llm/llmapi/llm_utils.py
tensorrt_llm/_torch/modules/fused_moe/quantization.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/quantization/mode.py
tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/models/modeling_utils.py
tensorrt_llm/models/modeling_utils.py
tensorrt_llm/_torch/model_config.py
tensorrt_llm/llmapi/llm_utils.py
tensorrt_llm/_torch/modules/fused_moe/quantization.py

🧬 Code Graph Analysis (7)

tensorrt_llm/quantization/mode.py (1)

tensorrt_llm/_utils.py (1)

BaseEnumMeta (773-780)

tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)

tensorrt_llm/llmapi/llm_args.py (2)

quant_config (2145-2148)

quant_config (2151-2152)

tensorrt_llm/_torch/models/modeling_utils.py (2)

tensorrt_llm/models/modeling_utils.py (2)

quant_algo (550-551)

QuantConfig (128-271)

tensorrt_llm/llmapi/llm_args.py (2)

quant_config (2145-2148)

quant_config (2151-2152)

tensorrt_llm/models/modeling_utils.py (1)

tensorrt_llm/quantization/mode.py (2)

QuantAlgo (23-46)

ActivationScheme (463-465)

tensorrt_llm/_torch/model_config.py (2)

tensorrt_llm/quantization/mode.py (2)

QuantAlgo (23-46)

ActivationScheme (463-465)

tensorrt_llm/models/modeling_utils.py (2)

QuantConfig (128-271)

quant_algo (550-551)

tensorrt_llm/llmapi/llm_utils.py (3)

tensorrt_llm/models/modeling_utils.py (3)

PretrainedConfig (369-570)

QuantConfig (128-271)

quant_algo (550-551)

tensorrt_llm/quantization/mode.py (2)

QuantAlgo (23-46)

ActivationScheme (463-465)

tensorrt_llm/llmapi/llm_args.py (2)

quant_config (2145-2148)

quant_config (2151-2152)

tensorrt_llm/_torch/modules/fused_moe/quantization.py (3)

tensorrt_llm/_torch/modules/fused_moe/interface.py (1)

MoEWeightLoadingMode (13-15)

tests/unittest/_torch/modules/test_fused_moe.py (1)

load_weights (1660-1725)

tensorrt_llm/_torch/utils.py (1)

shape (103-104)

🪛 Ruff (0.12.2)

tensorrt_llm/models/modeling_utils.py

143-143: Line too long (127 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (7)

tensorrt_llm/llmapi/llm_utils.py (1)

438-474: Manual verification needed for kv_cache_quant_method & activation_scheme enum values

It looks like no quantization_config.kv_cache_quant_method or .activation_scheme entries were found in the repo’s JSON files—this may simply mean that those fields live in user‐supplied Hugging Face configs or are generated at runtime. Before merging the enum-validation refactor, please:

Confirm which enum members actually appear in your external HF configs or examples (e.g. via your model hubs or deployment manifests).

Ensure the hardcoded “Expected one of” lists (INT8, FP8, NVFP4 and STATIC, DYNAMIC) cover every real-world value your users might supply.

Update the error messages or docstrings accordingly if there are additional valid enum values.

tensorrt_llm/_torch/model_config.py (2)

18-18: LGTM! Import addition for ActivationScheme is appropriate.

The addition of ActivationScheme to the import statement is necessary for the new functionality.

326-330: Good addition of W4A8_AWQ support with proper error handling.

The new case for "w4a8_awq" correctly maps to QuantAlgo.W4A8_AWQ, and the addition of NotImplementedError for unsupported quant methods improves error handling.

tensorrt_llm/_torch/modules/fused_moe/quantization.py (4)

209-210: Good addition of weight_name parameter for flexible weight loading.

The addition of the weight_name parameter with a default value of "weight" provides backward compatibility while enabling support for different weight types.

218-220: LGTM! Proper use of weight_name in expert weight keys.

The implementation correctly uses the weight_name parameter to construct the expert weight keys, allowing for flexible weight type selection.

959-963: Good override maintaining backward compatibility.

The WInt4AFP8FusedMoEMethod.load_weights override with weight_name: str = "qweight" properly handles quantized weights while maintaining the interface contract.

985-991: Per-expert FP8 scale handling is correct 
All of the expected "{expert_id}.w1.weight_scale" and "{expert_id}.w3.weight_scale" keys are present in the tests, and the code’s element-wise max → stack → reshape logic aligns with the shape of module.fc31_alpha. No missing keys or shape mismatches were found.

coderabbitai · 2025-08-13T08:28:58Z

tensorrt_llm/_torch/model_config.py

+    @staticmethod
+    def load_angelslim_quant_config(quant_config_file, model_dir, moe_backend):
+        quant_config = QuantConfig()
+        layer_quant_config = None
+
+        with open(quant_config_file) as f:
+            quant_config_dict = json.load(f)
+
+        json_quant_configs = quant_config_dict['quantization']
+
+        quant_config.quant_algo = QuantAlgo(
+            json_quant_configs.get('quant_algo', None).upper()) if json_quant_configs.get("quant_algo") else None
+        # fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES
+        if quant_config.quant_algo == "fp8_pb_wo":
+            quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES')
+
+        quant_config.kv_cache_quant_algo = QuantAlgo(
+            json_quant_configs.get("kv_cache_quant_algo").upper()
+        ) if json_quant_configs.get("kv_cache_quant_algo") else None
+        quant_config.group_size = json_quant_configs.get('group_size', None)
+        quant_config.exclude_modules = json_quant_configs.get(
+            'exclude_modules', None)
+        quant_config.activation_scheme = ActivationScheme(
+            json_quant_configs.get('activation_scheme', None).upper()
+        ) if json_quant_configs.get("activation_scheme") else None
+
+        json_exclude_quant_configs = json_quant_configs.get('exclude_quantization', None)
+        if json_exclude_quant_configs:
+            quant_config.exclude_quant_config = {
+                "quant_algo": QuantAlgo(
+                    json_exclude_quant_configs.get('quant_algo', None).upper()
+                ) if json_exclude_quant_configs.get("quant_algo") else None,
+                "kv_cache_quant_algo": QuantAlgo(
+                    json_exclude_quant_configs.get("kv_cache_quant_algo").upper()
+                ) if json_exclude_quant_configs.get("kv_cache_quant_algo") else None,
+                "activation_scheme": ActivationScheme(
+                    json_exclude_quant_configs.get('activation_scheme', None).upper()
+                ) if json_exclude_quant_configs.get("activation_scheme") else None,
+            }
+        return quant_config, layer_quant_config
+


⚠️ Potential issue

Add error handling for malformed JSON config files.

The new load_angelslim_quant_config method assumes the JSON file is well-formed and contains the expected structure. Add error handling for cases where required keys might be missing or the JSON is malformed.

Consider adding validation like this:

def load_angelslim_quant_config(quant_config_file, model_dir, moe_backend): quant_config = QuantConfig() layer_quant_config = None - with open(quant_config_file) as f: - quant_config_dict = json.load(f) + try: + with open(quant_config_file) as f: + quant_config_dict = json.load(f) + except (json.JSONDecodeError, IOError) as e: + raise ValueError(f"Failed to load angelslim config from {quant_config_file}: {e}") - json_quant_configs = quant_config_dict['quantization'] + json_quant_configs = quant_config_dict.get('quantization', {}) + if not json_quant_configs: + raise ValueError(f"Missing 'quantization' section in {quant_config_file}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

@staticmethod

def load_angelslim_quant_config(quant_config_file, model_dir, moe_backend):

quant_config = QuantConfig()

layer_quant_config = None

with open(quant_config_file) as f:

quant_config_dict = json.load(f)

json_quant_configs = quant_config_dict['quantization']

quant_config.quant_algo = QuantAlgo(

json_quant_configs.get('quant_algo', None).upper()) if json_quant_configs.get("quant_algo") else None

# fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES

if quant_config.quant_algo == "fp8_pb_wo":

quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES')

quant_config.kv_cache_quant_algo = QuantAlgo(

json_quant_configs.get("kv_cache_quant_algo").upper()

) if json_quant_configs.get("kv_cache_quant_algo") else None

quant_config.group_size = json_quant_configs.get('group_size', None)

quant_config.exclude_modules = json_quant_configs.get(

'exclude_modules', None)

quant_config.activation_scheme = ActivationScheme(

json_quant_configs.get('activation_scheme', None).upper()

) if json_quant_configs.get("activation_scheme") else None

json_exclude_quant_configs = json_quant_configs.get('exclude_quantization', None)

if json_exclude_quant_configs:

quant_config.exclude_quant_config = {

"quant_algo": QuantAlgo(

json_exclude_quant_configs.get('quant_algo', None).upper()

) if json_exclude_quant_configs.get("quant_algo") else None,

"kv_cache_quant_algo": QuantAlgo(

json_exclude_quant_configs.get("kv_cache_quant_algo").upper()

) if json_exclude_quant_configs.get("kv_cache_quant_algo") else None,

"activation_scheme": ActivationScheme(

json_exclude_quant_configs.get('activation_scheme', None).upper()

) if json_exclude_quant_configs.get("activation_scheme") else None,

}

return quant_config, layer_quant_config

@staticmethod

def load_angelslim_quant_config(quant_config_file, model_dir, moe_backend):

quant_config = QuantConfig()

layer_quant_config = None

try:

with open(quant_config_file) as f:

quant_config_dict = json.load(f)

except (json.JSONDecodeError, IOError) as e:

raise ValueError(f"Failed to load angelslim config from {quant_config_file}: {e}")

json_quant_configs = quant_config_dict.get('quantization', {})

if not json_quant_configs:

raise ValueError(f"Missing 'quantization' section in {quant_config_file}")

quant_config.quant_algo = QuantAlgo(

json_quant_configs.get('quant_algo', None).upper()

) if json_quant_configs.get("quant_algo") else None

# fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES

if quant_config.quant_algo == "fp8_pb_wo":

quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES')

quant_config.kv_cache_quant_algo = QuantAlgo(

json_quant_configs.get("kv_cache_quant_algo").upper()

) if json_quant_configs.get("kv_cache_quant_algo") else None

quant_config.group_size = json_quant_configs.get('group_size', None)

quant_config.exclude_modules = json_quant_configs.get(

'exclude_modules', None)

quant_config.activation_scheme = ActivationScheme(

json_quant_configs.get('activation_scheme', None).upper()

) if json_quant_configs.get("activation_scheme") else None

json_exclude_quant_configs = json_quant_configs.get('exclude_quantization', None)

if json_exclude_quant_configs:

quant_config.exclude_quant_config = {

"quant_algo": QuantAlgo(

json_exclude_quant_configs.get('quant_algo', None).upper()

) if json_exclude_quant_configs.get("quant_algo") else None,

"kv_cache_quant_algo": QuantAlgo(

json_exclude_quant_configs.get("kv_cache_quant_algo").upper()

) if json_exclude_quant_configs.get("kv_cache_quant_algo") else None,

"activation_scheme": ActivationScheme(

json_exclude_quant_configs.get('activation_scheme', None).upper()

) if json_exclude_quant_configs.get("activation_scheme") else None,

}

return quant_config, layer_quant_config

coderabbitai · 2025-08-13T08:28:58Z

tensorrt_llm/_torch/model_config.py

+        quant_config.quant_algo = QuantAlgo(
+            json_quant_configs.get('quant_algo', None).upper()) if json_quant_configs.get("quant_algo") else None
+        # fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES
+        if quant_config.quant_algo == "fp8_pb_wo":
+            quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES')


⚠️ Potential issue

Logical issue: string comparison against enum value.

Line 254 compares quant_config.quant_algo (which is now a QuantAlgo enum) against the string "fp8_pb_wo". This will always fail because you're comparing an enum to a string.

Fix the comparison:

- quant_config.quant_algo = QuantAlgo( - json_quant_configs.get('quant_algo', None).upper()) if json_quant_configs.get("quant_algo") else None - # fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES - if quant_config.quant_algo == "fp8_pb_wo": - quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES') + algo_str = json_quant_configs.get('quant_algo') + if algo_str: + algo_str = algo_str.upper() + # fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES + if algo_str == "FP8_PB_WO": + quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES') + else: + quant_config.quant_algo = QuantAlgo(algo_str) + else: + quant_config.quant_algo = None

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

quant_config.quant_algo = QuantAlgo(

json_quant_configs.get('quant_algo', None).upper()) if json_quant_configs.get("quant_algo") else None

# fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES

if quant_config.quant_algo == "fp8_pb_wo":

quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES')

algo_str = json_quant_configs.get('quant_algo')

if algo_str:

algo_str = algo_str.upper()

# fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES

if algo_str == "FP8_PB_WO":

quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES')

else:

quant_config.quant_algo = QuantAlgo(algo_str)

else:

quant_config.quant_algo = None

🤖 Prompt for AI Agents

tensorrt_llm/_torch/model_config.py around lines 251 to 255: the code currently compares the QuantAlgo enum instance to the string "fp8_pb_wo", which always fails; change the check to compare the enum's name or value to that string (for example: if quant_config.quant_algo.name.lower() == "fp8_pb_wo") and then set quant_config.quant_algo = QuantAlgo("FP8_BLOCK_SCALES") as before.

coderabbitai · 2025-08-13T08:28:58Z

tensorrt_llm/_torch/model_config.py

+        elif (quant_config_file := model_dir / 'angelslim_hf_quant_config.json').exists():
+            quant_config, layer_quant_config = cls.load_angelslim_quant_config(
+                quant_config_file, model_dir, moe_backend)


💡 Verification agent

🧩 Analysis chain

Verify the consistency of the naming convention for config files.

The code now supports both angelslim_hf_quant_config.json and hf_quant_config.json. Ensure this naming is consistent with the tool's documentation and expected usage.

🏁 Script executed:

#!/bin/bash # Check for any existing references to these config file names in the codebase echo "Searching for references to angelslim_hf_quant_config.json:" rg "angelslim_hf_quant_config" --type py -A 2 -B 2 echo -e "\nSearching for references to hf_quant_config.json:" rg "hf_quant_config\.json" --type py -A 2 -B 2 echo -e "\nChecking for any documentation about these config files:" fd -e md -e rst | xargs rg -l "angelslim_hf_quant_config|hf_quant_config"

Length of output: 8264

Update documentation to include the new AngelsLim config filename

The code now recognizes both hf_quant_config.json and angelslim_hf_quant_config.json, but our docs and examples only mention the former. Please update all references so users know both are supported:

• docs/source/performance/perf-benchmarking.md
• examples/models/core/deepseek_v3/README.md

– Add a note explaining that if an angelslim_hf_quant_config.json is present in the model directory, it will be loaded via load_angelslim_quant_config()
– Ensure any code snippets or CLI examples show both filenames where applicable

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/model_config.py around lines 438 to 440, the code now accepts an additional AngelsLim config filename (angelslim_hf_quant_config.json); update the documentation and example README to mention both hf_quant_config.json and angelslim_hf_quant_config.json. Edit docs/source/performance/perf-benchmarking.md and examples/models/core/deepseek_v3/README.md to add a short note that if angelslim_hf_quant_config.json exists in the model directory it will be loaded via load_angelslim_quant_config(), and update any code snippets or CLI examples to show both filenames where applicable (e.g., list both filenames in examples and usage text).

coderabbitai · 2025-08-13T08:28:58Z

tensorrt_llm/_torch/modules/fused_moe/quantization.py

+        all_w3_scales = torch.stack(all_w3_scales) / all_w3_w1_scales_fp8_max.unsqueeze(2)
        all_w1_scales = [
-            load_weight_shard(weights[f"{expert_id}.w1.weight_scale_inv"],
+            load_weight_shard(weights[f"{expert_id}.w1.weight_scale.int4"],
                              module.tp_size,
                              module.tp_rank,
                              TensorParallelMode.COLUMN,
                              device=self.device)
            for expert_id in module.initial_local_expert_ids
        ]
+        all_w1_scales = torch.stack(all_w1_scales) / all_w3_w1_scales_fp8_max.unsqueeze(2)
        all_w3_w1_scales = torch.cat(
-            [torch.stack(all_w3_scales),
-             torch.stack(all_w1_scales)], dim=-2)
+            [all_w3_scales,
+             all_w1_scales], dim=-2)


⚠️ Potential issue

Potential division by zero when scales are zero.

The code divides by all_w3_w1_scales_fp8_max and all_w2_scales_fp8 without checking if these values are zero. This could lead to runtime errors or inf/nan values.

Add validation for zero scales:

all_w3_scales = torch.stack(all_w3_scales) / all_w3_w1_scales_fp8_max.unsqueeze(2) +if torch.any(all_w3_w1_scales_fp8_max == 0): + raise ValueError("Found zero FP8 weight scales, which would cause division by zero") all_w1_scales = torch.stack(all_w1_scales) / all_w3_w1_scales_fp8_max.unsqueeze(2) all_w2_scales = torch.stack(all_w2_scales) / all_w2_scales_fp8.unsqueeze(2) +if torch.any(all_w2_scales_fp8 == 0): + raise ValueError("Found zero FP8 weight scales for w2, which would cause division by zero")

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/modules/fused_moe/quantization.py around lines 1002 to 1014, the code divides tensors by all_w3_w1_scales_fp8_max and all_w2_scales_fp8 without guarding against zeros; add a defensive check to avoid divide-by-zero by replacing zero (or near-zero) elements with a small safe epsilon (e.g., 1e-6) or use torch.clamp_min / torch.where to ensure the denominators are >= epsilon before performing the divisions, and optionally log or assert if any replacements occurred to aid debugging.

coderabbitai · 2025-08-13T08:28:59Z

tensorrt_llm/_torch/modules/fused_moe/quantization.py

        all_w1_scales = [
-            load_weight_shard(weights[f"{expert_id}.w1.weight_scale_inv"],
+            load_weight_shard(weights[f"{expert_id}.w1.weight_scale.int4"],


⚠️ Potential issue

Inconsistent weight scale key naming: weight_scale.int4 vs weight_scale_inv.

Line 1004 uses f"{expert_id}.w1.weight_scale.int4" while line 995 uses f"{expert_id}.w3.weight_scale_inv". This inconsistency could indicate different weight formats or be a bug.

Please verify if this naming difference is intentional:

#!/bin/bash # Check for both naming patterns in the codebase echo "Searching for weight_scale.int4 pattern:" rg "weight_scale\.int4" --type py -B 2 -A 2 echo -e "\nSearching for weight_scale_inv pattern:" rg "weight_scale_inv" --type py -B 2 -A 2

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/modules/fused_moe/quantization.py around line 1004 there is an inconsistent key name: line 1004 uses f"{expert_id}.w1.weight_scale.int4" while earlier (line ~995) uses f"{expert_id}.w3.weight_scale_inv"; confirm whether the correct stored key is weight_scale_inv or weight_scale.int4 by searching the repo for both patterns, then make the keys consistent (prefer using the canonical weight_scale_inv if other shards use that naming), update the load_weight_shard call to use the canonical key across all shards, and add a brief inline comment explaining the chosen convention so future readers know which format is expected.

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/models/modeling_utils.py (1)

468-507: Rename all remaining exclude_quant_config references to exclude_quantization.

tensorrt_llm/llmapi/llm_utils.py (around line 469): change quant_config.exclude_quant_config → quant_config.exclude_quantization.

tensorrt_llm/_torch/model_config.py (around lines 269 & 359): update both the assignment and subsequent checks of quant_config.exclude_quant_config → quant_config.exclude_quantization.

Re-run rg '\bexclude_quant_config\b' to confirm no leftovers.

♻️ Duplicate comments (2)

tensorrt_llm/_torch/model_config.py (2)

251-256: Fix enum/string mismatch for quant_algo ("fp8_pb_wo" mapping never triggers).

You construct an enum on Line 251, then compare it to a string on Line 254. This always fails. Normalize the source string first, then map to QuantAlgo.

-        quant_config.quant_algo = QuantAlgo(
-            json_quant_configs.get('quant_algo', None).upper()) if json_quant_configs.get("quant_algo") else None
-        # fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES
-        if quant_config.quant_algo == "fp8_pb_wo":
-            quant_config.quant_algo = QuantAlgo('FP8_BLOCK_SCALES')
+        algo_str = json_quant_configs.get('quant_algo')
+        if algo_str:
+            algo_str = algo_str.upper()
+            # fp8_pb_wo from modelopt is the same as FP8_BLOCK_SCALES
+            if algo_str == "FP8_PB_WO":
+                quant_config.quant_algo = QuantAlgo.FP8_BLOCK_SCALES
+            else:
+                quant_config.quant_algo = QuantAlgo(algo_str)
+        else:
+            quant_config.quant_algo = None

246-249: Add robust error handling for malformed/absent quantization JSON.

Currently assumes file is valid JSON and contains 'quantization'. Add try/except and schema checks.

-        with open(quant_config_file) as f:
-            quant_config_dict = json.load(f)
+        try:
+            with open(quant_config_file) as f:
+                quant_config_dict = json.load(f)
+        except (json.JSONDecodeError, OSError) as e:
+            raise ValueError(f"Failed to load angelslim config from {quant_config_file}: {e}") from e
+        json_quant_configs = quant_config_dict.get('quantization')
+        if not isinstance(json_quant_configs, dict):
+            raise ValueError(f"Missing or invalid 'quantization' in {quant_config_file}")
-
-        json_quant_configs = quant_config_dict['quantization']

🧹 Nitpick comments (5)

tensorrt_llm/_torch/model_config.py (2)
242-242: Silence ARG004 or use the params.

model_dir and moe_backend were unused (now moe_backend is used above). If model_dir remains unused, prefix to _model_dir or document intent.
-    def load_angelslim_quant_config(quant_config_file, model_dir, moe_backend):
+    def load_angelslim_quant_config(quant_config_file, _model_dir, moe_backend):
1-1: Missing 2025 NVIDIA Apache-2.0 header.

Please prepend the standard header per repo guidelines.
tensorrt_llm/_torch/models/modeling_utils.py (1)
480-493: Good construction of per-exclusion QuantConfig.

Honors kv_cache precedence.

Allows per-excluded-module quant_algo/activation_scheme/group_size.

Minor: if exclude_quantization is None, consider using the base quant_config.group_size to keep consistency.
-        group_size = 128
+        group_size = quant_config.group_size or 128
tensorrt_llm/llmapi/llm_utils.py (2)

1-1: Missing 2025 NVIDIA Apache-2.0 header.

Please prepend the standard header per repo guidelines.

354-484: Unify configuration key for ignored modules
Replace ignored_layers in tensorrt_llm/_torch/model_config.py:354 with ignored_modules to match the usage in llmapi/llm_utils.py. No string-literal comparisons against QuantAlgo were found.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 15fbe20 and 1dbc226.

📒 Files selected for processing (5)

tensorrt_llm/_torch/model_config.py (4 hunks)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1 hunks)
tensorrt_llm/_torch/models/modeling_utils.py (1 hunks)
tensorrt_llm/llmapi/llm_utils.py (3 hunks)
tensorrt_llm/models/modeling_utils.py (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/models/modeling_utils.py

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

tensorrt_llm/_torch/model_config.py
tensorrt_llm/llmapi/llm_utils.py
tensorrt_llm/_torch/models/modeling_utils.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

tensorrt_llm/_torch/model_config.py
tensorrt_llm/llmapi/llm_utils.py
tensorrt_llm/_torch/models/modeling_utils.py

**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

tensorrt_llm/_torch/model_config.py
tensorrt_llm/llmapi/llm_utils.py
tensorrt_llm/_torch/models/modeling_utils.py

🧬 Code graph analysis (3)

tensorrt_llm/_torch/model_config.py (2)

tensorrt_llm/quantization/mode.py (2)

QuantAlgo (23-46)

ActivationScheme (463-465)

tensorrt_llm/models/modeling_utils.py (2)

QuantConfig (128-272)

quant_algo (551-552)

tensorrt_llm/llmapi/llm_utils.py (4)

tensorrt_llm/models/modeling_utils.py (3)

PretrainedConfig (370-571)

QuantConfig (128-272)

quant_algo (551-552)

tensorrt_llm/quantization/mode.py (2)

QuantAlgo (23-46)

ActivationScheme (463-465)

tensorrt_llm/llmapi/llm_args.py (2)

quant_config (2145-2148)

quant_config (2151-2152)

cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.h (1)

group_size (63-63)

tensorrt_llm/_torch/models/modeling_utils.py (2)

tensorrt_llm/models/modeling_utils.py (2)

quant_algo (551-552)

QuantConfig (128-272)

tensorrt_llm/llmapi/llm_args.py (2)

quant_config (2145-2148)

quant_config (2151-2152)

🪛 Ruff (0.12.2)

tensorrt_llm/_torch/model_config.py

242-242: Unused static method argument: model_dir

(ARG004)

242-242: Unused static method argument: moe_backend

(ARG004)

🔇 Additional comments (5)

tensorrt_llm/_torch/models/modeling_utils.py (1)

28-28: Import looks correct.

ActivationScheme import aligns with the new QuantConfig fields.

tensorrt_llm/llmapi/llm_utils.py (4)

429-433: FP8 block-scales parsing LGTM.

Asserts (128,128) and sets group_size accordingly.

444-447: Support for 'fp8' and 'w4a8_awq' quant_method LGTM.

Sets group_size for AWQ.

452-459: KV cache and activation scheme mapping LGTM.

Uppercasing to enum names is consistent.

480-482: Log line LGTM.

Message is informative and low-noise.

coderabbitai · 2025-09-10T09:18:06Z

tensorrt_llm/_torch/model_config.py

+        json_exclude_quantization= json_quant_configs.get('exclude_quantization', None)
+        if json_exclude_quantization:
+            quant_config.exclude_quant_config = {
+                "quant_algo": QuantAlgo(
+                    json_exclude_quantization.get('quant_algo', None).upper()
+                ) if json_exclude_quantization.get("quant_algo") else None,
+                "kv_cache_quant_algo": QuantAlgo(
+                    json_exclude_quantization.get("kv_cache_quant_algo").upper()
+                ) if json_exclude_quantization.get("kv_cache_quant_algo") else None,
+                "activation_scheme": ActivationScheme(
+                    json_exclude_quantization.get('activation_scheme', None).upper()
+                ) if json_exclude_quantization.get("activation_scheme") else None,
+                "group_size": json_exclude_quantization.get('group_size', None),
+            }
+            if quant_config.exclude_quantization["quant_algo"] in [QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ]:
+                if quant_config.exclude_quantization["group_size"] is None:
+                    quant_config.exclude_quantization["group_size"] = 128
+
+        if quant_config.quant_algo in [QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ]:
+            if quant_config.group_size is None:
+                quant_config.group_size = 128


⚠️ Potential issue

Typo breaks exclude overrides: use QuantConfig.exclude_quantization consistently and initialize it.

Lines 269 and 356 use exclude_quant_config but QuantConfig defines exclude_quantization. Also, subsequent code (Line 281) expects exclude_quantization. This prevents per-module overrides from being applied and can raise exceptions.

- json_exclude_quantization= json_quant_configs.get('exclude_quantization', None) + json_exclude_quantization = json_quant_configs.get('exclude_quantization', None) if json_exclude_quantization: - quant_config.exclude_quant_config = { + quant_config.exclude_quantization = { "quant_algo": QuantAlgo( json_exclude_quantization.get('quant_algo', None).upper() ) if json_exclude_quantization.get("quant_algo") else None, "kv_cache_quant_algo": QuantAlgo( json_exclude_quantization.get("kv_cache_quant_algo").upper() ) if json_exclude_quantization.get("kv_cache_quant_algo") else None, "activation_scheme": ActivationScheme( json_exclude_quantization.get('activation_scheme', None).upper() ) if json_exclude_quantization.get("activation_scheme") else None, "group_size": json_exclude_quantization.get('group_size', None), } - if quant_config.exclude_quantization["quant_algo"] in [QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ]: + if quant_config.exclude_quantization["quant_algo"] in {QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ}: if quant_config.exclude_quantization["group_size"] is None: quant_config.exclude_quantization["group_size"] = 128 - - if quant_config.quant_algo in [QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ]: + if quant_config.quant_algo in {QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ}: if quant_config.group_size is None: quant_config.group_size = 128

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

json_exclude_quantization= json_quant_configs.get('exclude_quantization', None)

if json_exclude_quantization:

quant_config.exclude_quant_config = {

"quant_algo": QuantAlgo(

json_exclude_quantization.get('quant_algo', None).upper()

) if json_exclude_quantization.get("quant_algo") else None,

"kv_cache_quant_algo": QuantAlgo(

json_exclude_quantization.get("kv_cache_quant_algo").upper()

) if json_exclude_quantization.get("kv_cache_quant_algo") else None,

"activation_scheme": ActivationScheme(

json_exclude_quantization.get('activation_scheme', None).upper()

) if json_exclude_quantization.get("activation_scheme") else None,

"group_size": json_exclude_quantization.get('group_size', None),

}

if quant_config.exclude_quantization["quant_algo"] in [QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ]:

if quant_config.exclude_quantization["group_size"] is None:

quant_config.exclude_quantization["group_size"] = 128

if quant_config.quant_algo in [QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ]:

if quant_config.group_size is None:

quant_config.group_size = 128

json_exclude_quantization = json_quant_configs.get('exclude_quantization', None)

if json_exclude_quantization:

quant_config.exclude_quantization = {

"quant_algo": QuantAlgo(

json_exclude_quantization.get('quant_algo', None).upper()

) if json_exclude_quantization.get("quant_algo") else None,

"kv_cache_quant_algo": QuantAlgo(

json_exclude_quantization.get("kv_cache_quant_algo").upper()

) if json_exclude_quantization.get("kv_cache_quant_algo") else None,

"activation_scheme": ActivationScheme(

json_exclude_quantization.get('activation_scheme', None).upper()

) if json_exclude_quantization.get("activation_scheme") else None,

"group_size": json_exclude_quantization.get('group_size', None),

}

if quant_config.exclude_quantization["quant_algo"] in {QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ}:

if quant_config.exclude_quantization["group_size"] is None:

quant_config.exclude_quantization["group_size"] = 128

if quant_config.quant_algo in {QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ}:

if quant_config.group_size is None:

quant_config.group_size = 128

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/model_config.py around lines 267-287, replace the misspelled/incorrect property exclude_quant_config with the correct QuantConfig.exclude_quantization and ensure quant_config.exclude_quantization is initialized (e.g., to an empty dict or appropriate default) before assigning its keys; update any other occurrences (e.g., around line 356 mentioned in the comment) to use exclude_quantization so per-module overrides are applied consistently and avoid exceptions.

coderabbitai · 2025-09-10T09:18:06Z

tensorrt_llm/_torch/model_config.py

+        if quant_config.quant_algo in [QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ]:
+            if quant_config.group_size is None:
+                quant_config.group_size = 128
+
+        return quant_config, layer_quant_config


🛠️ Refactor suggestion

Replicate TRTLLM default excludes for FP8_BLOCK_SCALES (parity with modelopt path).

For AngelsLim FP8 block scales, set default exclude_modules when moe_backend == 'TRTLLM', as done in load_modelopt_quant_config.

- return quant_config, layer_quant_config + if (moe_backend == 'TRTLLM' + and quant_config.quant_algo == QuantAlgo.FP8_BLOCK_SCALES + and quant_config.exclude_modules is None): + quant_config.exclude_modules = ["*kv_b_proj*", "*k_b_proj*", "*eh_proj"] + return quant_config, layer_quant_config

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if quant_config.quant_algo in [QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ]:

if quant_config.group_size is None:

quant_config.group_size = 128

return quant_config, layer_quant_config

if quant_config.quant_algo in [QuantAlgo.FP8_BLOCK_SCALES, QuantAlgo.W4A8_AWQ]:

if quant_config.group_size is None:

quant_config.group_size = 128

if (moe_backend == 'TRTLLM'

and quant_config.quant_algo == QuantAlgo.FP8_BLOCK_SCALES

and quant_config.exclude_modules is None):

quant_config.exclude_modules = ["*kv_b_proj*", "*k_b_proj*", "*eh_proj"]

return quant_config, layer_quant_config

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/model_config.py around lines 285 to 289, when quant_config.quant_algo == QuantAlgo.FP8_BLOCK_SCALES and moe_backend == 'TRTLLM' you must set the default quant_config.exclude_modules to the same TRTLLM default excludes used by load_modelopt_quant_config; update this branch to assign quant_config.exclude_modules (if None) to the common TRTLLM_DEFAULT_EXCLUDES (or the exact list used in load_modelopt_quant_config) so FP8_BLOCK_SCALES follows the same exclude defaults as the modelopt path.

coderabbitai · 2025-09-10T09:18:07Z

tensorrt_llm/_torch/model_config.py

+        if quant_config.exclude_modules:
+            if hf_quant_config.get("ignored_layers"):
+                quant_config.exclude_modules += hf_quant_config.get("ignored_layers")
+        else:
+            quant_config.exclude_modules = hf_quant_config.get("ignored_layers")
+


🛠️ Refactor suggestion

Accept both 'ignored_modules' and 'ignored_layers' from HF config.

Two places in the codebase use different keys. Normalize here to avoid silently missing excludes.

- if quant_config.exclude_modules: - if hf_quant_config.get("ignored_layers"): - quant_config.exclude_modules += hf_quant_config.get("ignored_layers") - else: - quant_config.exclude_modules = hf_quant_config.get("ignored_layers") + ignored = hf_quant_config.get("ignored_modules") or hf_quant_config.get("ignored_layers") + if ignored: + if quant_config.exclude_modules: + quant_config.exclude_modules += ignored + else: + quant_config.exclude_modules = ignored

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if quant_config.exclude_modules:

if hf_quant_config.get("ignored_layers"):

quant_config.exclude_modules += hf_quant_config.get("ignored_layers")

else:

quant_config.exclude_modules = hf_quant_config.get("ignored_layers")

ignored = hf_quant_config.get("ignored_modules") or hf_quant_config.get("ignored_layers")

if ignored:

if quant_config.exclude_modules:

quant_config.exclude_modules += ignored

else:

quant_config.exclude_modules = ignored

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/model_config.py around lines 350 to 355, the code only reads hf_quant_config["ignored_layers"] which misses configs using "ignored_modules"; update the logic to accept and normalize both keys by checking for "ignored_modules" first then "ignored_layers" (or merge both if present), treat missing values as empty lists, ensure quant_config.exclude_modules is a list before extending/appending, merge the HF excludes into quant_config.exclude_modules (avoiding None) and optionally deduplicate the final list.

coderabbitai · 2025-09-10T09:18:07Z

tensorrt_llm/_torch/model_config.py

+        # set exclude_quant_config
+        hf_ignored_quantization_config = hf_quant_config.get("ignored_quantization_config")
+        if hf_ignored_quantization_config:
+            quant_config.exclude_quant_config = {
+                "kv_cache_quant_algo": QuantAlgo(
+                    hf_ignored_quantization_config.get("kv_cache_quant_method").upper()
+                ) if hf_ignored_quantization_config.get("kv_cache_quant_method") else None,
+                "activation_scheme": ActivationScheme(
+                    hf_ignored_quantization_config.get("activation_scheme").upper()
+                ) if hf_ignored_quantization_config.get("activation_scheme") else None,
+                "group_size": 128,
+            }
+            if hf_ignored_quantization_config.get(
+                    "quant_method") == "fp8" and hf_ignored_quantization_config.get("weight_block_size", []):
+                quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8_BLOCK_SCALES
+                block_size = hf_ignored_quantization_config.get("weight_block_size", [])
+                assert tuple(block_size) == (
+                    128,
+                    128), "FP8_BLOCK_SCALES only supports block_size=(128,128)"
+                quant_config.exclude_quantization["group_size"] = block_size[0]
+            elif hf_ignored_quantization_config.get("quant_method") == "fp8":
+                quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8
+            elif hf_ignored_quantization_config.get("quant_method") == "w4a8_awq":
+                quant_config.exclude_quantization["quant_algo"] = QuantAlgo.W4A8_AWQ
+                quant_config.exclude_quantization["group_size"] = hf_ignored_quantization_config.get(
+                    "weight_group_size", 128)
+            else:
+                raise NotImplementedError(f"Unsupported quantization_config.ignored_quantization_config: "
+                                          f"{hf_ignored_quantization_config}.")
+


⚠️ Potential issue

Fix exclude overrides for HF: wrong attribute name + missing FP8 block handling.

Same naming typo as above; also add FP8_BLOCK_SCALES handling when ignored_quantization_config carries weight_block_size.

- hf_ignored_quantization_config = hf_quant_config.get("ignored_quantization_config") - if hf_ignored_quantization_config: - quant_config.exclude_quant_config = { - "kv_cache_quant_algo": QuantAlgo( - hf_ignored_quantization_config.get("kv_cache_quant_method").upper() - ) if hf_ignored_quantization_config.get("kv_cache_quant_method") else None, - "activation_scheme": ActivationScheme( - hf_ignored_quantization_config.get("activation_scheme").upper() - ) if hf_ignored_quantization_config.get("activation_scheme") else None, - "group_size": 128, - } - if hf_ignored_quantization_config.get( - "quant_method") == "fp8" and hf_ignored_quantization_config.get("weight_block_size", []): - quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8_BLOCK_SCALES - block_size = hf_ignored_quantization_config.get("weight_block_size", []) - assert tuple(block_size) == ( - 128, - 128), "FP8_BLOCK_SCALES only supports block_size=(128,128)" - quant_config.exclude_quantization["group_size"] = block_size[0] - elif hf_ignored_quantization_config.get("quant_method") == "fp8": - quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8 - elif hf_ignored_quantization_config.get("quant_method") == "w4a8_awq": - quant_config.exclude_quantization["quant_algo"] = QuantAlgo.W4A8_AWQ - quant_config.exclude_quantization["group_size"] = hf_ignored_quantization_config.get( - "weight_group_size", 128) - else: - raise NotImplementedError(f"Unsupported quantization_config.ignored_quantization_config: " - f"{hf_ignored_quantization_config}.") + hf_ignored_quant = hf_quant_config.get("ignored_quantization_config") + if hf_ignored_quant: + quant_config.exclude_quantization = { + "kv_cache_quant_algo": QuantAlgo(hf_ignored_quant["kv_cache_quant_method"].upper()) + if hf_ignored_quant.get("kv_cache_quant_method") else None, + "activation_scheme": ActivationScheme(hf_ignored_quant["activation_scheme"].upper()) + if hf_ignored_quant.get("activation_scheme") else None, + "group_size": 128, + } + if hf_ignored_quant.get("quant_method") == "fp8" and hf_ignored_quant.get("weight_block_size"): + block_size = hf_ignored_quant["weight_block_size"] + assert tuple(block_size) == (128, 128), "FP8_BLOCK_SCALES only supports block_size=(128,128)" + quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8_BLOCK_SCALES + quant_config.exclude_quantization["group_size"] = block_size[0] + elif hf_ignored_quant.get("quant_method") == "fp8": + quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8 + elif hf_ignored_quant.get("quant_method") == "w4a8_awq": + quant_config.exclude_quantization["quant_algo"] = QuantAlgo.W4A8_AWQ + quant_config.exclude_quantization["group_size"] = hf_ignored_quant.get("weight_group_size", 128) + else: + raise NotImplementedError(f"Unsupported quantization_config.ignored_quantization_config: {hf_ignored_quant}.")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# set exclude_quant_config

hf_ignored_quantization_config = hf_quant_config.get("ignored_quantization_config")

if hf_ignored_quantization_config:

quant_config.exclude_quant_config = {

"kv_cache_quant_algo": QuantAlgo(

hf_ignored_quantization_config.get("kv_cache_quant_method").upper()

) if hf_ignored_quantization_config.get("kv_cache_quant_method") else None,

"activation_scheme": ActivationScheme(

hf_ignored_quantization_config.get("activation_scheme").upper()

) if hf_ignored_quantization_config.get("activation_scheme") else None,

"group_size": 128,

}

if hf_ignored_quantization_config.get(

"quant_method") == "fp8" and hf_ignored_quantization_config.get("weight_block_size", []):

quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8_BLOCK_SCALES

block_size = hf_ignored_quantization_config.get("weight_block_size", [])

assert tuple(block_size) == (

128,

128), "FP8_BLOCK_SCALES only supports block_size=(128,128)"

quant_config.exclude_quantization["group_size"] = block_size[0]

elif hf_ignored_quantization_config.get("quant_method") == "fp8":

quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8

elif hf_ignored_quantization_config.get("quant_method") == "w4a8_awq":

quant_config.exclude_quantization["quant_algo"] = QuantAlgo.W4A8_AWQ

quant_config.exclude_quantization["group_size"] = hf_ignored_quantization_config.get(

"weight_group_size", 128)

else:

raise NotImplementedError(f"Unsupported quantization_config.ignored_quantization_config: "

f"{hf_ignored_quantization_config}.")

# set exclude_quant_config

hf_ignored_quant = hf_quant_config.get("ignored_quantization_config")

if hf_ignored_quant:

quant_config.exclude_quantization = {

"kv_cache_quant_algo": QuantAlgo(hf_ignored_quant["kv_cache_quant_method"].upper())

if hf_ignored_quant.get("kv_cache_quant_method") else None,

"activation_scheme": ActivationScheme(hf_ignored_quant["activation_scheme"].upper())

if hf_ignored_quant.get("activation_scheme") else None,

"group_size": 128,

}

if hf_ignored_quant.get("quant_method") == "fp8" and hf_ignored_quant.get("weight_block_size"):

block_size = hf_ignored_quant["weight_block_size"]

assert tuple(block_size) == (128, 128), "FP8_BLOCK_SCALES only supports block_size=(128,128)"

quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8_BLOCK_SCALES

quant_config.exclude_quantization["group_size"] = block_size[0]

elif hf_ignored_quant.get("quant_method") == "fp8":

quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8

elif hf_ignored_quant.get("quant_method") == "w4a8_awq":

quant_config.exclude_quantization["quant_algo"] = QuantAlgo.W4A8_AWQ

quant_config.exclude_quantization["group_size"] = hf_ignored_quant.get("weight_group_size", 128)

else:

raise NotImplementedError(

f"Unsupported quantization_config.ignored_quantization_config: {hf_ignored_quant}."

)

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/model_config.py around lines 356 to 385, the code sets quant_config.exclude_quant_config but later references quant_config.exclude_quantization (typo/inconsistent attribute name) and it also doesn't correctly handle the FP8 block-size case; change the initial assignment to use the correct attribute name exclude_quantization (not exclude_quant_config) and build that dict with consistent keys ("kv_cache_quant_algo", "activation_scheme", "group_size"); then update the FP8 branch so that when hf_ignored_quantization_config["quant_method"] == "fp8" and weight_block_size is provided you set quant_algo = QuantAlgo.FP8_BLOCK_SCALES, assert the block_size equals (128,128) and set group_size = block_size[0], otherwise set quant_algo = QuantAlgo.FP8 for the non-block case; keep the w4a8_awq branch setting QuantAlgo.W4A8_AWQ and group_size from weight_group_size as before, and raise NotImplementedError for unsupported configs.

coderabbitai · 2025-09-10T09:18:07Z

tensorrt_llm/llmapi/llm_utils.py

+                if quant_config.exclude_modules:
+                    if hf_quant_config.get("ignored_modules"):
+                        quant_config.exclude_modules += hf_quant_config.get("ignored_modules")
+                else:
+                    quant_config.exclude_modules = hf_quant_config.get("ignored_modules")
+                # set exclude_quant_config


🛠️ Refactor suggestion

Accept both 'ignored_modules' and 'ignored_layers'.

Align with the other loader to avoid missing excludes.

- if quant_config.exclude_modules: - if hf_quant_config.get("ignored_modules"): - quant_config.exclude_modules += hf_quant_config.get("ignored_modules") - else: - quant_config.exclude_modules = hf_quant_config.get("ignored_modules") + ignored = hf_quant_config.get("ignored_modules") or hf_quant_config.get("ignored_layers") + if ignored: + if quant_config.exclude_modules: + quant_config.exclude_modules += ignored + else: + quant_config.exclude_modules = ignored

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if quant_config.exclude_modules:

if hf_quant_config.get("ignored_modules"):

quant_config.exclude_modules += hf_quant_config.get("ignored_modules")

else:

quant_config.exclude_modules = hf_quant_config.get("ignored_modules")

# set exclude_quant_config

ignored = hf_quant_config.get("ignored_modules") or hf_quant_config.get("ignored_layers")

if ignored:

if quant_config.exclude_modules:

quant_config.exclude_modules += ignored

else:

quant_config.exclude_modules = ignored

# set exclude_quant_config

🤖 Prompt for AI Agents

In tensorrt_llm/llmapi/llm_utils.py around lines 461 to 466, update the logic that reads hf_quant_config ignored entries so it accepts both "ignored_modules" and "ignored_layers" (treating them equivalently), and correctly accumulates them into quant_config.exclude_modules: read both keys (preferring one if needed), coerce the result to a list if None, and then either extend the existing quant_config.exclude_modules or assign a new list (avoiding None concatenation). Ensure you merge lists rather than overwrite unexpectedly so the behavior matches the other loader.

coderabbitai · 2025-09-10T09:18:07Z

tensorrt_llm/llmapi/llm_utils.py

+                hf_ignored_quantization_config = hf_quant_config.get("ignored_quantization_config")
+                if hf_ignored_quantization_config:
+                    quant_config.exclude_quant_config = {
+                        "quant_algo": QuantAlgo(
+                            hf_ignored_quantization_config.get("quant_method").upper()
+                        ) if hf_ignored_quantization_config.get("quant_method") else None,
+                        "kv_cache_quant_algo": QuantAlgo(
+                            hf_ignored_quantization_config.get("kv_cache_quant_method").upper()
+                        ) if hf_ignored_quantization_config.get("kv_cache_quant_method") else None,
+                        "activation_scheme": ActivationScheme(
+                            hf_ignored_quantization_config.get("activation_scheme").upper()
+                        ) if hf_ignored_quantization_config.get("activation_scheme") else None,
+                    }


⚠️ Potential issue

Fix exclude overrides: wrong attribute, add FP8 block handling, and group_size.

Use exclude_quantization (not exclude_quant_config) and mirror FP8_BLOCK_SCALES logic.

- hf_ignored_quantization_config = hf_quant_config.get("ignored_quantization_config") - if hf_ignored_quantization_config: - quant_config.exclude_quant_config = { - "quant_algo": QuantAlgo( - hf_ignored_quantization_config.get("quant_method").upper() - ) if hf_ignored_quantization_config.get("quant_method") else None, - "kv_cache_quant_algo": QuantAlgo( - hf_ignored_quantization_config.get("kv_cache_quant_method").upper() - ) if hf_ignored_quantization_config.get("kv_cache_quant_method") else None, - "activation_scheme": ActivationScheme( - hf_ignored_quantization_config.get("activation_scheme").upper() - ) if hf_ignored_quantization_config.get("activation_scheme") else None, - } + ignored_q = hf_quant_config.get("ignored_quantization_config") + if ignored_q: + quant_config.exclude_quantization = { + "kv_cache_quant_algo": QuantAlgo(ignored_q["kv_cache_quant_method"].upper()) + if ignored_q.get("kv_cache_quant_method") else None, + "activation_scheme": ActivationScheme(ignored_q["activation_scheme"].upper()) + if ignored_q.get("activation_scheme") else None, + "group_size": 128, + } + if ignored_q.get("quant_method") == "fp8" and ignored_q.get("weight_block_size"): + block_size = ignored_q["weight_block_size"] + assert tuple(block_size) == (128, 128), "FP8_BLOCK_SCALES only supports block_size=(128,128)" + quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8_BLOCK_SCALES + quant_config.exclude_quantization["group_size"] = block_size[0] + elif ignored_q.get("quant_method") == "fp8": + quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8 + elif ignored_q.get("quant_method") == "w4a8_awq": + quant_config.exclude_quantization["quant_algo"] = QuantAlgo.W4A8_AWQ + quant_config.exclude_quantization["group_size"] = ignored_q.get("weight_group_size", 128) + else: + raise NotImplementedError(f"Unsupported quantization_config.ignored_quantization_config: {ignored_q}.")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

hf_ignored_quantization_config = hf_quant_config.get("ignored_quantization_config")

if hf_ignored_quantization_config:

quant_config.exclude_quant_config = {

"quant_algo": QuantAlgo(

hf_ignored_quantization_config.get("quant_method").upper()

) if hf_ignored_quantization_config.get("quant_method") else None,

"kv_cache_quant_algo": QuantAlgo(

hf_ignored_quantization_config.get("kv_cache_quant_method").upper()

) if hf_ignored_quantization_config.get("kv_cache_quant_method") else None,

"activation_scheme": ActivationScheme(

hf_ignored_quantization_config.get("activation_scheme").upper()

) if hf_ignored_quantization_config.get("activation_scheme") else None,

}

ignored_q = hf_quant_config.get("ignored_quantization_config")

if ignored_q:

quant_config.exclude_quantization = {

"kv_cache_quant_algo": QuantAlgo(ignored_q["kv_cache_quant_method"].upper())

if ignored_q.get("kv_cache_quant_method") else None,

"activation_scheme": ActivationScheme(ignored_q["activation_scheme"].upper())

if ignored_q.get("activation_scheme") else None,

"group_size": 128,

}

if ignored_q.get("quant_method") == "fp8" and ignored_q.get("weight_block_size"):

block_size = ignored_q["weight_block_size"]

assert tuple(block_size) == (128, 128), "FP8_BLOCK_SCALES only supports block_size=(128,128)"

quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8_BLOCK_SCALES

quant_config.exclude_quantization["group_size"] = block_size[0]

elif ignored_q.get("quant_method") == "fp8":

quant_config.exclude_quantization["quant_algo"] = QuantAlgo.FP8

elif ignored_q.get("quant_method") == "w4a8_awq":

quant_config.exclude_quantization["quant_algo"] = QuantAlgo.W4A8_AWQ

quant_config.exclude_quantization["group_size"] = ignored_q.get("weight_group_size", 128)

else:

raise NotImplementedError(

f"Unsupported quantization_config.ignored_quantization_config: {ignored_q}."

)

🤖 Prompt for AI Agents

In tensorrt_llm/llmapi/llm_utils.py around lines 467 to 479, the code mistakenly sets quant_config.exclude_quant_config instead of quant_config.exclude_quantization and omits FP8 block handling and group_size; change the assignment to quant_config.exclude_quantization, map the keys to the correct QuantAlgo/ActivationScheme enums as before, and add handling for FP8 block scales by mirroring the existing FP8_BLOCK_SCALES logic (include fp8_block_scales or equivalent field and group_size parsing from hf_ignored_quantization_config) so excluded FP8 settings and group_size are applied consistently with the other FP8 handling code.

bppan requested review from a team as code owners August 13, 2025 08:18

bppan requested review from Superjomn, hlu1, 2ez4bz and suyoggupta August 13, 2025 08:18

yghstill mentioned this pull request Aug 13, 2025

SGLang 运行 AngelSlim/DeekSeek-R1-0528_w4a8_fp8 提示量化问题 Tencent/AngelSlim#36

Closed

coderabbitai bot reviewed Aug 13, 2025

View reviewed changes

svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Aug 13, 2025

coderabbitai bot reviewed Sep 10, 2025

View reviewed changes

beipingpan added 2 commits September 10, 2025 18:25

Support W4A8 method of AngleSlim tool

d8107c1

Update quantization config

62774f9

bppan force-pushed the support_angelslim_w4 branch from 1dbc226 to 62774f9 Compare September 10, 2025 10:25

Support W4A8 method of AngleSlim tool #6857

Are you sure you want to change the base?

Support W4A8 method of AngleSlim tool #6857

Uh oh!

Conversation

bppan commented Aug 13, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks (3 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bppan commented Aug 13, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 13, 2025 •

edited

Loading