Skip to content

Conversation

elvischenv
Copy link
Contributor

@elvischenv elvischenv commented Aug 26, 2025

Purpose

Support Silu_Mul + NVFP4 quant fusion(following up #22448).
Add these compilation flags to enable the fusion:

--compilation-config {"custom_ops":["+silu_and_mul"],"pass_config":{"enable_fusion":true,"enable_noop":true}}

Test Plan && Test Result

Kernel functional:
tests/kernels/quantization/test_silu_nvfp4_quant_fusion.py

====== 8 passed in 2.76s =====

Fusion unit test:
tests/compile/test_silu_mul_quant_fusion.py

====== 3 passed, 1 skipped, 5 warnings in 4.38s =====

lm_eval && benchmarking:
main:

============ Serving Benchmark Result ============
Successful requests:                     640
Maximum request concurrency:             128
Benchmark duration (s):                  174.31
Total input tokens:                      654052
Total generated tokens:                  655360
Request throughput (req/s):              3.67
Output token throughput (tok/s):         3759.83
Total Token throughput (tok/s):          7512.16
---------------Time to First Token----------------
Mean TTFT (ms):                          1458.45
Median TTFT (ms):                        1067.94
P99 TTFT (ms):                           5477.33
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          32.61
Median TPOT (ms):                        32.82
P99 TPOT (ms):                           34.01
---------------Inter-token Latency----------------
Mean ITL (ms):                           32.61
Median ITL (ms):                         29.03
P99 ITL (ms):                            331.72
==================================================
vllm ({'pretrained': 'nvidia/Llama-3.3-70B-Instruct-FP4', 'kv_cache_dtype': 'fp8', 'tensor_parallel_size': 1, 'max_model_len': 2048, 'trust_remote_code': True}), gen_kwargs: (temperature=0.0), limit: 500.0, num_fewshot: 5, batch_size: 200
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.934|±  |0.0111|
|     |       |strict-match    |     5|exact_match|↑  |0.858|±  |0.0156|

PR:

disable fusion:
============ Serving Benchmark Result ============
Successful requests:                     640
Maximum request concurrency:             128
Benchmark duration (s):                  172.20
Total input tokens:                      654052
Total generated tokens:                  655360
Request throughput (req/s):              3.72
Output token throughput (tok/s):         3805.72
Total Token throughput (tok/s):          7603.85
---------------Time to First Token----------------
Mean TTFT (ms):                          1452.02
Median TTFT (ms):                        1063.62
P99 TTFT (ms):                           5508.88
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          32.21
Median TPOT (ms):                        32.42
P99 TPOT (ms):                           33.67
---------------Inter-token Latency----------------
Mean ITL (ms):                           32.21
Median ITL (ms):                         28.71
P99 ITL (ms):                            325.90
==================================================
vllm ({'pretrained': 'nvidia/Llama-3.3-70B-Instruct-FP4', 'kv_cache_dtype': 'fp8', 'tensor_parallel_size': 1, 'max_model_len': 2048, 'trust_remote_code': True}), gen_kwargs: (temperature=0.0), limit: 500.0, num_fewshot: 5, batch_size: 200
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.932|±  |0.0113|
|     |       |strict-match    |     5|exact_match|↑  |0.842|±  |0.0163|

enable fusion:
============ Serving Benchmark Result ============
Successful requests:                     640
Maximum request concurrency:             128
Benchmark duration (s):                  170.02
Total input tokens:                      654052
Total generated tokens:                  655360
Request throughput (req/s):              3.76
Output token throughput (tok/s):         3854.49
Total Token throughput (tok/s):          7701.30
---------------Time to First Token----------------
Mean TTFT (ms):                          1430.17
Median TTFT (ms):                        1036.29
P99 TTFT (ms):                           5577.83
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          31.80
Median TPOT (ms):                        32.01
P99 TPOT (ms):                           33.28
---------------Inter-token Latency----------------
Mean ITL (ms):                           31.80
Median ITL (ms):                         28.41
P99 ITL (ms):                            337.28
==================================================
vllm ({'pretrained': 'nvidia/Llama-3.3-70B-Instruct-FP4', 'kv_cache_dtype': 'fp8', 'tensor_parallel_size': 1, 'compilation_config': {'custom_ops': ['+silu_and_mul'], 'pass_config': {'enable_fusion': True, 'enable_noop': True}}, 'max_model_len': 2048, 'trust_remote_code': True}), gen_kwargs: (temperature=0.0), limit: 500.0, num_fewshot: 5, batch_size: 200
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.926|±  |0.0117|
|     |       |strict-match    |     5|exact_match|↑  |0.852|±  |0.0159|

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for fusing SiLU+Mul with NVFP4 quantization, which is a valuable performance optimization for models running on NVIDIA GPUs with FP4 support. The changes are well-structured, including a new CUDA kernel, updates to the compilation passes for fusion, and comprehensive tests. The refactoring of the fusion pass and tests to accommodate the new pattern is clean. My review found one potential issue with pointer casting in the CUDA kernel wrapper that could lead to undefined behavior, and I've provided a suggestion to fix it. Overall, this is a solid contribution.

Comment on lines +336 to +368
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The pointer casts for output_ptr and sf_out are unsafe and overly complex, and the subsequent reinterpret_cast in the kernel launch can be avoided.

  1. static_cast<int64_t*>(output.data_ptr()) is unsafe. The output tensor is of type torch.uint8, so its data buffer is not guaranteed to have the 8-byte alignment required for int64_t*. This can lead to undefined behavior.
  2. The kernel expects uint32_t* for both out and SFout. It's cleaner to cast directly to this type using reinterpret_cast.

By casting directly to uint32_t* when defining output_ptr and sf_out, you can simplify the kernel launch call by removing the reinterpret_cast there.

void silu_and_mul_nvfp4_quant(torch::Tensor& output,  // [..., d]
                              torch::Tensor& output_sf,
                              torch::Tensor& input,  // [..., 2 * d]
                              torch::Tensor& input_sf) {
  TORCH_CHECK(input.dtype() == torch::kFloat16 ||
              input.dtype() == torch::kBFloat16);
  int32_t m = input.size(0);
  int32_t n = input.size(1) / 2;
  TORCH_CHECK(n % 16 == 0, "The N dimension must be multiple of 16.");
  int multiProcessorCount =
      get_device_attribute(cudaDevAttrMultiProcessorCount, -1);
  auto input_sf_ptr = static_cast<float const*>(input_sf.data_ptr());
  auto sf_out = reinterpret_cast<uint32_t*>(output_sf.data_ptr());
  auto output_ptr = reinterpret_cast<uint32_t*>(output.data_ptr());
  const at::cuda::OptionalCUDAGuard device_guard(device_of(input));
  auto stream = at::cuda::getCurrentCUDAStream(input.get_device());
  dim3 block(std::min(int(n / ELTS_PER_THREAD), 1024));
  int const numBlocksPerSM = 2048 / block.x;
  dim3 grid(std::min(int(m), multiProcessorCount * numBlocksPerSM));
  VLLM_DISPATCH_HALF_TYPES(
      input.scalar_type(), "act_and_mul_quant_kernel", [&] {
        auto input_ptr = reinterpret_cast<scalar_t const*>(input.data_ptr());
        VLLM_DISPATCH_BYTE_TYPES(
            output.scalar_type(), "fused_act_and_mul_quant_kernel_nvfp4_type",
            [&] {
              vllm::silu_and_cvt_fp16_to_fp4<scalar_t>
                  <<<grid, block, 0, stream>>>(
                      m, n, input_ptr, input_sf_ptr,
                      output_ptr,
                      sf_out);
            });
      });
}

Copy link
Collaborator

@ProExpertProg ProExpertProg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice and clean, thanks for the refactoring! A few final comments and create an issue for the kernel comments for follow up

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a FUSED_OPs array here as well?

Comment on lines 59 to 60
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this reference FUSED_OPS and QUANT_OPS instead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this could use ops_in_model_before (see other tests on how that's checked)

@elvischenv elvischenv force-pushed the elvischenv/silu-nvfp4-quant-fusion branch from ed4f126 to d7831c6 Compare August 27, 2025 03:55
@elvischenv
Copy link
Contributor Author

elvischenv commented Aug 27, 2025

Update: fixed by yapf: disable and yapf: enable


conflict between yapf and isort:

yapf................................................................................................Failed
- hook id: yapf
- files were modified by this hook

Reformatting tests/compile/test_silu_mul_quant_fusion.py

yapf modified the code to

from vllm.compilation.activation_quant_fusion import (
    ActivationQuantFusionPass, FUSED_OPS, SILU_MUL_OP)
isort...............................................................................................Failed
- hook id: isort
- files were modified by this hook

isort modified the code to

from vllm.compilation.activation_quant_fusion import (
    FUSED_OPS, SILU_MUL_OP, ActivationQuantFusionPass)

@elvischenv elvischenv force-pushed the elvischenv/silu-nvfp4-quant-fusion branch from d7831c6 to 3968b5b Compare August 27, 2025 06:56
@ProExpertProg ProExpertProg enabled auto-merge (squash) August 27, 2025 13:13
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 27, 2025
@elvischenv
Copy link
Contributor Author

elvischenv commented Aug 27, 2025

Look like it is failed to create tensor on L4:
https://buildkite.com/vllm/ci/builds/28558/steps/canvas?jid=0198ebac-8e86-4f3c-8697-dcf5829076a9

[2025-08-27T13:21:10Z] /usr/local/lib/python3.12/dist-packages/vllm/compilation/pass_manager.py:66: in configure
[2025-08-27T13:21:10Z]     self.passes += [ActivationQuantFusionPass(config)]
[2025-08-27T13:21:10Z] /usr/local/lib/python3.12/dist-packages/vllm/compilation/activation_quant_fusion.py:170: in __init__
[2025-08-27T13:21:10Z]     pattern_silu_mul_fp8.register(self.patterns)
[2025-08-27T13:21:10Z] /usr/local/lib/python3.12/dist-packages/vllm/compilation/activation_quant_fusion.py:100: in register
[2025-08-27T13:21:10Z]     self.empty_quant(5, 4),  # result
[2025-08-27T13:21:10Z] _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
[2025-08-27T13:21:10Z]
[2025-08-27T13:21:10Z] self = <vllm.compilation.activation_quant_fusion.SiluMulFp8StaticQuantPattern object at 0x7f5cec68be00>
[2025-08-27T13:21:10Z] args = (5, 4), kwargs = {'device': 'cuda', 'dtype': torch.float8_e4m3fn}
[2025-08-27T13:21:10Z]
[2025-08-27T13:21:10Z]     def empty_quant(self, *args, **kwargs):
[2025-08-27T13:21:10Z]         kwargs = {'dtype': self.quant_dtype, 'device': "cuda", **kwargs}
[2025-08-27T13:21:10Z] >       return torch.empty(*args, **kwargs)
[2025-08-27T13:21:10Z] E       RuntimeError: CUDA error: no kernel image is available for execution on the device

I got a L4 locally and tried creating tensors and it worked. Is the failure related to the driver or something else for the L4 in CI? cc @ProExpertProg @mgoin

$ nvidia-smi
Wed Aug 27 14:59:25 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.65.06              Driver Version: 580.65.06      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:C1:00.0 Off |                    0 |
| N/A   42C    P8             16W /   72W |       0MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
$ pip list | grep torch
torch                             2.7.1+cu128
torchaudio                        2.7.1+cu128
torchvision                       0.22.1+cu128
$ python -c "import torch; a = torch.ones((5,4), device='cuda', dtype=torch.float8_e4m3fn); print(a)"
tensor([[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]], device='cuda:0', dtype=torch.float8_e4m3fn)

stickingjh and others added 5 commits August 28, 2025 00:55
Signed-off-by: jindih <[email protected]>

fix review comment

Signed-off-by: jindih <[email protected]>

revise silu+nvfp4q pattern matching part

Signed-off-by: jindih <[email protected]>
Signed-off-by: elvischenv <[email protected]>
Signed-off-by: elvischenv <[email protected]>
Signed-off-by: elvischenv <[email protected]>
Signed-off-by: elvischenv <[email protected]>
auto-merge was automatically disabled August 28, 2025 07:55

Head branch was pushed to by a user without write access

@elvischenv elvischenv force-pushed the elvischenv/silu-nvfp4-quant-fusion branch from 3968b5b to 8b479b2 Compare August 28, 2025 07:55
Copy link

mergify bot commented Aug 28, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @elvischenv.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 28, 2025
@mergify mergify bot removed the needs-rebase label Aug 28, 2025
@ProExpertProg ProExpertProg enabled auto-merge (squash) August 28, 2025 17:53
@ProExpertProg ProExpertProg merged commit 16a45b3 into vllm-project:main Aug 28, 2025
72 checks passed
@github-project-automation github-project-automation bot moved this from To triage to Done in torch.compile integration Aug 28, 2025
zou3519 added a commit to zou3519/vllm that referenced this pull request Aug 29, 2025
@elvischenv elvischenv deleted the elvischenv/silu-nvfp4-quant-fusion branch September 3, 2025 01:51
eicherseiji pushed a commit to eicherseiji/vllm that referenced this pull request Sep 9, 2025
Signed-off-by: jindih <[email protected]>
Signed-off-by: elvischenv <[email protected]>
Co-authored-by: jindih <[email protected]>
Co-authored-by: Michael Goin <[email protected]>
Co-authored-by: Luka Govedic <[email protected]>
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: jindih <[email protected]>
Signed-off-by: elvischenv <[email protected]>
Co-authored-by: jindih <[email protected]>
Co-authored-by: Michael Goin <[email protected]>
Co-authored-by: Luka Govedic <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed torch.compile

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants