[Bug]: I got an `torch._scaled_mm` error using async tp with Ampere GPU

### Your current environment

<details>
<summary>The output of <code>python collect_env.py</code></summary>

```text
Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 20.04.6 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-2ubuntu1~20.04) 11.4.0
Clang version                : 10.0.0-4ubuntu1 
CMake version                : version 4.0.3
Libc version                 : glibc-2.31

==============================
       PyTorch Info
==============================
PyTorch version              : 2.7.1+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.9 (main, Feb 12 2025, 14:50:50) [Clang 19.1.6 ] (64-bit runtime)
Python platform              : Linux-5.13.0-30-generic-x86_64-with-glibc2.31

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.1.66
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration : 
GPU 0: NVIDIA GeForce RTX 3090
GPU 1: NVIDIA GeForce RTX 3090
GPU 2: NVIDIA GeForce RTX 3090
GPU 3: NVIDIA GeForce RTX 3090
GPU 4: NVIDIA GeForce RTX 3090
GPU 5: NVIDIA GeForce RTX 3090
GPU 6: NVIDIA GeForce RTX 3090
GPU 7: NVIDIA GeForce RTX 3090

Nvidia driver version        : 530.30.02
cuDNN version                : Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.6.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.6.0
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          56
On-line CPU(s) list:             0-55
Thread(s) per core:              2
Core(s) per socket:              14
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           79
Model name:                      Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
Stepping:                        1
CPU MHz:                         1218.644
CPU max MHz:                     3300.0000
CPU min MHz:                     1200.0000
BogoMIPS:                        4799.70
Virtualization:                  VT-x
L1d cache:                       896 KiB
L1i cache:                       896 KiB
L2 cache:                        7 MiB
L3 cache:                        70 MiB
NUMA node0 CPU(s):               0-13,28-41
NUMA node1 CPU(s):               14-27,42-55
Vulnerability Itlb multihit:     KVM: Mitigation: VMX disabled
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Mitigation; Clear CPU buffers; SMT vulnerable
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d

==============================
Versions of relevant libraries
==============================
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.8.3.14
[pip3] nvidia-cuda-cupti-cu12==12.8.57
[pip3] nvidia-cuda-nvrtc-cu12==12.8.61
[pip3] nvidia-cuda-runtime-cu12==12.8.57
[pip3] nvidia-cudnn-cu12==9.7.1.26
[pip3] nvidia-cudnn-frontend==1.13.0
[pip3] nvidia-cufft-cu12==11.3.3.41
[pip3] nvidia-cufile-cu12==1.13.0.11
[pip3] nvidia-curand-cu12==10.3.9.55
[pip3] nvidia-cusolver-cu12==11.7.2.55
[pip3] nvidia-cusparse-cu12==12.5.7.53
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-ml-py==12.575.51
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.8.61
[pip3] nvidia-nvshmem-cu12==3.3.20
[pip3] nvidia-nvtx-cu12==12.8.55
[pip3] onnx==1.18.0
[pip3] onnx-ir==0.1.4
[pip3] onnxruntime-gpu==1.22.0
[pip3] onnxscript==0.3.2
[pip3] open_clip_torch==3.0.0
[pip3] pynvml==12.0.0
[pip3] pyzmq==27.0.1
[pip3] sentence-transformers==5.0.0
[pip3] torch==2.7.1+cu128
[pip3] torchao==0.12.0+cu128
[pip3] torchaudio==2.7.1+cu128
[pip3] torchdata==0.11.0
[pip3] torchtitan==0.1.0
[pip3] torchvision==0.22.1+cu128
[pip3] transformers==4.54.1
[pip3] triton==3.3.1
[conda] mkl                       2024.2.2            ha957f24_16    conda-forge
[conda] mkl-devel                 2024.2.2            ha770c72_16    conda-forge
[conda] mkl-include               2024.2.2            ha957f24_16    conda-forge
[conda] mkl-service               2.4.2           py310h22455d7_0    conda-forge
[conda] mkl_fft                   1.3.11          py310h5bcb89a_0    conda-forge
[conda] mkl_random                1.2.8           py310hcacb51e_1    conda-forge
[conda] numpy                     2.1.3           py310heeff2f4_0  
[conda] numpy-base                2.1.3           py310h8a23956_0  
[conda] nvidia-ml-py              12.535.108               pypi_0    pypi
[conda] transformers              4.53.1                   pypi_0    pypi

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
Neuron SDK Version           : N/A
vLLM Version                 : 0.10.1.dev371+g74333ae2f (git sha: 74333ae2f)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity
GPU0     X      PIX     PHB     PHB     SYS     SYS     SYS     SYS     0-13,28-41      0
GPU1    PIX      X      PHB     PHB     SYS     SYS     SYS     SYS     0-13,28-41      0
GPU2    PHB     PHB      X      PIX     SYS     SYS     SYS     SYS     0-13,28-41      0
GPU3    PHB     PHB     PIX      X      SYS     SYS     SYS     SYS     0-13,28-41      0
GPU4    SYS     SYS     SYS     SYS      X      PIX     PHB     PHB     14-27,42-55     1
GPU5    SYS     SYS     SYS     SYS     PIX      X      PHB     PHB     14-27,42-55     1
GPU6    SYS     SYS     SYS     SYS     PHB     PHB      X      PIX     14-27,42-55     1
GPU7    SYS     SYS     SYS     SYS     PHB     PHB     PIX      X      14-27,42-55     1

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64:/usr/local/cuda-12.1/lib64
CUDA_HOME=/usr/local/cuda
CUDA_HOME=/usr/local/cuda
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY
```

</details>


### 🐛 Describe the bug

INFO 08-05 19:37:30 [__init__.py:241] Automatically detected platform cuda.
WARNING 08-05 19:37:32 [config.py:535] The global random seed is set to 0. Since VLLM_ENABLE_V1_MULTIPROCESSING is set to False, this may affect the random state of the Python process that launched vLLM.
INFO 08-05 19:37:42 [config.py:726] Resolved architecture: Qwen3MoeForCausalLM
INFO 08-05 19:37:42 [config.py:1765] Using max model len 10000
INFO 08-05 19:37:42 [arg_utils.py:1188] Using mp-based distributed executor backend for async scheduling.
INFO 08-05 19:37:42 [config.py:2594] Chunked prefill is enabled with max_num_batched_tokens=2048.
WARNING 08-05 19:37:42 [config.py:4914] Batch sizes [1] are removed because they are not multiple of tp_size 2 when sequence parallelism is enabled
INFO 08-05 19:37:51 [__init__.py:241] Automatically detected platform cuda.
(EngineCore_0 pid=776130) INFO 08-05 19:37:53 [core.py:619] Waiting for init message from front-end.
(EngineCore_0 pid=776130) INFO 08-05 19:37:53 [core.py:71] Initializing a V1 LLM engine (v0.10.1.dev371+g74333ae2f) with config: model='/data/pretrained_models/Qwen3-30B-A3B-Thinking-2507-FP8', speculative_config=None, tokenizer='/data/pretrained_models/Qwen3-30B-A3B-Thinking-2507-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=10000, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/data/pretrained_models/Qwen3-30B-A3B-Thinking-2507-FP8, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":["+quant_fp8","+rms_norm","+rms_norm"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{"enable_sequence_parallelism":true,"enable_async_tp":true},"max_capture_size":8,"local_cache_dir":null}
(EngineCore_0 pid=776130) INFO 08-05 19:37:53 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 16777216, 10, 'psm_a66dd76c'), local_subscribe_addr='ipc:///tmp/b4626ebf-18a3-45a1-a010-b4be1e6b00f5', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 08-05 19:38:02 [__init__.py:241] Automatically detected platform cuda.
INFO 08-05 19:38:02 [__init__.py:241] Automatically detected platform cuda.
(VllmWorker TP1 pid=776425) INFO 08-05 19:38:04 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_29e0f920'), local_subscribe_addr='ipc:///tmp/28a910d6-00d7-4a00-908f-91027b657d1c', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker TP0 pid=776424) INFO 08-05 19:38:04 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_fdaefb27'), local_subscribe_addr='ipc:///tmp/a736f931-20cf-4d7b-a383-eef79be4399e', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker TP1 pid=776425) INFO 08-05 19:38:06 [__init__.py:1381] Found nccl from library libnccl.so.2
(VllmWorker TP1 pid=776425) INFO 08-05 19:38:06 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker TP0 pid=776424) INFO 08-05 19:38:06 [__init__.py:1381] Found nccl from library libnccl.so.2
(VllmWorker TP0 pid=776424) INFO 08-05 19:38:06 [pynccl.py:70] vLLM is using nccl==2.26.2
(VllmWorker TP1 pid=776425) INFO 08-05 19:38:06 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
(VllmWorker TP0 pid=776424) INFO 08-05 19:38:06 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
(VllmWorker TP1 pid=776425) WARNING 08-05 19:38:06 [custom_all_reduce.py:147] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker TP0 pid=776424) WARNING 08-05 19:38:06 [custom_all_reduce.py:147] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorker TP0 pid=776424) INFO 08-05 19:38:06 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_8a3451e2'), local_subscribe_addr='ipc:///tmp/a6352421-23b6-40be-b804-f672baae9c36', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker TP1 pid=776425) INFO 08-05 19:38:06 [parallel_state.py:1124] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(VllmWorker TP0 pid=776424) INFO 08-05 19:38:06 [parallel_state.py:1124] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(VllmWorker TP1 pid=776425) INFO 08-05 19:38:06 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
(VllmWorker TP0 pid=776424) INFO 08-05 19:38:06 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
(VllmWorker TP1 pid=776425) INFO 08-05 19:38:06 [gpu_model_runner.py:1908] Starting to load model /data/pretrained_models/Qwen3-30B-A3B-Thinking-2507-FP8...
(VllmWorker TP0 pid=776424) INFO 08-05 19:38:06 [gpu_model_runner.py:1908] Starting to load model /data/pretrained_models/Qwen3-30B-A3B-Thinking-2507-FP8...
(VllmWorker TP1 pid=776425) INFO 08-05 19:38:06 [gpu_model_runner.py:1940] Loading model from scratch...
(VllmWorker TP0 pid=776424) INFO 08-05 19:38:06 [gpu_model_runner.py:1940] Loading model from scratch...
(VllmWorker TP1 pid=776425) INFO 08-05 19:38:07 [cuda.py:276] Using FlashInfer backend on V1 engine.
(VllmWorker TP1 pid=776425) WARNING 08-05 19:38:07 [fp8.py:533] CutlassBlockScaledGroupedGemm not supported on the current platform.
(VllmWorker TP0 pid=776424) INFO 08-05 19:38:07 [cuda.py:276] Using FlashInfer backend on V1 engine.
(VllmWorker TP0 pid=776424) WARNING 08-05 19:38:07 [fp8.py:533] CutlassBlockScaledGroupedGemm not supported on the current platform.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  3.65it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:04<00:04,  2.35s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:07<00:02,  2.99s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:11<00:00,  3.36s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:11<00:00,  2.94s/it]
(VllmWorker TP0 pid=776424) 
(VllmWorker TP0 pid=776424) INFO 08-05 19:38:19 [default_loader.py:262] Loading weights took 11.96 seconds
(VllmWorker TP0 pid=776424) WARNING 08-05 19:38:19 [marlin_utils_fp8.py:82] Your GPU does not have native support for FP8 computation but FP8 quantization is being used. Weight-only FP8 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
(VllmWorker TP1 pid=776425) INFO 08-05 19:38:20 [default_loader.py:262] Loading weights took 12.65 seconds
(VllmWorker TP1 pid=776425) WARNING 08-05 19:38:20 [marlin_utils_fp8.py:82] Your GPU does not have native support for FP8 computation but FP8 quantization is being used. Weight-only FP8 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
(VllmWorker TP0 pid=776424) INFO 08-05 19:38:22 [gpu_model_runner.py:1957] Model loading took 18.2612 GiB and 14.535958 seconds
(VllmWorker TP0 pid=776424) INFO 08-05 19:38:22 [gpu_model_runner.py:1964] EPLB is enabled for model /data/pretrained_models/Qwen3-30B-A3B-Thinking-2507-FP8.
(VllmWorker TP1 pid=776425) INFO 08-05 19:38:22 [gpu_model_runner.py:1957] Model loading took 18.2612 GiB and 15.234463 seconds
(VllmWorker TP1 pid=776425) INFO 08-05 19:38:22 [gpu_model_runner.py:1964] EPLB is enabled for model /data/pretrained_models/Qwen3-30B-A3B-Thinking-2507-FP8.
(VllmWorker TP1 pid=776425) INFO 08-05 19:38:40 [backends.py:530] Using cache directory: /home/mosh/.cache/vllm/torch_compile_cache/d349c6af44/rank_1_0/backbone for vLLM's torch.compile
(VllmWorker TP1 pid=776425) INFO 08-05 19:38:40 [backends.py:541] Dynamo bytecode transform time: 17.00 s
(VllmWorker TP0 pid=776424) INFO 08-05 19:38:40 [backends.py:530] Using cache directory: /home/mosh/.cache/vllm/torch_compile_cache/d349c6af44/rank_0_0/backbone for vLLM's torch.compile
(VllmWorker TP0 pid=776424) INFO 08-05 19:38:40 [backends.py:541] Dynamo bytecode transform time: 17.03 s
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596] WorkerProc hit an exception.
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596] Traceback (most recent call last):
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 591, in worker_busy_loop
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     output = func(*args, **kwargs)
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]              ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     return func(*args, **kwargs)
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 243, in determine_available_memory
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     self.model_runner.profile_run()
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2491, in profile_run
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     = self._dummy_run(self.max_num_tokens, is_profile=True)
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     return func(*args, **kwargs)
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2267, in _dummy_run
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     outputs = self.model(
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]               ^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     return self._call_impl(*args, **kwargs)
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     return forward_call(*args, **kwargs)
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_moe.py", line 645, in forward
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     hidden_states = self.model(input_ids, positions, intermediate_tensors,
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 272, in __call__
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     output = self.compiled_callable(*args, **kwargs)
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 663, in _fn
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/torch/_dynamo/output_graph.py", line 1544, in _call_user_compiler
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     raise BackendCompilerFailed(
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/torch/_dynamo/output_graph.py", line 1519, in _call_user_compiler
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     compiled_fn = compiler_fn(gm, self.example_inputs())
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/torch/_dynamo/repro/after_dynamo.py", line 150, in __call__
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     compiled_gm = compiler_fn(gm, example_inputs)
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/torch/_dynamo/repro/after_dynamo.py", line 150, in __call__
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     compiled_gm = compiler_fn(gm, example_inputs)
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/torch/__init__.py", line 2392, in __call__
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     return self.compiler_fn(model_, inputs_, **self.kwargs)
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/compilation/backends.py", line 549, in __call__
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     self.configure_post_pass()
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/compilation/backends.py", line 442, in configure_post_pass
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     self.post_grad_pass_manager.configure(self.vllm_config)
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/compilation/pass_manager.py", line 62, in configure
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     self.passes += [AsyncTPPass(config)]
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]                     ^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/compilation/collective_fusion.py", line 369, in __init__
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     self.device).register(self.patterns)
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]                  ^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/compilation/collective_fusion.py", line 170, in register
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     pm.register_replacement(pattern, replacement, self.get_inputs(),
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/torch/_inductor/pattern_matcher.py", line 1446, in register_replacement
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     pattern, gm = gen_pattern_and_search_gm(
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/home/mosh/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 81, in inner
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     return func(*args, **kwds)
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]            ^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/torch/_inductor/pattern_matcher.py", line 1649, in gen_pattern_and_search_gm
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     search_gm = trace_fn(search_fn, flat_inputs)
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     return func(*args, **kwargs)
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/torch/_inductor/pattern_matcher.py", line 2003, in fwd_only
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     gm = make_fx(fn, decompositions, tracing_mode="real")(*args)
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/torch/fx/experimental/proxy_tensor.py", line 2240, in wrapped
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     return make_fx_tracer.trace(f, *args)
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/torch/fx/experimental/proxy_tensor.py", line 2178, in trace
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     return self._trace_inner(f, *args)
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/torch/fx/experimental/proxy_tensor.py", line 2149, in _trace_inner
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     t = dispatch_trace(
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]         ^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/torch/_compile.py", line 51, in inner
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     return disable_fn(*args, **kwargs)
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     return fn(*args, **kwargs)
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]            ^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/torch/fx/experimental/proxy_tensor.py", line 1174, in dispatch_trace
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     graph = tracer.trace(root, concrete_args)  # type: ignore[arg-type]
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     return fn(*args, **kwargs)
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]            ^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/torch/fx/_symbolic_trace.py", line 837, in trace
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     (self.create_arg(fn(*args)),),
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]                      ^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/torch/fx/experimental/proxy_tensor.py", line 1229, in wrapped
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     out = f(*tensors)  # type:ignore[call-arg]
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]           ^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/vllm/compilation/collective_fusion.py", line 140, in pattern
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     scaled_mm = torch.ops.aten._scaled_mm.default(input,
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/torch/_ops.py", line 756, in __call__
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     return self._op(*args, **kwargs)
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/torch/fx/experimental/proxy_tensor.py", line 1277, in __torch_function__
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     return func(*args, **kwargs)
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/torch/_ops.py", line 756, in __call__
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     return self._op(*args, **kwargs)
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/torch/utils/_stats.py", line 27, in wrapper
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     return fn(*args, **kwargs)
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]            ^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/torch/fx/experimental/proxy_tensor.py", line 1379, in __torch_dispatch__
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     return proxy_call(self, func, self.pre_dispatch, args, kwargs)
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/torch/fx/experimental/proxy_tensor.py", line 914, in proxy_call
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     out = func(*args, **kwargs)
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]           ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]   File "/data/lijinghui/uv_projects/.venv/lib/python3.12/site-packages/torch/_ops.py", line 756, in __call__
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]     return self._op(*args, **kwargs)
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596] torch._dynamo.exc.BackendCompilerFailed: backend='<vllm.compilation.backends.VllmBackend object at 0x7f500808cf50>' raised:
(VllmWorker TP0 pid=776424) ERROR 08-05 19:38:40 [multiproc_executor.py:596] RuntimeError: torch._scaled_mm is only supported on CUDA devices with compute capability >= 9.0 or 8.9, or ROCm MI300+

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: I got an `torch._scaled_mm` error using async tp with Ampere GPU #22250

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: I got an torch._scaled_mm error using async tp with Ampere GPU #22250

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Bug]: I got an `torch._scaled_mm` error using async tp with Ampere GPU #22250