[Usage]: deepseek v3 can not set tensor_parallel_size=16 and pipeline-parallel-size=2 on L20 #12256 Open

### Your current environment

```text
The output of `python collect_env.py`
```
(RayWorkerWrapper pid=5057, ip=10.121.129.5) Cache shape torch.Size([163840, 64]) [repeated 30x across cluster]
(RayWorkerWrapper pid=5849, ip=10.121.129.12) INFO 01-21 00:46:19 model_runner.py:1099] Loading model weights took 18.9152 GB [repeated 7x across cluster]
(RayWorkerWrapper pid=5148, ip=10.121.129.13) INFO 01-21 00:46:25 model_runner.py:1099] Loading model weights took 21.4118 GB [repeated 8x across cluster]


(RayWorkerWrapper pid=5050, ip=10.121.129.5) INFO 01-21 00:47:24 model_runner.py:1099] Loading model weights took 21.4118 GB [repeated 8x across cluster]
(RayWorkerWrapper pid=5054, ip=10.121.129.5) WARNING 01-21 00:47:31 fused_moe.py:374] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=128,device_name=NVIDIA_L20,dtype=fp8_w8a8.json
(RayWorkerWrapper pid=5053, ip=10.121.129.5) INFO 01-21 00:47:24 model_runner.py:1099] Loading model weights took 21.4118 GB [repeated 7x across cluster]
WARNING 01-21 00:47:34 fused_moe.py:374] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=128,device_name=NVIDIA_L20,dtype=fp8_w8a8.json
(RayWorkerWrapper pid=5146, ip=10.121.129.13) INFO 01-21 00:47:39 worker.py:241] Memory profiling takes 14.78 seconds
(RayWorkerWrapper pid=5146, ip=10.121.129.13) INFO 01-21 00:47:39 worker.py:241] the current vLLM instance can use total_gpu_memory (44.42GiB) x gpu_memory_utilization (0.70) = 31.10GiB
(RayWorkerWrapper pid=5146, ip=10.121.129.13) INFO 01-21 00:47:39 worker.py:241] model weights take 21.41GiB; non_torch_memory takes 0.40GiB; PyTorch activation peak memory takes 0.39GiB; the rest of the memory reserved for KV Cache is 8.89GiB.
(RayWorkerWrapper pid=5856, ip=10.121.129.12) WARNING 01-21 00:47:34 fused_moe.py:374] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=128,device_name=NVIDIA_L20,dtype=fp8_w8a8.json [repeated 30x across cluster]

### How would you like to use vllm

I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Usage]: deepseek v3 can not set tensor_parallel_size=16 and pipeline-parallel-size=2 on L20 #12256 Open #12258

Your current environment

How would you like to use vllm

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Usage]: deepseek v3 can not set tensor_parallel_size=16 and pipeline-parallel-size=2 on L20 #12256 Open #12258

Description

Your current environment

How would you like to use vllm

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions