[Bug]: Not able to deploy Llama-4-Scout-17B-16E-Instruct on vllm-openai v0.8.3

### Your current environment

1) Download the Llama-4-Scout-17B-16E-Instruct  on PVC 
2) Deploy the model on azure kubernates on A100 with 2 GPUs 80GB each.
3) using below arguments 
```
args:
        - "--model"
        - "/mnt/models/meta-llama-4-scout-17b-16e-instruct"
        - "--api-key"
        - "$(VLLM_API_KEY)"
        - "--tensor-parallel-size"
        - "2"
        - "--dtype"
        - "bfloat16"
        - "--port"
        - "8000"
        - "--max-model-len"
        - "32768"
        - "--max-num-batched-tokens"
        - "32768"
        - "--max-num-seqs"
        - "16"
        - "--gpu-memory-utilization"
        - "0.99"
        - "--served-model-name"
        - "Llama-4-Scout-17B-16E-Instruct"
        - "--trust-remote-code"
        - "--disable-log-requests"
        - "--enable-chunked-prefill"
        - "--enable-prefix-caching"
```
4) Getting fallowing error
```
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:23 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=1 pid=239) INFO 04-07 08:59:23 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:23 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_57eb3085'), local_subscribe_addr='ipc:///tmp/8f0dd0fa-95b6-4959-8738-3b5acb47a883', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=239) INFO 04-07 08:59:23 [parallel_state.py:957] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:23 [parallel_state.py:957] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0
(VllmWorker rank=1 pid=239) INFO 04-07 08:59:23 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:23 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:26 [gpu_model_runner.py:1258] Starting to load model /mnt/models/meta-llama-4-scout-17b-16e-instruct...
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:26 [config.py:3334] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
(VllmWorker rank=1 pid=239) INFO 04-07 08:59:26 [gpu_model_runner.py:1258] Starting to load model /mnt/models/meta-llama-4-scout-17b-16e-instruct...
(VllmWorker rank=1 pid=239) INFO 04-07 08:59:26 [config.py:3334] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
(VllmWorker rank=0 pid=222) WARNING 04-07 08:59:26 [config.py:3785] `torch.compile` is turned on, but the model /mnt/models/meta-llama-4-scout-17b-16e-instruct does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=0 pid=222) WARNING 04-07 08:59:26 [config.py:3785] `torch.compile` is turned on, but the model /mnt/models/meta-llama-4-scout-17b-16e-instruct does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=1 pid=239) WARNING 04-07 08:59:26 [config.py:3785] `torch.compile` is turned on, but the model /mnt/models/meta-llama-4-scout-17b-16e-instruct does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=1 pid=239) WARNING 04-07 08:59:26 [config.py:3785] `torch.compile` is turned on, but the model /mnt/models/meta-llama-4-scout-17b-16e-instruct does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=0 pid=222) Process SpawnProcess-1:1:
CRITICAL 04-07 08:59:27 [multiproc_executor.py:49] MulitprocExecutor got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
CRITICAL 04-07 08:59:27 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
 0 nranks 2 cudaDev 0 nvmlDev 0 busId 100000 commId 0xaf1be216fd147d3d - Init COMPLETE
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1121, in <module>
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1069, in run_server
    async with build_async_engine_client(args) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args
    async_llm = AsyncLLM.from_vllm_config(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 136, in from_vllm_config
    return cls(
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 102, in __init__
    self.engine_core = EngineCoreClient.make_client(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 69, in make_client
    return AsyncMPClient(vllm_config, executor_class, log_stats)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 570, in __init__
    super().__init__(
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 401, in __init__
    engine.proc_handle.wait_for_startup()
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/utils.py", line 127, in wait_for_startup
    if self.reader.recv()["status"] != "READY":
       ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
          ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 430, in _recv_bytes
    buf = self._recv(4)
          ^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 399, in _recv
    raise EOFError
```

Any help will be appriciated.

### 🐛 Describe the bug

1) Download the Llama-4-Scout-17B-16E-Instruct  on PVC 
2) Deploy the model on azure kubernates on A100 with 2 GPUs 80GB each.
3) using below arguments 
args:
```
        - "--model"
        - "/mnt/models/meta-llama-4-scout-17b-16e-instruct"
        - "--api-key"
        - "$(VLLM_API_KEY)"
        - "--tensor-parallel-size"
        - "2"
        - "--dtype"
        - "bfloat16"
        - "--port"
        - "8000"
        - "--max-model-len"
        - "32768"
        - "--max-num-batched-tokens"
        - "32768"
        - "--max-num-seqs"
        - "16"
        - "--gpu-memory-utilization"
        - "0.99"
        - "--served-model-name"
        - "Llama-4-Scout-17B-16E-Instruct"
        - "--trust-remote-code"
        - "--disable-log-requests"
        - "--enable-chunked-prefill"
        - "--enable-prefix-caching"
```
4) Getting fallowing error
```
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:23 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=1 pid=239) INFO 04-07 08:59:23 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:23 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_57eb3085'), local_subscribe_addr='ipc:///tmp/8f0dd0fa-95b6-4959-8738-3b5acb47a883', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=239) INFO 04-07 08:59:23 [parallel_state.py:957] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:23 [parallel_state.py:957] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0
(VllmWorker rank=1 pid=239) INFO 04-07 08:59:23 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:23 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:26 [gpu_model_runner.py:1258] Starting to load model /mnt/models/meta-llama-4-scout-17b-16e-instruct...
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:26 [config.py:3334] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
(VllmWorker rank=1 pid=239) INFO 04-07 08:59:26 [gpu_model_runner.py:1258] Starting to load model /mnt/models/meta-llama-4-scout-17b-16e-instruct...
(VllmWorker rank=1 pid=239) INFO 04-07 08:59:26 [config.py:3334] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
(VllmWorker rank=0 pid=222) WARNING 04-07 08:59:26 [config.py:3785] `torch.compile` is turned on, but the model /mnt/models/meta-llama-4-scout-17b-16e-instruct does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=0 pid=222) WARNING 04-07 08:59:26 [config.py:3785] `torch.compile` is turned on, but the model /mnt/models/meta-llama-4-scout-17b-16e-instruct does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=1 pid=239) WARNING 04-07 08:59:26 [config.py:3785] `torch.compile` is turned on, but the model /mnt/models/meta-llama-4-scout-17b-16e-instruct does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=1 pid=239) WARNING 04-07 08:59:26 [config.py:3785] `torch.compile` is turned on, but the model /mnt/models/meta-llama-4-scout-17b-16e-instruct does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=0 pid=222) Process SpawnProcess-1:1:
CRITICAL 04-07 08:59:27 [multiproc_executor.py:49] MulitprocExecutor got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
CRITICAL 04-07 08:59:27 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
 0 nranks 2 cudaDev 0 nvmlDev 0 busId 100000 commId 0xaf1be216fd147d3d - Init COMPLETE
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1121, in <module>
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1069, in run_server
    async with build_async_engine_client(args) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args
    async_llm = AsyncLLM.from_vllm_config(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 136, in from_vllm_config
    return cls(
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 102, in __init__
    self.engine_core = EngineCoreClient.make_client(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 69, in make_client
    return AsyncMPClient(vllm_config, executor_class, log_stats)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 570, in __init__
    super().__init__(
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 401, in __init__
    engine.proc_handle.wait_for_startup()
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/utils.py", line 127, in wait_for_startup
    if self.reader.recv()["status"] != "READY":
       ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
          ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 430, in _recv_bytes
    buf = self._recv(4)
          ^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 399, in _recv
    raise EOFError
```

Any help will be appriciated.

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Not able to deploy Llama-4-Scout-17B-16E-Instruct on vllm-openai v0.8.3 #16197

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Not able to deploy Llama-4-Scout-17B-16E-Instruct on vllm-openai v0.8.3 #16197

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions