Skip to content

[Bug]: Not able to deploy Llama-4-Scout-17B-16E-Instruct on vllm-openai v0.8.3 #16197

@rabaja

Description

@rabaja

Your current environment

  1. Download the Llama-4-Scout-17B-16E-Instruct on PVC
  2. Deploy the model on azure kubernates on A100 with 2 GPUs 80GB each.
  3. using below arguments
args:
        - "--model"
        - "/mnt/models/meta-llama-4-scout-17b-16e-instruct"
        - "--api-key"
        - "$(VLLM_API_KEY)"
        - "--tensor-parallel-size"
        - "2"
        - "--dtype"
        - "bfloat16"
        - "--port"
        - "8000"
        - "--max-model-len"
        - "32768"
        - "--max-num-batched-tokens"
        - "32768"
        - "--max-num-seqs"
        - "16"
        - "--gpu-memory-utilization"
        - "0.99"
        - "--served-model-name"
        - "Llama-4-Scout-17B-16E-Instruct"
        - "--trust-remote-code"
        - "--disable-log-requests"
        - "--enable-chunked-prefill"
        - "--enable-prefix-caching"
  1. Getting fallowing error
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:23 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=1 pid=239) INFO 04-07 08:59:23 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:23 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_57eb3085'), local_subscribe_addr='ipc:///tmp/8f0dd0fa-95b6-4959-8738-3b5acb47a883', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=239) INFO 04-07 08:59:23 [parallel_state.py:957] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:23 [parallel_state.py:957] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0
(VllmWorker rank=1 pid=239) INFO 04-07 08:59:23 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:23 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:26 [gpu_model_runner.py:1258] Starting to load model /mnt/models/meta-llama-4-scout-17b-16e-instruct...
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:26 [config.py:3334] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
(VllmWorker rank=1 pid=239) INFO 04-07 08:59:26 [gpu_model_runner.py:1258] Starting to load model /mnt/models/meta-llama-4-scout-17b-16e-instruct...
(VllmWorker rank=1 pid=239) INFO 04-07 08:59:26 [config.py:3334] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
(VllmWorker rank=0 pid=222) WARNING 04-07 08:59:26 [config.py:3785] `torch.compile` is turned on, but the model /mnt/models/meta-llama-4-scout-17b-16e-instruct does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=0 pid=222) WARNING 04-07 08:59:26 [config.py:3785] `torch.compile` is turned on, but the model /mnt/models/meta-llama-4-scout-17b-16e-instruct does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=1 pid=239) WARNING 04-07 08:59:26 [config.py:3785] `torch.compile` is turned on, but the model /mnt/models/meta-llama-4-scout-17b-16e-instruct does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=1 pid=239) WARNING 04-07 08:59:26 [config.py:3785] `torch.compile` is turned on, but the model /mnt/models/meta-llama-4-scout-17b-16e-instruct does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=0 pid=222) Process SpawnProcess-1:1:
CRITICAL 04-07 08:59:27 [multiproc_executor.py:49] MulitprocExecutor got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
CRITICAL 04-07 08:59:27 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
 0 nranks 2 cudaDev 0 nvmlDev 0 busId 100000 commId 0xaf1be216fd147d3d - Init COMPLETE
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1121, in <module>
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1069, in run_server
    async with build_async_engine_client(args) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args
    async_llm = AsyncLLM.from_vllm_config(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 136, in from_vllm_config
    return cls(
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 102, in __init__
    self.engine_core = EngineCoreClient.make_client(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 69, in make_client
    return AsyncMPClient(vllm_config, executor_class, log_stats)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 570, in __init__
    super().__init__(
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 401, in __init__
    engine.proc_handle.wait_for_startup()
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/utils.py", line 127, in wait_for_startup
    if self.reader.recv()["status"] != "READY":
       ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
          ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 430, in _recv_bytes
    buf = self._recv(4)
          ^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 399, in _recv
    raise EOFError

Any help will be appriciated.

🐛 Describe the bug

  1. Download the Llama-4-Scout-17B-16E-Instruct on PVC
  2. Deploy the model on azure kubernates on A100 with 2 GPUs 80GB each.
  3. using below arguments
    args:
        - "--model"
        - "/mnt/models/meta-llama-4-scout-17b-16e-instruct"
        - "--api-key"
        - "$(VLLM_API_KEY)"
        - "--tensor-parallel-size"
        - "2"
        - "--dtype"
        - "bfloat16"
        - "--port"
        - "8000"
        - "--max-model-len"
        - "32768"
        - "--max-num-batched-tokens"
        - "32768"
        - "--max-num-seqs"
        - "16"
        - "--gpu-memory-utilization"
        - "0.99"
        - "--served-model-name"
        - "Llama-4-Scout-17B-16E-Instruct"
        - "--trust-remote-code"
        - "--disable-log-requests"
        - "--enable-chunked-prefill"
        - "--enable-prefix-caching"
  1. Getting fallowing error
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:23 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=1 pid=239) INFO 04-07 08:59:23 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:23 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_57eb3085'), local_subscribe_addr='ipc:///tmp/8f0dd0fa-95b6-4959-8738-3b5acb47a883', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=239) INFO 04-07 08:59:23 [parallel_state.py:957] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:23 [parallel_state.py:957] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0
(VllmWorker rank=1 pid=239) INFO 04-07 08:59:23 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:23 [cuda.py:221] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:26 [gpu_model_runner.py:1258] Starting to load model /mnt/models/meta-llama-4-scout-17b-16e-instruct...
(VllmWorker rank=0 pid=222) INFO 04-07 08:59:26 [config.py:3334] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
(VllmWorker rank=1 pid=239) INFO 04-07 08:59:26 [gpu_model_runner.py:1258] Starting to load model /mnt/models/meta-llama-4-scout-17b-16e-instruct...
(VllmWorker rank=1 pid=239) INFO 04-07 08:59:26 [config.py:3334] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
(VllmWorker rank=0 pid=222) WARNING 04-07 08:59:26 [config.py:3785] `torch.compile` is turned on, but the model /mnt/models/meta-llama-4-scout-17b-16e-instruct does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=0 pid=222) WARNING 04-07 08:59:26 [config.py:3785] `torch.compile` is turned on, but the model /mnt/models/meta-llama-4-scout-17b-16e-instruct does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=1 pid=239) WARNING 04-07 08:59:26 [config.py:3785] `torch.compile` is turned on, but the model /mnt/models/meta-llama-4-scout-17b-16e-instruct does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=1 pid=239) WARNING 04-07 08:59:26 [config.py:3785] `torch.compile` is turned on, but the model /mnt/models/meta-llama-4-scout-17b-16e-instruct does not support it. Please open an issue on GitHub if you want it to be supported.
(VllmWorker rank=0 pid=222) Process SpawnProcess-1:1:
CRITICAL 04-07 08:59:27 [multiproc_executor.py:49] MulitprocExecutor got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
CRITICAL 04-07 08:59:27 [core_client.py:361] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
 0 nranks 2 cudaDev 0 nvmlDev 0 busId 100000 commId 0xaf1be216fd147d3d - Init COMPLETE
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1121, in <module>
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1069, in run_server
    async with build_async_engine_client(args) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args
    async_llm = AsyncLLM.from_vllm_config(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 136, in from_vllm_config
    return cls(
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 102, in __init__
    self.engine_core = EngineCoreClient.make_client(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 69, in make_client
    return AsyncMPClient(vllm_config, executor_class, log_stats)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 570, in __init__
    super().__init__(
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 401, in __init__
    engine.proc_handle.wait_for_startup()
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/utils.py", line 127, in wait_for_startup
    if self.reader.recv()["status"] != "READY":
       ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
          ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 430, in _recv_bytes
    buf = self._recv(4)
          ^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/connection.py", line 399, in _recv
    raise EOFError

Any help will be appriciated.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions