-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Your current environment
I ran
sudo docker pull vllm/vllm-openai:latest
sudo docker run -d \
--gpus all \
--name vllm-8b-bf16-b200 \
-p 8000:8000 \
--ipc=host \
-e HF_TOKEN=... \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--tensor-parallel-size 8 \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000
and tried to perform a sweep across input and output configs using vllm bench serve and ran into below isues
🐛 Describe the bug
(APIServer pid=1) INFO: 127.0.0.1:42352 - "POST /v1/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 08-25 12:24:14 [loggers.py:123] Engine 000: Avg prompt throughput: 26298.0 tokens/s, Avg generation throughput: 10103.2 tokens/s, Running: 224 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.2%, Prefix cache hit rate: 6.8%
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] WorkerProc hit an exception.
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] Traceback (most recent call last):
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 591, in worker_busy_loop
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] output = func(*args, **kwargs)
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] return func(*args, **kwargs)
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 362, in execute_model
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] output = self.model_runner.execute_model(scheduler_output,
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] return func(*args, **kwargs)
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1733, in execute_model
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] valid_sampled_token_ids = sampled_token_ids.tolist()
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] RuntimeError: CUDA error: an illegal memory access was encountered
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596]
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] Traceback (most recent call last):
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 591, in worker_busy_loop
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] output = func(*args, **kwargs)
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] return func(*args, **kwargs)
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 362, in execute_model
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] output = self.model_runner.execute_model(scheduler_output,
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] return func(*args, **kwargs)
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 1733, in execute_model
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] valid_sampled_token_ids = sampled_token_ids.tolist()
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] ^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] RuntimeError: CUDA error: an illegal memory access was encountered
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596]
(VllmWorker TP1 pid=406) ERROR 08-25 12:24:17 [multiproc_executor.py:596]
[rank1]:[E825 12:24:17.876068535 ProcessGroupNCCL.cpp:1899] [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7fb2b19785e8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7fb2b190d4a2 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7fb2b1d24422 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fb24756d5a6 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7fb24757d840 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x782 (0x7fb24757f3d2 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fb247580fdd in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7fb2379b3253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7fb2b2602ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7fb2b2693a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7fb2b19785e8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7fb2b190d4a2 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7fb2b1d24422 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fb24756d5a6 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x70 (0x7fb24757d840 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x782 (0x7fb24757f3d2 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fb247580fdd in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xdc253 (0x7fb2379b3253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x94ac3 (0x7fb2b2602ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #9: clone + 0x44 (0x7fb2b2693a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1905 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7fb2b19785e8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xcc7b9e (0x7fb24754fb9e in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x9165ed (0x7fb24719e5ed in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xdc253 (0x7fb2379b3253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #4: <unknown function> + 0x94ac3 (0x7fb2b2602ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #5: clone + 0x44 (0x7fb2b2693a04 in /usr/lib/x86_64-linux-gnu/libc.so.6)
(EngineCore_0 pid=271) ERROR 08-25 12:24:19 [multiproc_executor.py:146] Worker proc VllmWorker-1 died unexpectedly, shutting down executor.
(VllmWorker TP0 pid=405) INFO 08-25 12:24:19 [multiproc_executor.py:520] Parent process exited, terminating worker
(VllmWorker TP2 pid=407) INFO 08-25 12:24:19 [multiproc_executor.py:520] Parent process exited, terminating worker
(VllmWorker TP3 pid=408) INFO 08-25 12:24:19 [multiproc_executor.py:520] Parent process exited, terminating worker
(VllmWorker TP4 pid=409) INFO 08-25 12:24:19 [multiproc_executor.py:520] Parent process exited, terminating worker
(VllmWorker TP5 pid=410) INFO 08-25 12:24:19 [multiproc_executor.py:520] Parent process exited, terminating worker
(VllmWorker TP6 pid=411) INFO 08-25 12:24:19 [multiproc_executor.py:520] Parent process exited, terminating worker
(VllmWorker TP7 pid=412) INFO 08-25 12:24:19 [multiproc_executor.py:520] Parent process exited, terminating worker
(APIServer pid=1) INFO 08-25 12:24:24 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 7599.4 tokens/s, Running: 46 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 6.8%
(APIServer pid=1) INFO 08-25 12:24:34 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 46 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 6.8%
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [dump_input.py:69] Dumping input data for V1 LLM engine (v0.10.1.1) with config: model='meta-llama/Llama-3.1-8B-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-3.1-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=8, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=meta-llama/Llama-3.1-8B-Instruct, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null},
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [dump_input.py:76] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['cmpl-a1608f9679d149b986d6c224b1692e1a-0', 'cmpl-b9f9b4cad7c842c6af3f97cd025b8827-0', 'cmpl-e4ff1c56885d4842ab7b9cc64a74f2cc-0', 'cmpl-e1d55e6ae7c548a197239f8fdadee2ce-0', 'cmpl-5df04664825344ddb17a4a2e03d71fc0-0', 'cmpl-6af42ebea36845f8ac231f6b5a85d788-0', 'cmpl-4ad81fc4d4cb4b4f8040716c6a88e93f-0', 'cmpl-7140426cd0274962a6a11e34a15f1c59-0', 'cmpl-f788a2daf8404fd9b022c7fa09a7a647-0', 'cmpl-c039f23811e84f62bfc4e91c1889ee89-0', 'cmpl-1f3a5ce1766f4089a1d5a5525b1d40ba-0', 'cmpl-4a2b48ee95894d268f0154ae3767a4f2-0', 'cmpl-fbd5471ba88d47f7b25f74fd36878654-0', 'cmpl-2d5a59add4a64ff0be382b958b67d7a6-0', 'cmpl-a8c734c5e2fc4068afa5da1a81db36f7-0', 'cmpl-db324d84c39a46fc96337dca0c5fb016-0', 'cmpl-fd46a95e15df4b479c449777f657bb7c-0', 'cmpl-67843866b6aa4e38aa711a6d7f965d64-0', 'cmpl-dbe7e63fa7204fc4972d56770b10f7e2-0', 'cmpl-f4bcdaea24a54d6b9f030357f63e51c8-0', 'cmpl-0055582e065f485b9544012ee7f10421-0', 'cmpl-7b37a606b8224758a21d919d80be033e-0', 'cmpl-54cc950534024949a6d81f3a8a754c5c-0', 'cmpl-b24988d1c4e54d71a4b98229fc153ec4-0', 'cmpl-e093e5e9465340a9a094a6fbf8717dc7-0', 'cmpl-b2b106465b9242a3981e8e41d3f52059-0', 'cmpl-1f9111a751bf4b8d9900e220977f1999-0', 'cmpl-643c6c37535c47b29eb2bce1eed10fa8-0', 'cmpl-86728ec6560a407d8362075a334b3bec-0', 'cmpl-59ca12e89c60434c845094de2eb20c9a-0', 'cmpl-5c17eac5412f4f27a80697e7fb2d897c-0', 'cmpl-9300f636cd6947a0ba2942e5c06ade7c-0', 'cmpl-73dc79082dab4d2c9865b1f07ce56066-0', 'cmpl-86372334c65b4b5b8c721c630bf7f2eb-0', 'cmpl-e4a76396c85140b2a51bd5ba1153dbdb-0', 'cmpl-721c99b3cf6a4291bbe6733dc2aab4a3-0', 'cmpl-8caaebbd79e748d68b9d22e2dc8fba73-0', 'cmpl-73a225f1176a4115b39c3fc9fc40a473-0', 'cmpl-4f2efb18da2c4a8081108869ae6ef5c7-0', 'cmpl-ea717b38e897454887bf5211398001dd-0', 'cmpl-b5453d5862794f9d983d589396451c8f-0', 'cmpl-b54360cc76ff4079bf66699b81753ae1-0', 'cmpl-935e4c690e93471a82c9fcd90ff4fd73-0', 'cmpl-ad50070e4db743a4a2d0e77634fef7ce-0', 'cmpl-921a25acd8954387b9649171350be691-0', 'cmpl-9d84b57ac81c484b9876dfa52d66a50d-0'], resumed_from_preemption=[false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false], new_token_ids=[], new_block_ids=[[[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]], [[]]], num_computed_tokens=[1790, 1790, 1790, 1790, 1790, 1790, 1789, 1789, 1789, 1789, 1789, 1789, 1789, 1789, 1788, 1788, 1788, 1788, 1788, 1788, 1788, 1788, 1787, 1787, 1787, 1787, 1787, 1787, 1787, 1786, 1786, 1786, 1786, 1786, 1786, 1786, 1786, 1785, 1785, 1785, 1785, 1785, 1785, 1785, 1784, 1784]), num_scheduled_tokens={cmpl-f788a2daf8404fd9b022c7fa09a7a647-0: 1, cmpl-1f9111a751bf4b8d9900e220977f1999-0: 1, cmpl-e4a76396c85140b2a51bd5ba1153dbdb-0: 1, cmpl-b54360cc76ff4079bf66699b81753ae1-0: 1, cmpl-7140426cd0274962a6a11e34a15f1c59-0: 1, cmpl-ad50070e4db743a4a2d0e77634fef7ce-0: 1, cmpl-67843866b6aa4e38aa711a6d7f965d64-0: 1, cmpl-b2b106465b9242a3981e8e41d3f52059-0: 1, cmpl-935e4c690e93471a82c9fcd90ff4fd73-0: 1, cmpl-1f3a5ce1766f4089a1d5a5525b1d40ba-0: 1, cmpl-b5453d5862794f9d983d589396451c8f-0: 1, cmpl-fd46a95e15df4b479c449777f657bb7c-0: 1, cmpl-5df04664825344ddb17a4a2e03d71fc0-0: 1, cmpl-b9f9b4cad7c842c6af3f97cd025b8827-0: 1, cmpl-73a225f1176a4115b39c3fc9fc40a473-0: 1, cmpl-b24988d1c4e54d71a4b98229fc153ec4-0: 1, cmpl-e1d55e6ae7c548a197239f8fdadee2ce-0: 1, cmpl-ea717b38e897454887bf5211398001dd-0: 1, cmpl-86372334c65b4b5b8c721c630bf7f2eb-0: 1, cmpl-86728ec6560a407d8362075a334b3bec-0: 1, cmpl-8caaebbd79e748d68b9d22e2dc8fba73-0: 1, cmpl-5c17eac5412f4f27a80697e7fb2d897c-0: 1, cmpl-7b37a606b8224758a21d919d80be033e-0: 1, cmpl-73dc79082dab4d2c9865b1f07ce56066-0: 1, cmpl-4ad81fc4d4cb4b4f8040716c6a88e93f-0: 1, cmpl-59ca12e89c60434c845094de2eb20c9a-0: 1, cmpl-db324d84c39a46fc96337dca0c5fb016-0: 1, cmpl-c039f23811e84f62bfc4e91c1889ee89-0: 1, cmpl-a8c734c5e2fc4068afa5da1a81db36f7-0: 1, cmpl-e093e5e9465340a9a094a6fbf8717dc7-0: 1, cmpl-a1608f9679d149b986d6c224b1692e1a-0: 1, cmpl-54cc950534024949a6d81f3a8a754c5c-0: 1, cmpl-9300f636cd6947a0ba2942e5c06ade7c-0: 1, cmpl-9d84b57ac81c484b9876dfa52d66a50d-0: 1, cmpl-721c99b3cf6a4291bbe6733dc2aab4a3-0: 1, cmpl-dbe7e63fa7204fc4972d56770b10f7e2-0: 1, cmpl-f4bcdaea24a54d6b9f030357f63e51c8-0: 1, cmpl-6af42ebea36845f8ac231f6b5a85d788-0: 1, cmpl-fbd5471ba88d47f7b25f74fd36878654-0: 1, cmpl-921a25acd8954387b9649171350be691-0: 1, cmpl-4f2efb18da2c4a8081108869ae6ef5c7-0: 1, cmpl-643c6c37535c47b29eb2bce1eed10fa8-0: 1, cmpl-4a2b48ee95894d268f0154ae3767a4f2-0: 1, cmpl-0055582e065f485b9544012ee7f10421-0: 1, cmpl-2d5a59add4a64ff0be382b958b67d7a6-0: 1, cmpl-e4ff1c56885d4842ab7b9cc64a74f2cc-0: 1}, total_num_scheduled_tokens=46, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0], finished_req_ids=['cmpl-9e8212d8f3df4fcfbc8a02e347a50af9-0', 'cmpl-141566cfe0d74757902f26519a45838d-0', 'cmpl-0bf2ca4c7ca54bf2b4ea1404847e424a-0', 'cmpl-ee984d614f544d6fa38c0f170a6657f8-0', 'cmpl-f26038f1788a477c8053acd5e5b49ec5-0', 'cmpl-801bcde4959d4a86bd3f4f7ce5df741a-0', 'cmpl-37b1364bc95148a9b512f8baee500fa1-0', 'cmpl-417cf722c01a4844ba0dbd89297d1d91-0'], free_encoder_input_ids=[], structured_output_request_ids={}, grammar_bitmask=null, kv_connector_metadata=null)
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [dump_input.py:79] Dumping scheduler stats: SchedulerStats(num_running_reqs=46, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.00820318034420564, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0), spec_decoding_stats=None, num_corrupted_reqs=0)
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] EngineCore encountered a fatal error.
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] Traceback (most recent call last):
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 243, in collective_rpc
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] result = get_response(w, dequeue_timeout)
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 226, in get_response
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] status, result = w.worker_response_mq.dequeue(
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 507, in dequeue
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] with self.acquire_read(timeout, cancel) as buf:
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] File "/usr/lib/python3.12/contextlib.py", line 137, in __enter__
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] return next(self.gen)
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] ^^^^^^^^^^^^^^
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 469, in acquire_read
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] raise TimeoutError
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] TimeoutError
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702]
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] The above exception was the direct cause of the following exception:
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702]
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] Traceback (most recent call last):
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 693, in run_engine_core
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] engine_core.run_busy_loop()
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 720, in run_busy_loop
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] self._process_engine_step()
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 745, in _process_engine_step
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] outputs, model_executed = self.step_fn()
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] ^^^^^^^^^^^^^^
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 288, in step
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] model_output = self.execute_model_with_error_logging(
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 274, in execute_model_with_error_logging
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] raise err
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 265, in execute_model_with_error_logging
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] return model_fn(scheduler_output)
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 173, in execute_model
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] (output, ) = self.collective_rpc(
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] ^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 249, in collective_rpc
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] raise TimeoutError(f"RPC call to {method} timed out.") from e
(EngineCore_0 pid=271) ERROR 08-25 12:29:17 [core.py:702] TimeoutError: RPC call to execute_model timed out.
(APIServer pid=1) ERROR 08-25 12:29:17 [async_llm.py:430] AsyncLLM output_handler failed.
(APIServer pid=1) ERROR 08-25 12:29:17 [async_llm.py:430] Traceback (most recent call last):
(APIServer pid=1) ERROR 08-25 12:29:17 [async_llm.py:430] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 389, in output_handler
(APIServer pid=1) ERROR 08-25 12:29:17 [async_llm.py:430] outputs = await engine_core.get_output_async()
(APIServer pid=1) ERROR 08-25 12:29:17 [async_llm.py:430] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ERROR 08-25 12:29:17 [async_llm.py:430] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 843, in get_output_async
(APIServer pid=1) ERROR 08-25 12:29:17 [async_llm.py:430] raise self._format_exception(outputs) from None
(APIServer pid=1) ERROR 08-25 12:29:17 [async_llm.py:430] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(EngineCore_0 pid=271) Process EngineCore_0:
(EngineCore_0 pid=271) Traceback (most recent call last):
(EngineCore_0 pid=271) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 243, in collective_rpc
(EngineCore_0 pid=271) result = get_response(w, dequeue_timeout)
(EngineCore_0 pid=271) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=271) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 226, in get_response
(EngineCore_0 pid=271) status, result = w.worker_response_mq.dequeue(
(EngineCore_0 pid=271) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=271) File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 507, in dequeue
(EngineCore_0 pid=271) with self.acquire_read(timeout, cancel) as buf:
(EngineCore_0 pid=271) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=271) File "/usr/lib/python3.12/contextlib.py", line 137, in __enter__
(EngineCore_0 pid=271) return next(self.gen)
(EngineCore_0 pid=271) ^^^^^^^^^^^^^^
(EngineCore_0 pid=271) File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 469, in acquire_read
(EngineCore_0 pid=271) raise TimeoutError
(EngineCore_0 pid=271) TimeoutError
(EngineCore_0 pid=271)
(EngineCore_0 pid=271) The above exception was the direct cause of the following exception:
(EngineCore_0 pid=271)
(EngineCore_0 pid=271) Traceback (most recent call last):
(EngineCore_0 pid=271) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_0 pid=271) self.run()
(EngineCore_0 pid=271) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_0 pid=271) self._target(*self._args, **self._kwargs)
(EngineCore_0 pid=271) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 704, in run_engine_core
(EngineCore_0 pid=271) raise e
(EngineCore_0 pid=271) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 693, in run_engine_core
(EngineCore_0 pid=271) engine_core.run_busy_loop()
(EngineCore_0 pid=271) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 720, in run_busy_loop
(EngineCore_0 pid=271) self._process_engine_step()
(EngineCore_0 pid=271) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 745, in _process_engine_step
(EngineCore_0 pid=271) outputs, model_executed = self.step_fn()
(EngineCore_0 pid=271) ^^^^^^^^^^^^^^
(EngineCore_0 pid=271) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 288, in step
(EngineCore_0 pid=271) model_output = self.execute_model_with_error_logging(
(EngineCore_0 pid=271) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=271) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 274, in execute_model_with_error_logging
(EngineCore_0 pid=271) raise err
(EngineCore_0 pid=271) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 265, in execute_model_with_error_logging
(EngineCore_0 pid=271) return model_fn(scheduler_output)
(EngineCore_0 pid=271) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=271) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 173, in execute_model
(EngineCore_0 pid=271) (output, ) = self.collective_rpc(
(EngineCore_0 pid=271) ^^^^^^^^^^^^^^^^^^^^
(EngineCore_0 pid=271) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 249, in collective_rpc
(EngineCore_0 pid=271) raise TimeoutError(f"RPC call to {method} timed out.") from e
(EngineCore_0 pid=271) TimeoutError: RPC call to execute_model timed out.
(APIServer pid=1) INFO: Shutting down
(APIServer pid=1) INFO: Waiting for application shutdown.
(APIServer pid=1) INFO: Application shutdown complete.
(APIServer pid=1) INFO: Finished server process [1]
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working