Skip to content

[Bug]: vllm-0.5.3.post1部署Qwen2-72b-instruct-awq模型,刚开始服务正常,但是并发高的时候就报错 #6734

@xinzaifeixiang1992

Description

@xinzaifeixiang1992

Your current environment

cuda-12.2
torch-2.3.1
vllm-0.5.3.post1

🐛 Describe the bug

[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0c70fb1897 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f0c70f61b25 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f0c71089718 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f0c722868e6 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f0c7228a9e8 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f0c7229005c in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0c72290dcc in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdbbf4 (0x7f0cbdd45bf4 in /data/anaconda3/envs/qwen/bin/../lib/libstdc++.so.6)
frame #8: + 0x7ea5 (0x7f0cc5608ea5 in /lib64/libpthread.so.0)
frame #9: clone + 0x6d (0x7f0cc5331b0d in /lib64/libc.so.6)

[rank0]:[E ProcessGroupNCCL.cpp:1414] [PG 2 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f0c70fb1897 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f0c70f61b25 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f0c71089718 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7f0c722868e6 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f0c7228a9e8 in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x77c (0x7f0c7229005c in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f0c72290dcc in /data/anaconda3/envs/qwen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xdbbf4 (0x7f0cbdd45bf4 in /data/anaconda3/envs/qwen/bin/../lib/libstdc++.so.6)
frame #8: + 0x7ea5 (0x7f0cc5608ea5 in /lib64/libpthread.so.0)
frame #9: clone + 0x6d (0x7f0cc5331b0d in /lib64/libc.so.6)

/data/anaconda3/envs/qwen/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions