Skip to content

[Bug]: 使用vllm+ray分布式推理报错 #5779

@JKYtydt

Description

@JKYtydt

Your current environment

Python==3.10.14
vllm==0.5.0.post1
ray==2.24.0

Node status

Active:
1 node_37c2b26800cc853721ef351ca107c298ae77efcb5504d8e0c900ed1d
1 node_62d48658974f4114465450f53fd97c10fcfe6d40b4e896a90a383682
Pending:
(no pending nodes)
Recent failures:
(no failures)

Resources

Usage:
0.0/52.0 CPU
0.0/2.0 GPU
0B/9.01GiB memory
0B/4.14GiB object_store_memory

Demands:
(no resource demands)

🐛 Describe the bug

在使用 Gloo 进行全网格连接时遇到了问题,没有找到解决办法
脚本如下:
from vllm import LLM
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]

llm = LLM(model="/mnt/d/llm/qwen/qwen1.5_0.5b", trust_remote_code=True, gpu_memory_utilization=0.4,enforce_eager=True,tensor_parallel_size=2,swap_space=1)

outputs = llm.generate(prompts)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

报错如下:
[rank0]: Traceback (most recent call last):
[rank0]: File "/data/vllm_test.py", line 13, in
[rank0]: llm = LLM(model="/mnt/d/llm/qwen/qwen1.5_0.5b", trust_remote_code=True, gpu_memory_utilization=0.4,enforce_eager=True,tensor_parallel_size=2,swap_space=1)
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 144, in init
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 363, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 223, in init
[rank0]: self.model_executor = executor_class(
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in init
[rank0]: super().init(*args, **kwargs)
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 41, in init
[rank0]: self._init_executor()
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 40, in _init_executor
[rank0]: self._init_workers_ray(placement_group)
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 171, in _init_workers_ray
[rank0]: self._run_workers("init_device")
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 246, in _run_workers
[rank0]: driver_worker_output = self.driver_worker.execute_method(
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 149, in execute_method
[rank0]: raise e
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140, in execute_method
[rank0]: return executor(*args, **kwargs)
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/worker/worker.py", line 115, in init_device
[rank0]: init_worker_distributed_environment(self.parallel_config, self.rank,
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/worker/worker.py", line 354, in init_worker_distributed_environment
[rank0]: init_distributed_environment(parallel_config.world_size, rank,
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 553, in init_distributed_environment
[rank0]: _WORLD = GroupCoordinator(
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 120, in init
[rank0]: cpu_group = torch.distributed.new_group(ranks, backend="gloo")
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
[rank0]: func_return = func(*args, **kwargs)
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3868, in new_group
[rank0]: return _new_group_with_tag(ranks, timeout, backend, pg_options, None, use_local_synchronization=use_local_synchronization)
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3939, in _new_group_with_tag
[rank0]: pg, pg_store = _new_process_group_helper(
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1509, in _new_process_group_helper
[rank0]: backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
[rank0]: RuntimeError: Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions