-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Description
Your current environment
Python==3.10.14
vllm==0.5.0.post1
ray==2.24.0
Node status
Active:
1 node_37c2b26800cc853721ef351ca107c298ae77efcb5504d8e0c900ed1d
1 node_62d48658974f4114465450f53fd97c10fcfe6d40b4e896a90a383682
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
Usage:
0.0/52.0 CPU
0.0/2.0 GPU
0B/9.01GiB memory
0B/4.14GiB object_store_memory
Demands:
(no resource demands)
🐛 Describe the bug
在使用 Gloo 进行全网格连接时遇到了问题,没有找到解决办法
脚本如下:
from vllm import LLM
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
llm = LLM(model="/mnt/d/llm/qwen/qwen1.5_0.5b", trust_remote_code=True, gpu_memory_utilization=0.4,enforce_eager=True,tensor_parallel_size=2,swap_space=1)
outputs = llm.generate(prompts)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
报错如下:
[rank0]: Traceback (most recent call last):
[rank0]: File "/data/vllm_test.py", line 13, in
[rank0]: llm = LLM(model="/mnt/d/llm/qwen/qwen1.5_0.5b", trust_remote_code=True, gpu_memory_utilization=0.4,enforce_eager=True,tensor_parallel_size=2,swap_space=1)
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 144, in init
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 363, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 223, in init
[rank0]: self.model_executor = executor_class(
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in init
[rank0]: super().init(*args, **kwargs)
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 41, in init
[rank0]: self._init_executor()
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 40, in _init_executor
[rank0]: self._init_workers_ray(placement_group)
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 171, in _init_workers_ray
[rank0]: self._run_workers("init_device")
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 246, in _run_workers
[rank0]: driver_worker_output = self.driver_worker.execute_method(
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 149, in execute_method
[rank0]: raise e
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140, in execute_method
[rank0]: return executor(*args, **kwargs)
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/worker/worker.py", line 115, in init_device
[rank0]: init_worker_distributed_environment(self.parallel_config, self.rank,
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/worker/worker.py", line 354, in init_worker_distributed_environment
[rank0]: init_distributed_environment(parallel_config.world_size, rank,
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 553, in init_distributed_environment
[rank0]: _WORLD = GroupCoordinator(
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/distributed/parallel_state.py", line 120, in init
[rank0]: cpu_group = torch.distributed.new_group(ranks, backend="gloo")
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
[rank0]: func_return = func(*args, **kwargs)
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3868, in new_group
[rank0]: return _new_group_with_tag(ranks, timeout, backend, pg_options, None, use_local_synchronization=use_local_synchronization)
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3939, in _new_group_with_tag
[rank0]: pg, pg_store = _new_process_group_helper(
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1509, in _new_process_group_helper
[rank0]: backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout)
[rank0]: RuntimeError: Gloo connectFullMesh failed with [../third_party/gloo/gloo/transport/tcp/pair.cc:144] no error