Skip to content

[Bug]: ValueError: Ray does not allocate any GPUs on the driver node. Consider adjusting the Ray placement group or running the driver on a GPU node. #6956

@youkaichao

Description

@youkaichao

Your current environment

The output of `python collect_env.py`

🐛 Describe the bug

We should improve the way we create the default placement group in

num_devices_in_cluster = ray.cluster_resources().get(device_str, 0)
if parallel_config.world_size > num_devices_in_cluster:
raise ValueError(
f"The number of required {device_str}s exceeds the total "
f"number of available {device_str}s in the placement group.")
# Create a new placement group
placement_group_specs = ([{
device_str: 1
}] * parallel_config.world_size)
current_placement_group = ray.util.placement_group(
placement_group_specs)
# Wait until PG is ready - this will block until all
# requested resources are available, and will timeout

  1. If we are in a placement group, but it does not contain the current node, error out. (this is a rare case, users usually don't set placement groups)
  2. If not, we are creating a placement group. Make sure it contains the current node.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingrayanything related with ray

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions