-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
[Ray backend] Better error when pg topology is bad. #7584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
youkaichao
merged 24 commits into
vllm-project:main
from
rkooo567:fix-ray-value-error-due-to-missing-pg
Aug 23, 2024
Merged
Changes from all commits
Commits
Show all changes
24 commits
Select commit
Hold shift + click to select a range
16bcb3f
basic working
rkooo567 7f75862
move test to a 2 nodes tests.
rkooo567 3a5ab31
lint fix test
rkooo567 59cdddb
Merge branch 'main' into fix-ray-value-error-due-to-missing-pg
rkooo567 0f8655f
fixed
rkooo567 9ba1f73
fixed
rkooo567 17d4385
linting
rkooo567 c361edd
Merge branch 'main' into fix-ray-value-error-due-to-missing-pg
rkooo567 30e0df2
.
rkooo567 c832faf
Merge branch 'main' into fix-ray-value-error-due-to-missing-pg
rkooo567 a85dcbc
fixed
rkooo567 87654a6
ip
rkooo567 a8e5aca
fixed
rkooo567 a3f9719
test 2
rkooo567 5ef0512
fix
rkooo567 53848df
fixed
rkooo567 59f8e2b
fix
rkooo567 2c91954
Merge branch 'main' into fix-ray-value-error-due-to-missing-pg
youkaichao f5772fc
update tests
youkaichao a01a1ff
update tests
youkaichao 09f841a
refine
youkaichao 196d857
fix test
youkaichao 4802308
fix test
youkaichao 8d6b00a
rearrange tests
youkaichao File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
"""Make sure ray assigns GPU workers to the correct node. | ||
|
||
Run: | ||
```sh | ||
cd $VLLM_PATH/tests | ||
|
||
pytest distributed/test_multi_node_assignment.py | ||
``` | ||
""" | ||
|
||
import os | ||
|
||
import pytest | ||
import ray | ||
from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy | ||
|
||
from vllm import initialize_ray_cluster | ||
from vllm.config import ParallelConfig | ||
from vllm.executor.ray_utils import _wait_until_pg_removed | ||
from vllm.utils import get_ip | ||
|
||
VLLM_MULTI_NODE = os.getenv("VLLM_MULTI_NODE", "0") == "1" | ||
|
||
|
||
@pytest.mark.skipif(not VLLM_MULTI_NODE, | ||
reason="Need at least 2 nodes to run the test.") | ||
def test_multi_node_assignment() -> None: | ||
|
||
# NOTE: important to keep this class definition here | ||
# to let ray use cloudpickle to serialize it. | ||
class Actor: | ||
|
||
def get_ip(self): | ||
return get_ip() | ||
|
||
for _ in range(10): | ||
config = ParallelConfig(1, 2) | ||
initialize_ray_cluster(config) | ||
|
||
current_ip = get_ip() | ||
workers = [] | ||
for bundle_id, bundle in enumerate( | ||
config.placement_group.bundle_specs): | ||
if not bundle.get("GPU", 0): | ||
continue | ||
scheduling_strategy = PlacementGroupSchedulingStrategy( | ||
placement_group=config.placement_group, | ||
placement_group_capture_child_tasks=True, | ||
placement_group_bundle_index=bundle_id, | ||
) | ||
|
||
worker = ray.remote( | ||
num_cpus=0, | ||
num_gpus=1, | ||
scheduling_strategy=scheduling_strategy, | ||
)(Actor).remote() | ||
worker_ip = ray.get(worker.get_ip.remote()) | ||
assert worker_ip == current_ip | ||
workers.append(worker) | ||
|
||
for worker in workers: | ||
ray.kill(worker) | ||
|
||
_wait_until_pg_removed(config.placement_group) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the key to make sure current node is included in the placement group?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that's right. I failed to find other clean way to support this unfortunately...