You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
dist.ddp: make rendezvous work out of the box on all schedulers (#400)
Summary:
Every OSS scheduler we support provides a special environment variable for the rank0 host address. This creates a new macro that provides that for all the schedulers and updates `dist.ddp` to use it.
This macro is a bit different since it's a `rank0_env` macro which provides the name of the ENV variable to read from. To use that, the component either needs to read that or do something like `f"/bin/bash -c main.py $${macros.rank0_env}`. We wrap dist.ddp with `/bin/bash` to achieve this. Due to the macro template language we need to use `$$` instead of just `$`.
Other misc changes:
* fixed ray tests to skip if `ray` isn't installed
* deleted TORCHX_IMAGE_EXAMPLES since it doesn't exist anymore
* updated dist.ddp to have `-m`, `--max_restarts` and set LOGLEVEL by default
Pull Request resolved: #400
Test Plan:
Updated slurm, docker and aws batch integ tests to use `dist.ddp` instead of bespoke usage.
Tested k8s and local_cwd manually.
I haven't tested ray
Updated component test framework to use two workers
Reviewed By: kiukchung
Differential Revision: D34437397
Pulled By: d4l3k
fbshipit-source-id: 24f7fc3ab94be1f824d3649baa78dea8309c076c
0 commit comments