Skip to content

Commit 0ddec7f

Browse files
Kiuk Chungfacebook-github-bot
authored andcommitted
(torchx/scheduler) add pretty print repr lambda for SlurmBatchRunRequest for dryrun (#417)
Summary: Pull Request resolved: #417 Makes `torchx run --dryrun -s slurm` pretty print the slurm scheduler reqeust object. BEFORE: ``` torchx 2022-03-09 14:52:15 INFO === SCHEDULER REQUEST === SlurmBatchRequest(cmd=['sbatch', '--parsable'], replicas={'trainer-0': SlurmReplicaRequest(name='trainer-0', entrypoint='python', args=['-m', 'torch.distributed.run', '--rdzv_backend', 'c10d', '--rdzv_id', '${app_id}', '--nnodes', '1', '--nproc_per_node', '8', '--tee', '3 ', '--role', '', '-m', 'torchx.examples.apps.fb.compute_world_size.main'], srun_opts={'output': 'slurm-${app_id}-trainer-0.out', 'error': 'slurm-${app_id}-trainer-0.err'}, sbatch_opts={'ntasks-per-node': '1', 'cpus-per-task': '56', 'mem': '1572864', 'gpus-per-task': '8'}, env={'HYDRA_MAIN_MODULE': 'torchx.examples.apps.fb.compute_world_size.main', 'NCCL_DEBUG': 'INFO', 'NCCL_ASYNC_ERROR_HANDLING': '1', 'NCCL_DEBUG_SUBSYS': 'INIT,ENV,GRAPH', 'MALLOC_CONF': 'oversize_threshold:67108864,dirty_decay_ms:180000,muzzy_decay_ms:180000', 'NCCL_SOC KET_IFNAME': 'eth1', 'NCCL_IB_HCA': 'mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1', 'LOGLEVEL': 'WARNING'})}) ``` AFTER: ``` === SCHEDULER REQUEST === #!/bin/bash #SBATCH --job-name=trainer-0 --ntasks-per-node=1 --cpus-per-task=56 --mem=1572864 --gpus-per-task=8 # exit on error set -e export PYTHONUNBUFFERED=1 export SLURM_UNBUFFEREDIO=1 srun --output=slurm-"$SLURM_JOB_ID"-trainer-0.out --error=slurm-"$SLURM_JOB_ID"-trainer-0.err --export=ALL,HYDRA_MAIN_MODULE=torchx.examples.apps.fb.compute_world_size.main,NCCL_DEBUG=INFO,NCCL_ASYNC_ERROR_HANDLING=1,NCCL_DEBUG_SUBSYS=I NIT,ENV,GRAPH,MALLOC_CONF=oversize_threshold:67108864,dirty_decay_ms:180000,muzzy_decay_ms:180000,NCCL_SOCKET_IFNAME=eth1,NCCL_IB_HCA=mlx5_0:1,mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,LOGLEVEL=WARNING python -m tor ch.distributed.run --rdzv_backend c10d --rdzv_id ''"$SLURM_JOB_ID"'' --nnodes 1 --nproc_per_node 8 --tee 3 --role '' -m torchx.examples.apps.fb.compute_world_size.main ``` Reviewed By: mannatsingh, aivanou Differential Revision: D34770528 fbshipit-source-id: 21c05348c7e660181735470d180f305b9623c673
1 parent 5c649da commit 0ddec7f

File tree

1 file changed

+9
-1
lines changed

1 file changed

+9
-1
lines changed

torchx/schedulers/slurm_scheduler.py

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -196,9 +196,16 @@ def materialize(self) -> str:
196196
197197
srun {" ".join(srun_groups)}
198198
"""
199-
sbatch_cmd = self.cmd + sbatch_groups
200199
return script
201200

201+
def __repr__(self) -> str:
202+
return f"""{' '.join(self.cmd + ['$SBATCH_SCRIPT'])}
203+
204+
#----------------
205+
# SBATCH_SCRIPT
206+
#----------------
207+
{self.materialize()}"""
208+
202209

203210
class SlurmScheduler(Scheduler):
204211
"""
@@ -345,6 +352,7 @@ def _submit_dryrun(
345352
cmd=cmd,
346353
replicas=replicas,
347354
)
355+
348356
return AppDryRunInfo(req, repr)
349357

350358
def _validate(self, app: AppDef, scheduler: SchedulerBackend) -> None:

0 commit comments

Comments
 (0)