-
Notifications
You must be signed in to change notification settings - Fork 74
Closed
Description
Background information
What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)
Open MPI 5.0.x nightly snapshot openmpi-v5.0.x-202203030340-563c565.tar.gz
What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)
Open MPI 5.0.x nightly snapshot openmpi-v5.0.x-202203030340-563c565.tar.gz
Please describe the system on which you are running
- Operating system/version: RHEL 7
- Computer hardware: x86_64 cluster
- WLM: Slurm
Details of the problem
The context of the issue is indirect launch of a job under control of a debugger.
Broadly following the indirect.c example,
shell$ salloc -N 1
shell$ indirect mpirun -n 2 <app>
gives output
An PRTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
Changing the command to use
shell$ salloc -N 1
shell$ indirect mpirun --mca plm ssh -n 2 <app>
allows the job to complete - this issue appears to be specific to slurm integration.
Metadata
Metadata
Assignees
Labels
No labels