Skip to content

Indirect launch fails using slurm #1251

@david-edwards-arm

Description

@david-edwards-arm

Background information

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

Open MPI 5.0.x nightly snapshot openmpi-v5.0.x-202203030340-563c565.tar.gz

What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)

Open MPI 5.0.x nightly snapshot openmpi-v5.0.x-202203030340-563c565.tar.gz

Please describe the system on which you are running

  • Operating system/version: RHEL 7
  • Computer hardware: x86_64 cluster
  • WLM: Slurm

Details of the problem

The context of the issue is indirect launch of a job under control of a debugger.
Broadly following the indirect.c example,

shell$ salloc -N 1
shell$ indirect mpirun -n 2 <app>

gives output

An PRTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).

Changing the command to use

shell$ salloc -N 1
shell$ indirect mpirun --mca plm ssh -n 2 <app>

allows the job to complete - this issue appears to be specific to slurm integration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions