Skip to content

OMPI5 --timeout parameter not killing job after timeout gets exceeded #12313

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
a-szegel opened this issue Feb 6, 2024 · 5 comments
Closed
Assignees

Comments

@a-szegel
Copy link
Member

a-szegel commented Feb 6, 2024

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v5.0.0

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Pulled from AWS EFA Installer v1.30.0

# On an AWS instance
curl -O https://efa-installer.amazonaws.com/aws-efa-installer-1.30.0.tar.gz
tar -xf aws-efa-installer-1.30.0.tar.gz && cd aws-efa-installer
sudo ./efa_installer.sh -y
module load openmpi5

Please describe the system on which you are running

This issue has been seen across the following systems:

centos7-hpc6a.48xlarge: test_imb[openmpi5-MPI1-Reduce_local]
debian10-c6gn.16xlarge: test_omb_collective[openmpi5-osu_iallreduce-host] 
rhel7-hpc6a.48xlarge: test_imb[openmpi5-MPI1-Reduce_local] 
rhel8-c6gn.16xlarge: test_imb[openmpi5-MPI1-Bcast] 

Details of the problem

My teams OMPI5 jobs are being launched with a timeout --timeout 1800 or --timeout 3600 but we are seeing the job hang for a day. An example run command:

export PATH=/opt/amazon/openmpi5/bin:$PATH;export FI_EFA_USE_DEVICE_RDMA=1;export LD_LIBRARY_PATH=/home/ec2-user/tmp/PortaFiducia/build/libraries/libfabric/main/install/libfabric/lib;export FI_PROVIDER=efa;/opt/amazon/openmpi5/bin/mpirun --wdir . -n 192 --hostfile /home/ec2-user/tmp/PortaFiducia/hostfile --map-by ppr:96:node --timeout 1800 -x FI_EFA_USE_DEVICE_RDMA=1 -x LD_LIBRARY_PATH=/home/ec2-user/tmp/PortaFiducia/build/libraries/libfabric/main/install/libfabric/lib -x FI_PROVIDER=efa -x PATH  /home/ec2-user/tmp/PortaFiducia/build/workloads/imb/openmpi-v5.0.0-installer/source/mpi-benchmarks-IMB-v2021.7/IMB-MPI1 Reduce_local -npmin 192 -iter 200 -time 20 -mem 1 2>&1 | tee node2-ppn96.txt

We are seeing the MPI jobs run for much longer than that:

2024-01-29 05:43:09] test_suites/libfabric/test_imb.py::test_imb[openmpi5-MPI1-Reduce_local] 2024-01-30 01:38:08,827 - WARNING - test_orchestrator - Test is being timed out...
2024-01-30 01:38:08,827 - INFO - test_orchestrator - Stopping timer...

The run logs currently don't get saved by our CI system in the event of a timeout (so I don't have better logs as of now), but I am working on that and will update the ticket when I get them.

We don't see this with OMPI4 (but that might just mean it isn't hanging in this way). This is not consistent behavior that we see (some sort of race).

@wenduwan
Copy link
Contributor

wenduwan commented Feb 6, 2024

The symptom aligns with #12064

The issue has been fixed in 5.0.1

@a-szegel
Copy link
Member Author

a-szegel commented Feb 6, 2024

My concern is that the timeout failed to kill the job, I understand the hang itself has been fixed in 5.0.1.

@wenduwan
Copy link
Contributor

wenduwan commented Feb 6, 2024

We will re-evaluate the issue after 5.0.2 release. The timeout functionality is also implemented in prrte, so it is possible the hang fix also resolves this issue.

@wenduwan wenduwan self-assigned this Feb 6, 2024
@rhc54
Copy link
Contributor

rhc54 commented Feb 6, 2024

FWIW: working fine in PRRTE master

@wenduwan
Copy link
Contributor

Issue not observed in 5.0.2. Resolving.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants