You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# On an AWS instance
curl -O https://efa-installer.amazonaws.com/aws-efa-installer-1.30.0.tar.gz
tar -xf aws-efa-installer-1.30.0.tar.gz && cd aws-efa-installer
sudo ./efa_installer.sh -y
module load openmpi5
Please describe the system on which you are running
This issue has been seen across the following systems:
My teams OMPI5 jobs are being launched with a timeout --timeout 1800 or --timeout 3600 but we are seeing the job hang for a day. An example run command:
We are seeing the MPI jobs run for much longer than that:
2024-01-29 05:43:09] test_suites/libfabric/test_imb.py::test_imb[openmpi5-MPI1-Reduce_local] 2024-01-30 01:38:08,827 - WARNING - test_orchestrator - Test is being timed out...
2024-01-30 01:38:08,827 - INFO - test_orchestrator - Stopping timer...
The run logs currently don't get saved by our CI system in the event of a timeout (so I don't have better logs as of now), but I am working on that and will update the ticket when I get them.
We don't see this with OMPI4 (but that might just mean it isn't hanging in this way). This is not consistent behavior that we see (some sort of race).
The text was updated successfully, but these errors were encountered:
We will re-evaluate the issue after 5.0.2 release. The timeout functionality is also implemented in prrte, so it is possible the hang fix also resolves this issue.
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
v5.0.0
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Pulled from AWS EFA Installer v1.30.0
Please describe the system on which you are running
This issue has been seen across the following systems:
Details of the problem
My teams OMPI5 jobs are being launched with a timeout
--timeout 1800
or--timeout 3600
but we are seeing the job hang for a day. An example run command:We are seeing the MPI jobs run for much longer than that:
The run logs currently don't get saved by our CI system in the event of a timeout (so I don't have better logs as of now), but I am working on that and will update the ticket when I get them.
We don't see this with OMPI4 (but that might just mean it isn't hanging in this way). This is not consistent behavior that we see (some sort of race).
The text was updated successfully, but these errors were encountered: