-
Notifications
You must be signed in to change notification settings - Fork 900
Return error from node failure #10389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
As the large banner on your output indicates, using ULFM with RMA windows is very experimental at this point. We have had success in the past running some code using the You may still have luck modifying your test program in the following ways:
|
@abouteiller Can you look into the "Sorry! You were supposed to get help..." issue? It seems like there was supposed to be a real help message there. |
The 'node-died' issue appears to be related to Are you using an internal prte? (it is indicated in the final lines of 'configure' output) |
I am using an internal prte. I tried adding the path to the openmpi installation to PATH, but that did not fix the missing error text. It's not a problem though. Calling MPI_Win_set_errhandler(window, MPI_ERRORS_RETURN); after creating the window does not fix the problem of the node crash not being returned as an error. It seems the program crashes once the MPI_Get attempts to access a node that is inaccessible (node-died error) or has raised a SIGKILL (fails silently). Is it safe to say that I should wait for the full OpenMPI 5 release? Thank you for your time. Side note: I do not have RDMA installed, and when I run the program without node failures it runs correctly, so perhaps RDMA is being reported in error? |
The root cause for the missing help message is here openpmix/prrte#1360 |
FWIW: the missing help message problem has been fixed in PRRTE (both master and release branches). |
Background information
What version of Open MPI are you using?
v5.0.0rc7
Describe how Open MPI was installed
tarball
Please describe the system on which you are running
Details of the problem
I am trying to make a distributed system built on OpenMPI continue past a node failure. In order to do this I must detect and handle a node failure.
I am using OpenMPI v5rc7, run with "--with-ft ulfm", and have set "MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN)". It seems the node failure is not returned as an error that can be handled in the code.
Example:
I have also tried running with "/home/ompi5rc7/bin/mpirun --with-ft ulfm --mca btl tcp,self -n 2 --hostfile ../hosts ./a.out" but get the same output. I am not using RDMA.
Is it possible to print out the error code after a node failure?
The text was updated successfully, but these errors were encountered: