-
Notifications
You must be signed in to change notification settings - Fork 900
ompi-tests/oneside/test_start2 hang with large group size on ompi master #9447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The root cause of this hang is a deadlock situation. Among the ranks that hangs at the post stage, there is one rank that is special, it has submitted an atomic operation, and is waiting for the completion, for that it calls the progress engine ( The following is the detail: In the post stage, each rank will need to setup a channel to every other rank in the group. osc/rdma implement that by two btl atomic operations: First, a rank uses btl's atomic fetch to get a Second, a rank uses btl's atomic compare and swap to occupy a communication channel of the peer. One thing worth noting is that the number of communication channel is limited(https://github.com/open-mpi/ompi/blob/master/ompi/.mca/osc/rdma/osc_rdma_types.h#L113), so The problem is both atomic actions are implemented as blocking operation, which means osc/rdma submitted an atomic operation then call progress engine ( Following is an example of how the deadlock can happen: We have a group of 72 ranks. Rank X is the special rank. All other 71 ranks need to setup a post channel to rank X, the first 32 are the lucky ones, and each of them got a channel. The rest 39 ranks have to wait, They keep calling btl_cswap on rank X, and waiting for a channel to become available. Because btl/tcp uses active message RDMA for atomic, the other 39 ranks will keep sending messages to rank X. Rank X itself need to setup post channels to others, for that it will submit btl atomic operations, then call |
I'm not sure I understand how you can have an endless events on a particular rank that would block opal_progress from returning upstream. The opal_progress only pull events from the network once, and once all are completely handled it returns to the caller, which means the upper level get a chance to move forward if it has received the expected answer. |
Thank you! This is what I observed, and I also found it puzzling. I will look into |
Earlier you mentioned that
I believe you meant the I found that it is not the case. The default However, this flag was removed during MPI_Init, (here).
So the behavior of I wonder which is the intended behavior of @bwbarrett @rhc54, I would appreciate your input on this topic. |
I've always used |
|
I am not sure I agree with the interpretation of The following is an excerpt from https://github.com/libevent/libevent/blob/master/event.c (between line 1981 and line 2054)
My reading of the code is: By default,
EVLOOP_ONCE means: when something happened, only do 1 iteration. So without the EVLOOP_ONCE flag, |
you're right, |
@bosilca Thank you! Based on the comment in the code and the commit message, I believe the current code intended to make sure The comment for the function So I opened #9480 to fix the issues. |
PR merged and back ported (#9573) |
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
ompi master branch + PR #9400
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
compiled from source
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.Please describe the system on which you are running
Details of the problem
I encountered this issue when testing my PR #9400, which fix a set of issues with osc/rdma + btl/tcp. I did some debug and know the root cause of the issue is not my PR, but something in osc/rdma.
But first, I want to provide a reproducer:
onesided/test_start2
test with tcp with a larger group size (I used 72 ranks on 2 nodes). The following is the command line I used:The test will hang.
The start2 test does the post-start-complete-wait on the group consisted with the world communicator.
Initial investigation shows that different ranks hang at different stages:
some ranks passed the post stage, and hang in the start stage.
other ranks hang at post.
The text was updated successfully, but these errors were encountered: