-
Notifications
You must be signed in to change notification settings - Fork 900
Hang during MPI_Init #11566
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I find that of all 512 processes, all but 5 have a stack trace like:
But the other 5 have one that looks like:
I though that perhaps I happened to catch these 5 outside of the sleep but when I issue another stack trace, I find the same processes are in that state, so it seems they they are actually waiting on something. |
Some statistics: with #11563 I only see 71 failures after 2370 tests (3%) |
Do you observe this if not using SGE? |
Yes, this still occurs without compiling with SGE support (at least specifically) |
I've tried to reproduce this with v5.0.x branch and cannot. I am able to reproduce this in main branch. I think it's highly probable that the latest pmix/prrte submodules have introduced some bugs in them. It's good that this isn't a 5.0.x blocker at least |
Judging based on where it is hanging:
I'd guess that something has gone wrong in @rhc54 do you think this may be a PMIx issue? |
Could be? Could also be in PRRTE, but I don't know. I suggest you try to chase it down as the PRRTE submodule in the v5.0 branch will be catching up to what is in main very soon. You could try adding |
Just to see, updated the PRRTE submodule to match OMPI main, haven't hit the issue though I don't know for sure it's not some PRRTE issue. Will see if I can get some time to investigate. |
Sounds like it is probably a PMIx problem. We did recently make a change to the client fence code - I cannot reproduce this problem, but I'll take a look at the code. |
Only thing I could find between PMIx v4.2 (what is in OMPI v5) and PMIx master (what is in OMPI main) is that the latter wasn't sorting the procs in the fence operation. I'm not sure why that would manifest the problem stated here, so I expect this isn't related. I fixed it anyway so we have consistency between the PMIx branches. I'm afraid I'll have to rely on others to provide more debug info - like I said, I'm unable to reproduce the problem here. |
One thing occurs to me. The biggest difference between PMIx v4.x and PMIx master is the new shared memory subsystem. Could you folks please add |
This fixes it on main! 0/178 runs failed, where previously 1/5 runs failed. Double-confirmed by re-running without the environment variable and failures returned. |
@samuelkgutierrez Can you please take a look into this? It appears that the modex information storage is having a problem. Might be we need the failover code after all, or it could be a race condition in the code where you store it. |
I wonder if this is related to openpmix/openpmix#2967? This certainly looks like a race condition. My understanding, though, is that the modex code in I've spent some time trying to track this hang down, but haven't found anything obvious on my end. So any help here is greatly appreciated. If memory serves me, I was able to replicate this hang with For those that are able to help take a look, this is the modex code in |
@wckzhang would it be possible to get a stack trace from all the threads to see where in PMIx this is happening? Thank you. |
I think it should be possible, I'll try to get that, though I haven't tried reproducing with less than 144 ranks. |
I think the question @samuelkgutierrez is trying to determine is whether the client has recvd the job info and is hung trying to process it, or if the client has not recvd the job info. The referenced PMIx issue indicates that it may just be one or two ranks that are actually stuck. There is another fence at the end of MPI_Init, so it is possible that all the other ranks are sitting in that fence waiting for the "stuck" ranks to catch up. Hence, you won't see the "hello" output from anyone. What I would do is add some output right after that referenced code that simply states "rank N has completed modex" so you can see which ranks are stuck, and then give us the full stack trace for just one of those. |
We iterated off-list. What Sam was remembering was something completely unrelated to the MPI_Init hang issue. This seems to be something specific to the gds/shmem component as disabling it resolves the problem. I suspect the problem is in the client side, and that the shmem component is hitting an issue that causes the client to not complete the fence operation. I'll start looking thru the code, but the results of the above test from someone who can reproduce the problem would help a lot. |
I put the print right after the referenced code, does not print anything during the hang, so it doesn't seem like that was the case:
|
Added --prtemca pmix_server_verbose 5 but that doesn't seem to give any useful information:
|
That is....totally unexpected. However, I went and checked and there is sadly no debug output in the "fence" upcall. Sigh. I'll fix that for next time. I'm also going to add some output in the shmem component to bark when it runs into trouble. Will update when it is ready. |
Okay, I would deeply appreciate it if you could $ cd 3rd-party/openpmix
$ git checkout master
$ git pull
$ cd ../prrte
$ git checkout master
$ git pull You may have to do a Once you have that done, please add the following to your respective --prtemca pmix_server_verbose 5 --pmixmca pmix_client_force_debug_verbose 5 --leave-session-attached On a "good" run, I expect you will see something like the following output (amongst other things): From each daemon: [rhc-node01:61350] [prterun-rhc-node01-61350@0,0] FENCE UPCALLED ON NODE <hostname> From each client: [rhc-node02:00168] client:unpack fence received status SUCCESS
[rhc-node02:00168] gds:shmem recv modex complete called
[rhc-node02:00168] gds:shmem connection complete with status SUCCESS
[rhc-node02:00168] client:unpack fence modex complete with status SUCCESS On a "bad" run, I expect you will see the same output from the daemons. However, on at least one node, you should see at least one proc that hangs after emitting: [rhc-node02:00168] client:unpack fence received status SUCCESS
[rhc-node02:00168] gds:shmem recv modex complete called
...no further output This will confirm that the hang is due to some problem with attaching to the shmem region after completing the modex. If you could then attach to one of those processes, it would help to get a full stack trace from it showing all threads. |
Doing it, but it is proving to be taking a while to reproduce. I suspect it's because, from what Luke said about - #11563 - the latest pmix/prrte pointers only have a 3% error rate compared to the 22% error rate on OMPI main branch. |
This was the (tail) output of one hang:
|
I just noticed, only 3 nodes are calling the fence upcall, one node is missing |
Adding some more prints, when I checked GDB, the hang was on this section of code this time:
I added some prints before and after the hang to around this fence as well, now trying to reproduce again |
I guess rank is not instantiated at this point in instance.c as getting rank from comm world only returns 0 for the rank. I was however able to check and on a hang, there is one less process that calls PMIx_Fence (143 procs), than a successful run (144 procs) at this location. So one proc is missing when trying to call the fence |
Let me see if I understand what you are saying. You put some prints around the call to If you set Is that an accurate summary? If so, then it sounds like the next step would be to put some print statements above this point and see if we can localize where the hang occurs.
That is highly suspicious as the rank is set when you init the RTE, which must be long before you get to a Fence. |
Yes
Yes
Not sure if it hangs or something else is occurring.
Yes (or --mca gds ^shmem)
All the prints look like:
This is the diff for the prints:
|
Very odd why you get rank=0 on every proc - sounds like it thinks you are starting a bunch of singletons? I don't know why else the rank would be set to 0. Is the rank=0 on runs that don't hang? I'm wondering if that might be a tip to what is happening here. |
rank=0 is on all runs, so I don't think it's related, plus these prints occur in this if statement:
It's only in this function that the rank is 0, when I put prints in the ompi_instance_init functions, it properly prints out a correct rank. I assumed that rank wasn't set at that point |
Okay, makes no sense to me, but let's set it aside for now. Can you add print statements in the code to get an idea of where it hangs? |
It makes sense, once you understand what is going on. The OMPI object for I noticed this few days ago when I was looking why |
The good news is at the start of the function I printed a statement which printed 144 times, so the error should be contained in this one function, I should be able to narrow it down from here. |
I think my prints are affecting the timing of the hang, so I might have to reduce the number of prints to reproduce the issue. |
I see - thanks for the explanation! The comm_world rank is provided by |
predefined communicators aren't set up if an application is only using sessions. |
May be time to look more at #11451 ? |
Are those fences occurring before this one? We need to ensure that we are getting thru all of the fences. If we block on one, then we won't get to the next. |
It looks like it's hanging somewhere in this code block. Any prints I put in here seem to make it much harder to reproduce. I suspect it's stuck in the ompi_rte_init:
Briefly looking into the function, it does PMIx init and various operations, so it looks suspicious. However considering that adding more prints seem to be causing an issue in reproducing it, I'm going to just manually attach to all 144 processes and see if one of them is doing something different. |
If you can find where in that function it blocks, it would help a lot. I'm betting it's in |
I couldn't find any different stack trace. There should be 36 proc's per node, but it looks like on the hanging node (the one w/o a fence upcall), there's only 35 procs. Could something be happening that kills a process in pmix? |
It shouldn't happen, but I cannot say that it is impossible. What I'd do is put print statements around the |
I suppose it is possible we are exiting with zero status - PRRTE would think that is normal and do nothing about it. I'll see if we can add some debug to help detect that situation should you find we are not returning from |
Yes I'm pretty sure PMIx_Init is killing a proc. It doesn't seem to be just hanging, there's a proc missing on one node each time, and I confirmed that this print:
The "Enter" print prints out 144 times, the error print printed out 0 times, and the exit print printed out 143 times. There is also a missing proc when I look at "top", only for the node without a fence upcall. |
Okay, that helps a bunch - thanks! I'm going to disable the gds/shmem component until someone has time to address this problem. I'll post a submodule update to pick up the chang and report it here as I lack commit privileges for OMPI. |
Okay, the update is here |
A straightforward way to test this might include registering an |
I've been thinking about this some. Here are some general thoughts and questions:
|
The lack of any indicators is indeed puzzling. If the process was terminating abnormally (e.g., SIGBUS), then PRRTE would (a) bark loudly, and (b) terminate the entire job. It wouldn't be allowed to hang. Likewise, if at least one local process calls I see a few possibilities:
The fact that the problem goes away with gds/shmem disabled would seem to indicate that it is somehow related to that component being present. Just not clear precisely how it is involved. |
Another thing that might help track this down: could be helpful to strip this down to the bare minimum. Use an application that consists of Doesn't rule out that PMIx has a problem with the shmem rendezvous as the address spaces might collide. But it would rule out a problem in the PMIx code itself. |
Okay, I went ahead and ran a Biggest difference: I did not see any lost process. All procs were present. I checked FWIW: my cmd line was @samuelkgutierrez Perhaps that formula will help you reproduce it? |
Thank you, @rhc54. This is certainly helpful, thank you. I'm debugging as we speak. |
Hi, @wckzhang. Can you please do me a favor? I've found the likely cause of your hangs using shmem. I've updated the way we do shared-memory segment reference counting and the fix is in OpenPMIx master (openpmix/openpmix#3051). When you have a moment, could you please retest using both OpenPMIx and PRRTE master? The only thing is that you'll have to unignore the shmem component before running autogen.pl by adding your user moniker to src/mca/gds/shmem/.pmix_unignore. Thank you and please let me know if you have any questions. |
@wckzhang and I spoke offline. He was kind enough to give me access to his environment. I can verify that after 200 test runs I did not see any hangs. Without the fix in place, the program hung after about 25 iterations. @rhc54 looks like openpmix/openpmix#3051 fixes this issue. |
Fixed by - openpmix/openpmix#3051 - Closing |
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
main branch
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From source:
./configure --with-sge --without-verbs --with-libfabric=/opt/amazon/efa --disable-man-pages --enable-debug --prefix=/home/ec2-user/ompi
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.Before and after #11563
I'm looking at the stack trace and hangs from after the PR, but I suspect it's the same issue as before the PR, but happening less for some reason.
Please describe the system on which you are running
Details of the problem
Hang during startup of a very simple all-gather test. (ibm/collective/allgatherv). Seems to be more common with larger runs. Test case is 512 ranks on 8 hosts.
By inserting prints at the front I can see the allgatherv program is launched, can print "hello", but sometimes it hangs forever during
MPI_Init()
.Stack trace:
Launch command:
$PREFIX/bin/mpirun --prefix=$PREFIX -np 512 -N 64 -hostfile /home/ec2-user/PortaFiducia/hostfile /home/ec2-user/ompi-tests/ibm/collective/allgatherv
The text was updated successfully, but these errors were encountered: