Skip to content

Hang during MPI_Init #11566

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lrbison opened this issue Apr 6, 2023 · 56 comments
Closed

Hang during MPI_Init #11566

lrbison opened this issue Apr 6, 2023 · 56 comments

Comments

@lrbison
Copy link
Contributor

lrbison commented Apr 6, 2023

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

main branch

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From source:
./configure --with-sge --without-verbs --with-libfabric=/opt/amazon/efa --disable-man-pages --enable-debug --prefix=/home/ec2-user/ompi

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Before and after #11563

I'm looking at the stack trace and hangs from after the PR, but I suspect it's the same issue as before the PR, but happening less for some reason.

Please describe the system on which you are running

  • Operating system/version: Amazon Linux2
  • Computer hardware: Arm (c7gn)
  • Network type: efa

Details of the problem

Hang during startup of a very simple all-gather test. (ibm/collective/allgatherv). Seems to be more common with larger runs. Test case is 512 ranks on 8 hosts.

By inserting prints at the front I can see the allgatherv program is launched, can print "hello", but sometimes it hangs forever during MPI_Init().

Stack trace:

#0  0x0000ffff7ee07f7c in nanosleep () from /lib64/libc.so.6
#1  0x0000ffff7ee2fd40 in usleep () from /lib64/libc.so.6
#2  0x0000ffff7efecb10 in ompi_mpi_instance_init_common (argc=1, argv=0xfffff5404768) at instance/instance.c:739
#3  0x0000ffff7efece9c in ompi_mpi_instance_init (ts_level=0, info=0xffff7f4891e8 <ompi_mpi_info_null>, errhandler=0xffff7f481228 <ompi_mpi_errors_are_fatal>,
    instance=0xffff7f491cc8 <ompi_mpi_instance_default>, argc=1, argv=0xfffff5404768) at instance/instance.c:814
#4  0x0000ffff7efdb1b0 in ompi_mpi_init (argc=1, argv=0xfffff5404768, requested=0, provided=0xfffff54045ac, reinit_ok=false) at runtime/ompi_mpi_init.c:359
#5  0x0000ffff7f0512a4 in PMPI_Init (argc=0xfffff54045dc, argv=0xfffff54045d0) at init.c:67

Launch command:

$PREFIX/bin/mpirun --prefix=$PREFIX -np 512 -N 64 -hostfile /home/ec2-user/PortaFiducia/hostfile /home/ec2-user/ompi-tests/ibm/collective/allgatherv

@lrbison
Copy link
Contributor Author

lrbison commented Apr 6, 2023

I find that of all 512 processes, all but 5 have a stack trace like:

#0  0x0000ffffba8d5f7c in nanosleep () from /lib64/libc.so.6
#1  0x0000ffffba8fdd40 in usleep () from /lib64/libc.so.6
#2  0x0000ffffbaabab10 in ompi_mpi_instance_init_common (argc=1, argv=0xffffd105cc38) at instance/instance.c:739
#3  0x0000ffffbaabae9c in ompi_mpi_instance_init (ts_level=0, info=0xffffbaf571e8 <ompi_mpi_info_null>, errhandler=0xffffbaf4f228 <ompi_mpi_errors_are_fatal>, instance=0xffffbaf5fcc8 <ompi_mpi_instance_default>, argc=1, argv=0xffffd105cc38) at instance/instance.c:814
#4  0x0000ffffbaaa91b0 in ompi_mpi_init (argc=1, argv=0xffffd105cc38, requested=0, provided=0xffffd105ca7c, reinit_ok=false) at runtime/ompi_mpi_init.c:359
#5  0x0000ffffbab1f2a4 in PMPI_Init (argc=0xffffd105caac, argv=0xffffd105caa0) at init.c:67

But the other 5 have one that looks like:

#0  0x0000ffff86741be4 in epoll_pwait () from /lib64/libc.so.6
#1  0x0000ffff85ed2fbc in epoll_dispatch () from /lib64/libevent_core-2.0.so.5
#2  0x0000ffff85ebc3a8 in event_base_loop () from /lib64/libevent_core-2.0.so.5
#3  0x0000ffff865208f4 in opal_progress_events () at runtime/opal_progress.c:188
#4  0x0000ffff86520a0c in opal_progress () at runtime/opal_progress.c:238
#5  0x0000ffff868f7b08 in ompi_mpi_instance_init_common (argc=1, argv=0xffffec8e22b8) at instance/instance.c:739
#6  0x0000ffff868f7e9c in ompi_mpi_instance_init (ts_level=0, info=0xffff86d941e8 <ompi_mpi_info_null>, errhandler=0xffff86d8c228 <ompi_mpi_errors_are_fatal>, instance=0xffff86d9ccc8 <ompi_mpi_instance_default>, argc=1, argv=0xffffec8e22b8) at instance/instance.c:814
#7  0x0000ffff868e61b0 in ompi_mpi_init (argc=1, argv=0xffffec8e22b8, requested=0, provided=0xffffec8e20fc, reinit_ok=false) at runtime/ompi_mpi_init.c:359
#8  0x0000ffff8695c2a4 in PMPI_Init (argc=0xffffec8e212c, argv=0xffffec8e2120) at init.c:67

I though that perhaps I happened to catch these 5 outside of the sleep but when I issue another stack trace, I find the same processes are in that state, so it seems they they are actually waiting on something.

@lrbison
Copy link
Contributor Author

lrbison commented Apr 6, 2023

Some statistics:

with #11563 I only see 71 failures after 2370 tests (3%)
without the pr, I see 533 failures after 2370 tests (22%)

@hppritcha
Copy link
Member

Do you observe this if not using SGE?

@mjkoop
Copy link
Contributor

mjkoop commented Apr 11, 2023

Yes, this still occurs without compiling with SGE support (at least specifically)

@wckzhang
Copy link
Contributor

I've tried to reproduce this with v5.0.x branch and cannot. I am able to reproduce this in main branch. I think it's highly probable that the latest pmix/prrte submodules have introduced some bugs in them. It's good that this isn't a 5.0.x blocker at least

@wckzhang
Copy link
Contributor

Judging based on where it is hanging:

    if (!opal_process_info.is_singleton) {                                                                              
        /* if we executed the above fence in the background, then                                                       
         * we have to wait here for it to complete. However, there                                                      
         * is no reason to do two barriers! */                                                                          
        if (background_fence) {                                                                                         
            OMPI_LAZY_WAIT_FOR_COMPLETION(active);                                                                      
        } else if (!ompi_async_mpi_init) {                                                                              
            /* wait for everyone to reach this point - this is a hard                                                   
             * barrier requirement at this time, though we hope to relax                                                
             * it at a later point */                                                                                   
            bool flag = false;                                                                                          
            active = true;                                                                                              
            OPAL_POST_OBJECT(&active);                                                                                  
            PMIX_INFO_LOAD(&info[0], PMIX_COLLECT_DATA, &flag, PMIX_BOOL);                                              
            if (PMIX_SUCCESS != (rc = PMIx_Fence_nb(NULL, 0, info, 1,                                                   
                                                    fence_release, (void*)&active))) {                                  
                ret = opal_pmix_convert_status(rc);                                                                     
                return ompi_instance_print_error ("PMIx_Fence_nb() failed", ret);                                       
            }                                                                                                           
            OMPI_LAZY_WAIT_FOR_COMPLETION(active);   // <---- Here                                                                   
        }                                                                                                               
    }

I'd guess that something has gone wrong in PMIx_Fence_nb

@rhc54 do you think this may be a PMIx issue?

@rhc54
Copy link
Contributor

rhc54 commented Apr 13, 2023

Could be? Could also be in PRRTE, but I don't know. I suggest you try to chase it down as the PRRTE submodule in the v5.0 branch will be catching up to what is in main very soon. You could try adding --prtemca pmix_server_verbose 5 to the cmd line and see if the local daemon is getting the "fence" upcall from the PMIx server.

@wckzhang
Copy link
Contributor

Just to see, updated the PRRTE submodule to match OMPI main, haven't hit the issue though I don't know for sure it's not some PRRTE issue. Will see if I can get some time to investigate.

@rhc54
Copy link
Contributor

rhc54 commented Apr 13, 2023

Sounds like it is probably a PMIx problem. We did recently make a change to the client fence code - I cannot reproduce this problem, but I'll take a look at the code.

@rhc54
Copy link
Contributor

rhc54 commented Apr 13, 2023

Only thing I could find between PMIx v4.2 (what is in OMPI v5) and PMIx master (what is in OMPI main) is that the latter wasn't sorting the procs in the fence operation. I'm not sure why that would manifest the problem stated here, so I expect this isn't related. I fixed it anyway so we have consistency between the PMIx branches.

I'm afraid I'll have to rely on others to provide more debug info - like I said, I'm unable to reproduce the problem here.

@rhc54
Copy link
Contributor

rhc54 commented Apr 13, 2023

One thing occurs to me. The biggest difference between PMIx v4.x and PMIx master is the new shared memory subsystem. Could you folks please add PMIX_MCA_gds=hash to your environment and run your tests again? I'd like to see if that works.

@lrbison
Copy link
Contributor Author

lrbison commented Apr 13, 2023

PMIX_MCA_gds=hash

This fixes it on main! 0/178 runs failed, where previously 1/5 runs failed. Double-confirmed by re-running without the environment variable and failures returned.

@rhc54
Copy link
Contributor

rhc54 commented Apr 13, 2023

@samuelkgutierrez Can you please take a look into this? It appears that the modex information storage is having a problem. Might be we need the failover code after all, or it could be a race condition in the code where you store it.

@samuelkgutierrez
Copy link
Member

I wonder if this is related to openpmix/openpmix#2967?

This certainly looks like a race condition. My understanding, though, is that the modex code in gds is single-threaded as far as the component is concerned. Perhaps there is an interaction that is causing this in gds/shmem related to differences exhibited by hash and shmem?

I've spent some time trying to track this hang down, but haven't found anything obvious on my end. So any help here is greatly appreciated. If memory serves me, I was able to replicate this hang with hash, but it was certainly much less frequent.

For those that are able to help take a look, this is the modex code in gds/shmem in question:

@samuelkgutierrez
Copy link
Member

Judging based on where it is hanging:

    if (!opal_process_info.is_singleton) {                                                                              
        /* if we executed the above fence in the background, then                                                       
         * we have to wait here for it to complete. However, there                                                      
         * is no reason to do two barriers! */                                                                          
        if (background_fence) {                                                                                         
            OMPI_LAZY_WAIT_FOR_COMPLETION(active);                                                                      
        } else if (!ompi_async_mpi_init) {                                                                              
            /* wait for everyone to reach this point - this is a hard                                                   
             * barrier requirement at this time, though we hope to relax                                                
             * it at a later point */                                                                                   
            bool flag = false;                                                                                          
            active = true;                                                                                              
            OPAL_POST_OBJECT(&active);                                                                                  
            PMIX_INFO_LOAD(&info[0], PMIX_COLLECT_DATA, &flag, PMIX_BOOL);                                              
            if (PMIX_SUCCESS != (rc = PMIx_Fence_nb(NULL, 0, info, 1,                                                   
                                                    fence_release, (void*)&active))) {                                  
                ret = opal_pmix_convert_status(rc);                                                                     
                return ompi_instance_print_error ("PMIx_Fence_nb() failed", ret);                                       
            }                                                                                                           
            OMPI_LAZY_WAIT_FOR_COMPLETION(active);   // <---- Here                                                                   
        }                                                                                                               
    }

I'd guess that something has gone wrong in PMIx_Fence_nb

@rhc54 do you think this may be a PMIx issue?

@wckzhang would it be possible to get a stack trace from all the threads to see where in PMIx this is happening? Thank you.

@wckzhang
Copy link
Contributor

I think it should be possible, I'll try to get that, though I haven't tried reproducing with less than 144 ranks.

@rhc54
Copy link
Contributor

rhc54 commented Apr 13, 2023

I think the question @samuelkgutierrez is trying to determine is whether the client has recvd the job info and is hung trying to process it, or if the client has not recvd the job info. The referenced PMIx issue indicates that it may just be one or two ranks that are actually stuck.

There is another fence at the end of MPI_Init, so it is possible that all the other ranks are sitting in that fence waiting for the "stuck" ranks to catch up. Hence, you won't see the "hello" output from anyone.

What I would do is add some output right after that referenced code that simply states "rank N has completed modex" so you can see which ranks are stuck, and then give us the full stack trace for just one of those.

@rhc54
Copy link
Contributor

rhc54 commented Apr 13, 2023

If memory serves me, I was able to replicate this hang with hash, but it was certainly much less frequent.

We iterated off-list. What Sam was remembering was something completely unrelated to the MPI_Init hang issue. This seems to be something specific to the gds/shmem component as disabling it resolves the problem. I suspect the problem is in the client side, and that the shmem component is hitting an issue that causes the client to not complete the fence operation.

I'll start looking thru the code, but the results of the above test from someone who can reproduce the problem would help a lot.

@wckzhang
Copy link
Contributor

I put the print right after the referenced code, does not print anything during the hang, so it doesn't seem like that was the case:

    if (!opal_process_info.is_singleton) {                                                                              
        /* if we executed the above fence in the background, then                                                       
         * we have to wait here for it to complete. However, there                                                      
         * is no reason to do two barriers! */                                                                          
        if (background_fence) {                                                                                         
            OMPI_LAZY_WAIT_FOR_COMPLETION(active);                                                                      
        } else if (!ompi_async_mpi_init) {                                                                              
            /* wait for everyone to reach this point - this is a hard                                                   
             * barrier requirement at this time, though we hope to relax                                                
             * it at a later point */                                                                                   
            bool flag = false;                                                                                          
            active = true;                                                                                              
            OPAL_POST_OBJECT(&active);                                                                                  
            PMIX_INFO_LOAD(&info[0], PMIX_COLLECT_DATA, &flag, PMIX_BOOL);                                              
            if (PMIX_SUCCESS != (rc = PMIx_Fence_nb(NULL, 0, info, 1,                                                   
                                                    fence_release, (void*)&active))) {                                  
                ret = opal_pmix_convert_status(rc);                                                                     
                return ompi_instance_print_error ("PMIx_Fence_nb() failed", ret);                                       
            }                                                                                                           
            OMPI_LAZY_WAIT_FOR_COMPLETION(active);   // <---- Hangs Here                                                                   
        }                                                                                                               
    }
/* Added prints here */
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
printf("Rank %d has completed the fence.\n", rank);

@wckzhang
Copy link
Contributor

wckzhang commented Apr 14, 2023

Added --prtemca pmix_server_verbose 5 but that doesn't seem to give any useful information:

ip-172-31-20-101:10470] [prterun-ip-172-31-20-101-10470@0,0] spawn upcalled on behalf of proc prterun-ip-172-31-20-101-10470@0:0 with 8 job infos
[ip-172-31-20-101:10470] [prterun-ip-172-31-20-101-10470@0,0] spawn called from proc [prterun-ip-172-31-20-101-10470@0,0] with 1 apps
[ip-172-31-20-101:10470] [prterun-ip-172-31-20-101-10470@0,0] register nspace for prterun-ip-172-31-20-101-10470@1
[ip-172-31-20-101:10470] UUID: ipv4://06:e9:bb:c0:ae:7f OSNAME: eth0 TYPE: NETWORK MIND: 11 MAXD: 11
[ip-172-31-20-101:10470] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: rdmap0s6 TYPE: OPENFABRICS MIND: 11 MAXD: 11
[ip-172-31-20-101:10470] UUID: ipv4://06:e9:bb:c0:ae:7f OSNAME: eth0 TYPE: NETWORK MIND: 11 MAXD: 11
[ip-172-31-20-101:10470] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: rdmap0s6 TYPE: OPENFABRICS MIND: 11 MAXD: 11
[ip-172-31-20-101:10470] UUID: ipv4://06:e9:bb:c0:ae:7f OSNAME: eth0 TYPE: NETWORK MIND: 11 MAXD: 11
[ip-172-31-20-101:10470] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: rdmap0s6 TYPE: OPENFABRICS MIND: 11 MAXD: 11
[ip-172-31-20-101:10470] UUID: ipv4://06:e9:bb:c0:ae:7f OSNAME: eth0 TYPE: NETWORK MIND: 11 MAXD: 11
[ip-172-31-20-101:10470] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: rdmap0s6 TYPE: OPENFABRICS MIND: 11 MAXD: 11
[ip-172-31-20-101:10470] UUID: ipv4://06:e9:bb:c0:ae:7f OSNAME: eth0 TYPE: NETWORK MIND: 11 MAXD: 11
[ip-172-31-20-101:10470] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: rdmap0s6 TYPE: OPENFABRICS MIND: 11 MAXD: 11
[ip-172-31-20-101:10470] UUID: ipv4://06:e9:bb:c0:ae:7f OSNAME: eth0 TYPE: NETWORK MIND: 11 MAXD: 11
[ip-172-31-20-101:10470] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: rdmap0s6 TYPE: OPENFABRICS MIND: 11 MAXD: 11
[ip-172-31-20-101:10470] UUID: ipv4://06:e9:bb:c0:ae:7f OSNAME: eth0 TYPE: NETWORK MIND: 11 MAXD: 11
[ip-172-31-20-101:10470] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: rdmap0s6 TYPE: OPENFABRICS MIND: 11 MAXD: 11
[ip-172-31-20-101:10470] UUID: ipv4://06:e9:bb:c0:ae:7f OSNAME: eth0 TYPE: NETWORK MIND: 11 MAXD: 11
[ip-172-31-20-101:10470] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: rdmap0s6 TYPE: OPENFABRICS MIND: 11 MAXD: 11

/* Hangs here */

@rhc54
Copy link
Contributor

rhc54 commented Apr 14, 2023

That is....totally unexpected. However, I went and checked and there is sadly no debug output in the "fence" upcall. Sigh. I'll fix that for next time.

I'm also going to add some output in the shmem component to bark when it runs into trouble. Will update when it is ready.

@rhc54
Copy link
Contributor

rhc54 commented Apr 14, 2023

Okay, I would deeply appreciate it if you could git pull both the PMIx and PRRTE submodules master branches:

$ cd 3rd-party/openpmix
$ git checkout master
$ git pull
$ cd ../prrte
$ git checkout master
$ git pull

You may have to do a git submodule update after each git pull as the submodules each have their own submodules that were updated recently. You'll then have to rerun autogen.pl, reconfigure, and rebuild.

Once you have that done, please add the following to your respective mpirun cmd line:

--prtemca pmix_server_verbose 5 --pmixmca pmix_client_force_debug_verbose 5 --leave-session-attached

On a "good" run, I expect you will see something like the following output (amongst other things):

From each daemon:

[rhc-node01:61350] [prterun-rhc-node01-61350@0,0] FENCE UPCALLED ON NODE <hostname>

From each client:

[rhc-node02:00168] client:unpack fence received status SUCCESS
[rhc-node02:00168] gds:shmem recv modex complete called
[rhc-node02:00168] gds:shmem connection complete with status SUCCESS
[rhc-node02:00168] client:unpack fence modex complete with status SUCCESS

On a "bad" run, I expect you will see the same output from the daemons. However, on at least one node, you should see at least one proc that hangs after emitting:

[rhc-node02:00168] client:unpack fence received status SUCCESS
[rhc-node02:00168] gds:shmem recv modex complete called
...no further output

This will confirm that the hang is due to some problem with attaching to the shmem region after completing the modex. If you could then attach to one of those processes, it would help to get a full stack trace from it showing all threads.

@wckzhang
Copy link
Contributor

Doing it, but it is proving to be taking a while to reproduce. I suspect it's because, from what Luke said about - #11563 - the latest pmix/prrte pointers only have a 3% error rate compared to the 22% error rate on OMPI main branch.

@wckzhang
Copy link
Contributor

This was the (tail) output of one hang:

[queue-c5n18xlarge-st-c5n18xlarge-3:08575] UUID: ipv4://06:9f:0a:ba:3c:47 OSNAME: eth0 TYPE: NETWORK MIND: 11 MAXD: 11
[queue-c5n18xlarge-st-c5n18xlarge-3:08575] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: rdmap0s6 TYPE: OPENFABRICS MIND: 11 MAXD: 11
[queue-c5n18xlarge-st-c5n18xlarge-4:09131] [prterun-ip-172-31-20-101-19515@0,4] FENCE UPCALLED ON NODE queue-c5n18xlarge-st-c5n18xlarge-4
[queue-c5n18xlarge-st-c5n18xlarge-3:08575] [prterun-ip-172-31-20-101-19515@0,3] FENCE UPCALLED ON NODE queue-c5n18xlarge-st-c5n18xlarge-3
[queue-c5n18xlarge-st-c5n18xlarge-1:02052] [prterun-ip-172-31-20-101-19515@0,1] FENCE UPCALLED ON NODE queue-c5n18xlarge-st-c5n18xlarge-1
/* Hangs */
/* Output when I enter newline in cmd */
[ip-172-31-20-101:19515] [prterun-ip-172-31-20-101-19515@0,0] pmix_server_stdin_push to dest [prterun-ip-172-31-20-101-19515@1,0]: size 1
[ip-172-31-20-101:19515] [prterun-ip-172-31-20-101-19515@0,0] pmix_server_stdin_push to dest [prterun-ip-172-31-20-101-19515@1,0]: size 1
[ip-172-31-20-101:19515] [prterun-ip-172-31-20-101-19515@0,0] pmix_server_stdin_push to dest [prterun-ip-172-31-20-101-19515@1,0]: size 1
[ip-172-31-20-101:19515] [prterun-ip-172-31-20-101-19515@0,0] pmix_server_stdin_push to dest [prterun-ip-172-31-20-101-19515@1,0]: size 1

@wckzhang
Copy link
Contributor

I just noticed, only 3 nodes are calling the fence upcall, one node is missing

@wckzhang
Copy link
Contributor

Adding some more prints, when I checked GDB, the hang was on this section of code this time:

    if (!opal_process_info.is_singleton) {                                                                              
        if (opal_pmix_base_async_modex) {                                                                               
            /* if we are doing an async modex, but we are collecting all                                                
             * data, then execute the non-blocking modex in the background.                                             
             * All calls to modex_recv will be cached until the background                                              
             * modex completes. If collect_all_data is false, then we skip                                              
             * the fence completely and retrieve data on-demand from the                                                
             * source node.                                                                                             
             */                                                                                                         
            if (opal_pmix_collect_all_data) {                                                                           
                /* execute the fence_nb in the background to collect                                                    
                 * the data */                                                                                          
                background_fence = true;                                                                                
                active = true;                                                                                          
                OPAL_POST_OBJECT(&active);                                                                              
                PMIX_INFO_LOAD(&info[0], PMIX_COLLECT_DATA, &opal_pmix_collect_all_data, PMIX_BOOL);                    
                if( PMIX_SUCCESS != (rc = PMIx_Fence_nb(NULL, 0, NULL, 0,                                               
                                                        fence_release,                                                  
                                                        (void*)&active))) {                                             
                    ret = opal_pmix_convert_status(rc);                                                                 
                    error = "PMIx_Fence_nb() failed";                                                                   
                    goto error;                                                                                         
                }                                                                                                       
            }                                                                                                           
        } else {                                                                                                        
            /* we want to do the modex - we block at this point, but we must                                            
             * do so in a manner that allows us to call opal_progress so our                                            
             * event library can be cycled as we have tied PMIx to that                                                 
             * event base */                                                                                            
            active = true;                                                                                              
            OPAL_POST_OBJECT(&active);                                                                                  
            PMIX_INFO_LOAD(&info[0], PMIX_COLLECT_DATA, &opal_pmix_collect_all_data, PMIX_BOOL);                        
            rc = PMIx_Fence_nb(NULL, 0, info, 1, fence_release, (void*)&active);                                        
            if( PMIX_SUCCESS != rc) {                                                                                   
                ret = opal_pmix_convert_status(rc);                                                                     
                error = "PMIx_Fence() failed";                                                                          
                goto error;                                                                                             
            }                                                                                                           
            /* cannot just wait on thread as we need to call opal_progress */                                           
            OMPI_LAZY_WAIT_FOR_COMPLETION(active);                                <--- hangs here                                     
        }                                                                                                               
    }                                               

I added some prints before and after the hang to around this fence as well, now trying to reproduce again

@wckzhang
Copy link
Contributor

I guess rank is not instantiated at this point in instance.c as getting rank from comm world only returns 0 for the rank.

I was however able to check and on a hang, there is one less process that calls PMIx_Fence (143 procs), than a successful run (144 procs) at this location. So one proc is missing when trying to call the fence

@rhc54
Copy link
Contributor

rhc54 commented Apr 14, 2023

Let me see if I understand what you are saying. You put some prints around the call to PMIx_Fence_nb. When the app hangs, you are seeing that at least one rank never actually calls PMIx_Fence_nb - it hangs somewhere before that point.

If you set PMIX_MCA_gds=hash (thereby disabling the shmem component), the problem disappears.

Is that an accurate summary? If so, then it sounds like the next step would be to put some print statements above this point and see if we can localize where the hang occurs.

I guess rank is not instantiated at this point in instance.c as getting rank from comm world only returns 0 for the rank

That is highly suspicious as the rank is set when you init the RTE, which must be long before you get to a Fence.

@wckzhang
Copy link
Contributor

wckzhang commented Apr 14, 2023

You put some prints around the call to PMIx_Fence_nb.

Yes

When the app hangs, you are seeing that at least one rank never actually calls PMIx_Fence_nb

Yes

it hangs somewhere before that point.

Not sure if it hangs or something else is occurring.

If you set PMIX_MCA_gds=hash (thereby disabling the shmem component), the problem disappears.

Yes (or --mca gds ^shmem)

That is highly suspicious as the rank is set when you init the RTE, which must be long before you get to a Fence.

All the prints look like:

Rank 0 calling fence
Rank 0 waiting on fence
Rank 0 calling fence
Rank 0 waiting on fence
Rank 0 calling fence
Rank 0 waiting on fence
Rank 0 calling fence
Rank 0 waiting on fence
Rank 0 calling fence
Rank 0 waiting on fence
Rank 0 calling fence

This is the diff for the prints:

+++ b/ompi/instance/instance.c
@@ -537,7 +537,8 @@ static int ompi_mpi_instance_init_common (int argc, char **argv)
         OMPI_TIMING_NEXT("pmix-barrier-2");
     }
 #endif
-
+    int rank;
+   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    if (! opal_process_info.is_singleton) {
         if (opal_pmix_base_async_modex) {
             /* if we are doing an async modex, but we are collecting all
@@ -569,13 +570,16 @@ static int ompi_mpi_instance_init_common (int argc, char **argv)
             active = true;
             OPAL_POST_OBJECT(&active);
             PMIX_INFO_LOAD(&info[0], PMIX_COLLECT_DATA, &opal_pmix_collect_all_data, PMIX_BOOL);
+           printf("Rank %d calling fence\n", rank);
             rc = PMIx_Fence_nb(NULL, 0, info, 1, fence_release, (void*)&active);
             if( PMIX_SUCCESS != rc) {
                 ret = opal_pmix_convert_status(rc);
                 return ompi_instance_print_error ("PMIx_Fence() failed", ret);
             }
             /* cannot just wait on thread as we need to call opal_progress */
+           printf("Rank %d waiting on fence\n", rank);
             OMPI_LAZY_WAIT_FOR_COMPLETION(active);
+           printf("Rank % completed fence\n", rank);
         }
     }

@rhc54
Copy link
Contributor

rhc54 commented Apr 14, 2023

Very odd why you get rank=0 on every proc - sounds like it thinks you are starting a bunch of singletons? I don't know why else the rank would be set to 0.

Is the rank=0 on runs that don't hang? I'm wondering if that might be a tip to what is happening here.

@wckzhang
Copy link
Contributor

wckzhang commented Apr 14, 2023

rank=0 is on all runs, so I don't think it's related, plus these prints occur in this if statement:

    if (! opal_process_info.is_singleton) {

It's only in this function that the rank is 0, when I put prints in the ompi_instance_init functions, it properly prints out a correct rank. I assumed that rank wasn't set at that point

@rhc54
Copy link
Contributor

rhc54 commented Apr 14, 2023

Okay, makes no sense to me, but let's set it aside for now. Can you add print statements in the code to get an idea of where it hangs?

@bosilca
Copy link
Member

bosilca commented Apr 14, 2023

It makes sense, once you understand what is going on. The OMPI object for MPI_COMM_WORLD is static and therefore initialized with rank = 0 on all nodes. As the printf happens very early in the process it is normal that all processes show a rank 0, simply because the predefined objects are not yet initialized.

I noticed this few days ago when I was looking why add_procs added a bunch of rank 0 (in MPI_COMM_WORLD). The root cause was that in the sessions code, the PML/BML initialization is triggered very early in the process, before the predefined communicators are correctly setup.

@wckzhang
Copy link
Contributor

The good news is at the start of the function I printed a statement which printed 144 times, so the error should be contained in this one function, I should be able to narrow it down from here.

@wckzhang
Copy link
Contributor

I think my prints are affecting the timing of the hang, so I might have to reduce the number of prints to reproduce the issue.

@rhc54
Copy link
Contributor

rhc54 commented Apr 14, 2023

It makes sense, once you understand what is going on. The OMPI object for MPI_COMM_WORLD is static and therefore initialized with rank = 0 on all nodes. As the printf happens very early in the process it is normal that all processes show a rank 0, simply because the predefined objects are not yet initialized.

I noticed this few days ago when I was looking why add_procs added a bunch of rank 0 (in MPI_COMM_WORLD). The root cause was that in the sessions code, the PML/BML initialization is triggered very early in the process, before the predefined communicators are correctly setup.

I see - thanks for the explanation! The comm_world rank is provided by PMIx_Init, which has to occur before anything in this file can take place. Would it make sense and alleviate confusion to assign the comm_world rank at that point?

@hppritcha
Copy link
Member

predefined communicators aren't set up if an application is only using sessions.

@hppritcha
Copy link
Member

May be time to look more at #11451 ?

@rhc54
Copy link
Contributor

rhc54 commented Apr 15, 2023

Are those fences occurring before this one? We need to ensure that we are getting thru all of the fences. If we block on one, then we won't get to the next.

@wckzhang
Copy link
Contributor

It looks like it's hanging somewhere in this code block. Any prints I put in here seem to make it much harder to reproduce. I suspect it's stuck in the ompi_rte_init:

    ret = mca_base_var_find("opal", "event", "*", "event_include");
    if (ret >= 0) {
        char *allvalue = "all";
        /* We have to explicitly "set" the MCA param value here
           because libevent initialization will re-register the MCA
           param and therefore override the default. Setting the value
           here puts the desired value ("all") in different storage
           that is not overwritten if/when the MCA param is
           re-registered. This is unless the user has specified a different
           value for this MCA parameter. Make sure we check to see if the
           default is specified before forcing "all" in case that is not what
           the user desires. Note that we do *NOT* set this value as an
           environment variable, just so that it won't be inherited by
           any spawned processes and potentially cause unintended
           side-effects with launching RTE tools... */
        mca_base_var_set_value(ret, allvalue, 4, MCA_BASE_VAR_SOURCE_DEFAULT, NULL);
    }

    OMPI_TIMING_NEXT("initialization");

    /* Setup RTE */
    if (OMPI_SUCCESS != (ret = ompi_rte_init (&argc, &argv))) {
            printf("EXTREME ERROR!\n");
        return ompi_instance_print_error ("ompi_mpi_init: ompi_rte_init failed", ret);
    }

Briefly looking into the function, it does PMIx init and various operations, so it looks suspicious. However considering that adding more prints seem to be causing an issue in reproducing it, I'm going to just manually attach to all 144 processes and see if one of them is doing something different.

@rhc54
Copy link
Contributor

rhc54 commented Apr 15, 2023

If you can find where in that function it blocks, it would help a lot. I'm betting it's in PMIx_Init, and the problem is that the shmem component cannot attach, which is why disabling that component makes the problem go away. However, it could be that you get a little further and we hang while trying to retrieve some piece of info.

@wckzhang
Copy link
Contributor

I couldn't find any different stack trace. There should be 36 proc's per node, but it looks like on the hanging node (the one w/o a fence upcall), there's only 35 procs. Could something be happening that kills a process in pmix?

@rhc54
Copy link
Contributor

rhc54 commented Apr 15, 2023

It shouldn't happen, but I cannot say that it is impossible. What I'd do is put print statements around the PMIx_Init call and see if you always exit it cleanly. If you see the print going in, but don't see a print coming out and the proc is gone - that's a sure sign we are somehow calling exit or causing you to segfault. Puzzling thing: PRRTE should see either of those situations and abort the job.

@rhc54
Copy link
Contributor

rhc54 commented Apr 15, 2023

I suppose it is possible we are exiting with zero status - PRRTE would think that is normal and do nothing about it. I'll see if we can add some debug to help detect that situation should you find we are not returning from PMIx_Init and the proc is gone. If the proc is still there, then we are likely hung and PRRTE won't know about it.

@wckzhang
Copy link
Contributor

Yes I'm pretty sure PMIx_Init is killing a proc. It doesn't seem to be just hanging, there's a proc missing on one node each time, and I confirmed that this print:

 /* initialize the selected module */
    printf("Entering pmix init\n");
    if (!PMIx_Initialized() && (PMIX_SUCCESS != (ret = PMIx_Init(&opal_process_info.myprocid, NULL, 0)))) {
        printf("PMIx_Init errored\n");
    .
    .
    .
    else {
        printf("Exiting pmix init successfully\n");
    }

The "Enter" print prints out 144 times, the error print printed out 0 times, and the exit print printed out 143 times. There is also a missing proc when I look at "top", only for the node without a fence upcall.

@rhc54
Copy link
Contributor

rhc54 commented Apr 15, 2023

Okay, that helps a bunch - thanks! I'm going to disable the gds/shmem component until someone has time to address this problem. I'll post a submodule update to pick up the chang and report it here as I lack commit privileges for OMPI.

@rhc54
Copy link
Contributor

rhc54 commented Apr 15, 2023

Okay, the update is here

@samuelkgutierrez
Copy link
Member

samuelkgutierrez commented Apr 15, 2023

I suppose it is possible we are exiting with zero status - PRRTE would think that is normal and do nothing about it.

A straightforward way to test this might include registering an atexit() function before PMIx_Init() is called that emits useful debugging information.

@samuelkgutierrez
Copy link
Member

I've been thinking about this some. Here are some general thoughts and questions:

  • The gds/shmem code does a good job checking return codes and logging errors when they occur. Exceptional paths are logged with either PMIX_ERROR_LOG() or pmix_show_help(). In the past, failures in this component have been easily identified because of this. I wonder what is different in this case? If a failure is occurring in gds/shmem, why aren't error message being emitted?
  • Another possibility is that processes are terminating because of a particularly nasty bug that is causing a SIGBUS or a SIGSEGV. In my experience, those are relatively conspicuous and are typically harder to miss. Are there any signs that this is happening?

@rhc54
Copy link
Contributor

rhc54 commented Apr 16, 2023

The lack of any indicators is indeed puzzling. If the process was terminating abnormally (e.g., SIGBUS), then PRRTE would (a) bark loudly, and (b) terminate the entire job. It wouldn't be allowed to hang. Likewise, if at least one local process calls PMIx_Init, then PRRTE marks the job as requiring formal finalization - i.e., all processes are required to call PMIx_Finalize prior to exiting. If any process fails to do so, then PRRTE will declare the job to be abnormally terminated, bark loudly, and kill the entire job.

I see a few possibilities:

  • it is possible that the process that terminates manages to do so prior to any other local process calling PMIx_Init. In this scenario, the job will not have been marked as requiring formal finalization, and so PRRTE will let that process go by without taking any further action. First indication would indeed be in a hang during the first fence. This would fit the fairly low incident rate, though it begs the question as to why the proc is terminating.
  • PMIx currently requires that every local process in a job use the same gds component - we only track the active component on a per-job basis. If one local process were to reject the shmem component, but the others select it, then that would create a problem. I'm not entirely sure how that would manifest inside PMIx - it could well cause a problem that falls outside our typical error paths.
  • there may be nothing wrong with the shmem code logic, but it could be causing memory corruption that results in a process semi-randomly stepping on a bad place. Depending on what gets stepped on, we may just cause the proc to terminate. I consider this unlikely as one would expect PRRTE to see something abnormal (exit status or something) and we aren't seeing that, but I can't rule it out.

The fact that the problem goes away with gds/shmem disabled would seem to indicate that it is somehow related to that component being present. Just not clear precisely how it is involved.

@rhc54
Copy link
Contributor

rhc54 commented Apr 17, 2023

Another thing that might help track this down: could be helpful to strip this down to the bare minimum. Use an application that consists of PMIx_Init, PMIx_Fence, and PMIx_Finalize. This will run much faster than the MPI version and remove all the MPI code overhead from the equation. If you can cycle that application without hitting problems, then the issue is something to do with the interaction with OMPI and not just something inside of PMIx.

Doesn't rule out that PMIx has a problem with the shmem rendezvous as the address spaces might collide. But it would rule out a problem in the PMIx code itself.

@rhc54
Copy link
Contributor

rhc54 commented Apr 17, 2023

Okay, I went ahead and ran a PMIx_Init, PMIx_Fence, and PMIx_Finalize app on a continuous cycle using a single VM, heavily oversubscribed to ensure things were loaded. Without the gds/shmem component, it ran forever without a problem. However, with the gds/shmem component active, it hung after about 530 iterations.

Biggest difference: I did not see any lost process. All procs were present. I checked prterun and it was sitting patiently waiting for events. I also checked a couple of procs and found them in PMIx_Fence waiting for release. I didn't check them all to see if one was hung in PMIx_Init.

FWIW: my cmd line was prterun -n 44 --map-by :oversubscribe simple where simple was the trivial app described above. This VM is pretty weak, so 44 procs was a significant burden on it.

@samuelkgutierrez Perhaps that formula will help you reproduce it?

@samuelkgutierrez
Copy link
Member

Thank you, @rhc54. This is certainly helpful, thank you. I'm debugging as we speak.

@samuelkgutierrez
Copy link
Member

Hi, @wckzhang. Can you please do me a favor? I've found the likely cause of your hangs using shmem. I've updated the way we do shared-memory segment reference counting and the fix is in OpenPMIx master (openpmix/openpmix#3051).

When you have a moment, could you please retest using both OpenPMIx and PRRTE master? The only thing is that you'll have to unignore the shmem component before running autogen.pl by adding your user moniker to src/mca/gds/shmem/.pmix_unignore.

Thank you and please let me know if you have any questions.

@samuelkgutierrez
Copy link
Member

@wckzhang and I spoke offline. He was kind enough to give me access to his environment. I can verify that after 200 test runs I did not see any hangs. Without the fix in place, the program hung after about 25 iterations. @rhc54 looks like openpmix/openpmix#3051 fixes this issue.

@wckzhang
Copy link
Contributor

Fixed by - openpmix/openpmix#3051 - Closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants