Skip to content

ompi_request_wait() hang @ coll module enable (in threaded application?) #9780

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gkatev opened this issue Dec 21, 2021 · 13 comments
Closed

Comments

@gkatev
Copy link
Contributor

gkatev commented Dec 21, 2021

Hi, I'm encountering an issue with point-to-point messages in a collective module's _module_enable(), in a specific application (whose code is not easy too easy to navigate...)

Environment:

Single-node, CentOS 8, aarch64
Open MPI v5.0.0rc2, built from git, release build
Tensorflow 2.7.0, Horovod 0.23.0

The application is Horovod's synthetic tensorflow2 benchmark.

I've been looking into producing minimal reproducible code, but I'm not yet there. I was looking for some feedback before heading down a potential rabbit hole. Is it even "legal" to call PML methods in a coll-module's enable function?

I am able to reproduce the issue, with the specific application, by adding something like this in the enable function: (eg. in coll/basic's in this case):

#include "ompi/mca/pml/pml.h"

static int test_recv_wait(mca_coll_base_module_t *module,
		struct ompi_communicator_t *comm) {
	
	int rank = ompi_comm_rank(comm);
	
	char buf[10];
	int ret;
	
	printf("COMM %s\n", comm->c_name);
	
	if(rank == 0) {
		printf("rank %d sending to rank %d, tag %d on comm %s\n",
			rank, 1, 345345, comm->c_name);
		
		ret = MCA_PML_CALL(send(buf, 10, MPI_BYTE, 1, 345345,
			MCA_PML_BASE_SEND_STANDARD, comm));
		
		if(ret != OMPI_SUCCESS) {
			printf("PML CALL ERROR\n");
			return ret;
		}
		
		printf("SEND DONE\n");
	}
	
	if(rank == 1) {
		printf("rank %d receiving from rank %d, tag %d on comm %s\n",
			rank, 0, 345345, comm->c_name);
		
		// ret = MCA_PML_CALL(recv(buf, 10,
			// MPI_BYTE, 0, 345345, comm, MPI_STATUS_IGNORE));
		
		ompi_request_t *req;
		
		ret = MCA_PML_CALL(irecv(buf, 10, MPI_BYTE,
			0, 345345, comm, &req));
		
		if(ret != OMPI_SUCCESS) {
			printf("PML CALL ERROR\n");
			return ret;
		}
		
		printf("IRECV DONE\n");
		
		// int completed = 0;
		// while(!completed)
			// ompi_request_test(&req, &completed, MPI_STATUS_IGNORE);
		
		// while(!req->req_complete)
			// opal_progress();
		
		ompi_request_wait(&req, MPI_STATUS_IGNORE);
		
		printf("RECEIVED\n");
	}
	
	return OMPI_SUCCESS;
}

/*
 * Init module on the communicator
 */
int
mca_coll_basic_module_enable(mca_coll_base_module_t *module,
                             struct ompi_communicator_t *comm)
{
    /* prepare the placeholder for the array of request* */
    module->base_data = OBJ_NEW(mca_coll_base_comm_t);
    if (NULL == module->base_data) {
        return OMPI_ERROR;
    }
    
    if(ompi_comm_size(comm) > 1) {
		if(test_recv_wait(module, comm) != OMPI_SUCCESS)
			return OMPI_ERROR;
	}
	
    /* All done */
    return OMPI_SUCCESS;
}

The process with rank 1 hangs at the ompi_request_wait call. Same happens if I use the recv code instead of the irecv one. But all works okay with the ompi_request_test and the req->req_complete loops. The application appears to be using threads -- not sure if this is directly related, if at all.

I'm not familiar with the requests' code, but with some quick prints the process appears to hang here:

SYNC_WAIT(&sync);

Some example debug output from prints in the code above and in ompi_request_wait_completion(): (note also that the hang happens on the dup-ed communicator and not on mpi_comm_world)

@1,1]<stdout>:COMM MPI_COMM_WORLD
@1,1]<stdout>:rank 1 receiving from rank 0, tag 345345 on comm MPI_COMM_WORLD
@1,1]<stdout>:IRECV DONE
@1,1]<stdout>:Entered ompi_request_wait_completion()
@1,1]<stdout>:Before SYNC_WAIT()
@1,0]<stdout>:COMM MPI_COMM_WORLD
@1,0]<stdout>:rank 0 sending to rank 1, tag 345345 on comm MPI_COMM_WORLD
@1,0]<stdout>:SEND DONE
@1,0]<stdout>:Entered ompi_request_wait_completion()
@1,0]<stdout>:Before SYNC_WAIT()
@1,1]<stdout>:After SYNC_WAIT()
@1,1]<stdout>:RECEIVED
@1,1]<stdout>:Entered ompi_request_wait_completion()
@1,1]<stdout>:Before SYNC_WAIT()
@1,1]<stdout>:After SYNC_WAIT()
@1,0]<stdout>:After SYNC_WAIT()
@1,0]<stdout>:Entered ompi_request_wait_completion()
@1,0]<stdout>:Before SYNC_WAIT()
@1,0]<stdout>:COMM MPI COMMUNICATOR 3 DUP FROM 0
@1,0]<stdout>:rank 0 sending to rank 1, tag 345345 on comm MPI COMMUNICATOR 3 DUP FROM 0
@1,0]<stdout>:SEND DONE
@1,0]<stdout>:After SYNC_WAIT()
@1,1]<stdout>:Entered ompi_request_wait_completion()
@1,1]<stdout>:Before SYNC_WAIT()
@1,1]<stdout>:COMM MPI COMMUNICATOR 3 DUP FROM 0
@1,1]<stdout>:rank 1 receiving from rank 0, tag 345345 on comm MPI COMMUNICATOR 3 DUP FROM 0
@1,1]<stdout>:IRECV DONE
@1,1]<stdout>:Entered ompi_request_wait_completion()
@1,1]<stdout>:Before SYNC_WAIT()
@1,0]<stdout>:Entered ompi_request_wait_completion()
@1,0]<stdout>:Before SYNC_WAIT()
<hang>

I will see if I can get something more reproducible application-wise. Though someone that does have horovod/tensorflow installed might (/should?) also be able to reproduce it. In the meantime any insight is appreciated, and I am available for further testing.

@gkatev gkatev changed the title ompi_request_wait() hang @ coll module enable, in threaded application ompi_request_wait() hang @ coll module enable (in threaded application?) Dec 21, 2021
@devreal
Copy link
Contributor

devreal commented Dec 21, 2021

There seems to be something going on with multiple threads:

@1,0]<stdout>:COMM MPI COMMUNICATOR 3 DUP FROM 0
[...]
@1,1]<stdout>:COMM MPI COMMUNICATOR 3 DUP FROM 0
[...]
@1,1]<stdout>:Entered ompi_request_wait_completion()
@1,1]<stdout>:Before SYNC_WAIT()
@1,0]<stdout>:Entered ompi_request_wait_completion()
@1,0]<stdout>:Before SYNC_WAIT()

Can you attach gdb and see if two threads are stuck? And also check that MPI is properly initialized with MPI_THREAD_MULTIPLE and that two threads are not issuing collectives on the same communicator?

@bosilca
Copy link
Member

bosilca commented Dec 21, 2021

The SYNC_WAIT is OMPI way of waiting for completion of events, including communications. Thus, any lost or incorrectly/unsatisfied receive will block in this function, despite the fact that the wait_sync might not be the issue. @devreal is right, something is fishy in your output code, as if the first process after doing the send went into the receive code, but without printing the "receiving from rank" message. Is this the entire output or partial ?

@gkatev
Copy link
Contributor Author

gkatev commented Dec 21, 2021

From as far as I can tell, process 0 could indeed have gone into receive mode after the send on the DUP-ED comm, just not at the code that I inserted for debugging. The "rank xx sending/receiving..." message is printed in the module's enable method, not in the pml's send/recv handler. Indeed the output might be lacking something and making the situation look different than it potentially is -- it is the entire output.

I will check the threads' states more thoroughly. I do remember seeing that one of the two processes was at 100% while the other one at 0%, for what that's worth. @devreal do you say that something is going on with threads based on the output?

What I initially focused on on the output:

@1,0]<stdout>:COMM MPI COMMUNICATOR 3 DUP FROM 0
@1,0]<stdout>:rank 0 sending to rank 1, tag 345345 on comm MPI COMMUNICATOR 3 DUP FROM 0
@1,0]<stdout>:SEND DONE
@1,1]<stdout>:COMM MPI COMMUNICATOR 3 DUP FROM 0
@1,1]<stdout>:rank 1 receiving from rank 0, tag 345345 on comm MPI COMMUNICATOR 3 DUP FROM 0
@1,1]<stdout>:IRECV DONE
@1,1]<stdout>:Entered ompi_request_wait_completion()
@1,1]<stdout>:Before SYNC_WAIT()

Which can by interpreted by rank 0 sending and the send completing, while rank 1 is stuck waiting. Perhaps then rank 0 enters receive mode, but its peer (rank 1) nevers sends something something since it's stuck. Further exacerbated by the fact that it seems to work with the test or the req_complete loop.

However, it is possible that the output is incomplete, and this rank 0's send gets matched with a different/wild receive somewhere.. (but could it, if we still are in the comm's creation?). Indeed I can't say that the application is correct. (will try, I expect tracing the (python/tensorflow) code to be a bit of a nightmate :-))

Since the problem triggers when the coll-module is enabled, I assumed that this all happens as a result of an MPI_Comm_dup call. Not sure if multiple threads are actually involved in this call -- it would be "illegal" if multiple threads called MPI_Comm_dup, correct?

@bosilca
Copy link
Member

bosilca commented Dec 21, 2021

It would be illegal is multiple threads called MPI_Comm_dup with the same underlying communicator. Otherwise the code would be correct.

The fact that it works correctly if you replace the WAIT with a test is worrisome, as it might indicate that one of the completion signal is missed by the WAIT (the test uses the status of the request as an indication of completion, so it is a different mechanism). What communication PML/BTL are you using ? If you are using the UCX PML you could try a different combination, OB1/TCP/SM (using ipoib) to see if the issue is reproducible there as well. This will give us a hint on where to look next.

You mention coll-module several times, but this looks legitimate to me, as we need to enable some collective module. Can you be more specific and check to see what collective module is triggering the issue ?

Btw, I never compiler tensorflow with my own MPI. Is there any specific step to do, or just configure and compile with the target MPI in the PATH ?

@gkatev
Copy link
Contributor Author

gkatev commented Dec 21, 2021

The above tests are with ob1+vader(sm)+xpmem. I only have libnbc and basic enabled, and the debug/triggering code is inserted in mca_coll_basic_module_enable(). However, the effect is also similar with other coll components and is not specific to basic. Essentially this happens (in this specific application) when point-to-point methods are called in a component's coll_module_enable function. Tomorrow I will try to add some prints to ob1's code and make sure all sends/recvs are accounted for.

Regarding horovod/tensorflow, I did have some trouble installing it, but it theoretically was due to some arm64 packaging irregularities. If all works as it should, the process might be more straightforward. You can try these steps:

  • Create/enable a python virtualenv (not actually strictly necessary)
  • $ pip instal tensorflow (or maybe tensorflow-cpu (?), I built mine from source with cuda support disabled). I, specifically, have version 2.7.0.
  • For horovod, HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_MPI=1 pip install horovod[tensorflow] (openmpi's stuff in PATH/LD_LIBRARY_PATH) (https://horovod.readthedocs.io/en/stable/install_include.html)
  • (watch out for pip's caching when re-installing packages...)

The application: https://github.com/horovod/horovod/blob/master/examples/tensorflow2/tensorflow2_synthetic_benchmark.py (https://horovod.readthedocs.io/en/stable/benchmarks_include.html)

Nothing too special in mpirun's command line. Eg:
$ $(which mpirun) --host <host>:2 --mca coll libnbc,basic --mca pml ob1 --mca btl vader,self --output tag python horovod/examples/tensorflow2/tensorflow2_synthetic_benchmark.py

@gkatev
Copy link
Contributor Author

gkatev commented Dec 22, 2021

I am able to reproduce with this app (and with the additional triggering code in the enable function)

#include <stdio.h>
#include <mpi.h>

int main(int argc, char **argv) {
	// MPI_Init(NULL, NULL);
	
	int thread_level_provided;
	MPI_Init_thread(NULL, NULL, MPI_THREAD_MULTIPLE, &thread_level_provided);
	
	printf("Requested thread level %s\n",
		(thread_level_provided == MPI_THREAD_MULTIPLE ? "OK" : "NOT OK"));
	
	MPI_Comm dcomm;
	MPI_Comm_dup(MPI_COMM_WORLD, &dcomm);
	
	MPI_Finalize();
	
	return 0;
}

Usign the non-threaded Init all works ok.
Can you confirm?

@devreal
Copy link
Contributor

devreal commented Dec 22, 2021

Can you provide a diff of the changes in the enable function that trigger this issue? I cannot reproduce with vanilla Open MPI and I'm not sure what changes you mean.

@gkatev
Copy link
Contributor Author

gkatev commented Dec 22, 2021

Yes sorry for the confusion here's the diff

diff --git a/ompi/mca/coll/basic/coll_basic_module.c b/ompi/mca/coll/basic/coll_basic_module.c
index e9f93bd..f4afbbf 100644
--- a/ompi/mca/coll/basic/coll_basic_module.c
+++ b/ompi/mca/coll/basic/coll_basic_module.c
@@ -144,6 +144,66 @@ mca_coll_basic_comm_query(struct ompi_communicator_t *comm,
     return &(basic_module->super);
 }
 
+#include "ompi/mca/pml/pml.h"
+
+static int test_recv_wait(mca_coll_base_module_t *module,
+		struct ompi_communicator_t *comm) {
+	
+	int rank = ompi_comm_rank(comm);
+	
+	char buf[10];
+	int ret;
+	
+	printf("COMM %s\n", comm->c_name);
+	
+	if(rank == 0) {
+		printf("rank %d sending to rank %d, tag %d on comm %s\n",
+			rank, 1, 345345, comm->c_name);
+		
+		ret = MCA_PML_CALL(send(buf, 10, MPI_BYTE, 1, 345345,
+			MCA_PML_BASE_SEND_STANDARD, comm));
+		
+		if(ret != OMPI_SUCCESS) {
+			printf("PML CALL ERROR\n");
+			return ret;
+		}
+		
+		printf("SEND DONE\n");
+	}
+	
+	if(rank == 1) {
+		printf("rank %d receiving from rank %d, tag %d on comm %s\n",
+			rank, 0, 345345, comm->c_name);
+		
+		// ret = MCA_PML_CALL(recv(buf, 10,
+			// MPI_BYTE, 0, 345345, comm, MPI_STATUS_IGNORE));
+		
+		ompi_request_t *req;
+		
+		ret = MCA_PML_CALL(irecv(buf, 10, MPI_BYTE,
+			0, 345345, comm, &req));
+		
+		if(ret != OMPI_SUCCESS) {
+			printf("PML CALL ERROR\n");
+			return ret;
+		}
+		
+		printf("IRECV DONE\n");
+		
+		// int completed = 0;
+		// while(!completed)
+			// ompi_request_test(&req, &completed, MPI_STATUS_IGNORE);
+		
+		// while(!req->req_complete)
+			// opal_progress();
+		
+		ompi_request_wait(&req, MPI_STATUS_IGNORE);
+		
+		printf("RECEIVED\n");
+	}
+	
+	return OMPI_SUCCESS;
+}
 
 /*
  * Init module on the communicator
@@ -157,7 +217,12 @@ mca_coll_basic_module_enable(mca_coll_base_module_t *module,
     if (NULL == module->base_data) {
         return OMPI_ERROR;
     }
-
+    
+    if(ompi_comm_size(comm) > 1) {
+		if(test_recv_wait(module, comm) != OMPI_SUCCESS)
+			return OMPI_ERROR;
+	}
+	
     /* All done */
     return OMPI_SUCCESS;
 }

@devreal
Copy link
Contributor

devreal commented Dec 22, 2021

Thanks! I can reproduce the hang: the sender ends up in MPI_Finalize while the receiver is stuck in ompi_sync_wait_mt. The problem seems to be a recursive wait. The stack trace I get:

(gdb) bt
#0  futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fffffffd240) at ../sysdeps/nptl/futex-internal.h:183
#1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7fffffffd248, cond=0x7fffffffd218) at pthread_cond_wait.c:508
#2  __pthread_cond_wait (cond=0x7fffffffd218, mutex=0x7fffffffd248) at pthread_cond_wait.c:638
#3  0x00007ffff783c03c in opal_thread_internal_cond_wait (p_mutex=0x7fffffffd248, p_cond=0x7fffffffd218) at ../../../../opal/mca/threads/pthreads/threads_pthreads_mutex.h:161
#4  ompi_sync_wait_mt (sync=sync@entry=0x7fffffffd210) at ../../../../opal/mca/threads/base/wait_sync.c:101
#5  0x00007ffff7c588a6 in ompi_request_wait_completion (req=0x555555697e00) at ../../ompi/request/request.h:468
#6  ompi_request_default_wait (req_ptr=0x7fffffffd2d0, status=0x0) at ../../ompi/request/req_wait.c:40
#7  0x00007ffff7d74c31 in test_recv_wait (module=<optimized out>, comm=0x5555555ab930) at ../../../../../ompi/mca/coll/basic/coll_basic_module.c:200
#8  mca_coll_basic_module_enable (module=<optimized out>, comm=0x5555555ab930) at ../../../../../ompi/mca/coll/basic/coll_basic_module.c:222
#9  mca_coll_basic_module_enable (module=<optimized out>, comm=0x5555555ab930) at ../../../../../ompi/mca/coll/basic/coll_basic_module.c:212
#10 0x00007ffff7cd245d in mca_coll_base_comm_select (comm=<optimized out>) at ../../../../ompi/mca/coll/base/coll_base_comm_select.c:139
#11 0x00007ffff7c20ad2 in ompi_comm_activate_nb_complete (request=<optimized out>) at ../../ompi/communicator/comm_cid.c:665
#12 0x00007ffff7c27136 in ompi_comm_request_progress () at ../../ompi/communicator/comm_request.c:142
#13 ompi_comm_request_progress () at ../../ompi/communicator/comm_request.c:99
#14 0x00007ffff779d414 in opal_progress () at ../../opal/runtime/opal_progress.c:224
#15 0x00007ffff783c095 in ompi_sync_wait_mt (sync=sync@entry=0x7fffffffd4c0) at ../../../../opal/mca/threads/base/wait_sync.c:121
#16 0x00007ffff7c1f6e5 in ompi_request_wait_completion (req=0x5555556a2be8) at ../../ompi/request/request.h:468
#17 0x00007ffff7c2698e in ompi_comm_activate (newcomm=newcomm@entry=0x7fffffffd5c0, comm=comm@entry=0x555555558020 <ompi_mpi_comm_world>, bridgecomm=bridgecomm@entry=0x0, 
    arg0=arg0@entry=0x0, arg1=arg1@entry=0x0, send_first=send_first@entry=false, mode=32) at ../../ompi/communicator/comm_cid.c:623
#18 0x00007ffff7c1b7f4 in ompi_comm_dup_with_info (comm=comm@entry=0x555555558020 <ompi_mpi_comm_world>, info=info@entry=0x0, newcomm=newcomm@entry=0x7fffffffd670)
    at ../../ompi/communicator/comm.c:1046
#19 0x00007ffff7c1b9be in ompi_comm_dup (comm=comm@entry=0x555555558020 <ompi_mpi_comm_world>, newcomm=newcomm@entry=0x7fffffffd670) at ../../ompi/communicator/comm.c:994
#20 0x00007ffff7c85359 in PMPI_Comm_dup (comm=0x555555558020 <ompi_mpi_comm_world>, newcomm=0x7fffffffd670) at pcomm_dup.c:73
#21 0x0000555555555245 in main ()

Notice the calls ompi_sync_wait_mt in frame 15 and frame 4. The first wait because blocking comm dup is implemented in terms of non-blocking comm dup + wait. While waiting for the communicator activation to complete, its completion callback fires, leading to your additional code that again waits for communication. However, this time the code in ompi_sync_wait_mt sees that there is a thread actively waiting already and suspends itself, unaware of the fact that it's the same thread. This behavior is specific to MPI_THREAD_MULTIPLE, where we only want one thread actively polling and all others should be suspended.

So this explains the problem with your added code. I'm not sure how such a case could be triggered without your test_recv_wait in Horovod.

I guess the answer to your initial question:

Is it even "legal" to call PML methods in a coll-module's enable function?

is No, at least not blocking functions. Now it would be interesting to see if we do that anywhere in the collective code. A stack trace of the original application would be really helpful here :)

@gkatev
Copy link
Contributor Author

gkatev commented Dec 22, 2021

Ah I see, I think I mostly understand. Essentially this is not related to Horovod, it just occured with it because apparently it uses MPI_Comm_dup + MPI_THREAD_MULTIPLE.

The underlying reason is that I use PML functions (actually, coll/base methods) during the initialization of my collectives component (in order to exchange extra required info), which happens in _module_enable. If PML calls at that time are not supported, I suppose I can do the initialization lazily the first time a collective operation is called.

Basically the problem is that wait calls opal_progress, and that in turn might result in a new (recursive) call to wait, correct? Could this occur due to another reason? Maybe something like, if a non-blocking collective op is created and waited-on, and its implemenentation (executed as a result of wait/progress) calls recv/wait, resulting in a recursive wait?

@devreal
Copy link
Contributor

devreal commented Dec 22, 2021

Yes, if you're experimenting with additional communication in the collectives modules you have to be careful (that was not clear from the original issue; I was assuming it was an issue with Horovod and vanilla Open MPI ;)). Let me rephrase my statement on PML usage: you can use PML operations pretty much anywhere (the coll and comm operations do that too after all) but you have to be careful because you're doing open heart surgery here. Most importantly, you cannot recursively wait for completion.

Basically the problem is that wait calls opal_progress, and that in turn might result in a new (recursive) call to wait, correct? Could this occur due to another reason? Maybe something like, if a non-blocking collective op is created and waited-on, and its implemenentation (executed as a result of wait/progress) calls recv/wait, resulting in a recursive wait?

Correct.

I guess @bosilca can say a word or two about how to best handle cases where completed communication triggers more communication that needs to be waited for. Your testing cycles work (as you pointed out) but that is not really efficient. But then, this might not be relevant during communicator duplication...

@gkatev
Copy link
Contributor Author

gkatev commented Dec 24, 2021

I see, yes, as it turns out, PML calls (or at least recv/wait?) cannot be used during the communicator's initialization, as an MPI_Comm_dup (or similar/other) call will trigger a hang. From a very quick look into libnbc, I think my example with non-blocking collectives is not a problem either, as the request fields appear to be checked directly (1a41482).

For my case, I switched to lazy initialization and all is okay. Thanks!

@gkatev gkatev closed this as completed Dec 24, 2021
@bosilca
Copy link
Member

bosilca commented Jan 3, 2022

The basic rule of thumb is to never block the execution flow when it is started from an active message handler or from a request completion callback. Not even in a multithreaded case, because there is no guarantee that the BTL has released all internal locks before triggering the AM callback. If you need to create new communications, they must be nonblocking, and the delayed completion will be triggered by the communication progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants