-
Notifications
You must be signed in to change notification settings - Fork 900
ompi_request_wait() hang @ coll module enable (in threaded application?) #9780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
There seems to be something going on with multiple threads:
Can you attach gdb and see if two threads are stuck? And also check that MPI is properly initialized with |
The |
From as far as I can tell, process 0 could indeed have gone into receive mode after the send on the DUP-ED comm, just not at the code that I inserted for debugging. The "rank xx sending/receiving..." message is printed in the module's enable method, not in the pml's send/recv handler. Indeed the output might be lacking something and making the situation look different than it potentially is -- it is the entire output. I will check the threads' states more thoroughly. I do remember seeing that one of the two processes was at 100% while the other one at 0%, for what that's worth. @devreal do you say that something is going on with threads based on the output? What I initially focused on on the output:
Which can by interpreted by rank 0 sending and the send completing, while rank 1 is stuck waiting. Perhaps then rank 0 enters receive mode, but its peer (rank 1) nevers sends something something since it's stuck. Further exacerbated by the fact that it seems to work with the test or the req_complete loop. However, it is possible that the output is incomplete, and this rank 0's send gets matched with a different/wild receive somewhere.. (but could it, if we still are in the comm's creation?). Indeed I can't say that the application is correct. (will try, I expect tracing the (python/tensorflow) code to be a bit of a nightmate :-)) Since the problem triggers when the coll-module is enabled, I assumed that this all happens as a result of an |
It would be illegal is multiple threads called MPI_Comm_dup with the same underlying communicator. Otherwise the code would be correct. The fact that it works correctly if you replace the WAIT with a test is worrisome, as it might indicate that one of the completion signal is missed by the WAIT (the test uses the status of the request as an indication of completion, so it is a different mechanism). What communication PML/BTL are you using ? If you are using the UCX PML you could try a different combination, OB1/TCP/SM (using ipoib) to see if the issue is reproducible there as well. This will give us a hint on where to look next. You mention coll-module several times, but this looks legitimate to me, as we need to enable some collective module. Can you be more specific and check to see what collective module is triggering the issue ? Btw, I never compiler tensorflow with my own MPI. Is there any specific step to do, or just configure and compile with the target MPI in the PATH ? |
The above tests are with ob1+vader(sm)+xpmem. I only have Regarding horovod/tensorflow, I did have some trouble installing it, but it theoretically was due to some arm64 packaging irregularities. If all works as it should, the process might be more straightforward. You can try these steps:
The application: https://github.com/horovod/horovod/blob/master/examples/tensorflow2/tensorflow2_synthetic_benchmark.py (https://horovod.readthedocs.io/en/stable/benchmarks_include.html) Nothing too special in mpirun's command line. Eg: |
I am able to reproduce with this app (and with the additional triggering code in the #include <stdio.h>
#include <mpi.h>
int main(int argc, char **argv) {
// MPI_Init(NULL, NULL);
int thread_level_provided;
MPI_Init_thread(NULL, NULL, MPI_THREAD_MULTIPLE, &thread_level_provided);
printf("Requested thread level %s\n",
(thread_level_provided == MPI_THREAD_MULTIPLE ? "OK" : "NOT OK"));
MPI_Comm dcomm;
MPI_Comm_dup(MPI_COMM_WORLD, &dcomm);
MPI_Finalize();
return 0;
} Usign the non-threaded Init all works ok. |
Can you provide a diff of the changes in the |
Yes sorry for the confusion here's the diff diff --git a/ompi/mca/coll/basic/coll_basic_module.c b/ompi/mca/coll/basic/coll_basic_module.c
index e9f93bd..f4afbbf 100644
--- a/ompi/mca/coll/basic/coll_basic_module.c
+++ b/ompi/mca/coll/basic/coll_basic_module.c
@@ -144,6 +144,66 @@ mca_coll_basic_comm_query(struct ompi_communicator_t *comm,
return &(basic_module->super);
}
+#include "ompi/mca/pml/pml.h"
+
+static int test_recv_wait(mca_coll_base_module_t *module,
+ struct ompi_communicator_t *comm) {
+
+ int rank = ompi_comm_rank(comm);
+
+ char buf[10];
+ int ret;
+
+ printf("COMM %s\n", comm->c_name);
+
+ if(rank == 0) {
+ printf("rank %d sending to rank %d, tag %d on comm %s\n",
+ rank, 1, 345345, comm->c_name);
+
+ ret = MCA_PML_CALL(send(buf, 10, MPI_BYTE, 1, 345345,
+ MCA_PML_BASE_SEND_STANDARD, comm));
+
+ if(ret != OMPI_SUCCESS) {
+ printf("PML CALL ERROR\n");
+ return ret;
+ }
+
+ printf("SEND DONE\n");
+ }
+
+ if(rank == 1) {
+ printf("rank %d receiving from rank %d, tag %d on comm %s\n",
+ rank, 0, 345345, comm->c_name);
+
+ // ret = MCA_PML_CALL(recv(buf, 10,
+ // MPI_BYTE, 0, 345345, comm, MPI_STATUS_IGNORE));
+
+ ompi_request_t *req;
+
+ ret = MCA_PML_CALL(irecv(buf, 10, MPI_BYTE,
+ 0, 345345, comm, &req));
+
+ if(ret != OMPI_SUCCESS) {
+ printf("PML CALL ERROR\n");
+ return ret;
+ }
+
+ printf("IRECV DONE\n");
+
+ // int completed = 0;
+ // while(!completed)
+ // ompi_request_test(&req, &completed, MPI_STATUS_IGNORE);
+
+ // while(!req->req_complete)
+ // opal_progress();
+
+ ompi_request_wait(&req, MPI_STATUS_IGNORE);
+
+ printf("RECEIVED\n");
+ }
+
+ return OMPI_SUCCESS;
+}
/*
* Init module on the communicator
@@ -157,7 +217,12 @@ mca_coll_basic_module_enable(mca_coll_base_module_t *module,
if (NULL == module->base_data) {
return OMPI_ERROR;
}
-
+
+ if(ompi_comm_size(comm) > 1) {
+ if(test_recv_wait(module, comm) != OMPI_SUCCESS)
+ return OMPI_ERROR;
+ }
+
/* All done */
return OMPI_SUCCESS;
} |
Thanks! I can reproduce the hang: the sender ends up in
Notice the calls So this explains the problem with your added code. I'm not sure how such a case could be triggered without your I guess the answer to your initial question:
is No, at least not blocking functions. Now it would be interesting to see if we do that anywhere in the collective code. A stack trace of the original application would be really helpful here :) |
Ah I see, I think I mostly understand. Essentially this is not related to Horovod, it just occured with it because apparently it uses The underlying reason is that I use PML functions (actually, coll/base methods) during the initialization of my collectives component (in order to exchange extra required info), which happens in Basically the problem is that |
Yes, if you're experimenting with additional communication in the collectives modules you have to be careful (that was not clear from the original issue; I was assuming it was an issue with Horovod and vanilla Open MPI ;)). Let me rephrase my statement on PML usage: you can use PML operations pretty much anywhere (the coll and comm operations do that too after all) but you have to be careful because you're doing open heart surgery here. Most importantly, you cannot recursively wait for completion.
Correct. I guess @bosilca can say a word or two about how to best handle cases where completed communication triggers more communication that needs to be waited for. Your testing cycles work (as you pointed out) but that is not really efficient. But then, this might not be relevant during communicator duplication... |
I see, yes, as it turns out, PML calls (or at least recv/wait?) cannot be used during the communicator's initialization, as an MPI_Comm_dup (or similar/other) call will trigger a hang. From a very quick look into libnbc, I think my example with non-blocking collectives is not a problem either, as the request fields appear to be checked directly (1a41482). For my case, I switched to lazy initialization and all is okay. Thanks! |
The basic rule of thumb is to never block the execution flow when it is started from an active message handler or from a request completion callback. Not even in a multithreaded case, because there is no guarantee that the BTL has released all internal locks before triggering the AM callback. If you need to create new communications, they must be nonblocking, and the delayed completion will be triggered by the communication progress. |
Hi, I'm encountering an issue with point-to-point messages in a collective module's
_module_enable()
, in a specific application (whose code is not easy too easy to navigate...)Environment:
The application is Horovod's synthetic tensorflow2 benchmark.
I've been looking into producing minimal reproducible code, but I'm not yet there. I was looking for some feedback before heading down a potential rabbit hole. Is it even "legal" to call PML methods in a coll-module's
enable
function?I am able to reproduce the issue, with the specific application, by adding something like this in the
enable
function: (eg. in coll/basic's in this case):The process with rank 1 hangs at the
ompi_request_wait
call. Same happens if I use the recv code instead of the irecv one. But all works okay with theompi_request_test
and thereq->req_complete
loops. The application appears to be using threads -- not sure if this is directly related, if at all.I'm not familiar with the requests' code, but with some quick prints the process appears to hang here:
ompi/ompi/request/request.h
Line 468 in 7fa73f1
Some example debug output from prints in the code above and in
ompi_request_wait_completion()
: (note also that the hang happens on the dup-ed communicator and not on mpi_comm_world)I will see if I can get something more reproducible application-wise. Though someone that does have horovod/tensorflow installed might (/should?) also be able to reproduce it. In the meantime any insight is appreciated, and I am available for further testing.
The text was updated successfully, but these errors were encountered: