ompi_request_wait() hang @ coll module enable (in threaded application?)

Hi, I'm encountering an issue with point-to-point messages in a collective module's `_module_enable()`, in a specific application (whose code is not easy too easy to navigate...)

Environment:
```
Single-node, CentOS 8, aarch64
Open MPI v5.0.0rc2, built from git, release build
Tensorflow 2.7.0, Horovod 0.23.0
```
The application is Horovod's [synthetic tensorflow2 benchmark](https://github.com/horovod/horovod/blob/master/examples/tensorflow2/tensorflow2_synthetic_benchmark.py).

I've been looking into producing minimal reproducible code, but I'm not yet there. I was looking for some feedback before heading down a potential rabbit hole. Is it even "legal" to call PML methods in a coll-module's `enable` function?

I am able to reproduce the issue, with the specific application, by adding something like this in the `enable` function: (eg. in coll/basic's in this case):
```c
#include "ompi/mca/pml/pml.h"

static int test_recv_wait(mca_coll_base_module_t *module,
		struct ompi_communicator_t *comm) {
	
	int rank = ompi_comm_rank(comm);
	
	char buf[10];
	int ret;
	
	printf("COMM %s\n", comm->c_name);
	
	if(rank == 0) {
		printf("rank %d sending to rank %d, tag %d on comm %s\n",
			rank, 1, 345345, comm->c_name);
		
		ret = MCA_PML_CALL(send(buf, 10, MPI_BYTE, 1, 345345,
			MCA_PML_BASE_SEND_STANDARD, comm));
		
		if(ret != OMPI_SUCCESS) {
			printf("PML CALL ERROR\n");
			return ret;
		}
		
		printf("SEND DONE\n");
	}
	
	if(rank == 1) {
		printf("rank %d receiving from rank %d, tag %d on comm %s\n",
			rank, 0, 345345, comm->c_name);
		
		// ret = MCA_PML_CALL(recv(buf, 10,
			// MPI_BYTE, 0, 345345, comm, MPI_STATUS_IGNORE));
		
		ompi_request_t *req;
		
		ret = MCA_PML_CALL(irecv(buf, 10, MPI_BYTE,
			0, 345345, comm, &req));
		
		if(ret != OMPI_SUCCESS) {
			printf("PML CALL ERROR\n");
			return ret;
		}
		
		printf("IRECV DONE\n");
		
		// int completed = 0;
		// while(!completed)
			// ompi_request_test(&req, &completed, MPI_STATUS_IGNORE);
		
		// while(!req->req_complete)
			// opal_progress();
		
		ompi_request_wait(&req, MPI_STATUS_IGNORE);
		
		printf("RECEIVED\n");
	}
	
	return OMPI_SUCCESS;
}

/*
 * Init module on the communicator
 */
int
mca_coll_basic_module_enable(mca_coll_base_module_t *module,
                             struct ompi_communicator_t *comm)
{
    /* prepare the placeholder for the array of request* */
    module->base_data = OBJ_NEW(mca_coll_base_comm_t);
    if (NULL == module->base_data) {
        return OMPI_ERROR;
    }
    
    if(ompi_comm_size(comm) > 1) {
		if(test_recv_wait(module, comm) != OMPI_SUCCESS)
			return OMPI_ERROR;
	}
	
    /* All done */
    return OMPI_SUCCESS;
}
```
The process with rank 1 hangs at the `ompi_request_wait` call. Same happens if I use the recv code instead of the irecv one. But all works okay with the `ompi_request_test` and the `req->req_complete` loops. The application appears to be using threads -- not sure if this is directly related, if at all.

I'm not familiar with the requests' code, but with some quick prints the process appears to hang here:
https://github.com/open-mpi/ompi/blob/7fa73f1b5aab875026d12da185a70e0f969f86d2/ompi/request/request.h#L468

Some example debug output from prints in the code above and in `ompi_request_wait_completion()`: (note also that the hang happens on the dup-ed communicator and not on mpi_comm_world)
```
@1,1]<stdout>:COMM MPI_COMM_WORLD
@1,1]<stdout>:rank 1 receiving from rank 0, tag 345345 on comm MPI_COMM_WORLD
@1,1]<stdout>:IRECV DONE
@1,1]<stdout>:Entered ompi_request_wait_completion()
@1,1]<stdout>:Before SYNC_WAIT()
@1,0]<stdout>:COMM MPI_COMM_WORLD
@1,0]<stdout>:rank 0 sending to rank 1, tag 345345 on comm MPI_COMM_WORLD
@1,0]<stdout>:SEND DONE
@1,0]<stdout>:Entered ompi_request_wait_completion()
@1,0]<stdout>:Before SYNC_WAIT()
@1,1]<stdout>:After SYNC_WAIT()
@1,1]<stdout>:RECEIVED
@1,1]<stdout>:Entered ompi_request_wait_completion()
@1,1]<stdout>:Before SYNC_WAIT()
@1,1]<stdout>:After SYNC_WAIT()
@1,0]<stdout>:After SYNC_WAIT()
@1,0]<stdout>:Entered ompi_request_wait_completion()
@1,0]<stdout>:Before SYNC_WAIT()
@1,0]<stdout>:COMM MPI COMMUNICATOR 3 DUP FROM 0
@1,0]<stdout>:rank 0 sending to rank 1, tag 345345 on comm MPI COMMUNICATOR 3 DUP FROM 0
@1,0]<stdout>:SEND DONE
@1,0]<stdout>:After SYNC_WAIT()
@1,1]<stdout>:Entered ompi_request_wait_completion()
@1,1]<stdout>:Before SYNC_WAIT()
@1,1]<stdout>:COMM MPI COMMUNICATOR 3 DUP FROM 0
@1,1]<stdout>:rank 1 receiving from rank 0, tag 345345 on comm MPI COMMUNICATOR 3 DUP FROM 0
@1,1]<stdout>:IRECV DONE
@1,1]<stdout>:Entered ompi_request_wait_completion()
@1,1]<stdout>:Before SYNC_WAIT()
@1,0]<stdout>:Entered ompi_request_wait_completion()
@1,0]<stdout>:Before SYNC_WAIT()
<hang>
```
I will see if I can get something more reproducible application-wise. Though someone that does have horovod/tensorflow installed might (/should?) also be able to reproduce it. In the meantime any insight is appreciated, and I am available for further testing.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ompi_request_wait() hang @ coll module enable (in threaded application?) #9780

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ompi_request_wait() hang @ coll module enable (in threaded application?) #9780

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions