Skip to content

ompi_request_wait() hang @ coll module enable (in threaded application?) #9780

Closed
@gkatev

Description

@gkatev

Hi, I'm encountering an issue with point-to-point messages in a collective module's _module_enable(), in a specific application (whose code is not easy too easy to navigate...)

Environment:

Single-node, CentOS 8, aarch64
Open MPI v5.0.0rc2, built from git, release build
Tensorflow 2.7.0, Horovod 0.23.0

The application is Horovod's synthetic tensorflow2 benchmark.

I've been looking into producing minimal reproducible code, but I'm not yet there. I was looking for some feedback before heading down a potential rabbit hole. Is it even "legal" to call PML methods in a coll-module's enable function?

I am able to reproduce the issue, with the specific application, by adding something like this in the enable function: (eg. in coll/basic's in this case):

#include "ompi/mca/pml/pml.h"

static int test_recv_wait(mca_coll_base_module_t *module,
		struct ompi_communicator_t *comm) {
	
	int rank = ompi_comm_rank(comm);
	
	char buf[10];
	int ret;
	
	printf("COMM %s\n", comm->c_name);
	
	if(rank == 0) {
		printf("rank %d sending to rank %d, tag %d on comm %s\n",
			rank, 1, 345345, comm->c_name);
		
		ret = MCA_PML_CALL(send(buf, 10, MPI_BYTE, 1, 345345,
			MCA_PML_BASE_SEND_STANDARD, comm));
		
		if(ret != OMPI_SUCCESS) {
			printf("PML CALL ERROR\n");
			return ret;
		}
		
		printf("SEND DONE\n");
	}
	
	if(rank == 1) {
		printf("rank %d receiving from rank %d, tag %d on comm %s\n",
			rank, 0, 345345, comm->c_name);
		
		// ret = MCA_PML_CALL(recv(buf, 10,
			// MPI_BYTE, 0, 345345, comm, MPI_STATUS_IGNORE));
		
		ompi_request_t *req;
		
		ret = MCA_PML_CALL(irecv(buf, 10, MPI_BYTE,
			0, 345345, comm, &req));
		
		if(ret != OMPI_SUCCESS) {
			printf("PML CALL ERROR\n");
			return ret;
		}
		
		printf("IRECV DONE\n");
		
		// int completed = 0;
		// while(!completed)
			// ompi_request_test(&req, &completed, MPI_STATUS_IGNORE);
		
		// while(!req->req_complete)
			// opal_progress();
		
		ompi_request_wait(&req, MPI_STATUS_IGNORE);
		
		printf("RECEIVED\n");
	}
	
	return OMPI_SUCCESS;
}

/*
 * Init module on the communicator
 */
int
mca_coll_basic_module_enable(mca_coll_base_module_t *module,
                             struct ompi_communicator_t *comm)
{
    /* prepare the placeholder for the array of request* */
    module->base_data = OBJ_NEW(mca_coll_base_comm_t);
    if (NULL == module->base_data) {
        return OMPI_ERROR;
    }
    
    if(ompi_comm_size(comm) > 1) {
		if(test_recv_wait(module, comm) != OMPI_SUCCESS)
			return OMPI_ERROR;
	}
	
    /* All done */
    return OMPI_SUCCESS;
}

The process with rank 1 hangs at the ompi_request_wait call. Same happens if I use the recv code instead of the irecv one. But all works okay with the ompi_request_test and the req->req_complete loops. The application appears to be using threads -- not sure if this is directly related, if at all.

I'm not familiar with the requests' code, but with some quick prints the process appears to hang here:

SYNC_WAIT(&sync);

Some example debug output from prints in the code above and in ompi_request_wait_completion(): (note also that the hang happens on the dup-ed communicator and not on mpi_comm_world)

@1,1]<stdout>:COMM MPI_COMM_WORLD
@1,1]<stdout>:rank 1 receiving from rank 0, tag 345345 on comm MPI_COMM_WORLD
@1,1]<stdout>:IRECV DONE
@1,1]<stdout>:Entered ompi_request_wait_completion()
@1,1]<stdout>:Before SYNC_WAIT()
@1,0]<stdout>:COMM MPI_COMM_WORLD
@1,0]<stdout>:rank 0 sending to rank 1, tag 345345 on comm MPI_COMM_WORLD
@1,0]<stdout>:SEND DONE
@1,0]<stdout>:Entered ompi_request_wait_completion()
@1,0]<stdout>:Before SYNC_WAIT()
@1,1]<stdout>:After SYNC_WAIT()
@1,1]<stdout>:RECEIVED
@1,1]<stdout>:Entered ompi_request_wait_completion()
@1,1]<stdout>:Before SYNC_WAIT()
@1,1]<stdout>:After SYNC_WAIT()
@1,0]<stdout>:After SYNC_WAIT()
@1,0]<stdout>:Entered ompi_request_wait_completion()
@1,0]<stdout>:Before SYNC_WAIT()
@1,0]<stdout>:COMM MPI COMMUNICATOR 3 DUP FROM 0
@1,0]<stdout>:rank 0 sending to rank 1, tag 345345 on comm MPI COMMUNICATOR 3 DUP FROM 0
@1,0]<stdout>:SEND DONE
@1,0]<stdout>:After SYNC_WAIT()
@1,1]<stdout>:Entered ompi_request_wait_completion()
@1,1]<stdout>:Before SYNC_WAIT()
@1,1]<stdout>:COMM MPI COMMUNICATOR 3 DUP FROM 0
@1,1]<stdout>:rank 1 receiving from rank 0, tag 345345 on comm MPI COMMUNICATOR 3 DUP FROM 0
@1,1]<stdout>:IRECV DONE
@1,1]<stdout>:Entered ompi_request_wait_completion()
@1,1]<stdout>:Before SYNC_WAIT()
@1,0]<stdout>:Entered ompi_request_wait_completion()
@1,0]<stdout>:Before SYNC_WAIT()
<hang>

I will see if I can get something more reproducible application-wise. Though someone that does have horovod/tensorflow installed might (/should?) also be able to reproduce it. In the meantime any insight is appreciated, and I am available for further testing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions