Open MPI 3.0.0 hangs in code using the GPU aware MPI feature

Thank you for taking the time to submit an issue!

## Background information

### What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

3.0.0

### Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

from source using openmpi-3.0.0.tar.bz2
./configure --prefix=/opt/openmpi/v3.0.0 --with-cuda=/usr/local/cuda-9.0 --without-ucx --with-pmi --with-knem=/opt/knem-1.1.2.90mlnx2

### Please describe the system on which you are running

* Operating system/version: CentOS Linux release 7.3.1611 (Core)
* Computer hardware: Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz
* Network type: MLNX_OFED_LINUX-4.0-2.0.0.1

CUDA 9.0.176
NVIDIA driver 384.81

-----------------------------

## Details of the problem

I am testing with an internal application which simply has a couple of MPI_Sendrecv() with GPU memory pointers.
```
    while ( l2_norm > tol && iter < iter_max )
    {
        CUDA_RT_CALL( cudaMemsetAsync(l2_norm_d, 0 , sizeof(real), compute_stream ) );
        launch_jacobi_kernel( a_new, a, l2_norm_d, iy_start, iy_end, nx, compute_stream );
        CUDA_RT_CALL( cudaEventRecord( compute_done, compute_stream ) );
        if ( (iter % nccheck) == 0 || (!csv && (iter % 100) == 0) ) {
            CUDA_RT_CALL( cudaMemcpyAsync( l2_norm_h, l2_norm_d, sizeof(real), cudaMemcpyDeviceToHost, compute_stream ) );
        }
        const int top = rank > 0 ? rank - 1 : (size-1);
        const int bottom = (rank+1)%size;
        CUDA_RT_CALL( cudaEventSynchronize( compute_done ) );
        MPI_CALL( MPI_Sendrecv( a_new+iy_start*nx,   nx, MPI_REAL_TYPE, top   , 0, a_new+(iy_end*nx), nx, MPI_REAL_TYPE, bottom, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE ));
        MPI_CALL( MPI_Sendrecv( a_new+(iy_end-1)*nx, nx, MPI_REAL_TYPE, bottom, 0, a_new,             nx, MPI_REAL_TYPE, top,    0, MPI_COMM_WORLD, MPI_STATUS_IGNORE ));
        POP_RANGE

 ```
up to 2048x2048 it works:

```shell
[1] GPU0 name=Tesla P100-PCIE-16GB clockRate=1328500 memoryClockRate=715000 multiProcessorCount=56 <==
[0] GPU0 name=Tesla P100-PCIE-16GB clockRate=1328500 memoryClockRate=715000 multiProcessorCount=56 <==
Single GPU jacobi relaxation: 1000 iterations on 2048 x 2048 mesh with norm check every 100 iterations
    0, 11.310944
  100, 0.317338
  200, 0.189262
  300, 0.139756
  400, 0.112668
  500, 0.095314
  600, 0.083131
  700, 0.074048
  800, 0.066984
  900, 0.061312
[1] allocated a/a_new size=2099200 reals
[1] using MPI
...
​[brdw0.nvidia.com:39733] CUDA: cuMemHostRegister OK on test region
[brdw0.nvidia.com:39733] CUDA: the extra gpu memory check is off
[brdw0.nvidia.com:39733] CUDA: initialized
Jacobi relaxation: 1000 iterations on 2048 x 2048 mesh with norm check every 100 iterations
    0, 11.310951
  100, 0.317339
  200, 0.189263
  300, 0.139756
  400, 0.112668
  500, 0.095314
  600, 0.083131
  700, 0.074049
  800, 0.066984
  900, 0.061312
Num GPUs: 2.
2048x2048: 1 GPU:   0.4874 s, 2 GPUs:   0.2866 s, speedup:     1.70, efficiency:    85.04 
1 GPU: single kernel execution took 0.000457 s
```

for 4096x4096 it hangs:
```shell
[1] GPU0 name=Tesla P100-PCIE-16GB clockRate=1328500 memoryClockRate=715000 multiProcessorCount=56 <==
[0] GPU0 name=Tesla P100-PCIE-16GB clockRate=1328500 memoryClockRate=715000 multiProcessorCount=56 <==
Single GPU jacobi relaxation: 1000 iterations on 4096 x 4096 mesh with norm check every 100 iterations
    0, 15.998030
  100, 0.448909
  200, 0.267773
  300, 0.197771
  400, 0.159468
  500, 0.134929
  600, 0.117704
  700, 0.104862
  800, 0.094873
  900, 0.086856
[1] allocated a/a_new size=8392704 reals
[1] using MPI
[brdw0.nvidia.com:39595] CUDA: entering stage three init
[brdw0.nvidia.com:39595] CUDA: cuCtxGetCurrent succeeded
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xe0f000, bufsize=4096
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xe15000, bufsize=4096
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xe18000, bufsize=4096
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xe1b000, bufsize=20480
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xe21000, bufsize=20480
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xe28000, bufsize=102400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xe42000, bufsize=102400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f85ed6000, bufsize=1052672
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f84014000, bufsize=1052672
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xeb4000, bufsize=8192
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xeb9000, bufsize=8192
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xebe000, bufsize=69632
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xed2000, bufsize=69632
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xee6000, bufsize=69632
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xefa000, bufsize=69632
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xf0e000, bufsize=69632
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xf22000, bufsize=69632
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xf36000, bufsize=69632
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xf4a000, bufsize=69632
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f85e73000, bufsize=397312
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f85e10000, bufsize=397312
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7e50e000, bufsize=397312
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7e4ab000, bufsize=397312
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7e448000, bufsize=397312
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7e3e5000, bufsize=397312
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7e382000, bufsize=397312
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7e0fe000, bufsize=397312
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7c08e000, bufsize=4198400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7bc8b000, bufsize=4198400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7b888000, bufsize=4198400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7b485000, bufsize=4198400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7b082000, bufsize=4198400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7ac7f000, bufsize=4198400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7a87c000, bufsize=4198400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7a479000, bufsize=4198400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on test region
[brdw0.nvidia.com:39595] CUDA: the extra gpu memory check is off
[brdw0.nvidia.com:39595] CUDA: initialized
[brdw0.nvidia.com:39595] CUDA: progress_one_cuda_dtoh_event, outstanding_events=1
[brdw0.nvidia.com:39595] CUDA: cuEventQuery returned CUDA_ERROR_NOT_READY
[brdw0.nvidia.com:39595] CUDA: progress_one_cuda_dtoh_event, outstanding_events=1
[brdw0.nvidia.com:39595] CUDA: cuEventQuery returned 0
[brdw1.nvidia.com:37290] CUDA: progress_one_cuda_dtoh_event, outstanding_events=1
[brdw1.nvidia.com:37290] CUDA: cuEventQuery returned CUDA_ERROR_NOT_READY
[brdw1.nvidia.com:37290] CUDA: progress_one_cuda_dtoh_event, outstanding_events=1
[brdw1.nvidia.com:37290] CUDA: cuEventQuery returned 0
```




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Open MPI 3.0.0 hangs in code using the GPU aware MPI feature #4650

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Open MPI 3.0.0 hangs in code using the GPU aware MPI feature #4650

Description

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions