Skip to content

Open MPI 3.0.0 hangs in code using the GPU aware MPI feature #4650

Open
@drossetti

Description

@drossetti

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

3.0.0

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

from source using openmpi-3.0.0.tar.bz2
./configure --prefix=/opt/openmpi/v3.0.0 --with-cuda=/usr/local/cuda-9.0 --without-ucx --with-pmi --with-knem=/opt/knem-1.1.2.90mlnx2

Please describe the system on which you are running

  • Operating system/version: CentOS Linux release 7.3.1611 (Core)
  • Computer hardware: Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz
  • Network type: MLNX_OFED_LINUX-4.0-2.0.0.1

CUDA 9.0.176
NVIDIA driver 384.81


Details of the problem

I am testing with an internal application which simply has a couple of MPI_Sendrecv() with GPU memory pointers.

    while ( l2_norm > tol && iter < iter_max )
    {
        CUDA_RT_CALL( cudaMemsetAsync(l2_norm_d, 0 , sizeof(real), compute_stream ) );
        launch_jacobi_kernel( a_new, a, l2_norm_d, iy_start, iy_end, nx, compute_stream );
        CUDA_RT_CALL( cudaEventRecord( compute_done, compute_stream ) );
        if ( (iter % nccheck) == 0 || (!csv && (iter % 100) == 0) ) {
            CUDA_RT_CALL( cudaMemcpyAsync( l2_norm_h, l2_norm_d, sizeof(real), cudaMemcpyDeviceToHost, compute_stream ) );
        }
        const int top = rank > 0 ? rank - 1 : (size-1);
        const int bottom = (rank+1)%size;
        CUDA_RT_CALL( cudaEventSynchronize( compute_done ) );
        MPI_CALL( MPI_Sendrecv( a_new+iy_start*nx,   nx, MPI_REAL_TYPE, top   , 0, a_new+(iy_end*nx), nx, MPI_REAL_TYPE, bottom, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE ));
        MPI_CALL( MPI_Sendrecv( a_new+(iy_end-1)*nx, nx, MPI_REAL_TYPE, bottom, 0, a_new,             nx, MPI_REAL_TYPE, top,    0, MPI_COMM_WORLD, MPI_STATUS_IGNORE ));
        POP_RANGE

up to 2048x2048 it works:

[1] GPU0 name=Tesla P100-PCIE-16GB clockRate=1328500 memoryClockRate=715000 multiProcessorCount=56 <==
[0] GPU0 name=Tesla P100-PCIE-16GB clockRate=1328500 memoryClockRate=715000 multiProcessorCount=56 <==
Single GPU jacobi relaxation: 1000 iterations on 2048 x 2048 mesh with norm check every 100 iterations
    0, 11.310944
  100, 0.317338
  200, 0.189262
  300, 0.139756
  400, 0.112668
  500, 0.095314
  600, 0.083131
  700, 0.074048
  800, 0.066984
  900, 0.061312
[1] allocated a/a_new size=2099200 reals
[1] using MPI
...
​[brdw0.nvidia.com:39733] CUDA: cuMemHostRegister OK on test region
[brdw0.nvidia.com:39733] CUDA: the extra gpu memory check is off
[brdw0.nvidia.com:39733] CUDA: initialized
Jacobi relaxation: 1000 iterations on 2048 x 2048 mesh with norm check every 100 iterations
    0, 11.310951
  100, 0.317339
  200, 0.189263
  300, 0.139756
  400, 0.112668
  500, 0.095314
  600, 0.083131
  700, 0.074049
  800, 0.066984
  900, 0.061312
Num GPUs: 2.
2048x2048: 1 GPU:   0.4874 s, 2 GPUs:   0.2866 s, speedup:     1.70, efficiency:    85.04 
1 GPU: single kernel execution took 0.000457 s

for 4096x4096 it hangs:

[1] GPU0 name=Tesla P100-PCIE-16GB clockRate=1328500 memoryClockRate=715000 multiProcessorCount=56 <==
[0] GPU0 name=Tesla P100-PCIE-16GB clockRate=1328500 memoryClockRate=715000 multiProcessorCount=56 <==
Single GPU jacobi relaxation: 1000 iterations on 4096 x 4096 mesh with norm check every 100 iterations
    0, 15.998030
  100, 0.448909
  200, 0.267773
  300, 0.197771
  400, 0.159468
  500, 0.134929
  600, 0.117704
  700, 0.104862
  800, 0.094873
  900, 0.086856
[1] allocated a/a_new size=8392704 reals
[1] using MPI
[brdw0.nvidia.com:39595] CUDA: entering stage three init
[brdw0.nvidia.com:39595] CUDA: cuCtxGetCurrent succeeded
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xe0f000, bufsize=4096
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xe15000, bufsize=4096
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xe18000, bufsize=4096
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xe1b000, bufsize=20480
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xe21000, bufsize=20480
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xe28000, bufsize=102400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xe42000, bufsize=102400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f85ed6000, bufsize=1052672
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f84014000, bufsize=1052672
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xeb4000, bufsize=8192
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xeb9000, bufsize=8192
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xebe000, bufsize=69632
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xed2000, bufsize=69632
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xee6000, bufsize=69632
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xefa000, bufsize=69632
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xf0e000, bufsize=69632
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xf22000, bufsize=69632
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xf36000, bufsize=69632
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xf4a000, bufsize=69632
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f85e73000, bufsize=397312
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f85e10000, bufsize=397312
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7e50e000, bufsize=397312
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7e4ab000, bufsize=397312
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7e448000, bufsize=397312
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7e3e5000, bufsize=397312
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7e382000, bufsize=397312
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7e0fe000, bufsize=397312
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7c08e000, bufsize=4198400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7bc8b000, bufsize=4198400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7b888000, bufsize=4198400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7b485000, bufsize=4198400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7b082000, bufsize=4198400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7ac7f000, bufsize=4198400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7a87c000, bufsize=4198400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7a479000, bufsize=4198400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on test region
[brdw0.nvidia.com:39595] CUDA: the extra gpu memory check is off
[brdw0.nvidia.com:39595] CUDA: initialized
[brdw0.nvidia.com:39595] CUDA: progress_one_cuda_dtoh_event, outstanding_events=1
[brdw0.nvidia.com:39595] CUDA: cuEventQuery returned CUDA_ERROR_NOT_READY
[brdw0.nvidia.com:39595] CUDA: progress_one_cuda_dtoh_event, outstanding_events=1
[brdw0.nvidia.com:39595] CUDA: cuEventQuery returned 0
[brdw1.nvidia.com:37290] CUDA: progress_one_cuda_dtoh_event, outstanding_events=1
[brdw1.nvidia.com:37290] CUDA: cuEventQuery returned CUDA_ERROR_NOT_READY
[brdw1.nvidia.com:37290] CUDA: progress_one_cuda_dtoh_event, outstanding_events=1
[brdw1.nvidia.com:37290] CUDA: cuEventQuery returned 0

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions