Open
Description
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
3.0.0
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
from source using openmpi-3.0.0.tar.bz2
./configure --prefix=/opt/openmpi/v3.0.0 --with-cuda=/usr/local/cuda-9.0 --without-ucx --with-pmi --with-knem=/opt/knem-1.1.2.90mlnx2
Please describe the system on which you are running
- Operating system/version: CentOS Linux release 7.3.1611 (Core)
- Computer hardware: Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz
- Network type: MLNX_OFED_LINUX-4.0-2.0.0.1
CUDA 9.0.176
NVIDIA driver 384.81
Details of the problem
I am testing with an internal application which simply has a couple of MPI_Sendrecv() with GPU memory pointers.
while ( l2_norm > tol && iter < iter_max )
{
CUDA_RT_CALL( cudaMemsetAsync(l2_norm_d, 0 , sizeof(real), compute_stream ) );
launch_jacobi_kernel( a_new, a, l2_norm_d, iy_start, iy_end, nx, compute_stream );
CUDA_RT_CALL( cudaEventRecord( compute_done, compute_stream ) );
if ( (iter % nccheck) == 0 || (!csv && (iter % 100) == 0) ) {
CUDA_RT_CALL( cudaMemcpyAsync( l2_norm_h, l2_norm_d, sizeof(real), cudaMemcpyDeviceToHost, compute_stream ) );
}
const int top = rank > 0 ? rank - 1 : (size-1);
const int bottom = (rank+1)%size;
CUDA_RT_CALL( cudaEventSynchronize( compute_done ) );
MPI_CALL( MPI_Sendrecv( a_new+iy_start*nx, nx, MPI_REAL_TYPE, top , 0, a_new+(iy_end*nx), nx, MPI_REAL_TYPE, bottom, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE ));
MPI_CALL( MPI_Sendrecv( a_new+(iy_end-1)*nx, nx, MPI_REAL_TYPE, bottom, 0, a_new, nx, MPI_REAL_TYPE, top, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE ));
POP_RANGE
up to 2048x2048 it works:
[1] GPU0 name=Tesla P100-PCIE-16GB clockRate=1328500 memoryClockRate=715000 multiProcessorCount=56 <==
[0] GPU0 name=Tesla P100-PCIE-16GB clockRate=1328500 memoryClockRate=715000 multiProcessorCount=56 <==
Single GPU jacobi relaxation: 1000 iterations on 2048 x 2048 mesh with norm check every 100 iterations
0, 11.310944
100, 0.317338
200, 0.189262
300, 0.139756
400, 0.112668
500, 0.095314
600, 0.083131
700, 0.074048
800, 0.066984
900, 0.061312
[1] allocated a/a_new size=2099200 reals
[1] using MPI
...
[brdw0.nvidia.com:39733] CUDA: cuMemHostRegister OK on test region
[brdw0.nvidia.com:39733] CUDA: the extra gpu memory check is off
[brdw0.nvidia.com:39733] CUDA: initialized
Jacobi relaxation: 1000 iterations on 2048 x 2048 mesh with norm check every 100 iterations
0, 11.310951
100, 0.317339
200, 0.189263
300, 0.139756
400, 0.112668
500, 0.095314
600, 0.083131
700, 0.074049
800, 0.066984
900, 0.061312
Num GPUs: 2.
2048x2048: 1 GPU: 0.4874 s, 2 GPUs: 0.2866 s, speedup: 1.70, efficiency: 85.04
1 GPU: single kernel execution took 0.000457 s
for 4096x4096 it hangs:
[1] GPU0 name=Tesla P100-PCIE-16GB clockRate=1328500 memoryClockRate=715000 multiProcessorCount=56 <==
[0] GPU0 name=Tesla P100-PCIE-16GB clockRate=1328500 memoryClockRate=715000 multiProcessorCount=56 <==
Single GPU jacobi relaxation: 1000 iterations on 4096 x 4096 mesh with norm check every 100 iterations
0, 15.998030
100, 0.448909
200, 0.267773
300, 0.197771
400, 0.159468
500, 0.134929
600, 0.117704
700, 0.104862
800, 0.094873
900, 0.086856
[1] allocated a/a_new size=8392704 reals
[1] using MPI
[brdw0.nvidia.com:39595] CUDA: entering stage three init
[brdw0.nvidia.com:39595] CUDA: cuCtxGetCurrent succeeded
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xe0f000, bufsize=4096
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xe15000, bufsize=4096
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xe18000, bufsize=4096
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xe1b000, bufsize=20480
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xe21000, bufsize=20480
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xe28000, bufsize=102400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xe42000, bufsize=102400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f85ed6000, bufsize=1052672
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f84014000, bufsize=1052672
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xeb4000, bufsize=8192
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xeb9000, bufsize=8192
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xebe000, bufsize=69632
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xed2000, bufsize=69632
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xee6000, bufsize=69632
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xefa000, bufsize=69632
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xf0e000, bufsize=69632
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xf22000, bufsize=69632
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xf36000, bufsize=69632
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xf4a000, bufsize=69632
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f85e73000, bufsize=397312
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f85e10000, bufsize=397312
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7e50e000, bufsize=397312
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7e4ab000, bufsize=397312
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7e448000, bufsize=397312
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7e3e5000, bufsize=397312
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7e382000, bufsize=397312
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7e0fe000, bufsize=397312
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7c08e000, bufsize=4198400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7bc8b000, bufsize=4198400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7b888000, bufsize=4198400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7b485000, bufsize=4198400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7b082000, bufsize=4198400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7ac7f000, bufsize=4198400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7a87c000, bufsize=4198400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7a479000, bufsize=4198400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on test region
[brdw0.nvidia.com:39595] CUDA: the extra gpu memory check is off
[brdw0.nvidia.com:39595] CUDA: initialized
[brdw0.nvidia.com:39595] CUDA: progress_one_cuda_dtoh_event, outstanding_events=1
[brdw0.nvidia.com:39595] CUDA: cuEventQuery returned CUDA_ERROR_NOT_READY
[brdw0.nvidia.com:39595] CUDA: progress_one_cuda_dtoh_event, outstanding_events=1
[brdw0.nvidia.com:39595] CUDA: cuEventQuery returned 0
[brdw1.nvidia.com:37290] CUDA: progress_one_cuda_dtoh_event, outstanding_events=1
[brdw1.nvidia.com:37290] CUDA: cuEventQuery returned CUDA_ERROR_NOT_READY
[brdw1.nvidia.com:37290] CUDA: progress_one_cuda_dtoh_event, outstanding_events=1
[brdw1.nvidia.com:37290] CUDA: cuEventQuery returned 0