Skip to content

Open MPI 3.0.0 hangs in code using the GPU aware MPI feature #4650

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
drossetti opened this issue Dec 20, 2017 · 13 comments
Open

Open MPI 3.0.0 hangs in code using the GPU aware MPI feature #4650

drossetti opened this issue Dec 20, 2017 · 13 comments

Comments

@drossetti
Copy link

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

3.0.0

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

from source using openmpi-3.0.0.tar.bz2
./configure --prefix=/opt/openmpi/v3.0.0 --with-cuda=/usr/local/cuda-9.0 --without-ucx --with-pmi --with-knem=/opt/knem-1.1.2.90mlnx2

Please describe the system on which you are running

  • Operating system/version: CentOS Linux release 7.3.1611 (Core)
  • Computer hardware: Intel(R) Xeon(R) CPU E5-2687W v4 @ 3.00GHz
  • Network type: MLNX_OFED_LINUX-4.0-2.0.0.1

CUDA 9.0.176
NVIDIA driver 384.81


Details of the problem

I am testing with an internal application which simply has a couple of MPI_Sendrecv() with GPU memory pointers.

    while ( l2_norm > tol && iter < iter_max )
    {
        CUDA_RT_CALL( cudaMemsetAsync(l2_norm_d, 0 , sizeof(real), compute_stream ) );
        launch_jacobi_kernel( a_new, a, l2_norm_d, iy_start, iy_end, nx, compute_stream );
        CUDA_RT_CALL( cudaEventRecord( compute_done, compute_stream ) );
        if ( (iter % nccheck) == 0 || (!csv && (iter % 100) == 0) ) {
            CUDA_RT_CALL( cudaMemcpyAsync( l2_norm_h, l2_norm_d, sizeof(real), cudaMemcpyDeviceToHost, compute_stream ) );
        }
        const int top = rank > 0 ? rank - 1 : (size-1);
        const int bottom = (rank+1)%size;
        CUDA_RT_CALL( cudaEventSynchronize( compute_done ) );
        MPI_CALL( MPI_Sendrecv( a_new+iy_start*nx,   nx, MPI_REAL_TYPE, top   , 0, a_new+(iy_end*nx), nx, MPI_REAL_TYPE, bottom, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE ));
        MPI_CALL( MPI_Sendrecv( a_new+(iy_end-1)*nx, nx, MPI_REAL_TYPE, bottom, 0, a_new,             nx, MPI_REAL_TYPE, top,    0, MPI_COMM_WORLD, MPI_STATUS_IGNORE ));
        POP_RANGE

up to 2048x2048 it works:

[1] GPU0 name=Tesla P100-PCIE-16GB clockRate=1328500 memoryClockRate=715000 multiProcessorCount=56 <==
[0] GPU0 name=Tesla P100-PCIE-16GB clockRate=1328500 memoryClockRate=715000 multiProcessorCount=56 <==
Single GPU jacobi relaxation: 1000 iterations on 2048 x 2048 mesh with norm check every 100 iterations
    0, 11.310944
  100, 0.317338
  200, 0.189262
  300, 0.139756
  400, 0.112668
  500, 0.095314
  600, 0.083131
  700, 0.074048
  800, 0.066984
  900, 0.061312
[1] allocated a/a_new size=2099200 reals
[1] using MPI
...
​[brdw0.nvidia.com:39733] CUDA: cuMemHostRegister OK on test region
[brdw0.nvidia.com:39733] CUDA: the extra gpu memory check is off
[brdw0.nvidia.com:39733] CUDA: initialized
Jacobi relaxation: 1000 iterations on 2048 x 2048 mesh with norm check every 100 iterations
    0, 11.310951
  100, 0.317339
  200, 0.189263
  300, 0.139756
  400, 0.112668
  500, 0.095314
  600, 0.083131
  700, 0.074049
  800, 0.066984
  900, 0.061312
Num GPUs: 2.
2048x2048: 1 GPU:   0.4874 s, 2 GPUs:   0.2866 s, speedup:     1.70, efficiency:    85.04 
1 GPU: single kernel execution took 0.000457 s

for 4096x4096 it hangs:

[1] GPU0 name=Tesla P100-PCIE-16GB clockRate=1328500 memoryClockRate=715000 multiProcessorCount=56 <==
[0] GPU0 name=Tesla P100-PCIE-16GB clockRate=1328500 memoryClockRate=715000 multiProcessorCount=56 <==
Single GPU jacobi relaxation: 1000 iterations on 4096 x 4096 mesh with norm check every 100 iterations
    0, 15.998030
  100, 0.448909
  200, 0.267773
  300, 0.197771
  400, 0.159468
  500, 0.134929
  600, 0.117704
  700, 0.104862
  800, 0.094873
  900, 0.086856
[1] allocated a/a_new size=8392704 reals
[1] using MPI
[brdw0.nvidia.com:39595] CUDA: entering stage three init
[brdw0.nvidia.com:39595] CUDA: cuCtxGetCurrent succeeded
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xe0f000, bufsize=4096
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xe15000, bufsize=4096
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xe18000, bufsize=4096
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xe1b000, bufsize=20480
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xe21000, bufsize=20480
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xe28000, bufsize=102400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xe42000, bufsize=102400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f85ed6000, bufsize=1052672
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f84014000, bufsize=1052672
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xeb4000, bufsize=8192
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xeb9000, bufsize=8192
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xebe000, bufsize=69632
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xed2000, bufsize=69632
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xee6000, bufsize=69632
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xefa000, bufsize=69632
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xf0e000, bufsize=69632
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xf22000, bufsize=69632
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xf36000, bufsize=69632
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0xf4a000, bufsize=69632
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f85e73000, bufsize=397312
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f85e10000, bufsize=397312
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7e50e000, bufsize=397312
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7e4ab000, bufsize=397312
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7e448000, bufsize=397312
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7e3e5000, bufsize=397312
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7e382000, bufsize=397312
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7e0fe000, bufsize=397312
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7c08e000, bufsize=4198400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7bc8b000, bufsize=4198400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7b888000, bufsize=4198400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7b485000, bufsize=4198400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7b082000, bufsize=4198400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7ac7f000, bufsize=4198400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7a87c000, bufsize=4198400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on rcache grdma: address=0x7f7f7a479000, bufsize=4198400
[brdw0.nvidia.com:39595] CUDA: cuMemHostRegister OK on test region
[brdw0.nvidia.com:39595] CUDA: the extra gpu memory check is off
[brdw0.nvidia.com:39595] CUDA: initialized
[brdw0.nvidia.com:39595] CUDA: progress_one_cuda_dtoh_event, outstanding_events=1
[brdw0.nvidia.com:39595] CUDA: cuEventQuery returned CUDA_ERROR_NOT_READY
[brdw0.nvidia.com:39595] CUDA: progress_one_cuda_dtoh_event, outstanding_events=1
[brdw0.nvidia.com:39595] CUDA: cuEventQuery returned 0
[brdw1.nvidia.com:37290] CUDA: progress_one_cuda_dtoh_event, outstanding_events=1
[brdw1.nvidia.com:37290] CUDA: cuEventQuery returned CUDA_ERROR_NOT_READY
[brdw1.nvidia.com:37290] CUDA: progress_one_cuda_dtoh_event, outstanding_events=1
[brdw1.nvidia.com:37290] CUDA: cuEventQuery returned 0
@drossetti
Copy link
Author

As a work around, the following env var works:
export OMPI_MCA_btl_openib_cuda_async_recv=false

@drossetti
Copy link
Author

probably the same issue as #4649

@backyes
Copy link

backyes commented Dec 28, 2017

  • 1,

I encounter similar GDR hung with user 100% single core overhead.

And I uses perftest tool with cuda-aware from mlnx_ofed stack, then the GDR feature works fine instead. OSU benchmark with cuda-aware mpi hungs for ever.

@jsquyres
Copy link
Member

jsquyres commented Jan 3, 2018

@drossetti Is someone at NVIDIA looking into this and/or #4649?

@drossetti
Copy link
Author

@Akshay-Venkatesh do you know if anybody is working on this problem?

@Akshay-Venkatesh
Copy link
Contributor

@jsquyres
I looked at this issue when I raised the closed post concerning this but there haven't been any resources assigned to this after that. Given that openib btl is not the way forward, this has lower priority. Is there anyone that is familiar with pml progress (this is the culprit as there was no progress trigger) path who can look into this?

@jsquyres
Copy link
Member

I'm sure that someone can talk you through how the PML works (e.g., @bosilca, whom I think you guys already know), but I'm unaware of anyone who has the GPU resources to debug this issue.

@jsquyres
Copy link
Member

More specifically: AFAIK, NVIDIA added this GPU Direct RDMA code to Open MPI. It would be great if NVIDIA would support and maintain it. Thanks!

@Akshay-Venkatesh
Copy link
Contributor

Akshay-Venkatesh commented Mar 30, 2018

@jsquyres
After some digging around I was able to locate the cause for the hang.

git blame shows that mca_pml_ob1_progress was commented out in this commit. Surprisingly the commit reads Don't refcount the predefined datatypes.
c2cd717
This is the specific line:
c2cd717#diff-946cfcf2824a3b50dbd054321338388dR61

@bosilca @jsquyres
What was the decision behind marking progress function as NULL? Was this intended?

The fix is simple on master and the hang goes away but I haven't tested extensively. Lmk your thoughts
ob1-progress-hang-patch.txt

CC: @drossetti

@bosilca
Copy link
Member

bosilca commented Apr 1, 2018

The reason was to decrease the latency for cases where there is no backlog on the OB1 PML (i.e. no messages on the mca_pml_ob1.send_pending). Your proposed patch would reinstantiate this performance hit.

The function mca_pml_ob1_enable_progress is supposed to be called when OB1 progress is required, in other words when messages are not driving the progress themselves. Calling mca_pml_ob1_enable_progress on the execution path that leads to the deadlock is the desirable approach, as it will maintain the non-CUDA execution path on the current performance level.

@Akshay-Venkatesh
Copy link
Contributor

Akshay-Venkatesh commented Apr 2, 2018

@bosilca Thanks for the explanation. Is it reasonable to enable mca_pml_ob1_progress for builds configured with --with-cuda?

Something along these lines...

#if OPAL_CUDA_SUPPORT
mca_pml_ob1_progress,
#else
NULL,
#endif

I do understand that non-CUDA transfers may take a performance hit for CUDA builds but at least the deadlock can be avoided. If not, do you have suggestions on how asynchronous receives for CUDA rendezvous transfers can be made to not depend on pm_ob1_progress (if at all this is possible)?

@bosilca
Copy link
Member

bosilca commented Apr 2, 2018

Doing so will maintain the performance hit on all OMPI versions shipped with CUDA support, which if I'm not mistaken include all versions shipped by distros.

A possible solution to your problem is to check for every send/recv/windows the CONVERTOR_CUDA flag on the convertor. If set make sure the OB1 progress is registered by calling mca_pml_ob1_enable_progress. However, I don't like what we have right now, it's a non-symmetrical solution, once a costly event is registered, the progress remains enabled forever. A much cleaner solution will be to have the CUDA part of the progress (aka. mca_pml_ob1_process_pending_cuda_async_copies) registered as a progress function for as long as there are pending CUDA events (basically unregistering it when no pending CUDA events remain). I would be happy to work with you toward such a solution.

@Akshay-Venkatesh
Copy link
Contributor

I agree with your suggestion. I've sent you an email for a meeting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants