-
Notifications
You must be signed in to change notification settings - Fork 900
Open MPI 3.0.0 hangs in code using the GPU aware MPI feature #4650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
As a work around, the following env var works: |
probably the same issue as #4649 |
I encounter similar GDR hung with user 100% single core overhead. And I uses perftest tool with cuda-aware from mlnx_ofed stack, then the GDR feature works fine instead. OSU benchmark with cuda-aware mpi hungs for ever. |
@drossetti Is someone at NVIDIA looking into this and/or #4649? |
@Akshay-Venkatesh do you know if anybody is working on this problem? |
@jsquyres |
I'm sure that someone can talk you through how the PML works (e.g., @bosilca, whom I think you guys already know), but I'm unaware of anyone who has the GPU resources to debug this issue. |
More specifically: AFAIK, NVIDIA added this GPU Direct RDMA code to Open MPI. It would be great if NVIDIA would support and maintain it. Thanks! |
@jsquyres git blame shows that mca_pml_ob1_progress was commented out in this commit. Surprisingly the commit reads Don't refcount the predefined datatypes. @bosilca @jsquyres The fix is simple on master and the hang goes away but I haven't tested extensively. Lmk your thoughts CC: @drossetti |
The reason was to decrease the latency for cases where there is no backlog on the OB1 PML (i.e. no messages on the mca_pml_ob1.send_pending). Your proposed patch would reinstantiate this performance hit. The function mca_pml_ob1_enable_progress is supposed to be called when OB1 progress is required, in other words when messages are not driving the progress themselves. Calling mca_pml_ob1_enable_progress on the execution path that leads to the deadlock is the desirable approach, as it will maintain the non-CUDA execution path on the current performance level. |
@bosilca Thanks for the explanation. Is it reasonable to enable mca_pml_ob1_progress for builds configured with --with-cuda? Something along these lines... #if OPAL_CUDA_SUPPORT I do understand that non-CUDA transfers may take a performance hit for CUDA builds but at least the deadlock can be avoided. If not, do you have suggestions on how asynchronous receives for CUDA rendezvous transfers can be made to not depend on pm_ob1_progress (if at all this is possible)? |
Doing so will maintain the performance hit on all OMPI versions shipped with CUDA support, which if I'm not mistaken include all versions shipped by distros. A possible solution to your problem is to check for every send/recv/windows the CONVERTOR_CUDA flag on the convertor. If set make sure the OB1 progress is registered by calling mca_pml_ob1_enable_progress. However, I don't like what we have right now, it's a non-symmetrical solution, once a costly event is registered, the progress remains enabled forever. A much cleaner solution will be to have the CUDA part of the progress (aka. mca_pml_ob1_process_pending_cuda_async_copies) registered as a progress function for as long as there are pending CUDA events (basically unregistering it when no pending CUDA events remain). I would be happy to work with you toward such a solution. |
I agree with your suggestion. I've sent you an email for a meeting. |
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
3.0.0
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
from source using openmpi-3.0.0.tar.bz2
./configure --prefix=/opt/openmpi/v3.0.0 --with-cuda=/usr/local/cuda-9.0 --without-ucx --with-pmi --with-knem=/opt/knem-1.1.2.90mlnx2
Please describe the system on which you are running
CUDA 9.0.176
NVIDIA driver 384.81
Details of the problem
I am testing with an internal application which simply has a couple of MPI_Sendrecv() with GPU memory pointers.
up to 2048x2048 it works:
for 4096x4096 it hangs:
The text was updated successfully, but these errors were encountered: