-
Notifications
You must be signed in to change notification settings - Fork 900
CUDA GPUDirect blocks on message size greater than GPUDirect RDMA limit with openib btl #3972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Are there any updates on this issue? OpenMPI 3.0 hangs for me with CUDA+IB where 2.1 works fine, so maybe it's the same problem. |
@jladd-mlnx, what's the recommended path for GPUDirect on Mellanox hardware in v3.0.x? Is it still OpenIB (and we should fix this issue) or is it UCX and we should figure out how to encourage a move to UCX? |
@Felandric I was made aware of this bug about 20 mins ago. I've missed my assignment notification that happened a year ago. My apologies. @bwbarrett I'm able to verify that the issue doesn't occur with UCX CUDA support
|
@Felandric
So using |
@hppritcha I thought that a decision was made a few months ago that |
@Akshay-Venkatesh Are you referring to #4650? Recall that openib is not the default for IB networks in the upcoming v4.0.0 and will likely go away in future releases. |
@jsquyres I was indeed referring to that. I agree that this would be of little consequence 4.0.0 and onwards but I think that |
@hppritcha @bwbarrett Opinions for v2.x / v3.0.x / v3.1.x? We're not going to do a new v2.x release for this, but we could commit it so that it's at least there (e.g., if anyone uses the nightly tarball). |
@jsquyres This approach suffices for the time being. Usual admins install an openmpi package on a cluster. So at least internally, we can recommend picking up the nightly tarball. |
Per 2018-09-18 webex, @Akshay-Venkatesh is going to make a PR for master + release branches to change the MCA var default value as described in #3972 (comment). |
BTW, also per 2018-09-18 webex, it turns out that we are going to do another v2.1.x release (sigh) because of #5696. So I'm going to add the v2.x label to this issue, too. |
Disable async receive for CUDA under OpenIB. While a performance optimization, it also causes incorrect results for transfers larger than the GPUDirect RDMA limit. This change has been validated and approved by Akshay. References open-mpi#3972 Signed-off-by: Brian Barrett <[email protected]>
Disable async receive for CUDA under OpenIB. While a performance optimization, it also causes incorrect results for transfers larger than the GPUDirect RDMA limit. This change has been validated and approved by Akshay. References #3972 Signed-off-by: Brian Barrett <[email protected]>
Disable async receive for CUDA under OpenIB. While a performance optimization, it also causes incorrect results for transfers larger than the GPUDirect RDMA limit. This change has been validated and approved by Akshay. References open-mpi#3972 Signed-off-by: Brian Barrett <[email protected]> (cherry picked from commit 9344afd) Signed-off-by: Brian Barrett <[email protected]>
Disable async receive for CUDA under OpenIB. While a performance optimization, it also causes incorrect results for transfers larger than the GPUDirect RDMA limit. This change has been validated and approved by Akshay. References open-mpi#3972 Signed-off-by: Brian Barrett <[email protected]> (cherry picked from commit 9344afd) Signed-off-by: Brian Barrett <[email protected]>
Disable async receive for CUDA under OpenIB. While a performance optimization, it also causes incorrect results for transfers larger than the GPUDirect RDMA limit. This change has been validated and approved by Akshay. References open-mpi#3972 Signed-off-by: Brian Barrett <[email protected]> (cherry picked from commit 9344afd) Signed-off-by: Brian Barrett <[email protected]>
Disable async receive for CUDA under OpenIB. While a performance optimization, it also causes incorrect results for transfers larger than the GPUDirect RDMA limit. This change has been validated and approved by Akshay. References open-mpi#3972 Signed-off-by: Brian Barrett <[email protected]> (cherry picked from commit 9344afd) Signed-off-by: Brian Barrett <[email protected]>
Work around is in master; pull requests for v2.1.x, v3.0.x, v3.1.x, and v4.0.x opened. |
Pull requests all merged. Closing. |
Background information
When using CUDA GPUDirect to Send and Recv directly from GPU buffers on message sizes greater than the RDMA limit, rather than expected behavior of staging buffers through host memory, Open MPI simply hangs.
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
v3.0.x
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Git clone.
Commands used:
Please describe the system on which you are running
Operating system/version: Red Hat Enterprise Linux Server 7.2 (Maipo) (3.10.0-327.el7.x86_64)
CUDA distribution: 8.0
CUDA GPUDirect Driver: Mellanox OFED GPUDirect RDMA
Computer hardware: 2 nodes, 48 cores/node, 2 GPUs per node
CPUs: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
GPUs: NVIDIA Tesla P100-PCIE-16GB
Network type:
Infiniband
Details of the problem
GPUDirect is intended to improve latency and bandwidth by allowing RDMA transfers from GPU to GPU and bypassing the CPU. However, there is a message size limit on RDMA transfers. Above this limit, GPUDirect is expected to stage buffers through host memory with cudaMemcpy.
The limit can be changed with the parameter "-mca btl_openib_cuda_rdma_limit x", where x is the message size in bytes.
Unfortunately, rather than behave as expected, the program simply hangs on MPI_Recv.
According to this, this may occur due to OpenMPI using blocking copies. The option "-mca mpi_common_cuda_cumemcpy_async 1" is meant to instruct OpenMPI to use non-blocking asynchronous copies when staging buffers through host.
However, enabling this option does nothing, and according to this, the option was enabled by default in OpenMPI v1.10 and onwards.
I attempted the same test in v1.10 and it was successful.
Below is gpudirect_bug.c, to replicate the behavior. argv[1] is taken as the number of floating point elements in the send and receive buffers, so the actual array size is that multiplied by sizeof(float).
The btl_openib_cuda_rdma_limit parameter is a limit on total message size; in my case an extra 56 bytes seemed to be used as a header, so a limit of 1000 is actually a limit of 944. This small discrepancy is a minor inconvenience but should possibly be corrected to match the user message length.
In my command line I inputted 1000, so the array was 4000 bytes, and the RDMA limit was 1000 bytes, so it was above the limit and hung. Message sizes below the limit act as expected.
gpudirect_bug.c:
Compilation:
Command line:
Stacktrace:
The text was updated successfully, but these errors were encountered: