-
Notifications
You must be signed in to change notification settings - Fork 900
Performance impact when running psm2 through ofi mtl versus psm2 mtl #8762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Adding @mwheinz to see this as it will impact PSM2 (obviously :P ) |
So this disables CUDA initialization for all OFI providers, not just PSM3 and PSM2, correct? Do you know how this might interact with OFI's HMEM interfaces? |
Yes, I am not sure, I know that FI_HMEM support appears to be added in 5.x (master), but I am not sure if they rely on this feature in openmpi or not. I also know that there is some code differences in the master branch, which I have not had time to check yet. I plan to do some testing with it, but figured I would open this to see if anyone had an answer before I went down that route. |
The FI_HMEM work in the OFI MTL does functionally depend on the CUDA converter check under a build with CUDA support and we should NOT opt the MTL out of it. This is how OMPI determines if a virtual address is backed by a GPU or not and sets the MR iface attributes in fi_mr_regattr() accordingly. We know this is an expensive check, but there's no way around it, unless each libfabric provider does that check, which makes FI_HMEM moot. Can you post some data? Is the PSM3 provider also doing a redundant pointer check? If you run it through nvprof/nsys or the like, you can see where the overhead is coming from if it is more than what we expect. cc: @wckzhang |
Yeah, in order to communicate to the provider that a buffer is a gpu buffer, we need to have the convertor cuda code active to do pointer detection in the function mca_common_cuda_is_gpu_buffer. This has to happen at some point, I'd imagine that the psm2 provider does this detection again underneath. The pointer detection costs pretty much nothing for host buffers, so that explains why the H-H has no impact as well. If the psm3 provider supports FI_HMEM, the code in master should avoid a redundant check. |
I meant to say PSM2, not PSM3. Can you blame me? |
@acgoldma This is exactly the type of confusion that the name "PSM3" induces. 🤯 |
I took a quick skim of the OFI MTL code and it appears to error out if someone request cuda but is using an older libfabric or a provider that does not have Is FI_HMEM support in OpenMPI 4.x? I did not see it in the code. Many customers are not going to want to jump from 4.x to 5.x (we saw this with 3.x to 4.x), so backporting a simple fix to 4.x would be preferred if you do not plan to backport @jsquyres, @rajachan: As for the PSM3 name, while it is not my decision, I will inform those that can, but it is unlikely to change. As stated before, it is common to name a library after a protocol that it implements. It is also common to increment a major version when a protocol is no longer compatible with the older protocol. PSM3 is the 3rd iteration (v3.0) of the PSM (Intel's Performance Scaled Messaging) protocol. |
The problem with the previous behavior with using libfabric without FI_HMEM alongside a cuda compiled OMPI is that we cannot guarantee that the provider selected can even support device buffers, thus I don't believe the previous selection logic can be appropriately used, some providers should break completely even if they were selected. I'm not sure we can support device transfers using a libfabric version prior to v1.9 reliably unless we resort to device-host copies. The support for FI_HMEM in OMPI hasn't been considered for backport into 4.x since it's a fairly major change. I'm not sure if there's a way to disable the cuda code during runtime, it's possible there is, but I don't recall any way to do so off the top of my head. I haven't thought |
William captured the essence of it. OFI MTL CUDA buffer support in v4.0.x and v4.1.x is... non-existent. We fixed this in master and in v5.0.x with #8536
Opting the MTL out with cc: @open-mpi/ofi @jsquyres @hppritcha |
I was able to finally test with v5.0 and after modifying psm2/psm3 to set the FI_HMEM flag when built in cuda mode. So that should solve the 5.0 case for now, but as for 4.x:
|
That's good! The proposed change for 4.x looks fine to me given this discussion. |
Thank you for taking the time to submit an issue!
Background information
While testing psm3 cuda support with Open-MPI I have compared performance of psm2 to track as a pseudo baseline. PSM3 is an fully OFI provider and does not have a separate API like psm2. So, when comparing the 2 I noticed the psm2 was showing a massive difference in latency (osu_latency) between ofi mtl and psm2 mtl, when running Device to Device, however Host to Host was nearly identical.
Upon code inspection I noticed that psm2 mtl disables a feature.
I added a similar line to the OFI mtl and was able to 'fix' the performance issue, such that the performance was identical.
This does not seem like a true fix as I am not sure what other OFI providers might require this feature or might also want to disable this.
I could add a wrapper to check if provider is psm2/psm3 before setting flag and let other providers add to list as needed, but wanted to ask upstream what they thought?
Is there a flag on command line I could set/override? so a code change is not needed? Could we add one?
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v4.0.3 / v4.0.5
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
rebuild of SRPM with custom options (Similar to IFS OPA and IEFS releases)
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.n/a
Please describe the system on which you are running
Details of the problem
Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.
Replace OFI with psm2 mtl:
/usr/mpi/gcc/openmpi-4.0.3-cuda-hfi/bin/mpirun \ -np 2 -map-by node --allow-run-as-root \ -machinefile /root/mpi_apps/mpi_hosts \ -mca mtl psm2 \ -x PSM2_IDENTIFY=1 \ -x PSM2_CUDA=1 \ -x PSM2_GPUDIRECT=1 \ osu-micro-benchmarks-5.6.3/mpi/pt2pt/osu_latency D D # Size Latency (us) ... 8 3.68 ...
With my 'fix' to add disable CUDA flag on OFI mtl:
The text was updated successfully, but these errors were encountered: