Skip to content

Performance impact when running psm2 through ofi mtl versus psm2 mtl #8762

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
acgoldma opened this issue Apr 5, 2021 · 12 comments
Closed

Performance impact when running psm2 through ofi mtl versus psm2 mtl #8762

acgoldma opened this issue Apr 5, 2021 · 12 comments
Assignees

Comments

@acgoldma
Copy link
Contributor

acgoldma commented Apr 5, 2021

Thank you for taking the time to submit an issue!

Background information

While testing psm3 cuda support with Open-MPI I have compared performance of psm2 to track as a pseudo baseline. PSM3 is an fully OFI provider and does not have a separate API like psm2. So, when comparing the 2 I noticed the psm2 was showing a massive difference in latency (osu_latency) between ofi mtl and psm2 mtl, when running Device to Device, however Host to Host was nearly identical.

Upon code inspection I noticed that psm2 mtl disables a feature.

#if OPAL_CUDA_SUPPORT
    ompi_mtl_psm2.super.mtl_flags |= MCA_MTL_BASE_FLAG_CUDA_INIT_DISABLE;
#endif

I added a similar line to the OFI mtl and was able to 'fix' the performance issue, such that the performance was identical.

This does not seem like a true fix as I am not sure what other OFI providers might require this feature or might also want to disable this.
I could add a wrapper to check if provider is psm2/psm3 before setting flag and let other providers add to list as needed, but wanted to ask upstream what they thought?

Is there a flag on command line I could set/override? so a code change is not needed? Could we add one?

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.0.3 / v4.0.5

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

rebuild of SRPM with custom options (Similar to IFS OPA and IEFS releases)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

n/a

Please describe the system on which you are running

  • Operating system/version: RHEL8.x SLES 15.x
  • Computer hardware: 2x Servers with: Intel® Xeon® Gold 6138F + Nvidia V100 GPU (Back to Back Fabric)
  • Network type: OPA hfi1

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

/usr/mpi/gcc/openmpi-4.0.3-cuda-hfi/bin/mpirun \
   -np 2 -map-by node --allow-run-as-root \
   -machinefile /root/mpi_apps/mpi_hosts \
   -mca mtl ofi -x 'FI_PROVIDER=psm2' \
    -x PSM2_IDENTIFY=1 \
    -x PSM2_CUDA=1 \
    -x PSM2_GPUDIRECT=1 \
     osu-micro-benchmarks-5.6.3/mpi/pt2pt/osu_latency D D
# Size          Latency (us)
...
8                       8.63
...

Replace OFI with psm2 mtl:

/usr/mpi/gcc/openmpi-4.0.3-cuda-hfi/bin/mpirun \
   -np 2 -map-by node --allow-run-as-root \
   -machinefile /root/mpi_apps/mpi_hosts \
   -mca mtl psm2 \
    -x PSM2_IDENTIFY=1 \
    -x PSM2_CUDA=1 \
    -x PSM2_GPUDIRECT=1 \
     osu-micro-benchmarks-5.6.3/mpi/pt2pt/osu_latency D D
# Size          Latency (us)
...
8                        3.68
...

With my 'fix' to add disable CUDA flag on OFI mtl:

/usr/mpi/gcc/openmpi-4.0.3-cuda-hfi/bin/mpirun \
   -np 2 -map-by node --allow-run-as-root \
   -machinefile /root/mpi_apps/mpi_hosts \
   -mca mtl ofi -x 'FI_PROVIDER=psm2' \
    -x PSM2_IDENTIFY=1 \
    -x PSM2_CUDA=1 \
    -x PSM2_GPUDIRECT=1 \
     osu-micro-benchmarks-5.6.3/mpi/pt2pt/osu_latency D D
# Size          Latency (us)
...
8                        3.72
...
@acgoldma
Copy link
Contributor Author

acgoldma commented Apr 5, 2021

Adding @mwheinz to see this as it will impact PSM2 (obviously :P )

@mwheinz
Copy link

mwheinz commented Apr 5, 2021

So this disables CUDA initialization for all OFI providers, not just PSM3 and PSM2, correct? Do you know how this might interact with OFI's HMEM interfaces?

@acgoldma
Copy link
Contributor Author

acgoldma commented Apr 5, 2021

This does not seem like a true fix as I am not sure what other OFI providers might require this feature or might also want to disable this.
I could add a wrapper to check if provider is psm2/psm3 before setting flag and let other providers add to list as needed, but wanted to ask upstream what they thought?

Yes, I am not sure, I know that FI_HMEM support appears to be added in 5.x (master), but I am not sure if they rely on this feature in openmpi or not. I also know that there is some code differences in the master branch, which I have not had time to check yet. I plan to do some testing with it, but figured I would open this to see if anyone had an answer before I went down that route.

@rajachan
Copy link
Member

rajachan commented Apr 5, 2021

The FI_HMEM work in the OFI MTL does functionally depend on the CUDA converter check under a build with CUDA support and we should NOT opt the MTL out of it. This is how OMPI determines if a virtual address is backed by a GPU or not and sets the MR iface attributes in fi_mr_regattr() accordingly. We know this is an expensive check, but there's no way around it, unless each libfabric provider does that check, which makes FI_HMEM moot.

Can you post some data? Is the PSM3 provider also doing a redundant pointer check? If you run it through nvprof/nsys or the like, you can see where the overhead is coming from if it is more than what we expect.

cc: @wckzhang

@wckzhang
Copy link
Contributor

wckzhang commented Apr 5, 2021

Yeah, in order to communicate to the provider that a buffer is a gpu buffer, we need to have the convertor cuda code active to do pointer detection in the function mca_common_cuda_is_gpu_buffer. This has to happen at some point, I'd imagine that the psm2 provider does this detection again underneath. The pointer detection costs pretty much nothing for host buffers, so that explains why the H-H has no impact as well. If the psm3 provider supports FI_HMEM, the code in master should avoid a redundant check.

@rajachan
Copy link
Member

rajachan commented Apr 5, 2021

Is the PSM3 provider also doing a redundant pointer check?

I meant to say PSM2, not PSM3. Can you blame me?

@jsquyres
Copy link
Member

jsquyres commented Apr 5, 2021

I meant to say PSM2, not PSM3. Can you blame me?

@acgoldma This is exactly the type of confusion that the name "PSM3" induces. 🤯

@acgoldma
Copy link
Contributor Author

acgoldma commented Apr 5, 2021

FI_HMEM is a 'newer' feature (v1.9 OFI and OMPI 5.x) and we are looking into it and will most likely support it. However, that does not really solve libfabric support from before OFI v1.9 and before OMPI v5.x.

I took a quick skim of the OFI MTL code and it appears to error out if someone request cuda but is using an older libfabric or a provider that does not have FI_HMEM. This seems to change behavior that would have worked in 4.x that now will error out in 5.x?

Is FI_HMEM support in OpenMPI 4.x? I did not see it in the code.
If not, then is there a way to disable cuda converter at runtime?
Would you be against adding one (at least for older versions of OFI or OMPI)?
e.g. -mca cuda_converter [disable|0|off]

Many customers are not going to want to jump from 4.x to 5.x (we saw this with 3.x to 4.x), so backporting a simple fix to 4.x would be preferred if you do not plan to backport FI_HMEM support.

@jsquyres, @rajachan: As for the PSM3 name, while it is not my decision, I will inform those that can, but it is unlikely to change. As stated before, it is common to name a library after a protocol that it implements. It is also common to increment a major version when a protocol is no longer compatible with the older protocol. PSM3 is the 3rd iteration (v3.0) of the PSM (Intel's Performance Scaled Messaging) protocol.

@wckzhang
Copy link
Contributor

wckzhang commented Apr 5, 2021

The problem with the previous behavior with using libfabric without FI_HMEM alongside a cuda compiled OMPI is that we cannot guarantee that the provider selected can even support device buffers, thus I don't believe the previous selection logic can be appropriately used, some providers should break completely even if they were selected. I'm not sure we can support device transfers using a libfabric version prior to v1.9 reliably unless we resort to device-host copies.

The support for FI_HMEM in OMPI hasn't been considered for backport into 4.x since it's a fairly major change.

I'm not sure if there's a way to disable the cuda code during runtime, it's possible there is, but I don't recall any way to do so off the top of my head.

I haven't thought -mca cuda_converter [disable|0|off] through, but there could be room for something like that. I think that the 4.x series GPU support through the ofi mtl is unreliable though since we cannot safely select providers for GPU support.

@rajachan
Copy link
Member

rajachan commented Apr 6, 2021

Is FI_HMEM support in OpenMPI 4.x? I did not see it in the code.

William captured the essence of it. OFI MTL CUDA buffer support in v4.0.x and v4.1.x is... non-existent. We fixed this in master and in v5.0.x with #8536

Would you be against adding one (at least for older versions of OFI or OMPI)?

Opting the MTL out with MCA_MTL_BASE_FLAG_CUDA_INIT_DISABLE in the release branches that did not have FI_HMEM checks might be okay (assuming the RMs are fine taking that change) as that wouldn't materially change libfabric provider-specific behaviors, well outside the performance gain PSM2 will get.

cc: @open-mpi/ofi @jsquyres @hppritcha

@acgoldma
Copy link
Contributor Author

I was able to finally test with v5.0 and after modifying psm2/psm3 to set the FI_HMEM flag when built in cuda mode.
It worked great, the performance was even better then v4.x in all test cases and saw minimal difference in performance between ofi mtl and psm2 mtl.

So that should solve the 5.0 case for now, but as for 4.x:
Can I add something like the following for v4.1.x/v4.0.x:

diff --git a/ompi/mca/mtl/ofi/mtl_ofi_component.c b/ompi/mca/mtl/ofi/mtl_ofi_component.c
index ec88f67fe2..09197c620a 100644
--- a/ompi/mca/mtl/ofi/mtl_ofi_component.c
+++ b/ompi/mca/mtl/ofi/mtl_ofi_component.c
@@ -900,6 +900,16 @@ select_prov:
          prov->domain_attr->mr_mode = FI_MR_BASIC;
     }

+#if OPAL_CUDA_SUPPORT
+    /**
+     * Some providers do not require the use of the CUDA convertor
+     * in OMPI and it's use will cause performance degradation. The
+     * following providers will disable it when selected.
+     */
+    if (!strncmp(prov->fabric_attr->prov_name, "psm3", 4)
+        || !strncmp(prov->fabric_attr->prov_name, "psm2", 4))
+    {
+        ompi_mtl_ofi.base.mtl_flags |= MCA_MTL_BASE_FLAG_CUDA_INIT_DISABLE;
+    }
+#endif /* OPAL_CUDA_SUPPORT */
+
     /**
      * Create the access domain, which is the physical or virtual network or
      * hardware port/collection of ports.  Returns a domain object that ca

@rajachan
Copy link
Member

I was able to finally test with v5.0 and after modifying psm2/psm3 to set the FI_HMEM flag when built in cuda mode.
It worked great, the performance was even better then v4.x in all test cases and saw minimal difference in performance between ofi mtl and psm2 mtl

That's good! The proposed change for 4.x looks fine to me given this discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants