Description
Thank you for taking the time to submit an issue!
Background information
While testing psm3 cuda support with Open-MPI I have compared performance of psm2 to track as a pseudo baseline. PSM3 is an fully OFI provider and does not have a separate API like psm2. So, when comparing the 2 I noticed the psm2 was showing a massive difference in latency (osu_latency) between ofi mtl and psm2 mtl, when running Device to Device, however Host to Host was nearly identical.
Upon code inspection I noticed that psm2 mtl disables a feature.
#if OPAL_CUDA_SUPPORT
ompi_mtl_psm2.super.mtl_flags |= MCA_MTL_BASE_FLAG_CUDA_INIT_DISABLE;
#endif
I added a similar line to the OFI mtl and was able to 'fix' the performance issue, such that the performance was identical.
This does not seem like a true fix as I am not sure what other OFI providers might require this feature or might also want to disable this.
I could add a wrapper to check if provider is psm2/psm3 before setting flag and let other providers add to list as needed, but wanted to ask upstream what they thought?
Is there a flag on command line I could set/override? so a code change is not needed? Could we add one?
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v4.0.3 / v4.0.5
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
rebuild of SRPM with custom options (Similar to IFS OPA and IEFS releases)
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status
.
n/a
Please describe the system on which you are running
- Operating system/version: RHEL8.x SLES 15.x
- Computer hardware: 2x Servers with: Intel® Xeon® Gold 6138F + Nvidia V100 GPU (Back to Back Fabric)
- Network type: OPA hfi1
Details of the problem
Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.
/usr/mpi/gcc/openmpi-4.0.3-cuda-hfi/bin/mpirun \
-np 2 -map-by node --allow-run-as-root \
-machinefile /root/mpi_apps/mpi_hosts \
-mca mtl ofi -x 'FI_PROVIDER=psm2' \
-x PSM2_IDENTIFY=1 \
-x PSM2_CUDA=1 \
-x PSM2_GPUDIRECT=1 \
osu-micro-benchmarks-5.6.3/mpi/pt2pt/osu_latency D D
# Size Latency (us)
...
8 8.63
...
Replace OFI with psm2 mtl:
/usr/mpi/gcc/openmpi-4.0.3-cuda-hfi/bin/mpirun \
-np 2 -map-by node --allow-run-as-root \
-machinefile /root/mpi_apps/mpi_hosts \
-mca mtl psm2 \
-x PSM2_IDENTIFY=1 \
-x PSM2_CUDA=1 \
-x PSM2_GPUDIRECT=1 \
osu-micro-benchmarks-5.6.3/mpi/pt2pt/osu_latency D D
# Size Latency (us)
...
8 3.68
...
With my 'fix' to add disable CUDA flag on OFI mtl:
/usr/mpi/gcc/openmpi-4.0.3-cuda-hfi/bin/mpirun \
-np 2 -map-by node --allow-run-as-root \
-machinefile /root/mpi_apps/mpi_hosts \
-mca mtl ofi -x 'FI_PROVIDER=psm2' \
-x PSM2_IDENTIFY=1 \
-x PSM2_CUDA=1 \
-x PSM2_GPUDIRECT=1 \
osu-micro-benchmarks-5.6.3/mpi/pt2pt/osu_latency D D
# Size Latency (us)
...
8 3.72
...