Performance impact when running psm2 through ofi mtl versus psm2 mtl

Thank you for taking the time to submit an issue!

## Background information
While testing psm3 cuda support with Open-MPI I have compared performance of psm2 to track as a pseudo baseline. PSM3 is an fully OFI provider and does not have a separate API like psm2. So, when comparing the 2 I noticed the psm2 was showing a massive difference in latency (osu_latency) between ofi mtl and psm2 mtl, when running Device to Device, however Host to Host was nearly identical.

Upon code inspection I noticed that psm2 mtl disables a feature. 
```
#if OPAL_CUDA_SUPPORT
    ompi_mtl_psm2.super.mtl_flags |= MCA_MTL_BASE_FLAG_CUDA_INIT_DISABLE;
#endif
``` 
I added a similar line to the OFI mtl and was able to 'fix' the performance issue, such that the performance was identical. 

This does not seem like a true fix as I am not sure what other OFI providers might require this feature or might also want to disable this.
I could add a wrapper to check if provider is psm2/psm3 before setting flag and let other providers add to list as needed, but wanted to ask upstream what they thought?

Is there a flag on command line I could set/override? so a code change is not needed? Could we add one?

### What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.0.3 / v4.0.5

### Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

rebuild of SRPM with custom options (Similar to IFS OPA and IEFS releases)

### If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.
n/a


### Please describe the system on which you are running

* Operating system/version: RHEL8.x SLES 15.x
* Computer hardware: 2x Servers with: Intel® Xeon® Gold 6138F + Nvidia V100 GPU  (Back to Back Fabric)
* Network type: OPA hfi1

-----------------------------

## Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc.  It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

```shell
/usr/mpi/gcc/openmpi-4.0.3-cuda-hfi/bin/mpirun \
   -np 2 -map-by node --allow-run-as-root \
   -machinefile /root/mpi_apps/mpi_hosts \
   -mca mtl ofi -x 'FI_PROVIDER=psm2' \
    -x PSM2_IDENTIFY=1 \
    -x PSM2_CUDA=1 \
    -x PSM2_GPUDIRECT=1 \
     osu-micro-benchmarks-5.6.3/mpi/pt2pt/osu_latency D D
# Size          Latency (us)
...
8                       8.63
...
```
Replace OFI with psm2 mtl:
```shell
/usr/mpi/gcc/openmpi-4.0.3-cuda-hfi/bin/mpirun \
   -np 2 -map-by node --allow-run-as-root \
   -machinefile /root/mpi_apps/mpi_hosts \
   -mca mtl psm2 \
    -x PSM2_IDENTIFY=1 \
    -x PSM2_CUDA=1 \
    -x PSM2_GPUDIRECT=1 \
     osu-micro-benchmarks-5.6.3/mpi/pt2pt/osu_latency D D
# Size          Latency (us)
...
8                        3.68
...
```
With my 'fix' to add disable CUDA flag on OFI mtl:
```shell
/usr/mpi/gcc/openmpi-4.0.3-cuda-hfi/bin/mpirun \
   -np 2 -map-by node --allow-run-as-root \
   -machinefile /root/mpi_apps/mpi_hosts \
   -mca mtl ofi -x 'FI_PROVIDER=psm2' \
    -x PSM2_IDENTIFY=1 \
    -x PSM2_CUDA=1 \
    -x PSM2_GPUDIRECT=1 \
     osu-micro-benchmarks-5.6.3/mpi/pt2pt/osu_latency D D
# Size          Latency (us)
...
8                        3.72
...



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance impact when running psm2 through ofi mtl versus psm2 mtl #8762

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

Please describe the system on which you are running

Details of the problem

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance impact when running psm2 through ofi mtl versus psm2 mtl #8762

Description

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

Details of the problem

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.