pml/ucx: move pmix finalize to the end of ompi_rte_finalize() #11228

MamziB · 2022-12-20T22:32:56Z

v5.0.x fails inside MPI_Finalize when we run NWCHEM (when --mca pml ucx is set).

We searched for a smaller reproducer and we found out that osu_allgather also fails with exact same backtrace. Please see the bottom of this description to find the backtrace and how to reproduce it.

We suspected that issue might be coming from the UCX library. Therefore, we searched for a UCX commit that does not fail with these tests.

We found the below commit before which there is no such failure:

commit 7c35b423e95d1bfc1a196961309cd3db15e9a22b
Author: Yossi Itigin <[[email protected]](mailto:[email protected])>
Date:   Tue Dec 31 17:56:12 2019 +0000
 
    UCP/RNDV: Enable shared memory rendezvous protocol over xpmem

Then we reached out to @yosefe for his thoughts. He debugged the open mpi main branch and found out that since this commit enables shared memory, processes must be accessing a remote mem that no longer exists when ep_disconnect is called. It seems the issue is that PMI is finalized before UCX del_procs is called. As a result, PMIx_Fence() called from opal_common_ucx_mca_pmix_fence() fails silently and is actually a no-op.

Based on these suggestions, this PR is made. Please let me know your thoughts. This PR fixes all these failures in NWCHEM and osu_allgather.

Signed-off-by: Mamzi Bayatpour [email protected]

Reproduce the issue:

- Benchmark:
        osu_allgather on 4 nodes 40 processes per node on iris partition (HPCAC)

- UCX and OPEN MPI version:
        Latest master/main branches

- Runtime command options for OPEN MPI:
        “--mca pml ucx    -x UCX_NET_DEVICES=mlx5_0:1”

- OPEN MPI config:
      $ configure -C --enable-debug --with-ompi-param-check --enable-picky --prefix=xx --with-ucx=xx --with-verbs --enable-mpi1-              compatibility --without-xpmem --with-libevent=/usr --with-slurm --enable-wrapper-rpath --with-pmix=internal --with-hwloc=internal

- UCX config:
      $ configure --enable-gtest --enable-examples --with-valgrind --enable-profiling --enable-frame-pointer --enable-stats --enable-fault-   injection --enable-debug-data --enable-mt -C --enable-mt --prefix=xx  --without-valgrind --enable-debug

and here is the backtrace:

==== backtrace (tid:1400578) ====
0 0x0000000000012ce0 __funlockfile()  :0
1 0x000000000001a047 uct_mm_ep_has_tx_resources()  /build-result/src/hpcx-v2.12-gcc-MLNX_OFED_LINUX-5-redhat8-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-master/src/uct/sm/mm/base/mm_ep.c:423
2 0x000000000001a047 uct_mm_ep_flush()  /build-result/src/hpcx-v2.12-gcc-MLNX_OFED_LINUX-5-redhat8-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-master/src/uct/sm/mm/base/mm_ep.c:534
3 0x0000000000061048 uct_ep_flush()  /build-result/src/hpcx-v2.12-gcc-MLNX_OFED_LINUX-5-redhat8-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-master/src/uct/api/uct.h:3105
4 0x0000000000061901 ucp_ep_flush_internal()  /build-result/src/hpcx-v2.12-gcc-MLNX_OFED_LINUX-5-redhat8-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-master/src/ucp/rma/flush.c:341
5 0x0000000000061901 ucp_ep_flush_internal()  /build-result/src/hpcx-v2.12-gcc-MLNX_OFED_LINUX-5-redhat8-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-master/src/ucp/rma/flush.c:343
6 0x0000000000032885 ucp_ep_close_nbx()  /build-result/src/hpcx-v2.12-gcc-MLNX_OFED_LINUX-5-redhat8-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-master/src/ucp/core/ucp_ep.c:1620
7 0x00000000000326f6 ucp_ep_close_nb()  /build-result/src/hpcx-v2.12-gcc-MLNX_OFED_LINUX-5-redhat8-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-master/src/ucp/core/ucp_ep.c:1589
8 0x00000000000329ce ucp_ep_destroy()  /build-result/src/hpcx-v2.12-gcc-MLNX_OFED_LINUX-5-redhat8-cuda11-gdrcopy2-nccl2.12-x86_64/ucx-master/src/ucp/core/ucp_ep.c:1658
9 0x000000000000dcfe component_finalize()  /dev/shm/openmpi-gitclone/ompi/mca/osc/ucx/osc_ucx_component.c:157
10 0x00000000000ba966 ompi_osc_base_finalize()  /dev/shm/openmpi-gitclone/ompi/mca/osc/base/osc_base_frame.c:72
11 0x00000000000ba966 opal_obj_update()  /dev/shm/openmpi-gitclone/ompi/mca/osc/../../../opal/class/opal_object.h:534
12 0x00000000000ba966 ompi_osc_base_finalize()  /dev/shm/openmpi-gitclone/ompi/mca/osc/base/osc_base_frame.c:73
13 0x0000000000056730 ompi_mpi_finalize()  /dev/shm/openmpi-gitclone/ompi/runtime/ompi_mpi_finalize.c:323

Signed-off-by: Mamzi Bayatpour <[email protected]>

rhc54 · 2022-12-20T23:04:49Z

So.....you are saying that UCX introduces two additional fence operations into the MPI lifecycle??? One in init, and another in finalize? Why would you want to do that - it was Mellanox that originally was so concerned about launch scaling. Seems like something isn't quite right in this discussion. Perhaps the right question would be to ask why UCX requires these extra operations and spend a little effort to remove them.

Up to you guys - it's your product and your customers. Perhaps the concerns over startup/finalize times no longer exist there?

hppritcha · 2022-12-22T16:51:24Z

Why the pmix_fence in the opal_common_ucx_del_procs call? This wont work in the general case, i.e. using MPI Sessions. Note the UCX transport within OMPI doesn't currently support sessions, or more specifically MPI_Comm_create_from_group so it doesn't exactly matter at this point.

bosilca · 2022-12-29T17:13:55Z

Moreover, pml_add_procs and pml_del_procs are not supposed to be a collective call, so no collective behavior should be inserted in there or badness will follow.

hppritcha · 2022-12-29T22:35:48Z

good point @bosilca but i wonder if for some PMLs' add_procs methods this is somewhat aspirational. From investigating #11212 it seems that a PMIx_Fence across the procs being added is required for some PML add_procs to work corrrectly.

rhc54 · 2022-12-29T22:58:31Z

I believe that is why we have a "fence" at the end of MPI init - to support those PML's that need a fence before becoming operational. It isn't in "add_procs" itself - it is performed separately. Having a fence inside add_procs is...potentially problematic.

hppritcha · 2022-12-29T23:07:40Z

I found that this code is problematic unless there's a full pmix_fence prior to add_procs, at least for OB1:

int
mca_pml_base_pml_check_selected(const char *my_pml,
                                ompi_proc_t **procs,
                                size_t nprocs)
{
    int ret = 0;
    size_t i;

    if (!opal_pmix_collect_all_data) {
        /*
         * If direct modex, then compare our PML with the peer's PML
         * for all procs
         */
        for (i = 0; i < nprocs; i++) {
            ret = mca_pml_base_pml_check_selected_impl(
                                                 my_pml,
                                                 procs[i]->super.proc_name);
            if (ret) {
                return ret;
            }
        }
    } else {
        /* else if full modex compare our PML with rank 0 */
        opal_process_name_t proc_name = {
                           .jobid = ompi_proc_local()->super.proc_name.jobid,
                           .vpid = 0
        };
        ret = mca_pml_base_pml_check_selected_impl(
                                                 my_pml,
                                                 proc_name);
    }

    return ret;
}

If i set opal_pmix_collect_all_data to false we don't need the pmix barrier, but that's not the default.

rhc54 · 2022-12-29T23:19:37Z

Hmmm...looking thru the code in ompi_mpi_init, it appears to me that someone has made a significant mistake. They put ompi_mpi_instance_init ahead of the modex - but the "instance_init" function depends upon the data returned by the modex. This seems like a rather unfortunate organization of the code that should be addressed - it is the root cause of what you guys are struggling with here and in other places.

What you need is a more linear code flow that supports "instances" as well as the initial "MPI Init" call. It should allow for the modex to occur prior to any "add_procs", then initialization of transports and infrastructure, followed by a single "fence" at the end of Init for those who need it.

Might be worth the time to straighten this out now or else you'll continue to struggle with insertion of arbitrary fences all over the code. The issue really is that the "instance" code duplicates so much of "mpi_init", but isn't well integrated into it. One or the other should probably be eliminated, but that's just my initial impression.

rhc54 · 2022-12-30T18:12:16Z

Something hit me about this piece of the code @hppritcha posted:

 if (!opal_pmix_collect_all_data) {
        /*
         * If direct modex, then compare our PML with the peer's PML
         * for all procs
         */
        for (i = 0; i < nprocs; i++) {
            ret = mca_pml_base_pml_check_selected_impl(
                                                 my_pml,
                                                 procs[i]->super.proc_name);
            if (ret) {
                return ret;
            }
        }

Are you folks really saying that if someone doesn't want a full data exchange, meaning they are going with direct modex - you are seriously going to force a complete retrieval of every proc's information by every proc, just to check their PML selection?? I wonder if you realize that this creates an NxN exchange of OOB messages across the entire job?

It is, quite simply, the worst possible thing you could do as it is the most inefficient method for exchanging all that data. If you truly require this check, then you should just remove the async modex code completely and force everyone to do the full exchange.

I am only aware of AWS needing this check - so perhaps there is another method you could use for selecting it? Perhaps AWS could add an envar into their images that indicate full modex coupled with this check must be performed for all apps? It would then avoid penalizing everyone else. Or if you can think of some other method for controlling selection, that would be worth some thought.

As things stand, your only choice is to perform a full modex (which is burdensome) or perform an even worse NxN lookup. Not good choices for most people.

hppritcha · 2022-12-30T23:03:51Z

I am not sure about the provenance of this check but I can say it has an mpi world process model flavor about it.

rhc54 · 2022-12-31T00:11:27Z

Would it help for us to talk about it? I can jump on a meeting (RM or devel or whatever) if it would help.

It's somewhat orthogonal to this PR, though it all somewhat loosely falls under the "can we optimize/reduce all these fence operations" category.

rhc54 · 2023-01-03T17:13:24Z

Just curious: @yosefe Does your approval serve as indication that you don't care about the extra "fence" operations in the UCX code path? Or did you not read the above discussion about the problems this creates?

yosefe · 2023-01-03T17:27:46Z

Just curious: @yosefe Does your approval serve as indication that you don't care about the extra "fence" operations in the UCX code path? Or did you not read the above discussion about the problems this creates?

Yes, UCX requires this fence in order to synchronize the destroy flow (ucx endpoint must be destroyed before the target ucx worker).
What's really needed is a PMIx fence after del_procs and before pml finalize. Maybe the right thing is to move this fence to pml finalize?

In any case, PMIx seems a lower SW layer than OPAL, so IMO it should be closed after OPAL.

hppritcha · 2023-01-03T17:29:35Z

@rhc54 the instance_init code is supposed to fire up the minimal amount of stuff needed to support a session. The author of all this code called it an instance rather than a session. Curiously, the original version of this code (or at least the last version prior to the developer's departure from LANL) had no pmix fences but the add_procs at that time seemed to be okay with this. Note this work was originally done in 2018... Looks like a commit 5418cc5 changed the behavior of mca_pml_base_pml_check_selected and that may explain why the instance_init code was not needing a pmix fence at the time it was being developed.

hppritcha · 2023-01-03T17:37:01Z

We need to come up with a finalize solution that doesn't require a pmix fence before a del procs as this won't necessarily work for a session finalize.

I don't like the coupling of one UCX consumer's action's with anothers. This XPMEM problem is likely related to #9868 .

rhc54 · 2023-01-03T17:37:21Z

In any case, PMIx seems a lower SW layer than OPAL, so IMO it should be closed after OPAL.

Agreed - I don't dispute the move, only wondering if we should take a closer look at the overall architecture.

Looks like a commit 5418cc5 changed the behavior of mca_pml_base_pml_check_selected and that may explain why the instance_init code was not needing a pmix fence at the time it was being developed.

I wasn't trying to be critical of the code, and hope it didn't come across that way. My overall impression from looking at it is that quite a few things have happened around MPI init/finalize, and people have patched spot problems as they appeared. As a result, some significant inefficiencies have crept into the code - e.g., one can "fix" a problem by inserting another "fence", but maybe the right answer is to figure out how to take advantage of the existing fence?

hppritcha · 2023-01-03T22:24:03Z

Well digging more into this, it seems there are various places where OPAL_MODEX_RECV_IMMEDIATE is used in add_proc methods (see init_sm_endpoint) and these don't work correctly if there isn't a PMIx_Fence prior to the add_procs method for that PML/BTL, etc.

janjust · 2023-01-04T19:40:30Z

@MamziB please open up the v5.0.x version of this PR

pml/ucx: move pmix finalize to the end of ompi_rte_finalize()

a12aa2f

Signed-off-by: Mamzi Bayatpour <[email protected]>

github-actions bot added the Target: main label Dec 20, 2022

janjust requested a review from yosefe December 20, 2022 22:41

janjust requested a review from brminich December 21, 2022 14:56

yosefe approved these changes Jan 3, 2023

View reviewed changes

janjust merged commit bb150ac into open-mpi:main Jan 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pml/ucx: move pmix finalize to the end of ompi_rte_finalize() #11228

pml/ucx: move pmix finalize to the end of ompi_rte_finalize() #11228

MamziB commented Dec 20, 2022 •

edited

Loading

rhc54 commented Dec 20, 2022 •

edited

Loading

hppritcha commented Dec 22, 2022

bosilca commented Dec 29, 2022

hppritcha commented Dec 29, 2022

rhc54 commented Dec 29, 2022

hppritcha commented Dec 29, 2022

rhc54 commented Dec 29, 2022

rhc54 commented Dec 30, 2022

hppritcha commented Dec 30, 2022

rhc54 commented Dec 31, 2022

rhc54 commented Jan 3, 2023

yosefe commented Jan 3, 2023

hppritcha commented Jan 3, 2023

hppritcha commented Jan 3, 2023

rhc54 commented Jan 3, 2023

hppritcha commented Jan 3, 2023

janjust commented Jan 4, 2023

pml/ucx: move pmix finalize to the end of ompi_rte_finalize() #11228

pml/ucx: move pmix finalize to the end of ompi_rte_finalize() #11228

Conversation

MamziB commented Dec 20, 2022 • edited Loading

rhc54 commented Dec 20, 2022 • edited Loading

hppritcha commented Dec 22, 2022

bosilca commented Dec 29, 2022

hppritcha commented Dec 29, 2022

rhc54 commented Dec 29, 2022

hppritcha commented Dec 29, 2022

rhc54 commented Dec 29, 2022

rhc54 commented Dec 30, 2022

hppritcha commented Dec 30, 2022

rhc54 commented Dec 31, 2022

rhc54 commented Jan 3, 2023

yosefe commented Jan 3, 2023

hppritcha commented Jan 3, 2023

hppritcha commented Jan 3, 2023

rhc54 commented Jan 3, 2023

hppritcha commented Jan 3, 2023

janjust commented Jan 4, 2023

MamziB commented Dec 20, 2022 •

edited

Loading

rhc54 commented Dec 20, 2022 •

edited

Loading