-
Notifications
You must be signed in to change notification settings - Fork 900
pml/ucx: move pmix finalize to the end of ompi_rte_finalize() #11228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Mamzi Bayatpour <[email protected]>
So.....you are saying that UCX introduces two additional fence operations into the MPI lifecycle??? One in init, and another in finalize? Why would you want to do that - it was Mellanox that originally was so concerned about launch scaling. Seems like something isn't quite right in this discussion. Perhaps the right question would be to ask why UCX requires these extra operations and spend a little effort to remove them. Up to you guys - it's your product and your customers. Perhaps the concerns over startup/finalize times no longer exist there? |
Why the pmix_fence in the opal_common_ucx_del_procs call? This wont work in the general case, i.e. using MPI Sessions. Note the UCX transport within OMPI doesn't currently support sessions, or more specifically MPI_Comm_create_from_group so it doesn't exactly matter at this point. |
Moreover, pml_add_procs and pml_del_procs are not supposed to be a collective call, so no collective behavior should be inserted in there or badness will follow. |
I believe that is why we have a "fence" at the end of MPI init - to support those PML's that need a fence before becoming operational. It isn't in "add_procs" itself - it is performed separately. Having a fence inside add_procs is...potentially problematic. |
I found that this code is problematic unless there's a full pmix_fence prior to add_procs, at least for OB1:
If i set opal_pmix_collect_all_data to false we don't need the pmix barrier, but that's not the default. |
Hmmm...looking thru the code in ompi_mpi_init, it appears to me that someone has made a significant mistake. They put What you need is a more linear code flow that supports "instances" as well as the initial "MPI Init" call. It should allow for the modex to occur prior to any "add_procs", then initialization of transports and infrastructure, followed by a single "fence" at the end of Init for those who need it. Might be worth the time to straighten this out now or else you'll continue to struggle with insertion of arbitrary fences all over the code. The issue really is that the "instance" code duplicates so much of "mpi_init", but isn't well integrated into it. One or the other should probably be eliminated, but that's just my initial impression. |
Something hit me about this piece of the code @hppritcha posted: if (!opal_pmix_collect_all_data) {
/*
* If direct modex, then compare our PML with the peer's PML
* for all procs
*/
for (i = 0; i < nprocs; i++) {
ret = mca_pml_base_pml_check_selected_impl(
my_pml,
procs[i]->super.proc_name);
if (ret) {
return ret;
}
} Are you folks really saying that if someone doesn't want a full data exchange, meaning they are going with direct modex - you are seriously going to force a complete retrieval of every proc's information by every proc, just to check their PML selection?? I wonder if you realize that this creates an NxN exchange of OOB messages across the entire job? It is, quite simply, the worst possible thing you could do as it is the most inefficient method for exchanging all that data. If you truly require this check, then you should just remove the async modex code completely and force everyone to do the full exchange. I am only aware of AWS needing this check - so perhaps there is another method you could use for selecting it? Perhaps AWS could add an envar into their images that indicate full modex coupled with this check must be performed for all apps? It would then avoid penalizing everyone else. Or if you can think of some other method for controlling selection, that would be worth some thought. As things stand, your only choice is to perform a full modex (which is burdensome) or perform an even worse NxN lookup. Not good choices for most people. |
I am not sure about the provenance of this check but I can say it has an mpi world process model flavor about it. |
Would it help for us to talk about it? I can jump on a meeting (RM or devel or whatever) if it would help. It's somewhat orthogonal to this PR, though it all somewhat loosely falls under the "can we optimize/reduce all these fence operations" category. |
Just curious: @yosefe Does your approval serve as indication that you don't care about the extra "fence" operations in the UCX code path? Or did you not read the above discussion about the problems this creates? |
Yes, UCX requires this fence in order to synchronize the destroy flow (ucx endpoint must be destroyed before the target ucx worker). In any case, PMIx seems a lower SW layer than OPAL, so IMO it should be closed after OPAL. |
@rhc54 the instance_init code is supposed to fire up the minimal amount of stuff needed to support a session. The author of all this code called it an instance rather than a session. Curiously, the original version of this code (or at least the last version prior to the developer's departure from LANL) had no pmix fences but the add_procs at that time seemed to be okay with this. Note this work was originally done in 2018... Looks like a commit 5418cc5 changed the behavior of |
We need to come up with a finalize solution that doesn't require a pmix fence before a del procs as this won't necessarily work for a session finalize. I don't like the coupling of one UCX consumer's action's with anothers. This XPMEM problem is likely related to #9868 . |
Agreed - I don't dispute the move, only wondering if we should take a closer look at the overall architecture.
I wasn't trying to be critical of the code, and hope it didn't come across that way. My overall impression from looking at it is that quite a few things have happened around MPI init/finalize, and people have patched spot problems as they appeared. As a result, some significant inefficiencies have crept into the code - e.g., one can "fix" a problem by inserting another "fence", but maybe the right answer is to figure out how to take advantage of the existing fence? |
Well digging more into this, it seems there are various places where OPAL_MODEX_RECV_IMMEDIATE is used in add_proc methods (see init_sm_endpoint) and these don't work correctly if there isn't a PMIx_Fence prior to the add_procs method for that PML/BTL, etc. |
@MamziB please open up the v5.0.x version of this PR |
v5.0.x fails inside MPI_Finalize when we run NWCHEM (when --mca pml ucx is set).
We searched for a smaller reproducer and we found out that osu_allgather also fails with exact same backtrace. Please see the bottom of this description to find the backtrace and how to reproduce it.
We suspected that issue might be coming from the UCX library. Therefore, we searched for a UCX commit that does not fail with these tests.
We found the below commit before which there is no such failure:
Then we reached out to @yosefe for his thoughts. He debugged the open mpi main branch and found out that since this commit enables shared memory, processes must be accessing a remote mem that no longer exists when ep_disconnect is called. It seems the issue is that PMI is finalized before UCX del_procs is called. As a result, PMIx_Fence() called from opal_common_ucx_mca_pmix_fence() fails silently and is actually a no-op.
Based on these suggestions, this PR is made. Please let me know your thoughts. This PR fixes all these failures in NWCHEM and osu_allgather.
Signed-off-by: Mamzi Bayatpour [email protected]
Reproduce the issue: