Skip to content

osc/ucx: Move the ucx_mca_register/deregister to open/close #10758

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 2, 2022

Conversation

jjhursey
Copy link
Member

@jjhursey jjhursey commented Sep 2, 2022

  • In the scenario where UCX is built into Open MPI, but not used at runtime
    then the memory hooks infrastructure can get into a situation where it has
    an invalid function pointer resulting in a segv.
  • Move the opal_common_ucx_mca_register and opal_common_ucx_mca_deregister
    calls to the component open and close functions, just like the UCX PML.

 * In the scenario where UCX is built into Open MPI, but not used at runtime
   then the memory hooks infrastructure can get into a situation where it has
   an invalid function pointer resulting in a segv.
 * Move the `opal_common_ucx_mca_register` and `opal_common_ucx_mca_deregister`
   calls to the component open and close functions, just like the UCX PML.

Signed-off-by: Joshua Hursey <[email protected]>
@jjhursey
Copy link
Member Author

jjhursey commented Sep 2, 2022

This one was tricky and hand to do with component opening/initializing/finalizing/closing, unloading in a scenario that is probably not commonly tested.

From what I can tell, the problem in v4.1.x is that the common/ucx is being unloaded while the opal_common_ucx_mem_release_cb memory hook is still registered. So when a memory release happens in the finalization of the OSC components it encounters an invalid function pointer (the memory it points to is no longer valid). So it segv's.

I'm running in a Docker container without IB. So the default pml_ucx_tls will exclude the ucx PML because there is no device match. If I set export OMPI_MCA_pml_ucx_tls=tcp,self,sysv,posix then the ucx PML is selected, and the test case runs fine without a segv.

The test case was simple: mpirun -np 2 examples/hello_c

Thus the scenario is when the ucx PML is loaded and disqualifies itself (due to the pml_ucx_tls default), and the ucx OSC is loaded and also disqualifies itself. A scenario where UCX is built in, but cannot be used at runtime.

The difference is that the ucx PML calls opal_common_ucx_mca_register and opal_common_ucx_mca_deregister in mca_pml_ucx_component_open and mca_pml_ucx_component_close, respectively. However, usc OSC calls opal_common_ucx_mca_register in component_init instead of open, and (most critically) calls opal_common_ucx_mca_deregister in component_finalize (instead of in a component_close like the PML) which is called during MPI_Finalize.

This probably did not impact main or v5.0.x because they don't rely as strongly on DSOs, unlike the older branches. However, it is good to apply the fix to main and v5.0.x to keep things uniform across the branches.

@janjust
Copy link
Contributor

janjust commented Sep 2, 2022

@jjhursey Thanks for the fix! Can you open the equivalent v4.1 and v5.0 PR?

@janjust janjust merged commit cdb631d into open-mpi:main Sep 2, 2022
@jjhursey jjhursey deleted the fix-ucx-reg-segv branch September 6, 2022 13:53
@jjhursey
Copy link
Member Author

jjhursey commented Sep 6, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants