Skip to content

mca: Dynamic components link against project lib #4121

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 25, 2017

Conversation

jjhursey
Copy link
Member

@jjhursey jjhursey commented Aug 21, 2017

MCA components in ompi/
       $(top_builddir)/ompi/lib@[email protected]
MCA components in orte/
       $(top_builddir)/orte/lib@[email protected]
MCA components in opal/
       $(top_builddir)/opal/lib@[email protected]

Note: The changes in this commit were automated. Some components
were not included because they are statically built only.

@@ -37,6 +38,9 @@ mcacomponentdir = $(ompilibdir)
mcacomponent_LTLIBRARIES = $(component_install)
mca_bml_r2_la_SOURCES = $(r2_sources)
mca_bml_r2_la_LDFLAGS = -module -avoid-version
mca_bml_r2_la_LIBADD = $(top_builddir)/ompi/lib@[email protected] \
$(top_builddir)/orte/lib@[email protected] \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blah; First, we should find a way to macro-ize this, so there's only one thing in LIBADD.

But second, is a dependency on all three libraries the right thing here? It means that a bump in the major version of any one of the three will result in a load failure, which I guess is fine?

@jjhursey
Copy link
Member Author

Note the WIP tag. Initial commit basically reversed this commit with a script. Initial PR posted to give folks an early chance to test.

This PR needs some cleanup still including:

  • Cleanup Java bindings (dlopen trick no longer needed)
  • Does not include the oshmem project (only ompi/orte/opal)
  • See if we clean generalize the per-component change using project-level rules or additional build system scripts.

If you have more suggestions for changes feel free to keep the conversation going here or on the issue.

@dalcinl
Copy link
Contributor

dalcinl commented Aug 21, 2017

@jjhursey I confirm all of mpi4py testsuite passes. I have two minor regressions, but they seem to be unrelated to this PR (these failures occur even if I dlopen libmpi).

@rhc54
Copy link
Contributor

rhc54 commented Aug 21, 2017

@jjhursey You really cannot do this - building "-no-orte" and using your own RTE will cause all of these components to fail. You can only be certain that the project the component belongs to will have built - you cannot assume any other project did.

@bwbarrett
Copy link
Member

I'm not sure I completely agree with @rhc54. I agree that we probably shouldn't explicitly include the liborte / libopal paths in every OMPI makefile (or libopal in every ORTE makefile). But I think at configure time, you can know which liborte you're going to link with (at least, whether it's a relative path or out-of-tree libopen-rte), since you can know when you link libmpi. So that's solvable, but with more work than we have today.

It's also worth noting that not all components have a build-time dependency on libopen-rte, so it's not clear that we should be linking libopen-rte into every component. Probably safest to do that during the initial pass, but definitely not necessary in general.

@jjhursey
Copy link
Member Author

I could try a version where each project level only links to it's project level library (and not that of the one(s) beneath it). I suspect that will still work for the driving use case since libmpi links against libopen-rte and libopen-pal (and libopen-rte links against libopen-pal). That seems a bit cleaner.

@rhc54
Copy link
Contributor

rhc54 commented Aug 21, 2017

@bwbarrett I'm afraid that isn't true as the RTE enters the MPI layer solely thru the ompi/mca/rte framework. There is nothing in the configure line to tell you which RTE component is going to be selected, and only the RTE component actually links in the RTE library.

@rhc54
Copy link
Contributor

rhc54 commented Aug 21, 2017

This is why there are no RTE symbols anywhere in the OMPI layer - not in the base, nor in any component.

@bwbarrett
Copy link
Member

@rhc54, ah, if we've actually completed that transition, then I agree with you. We shouldn't link libopen-rte.so into the OMPI modules (nor should we need to).

Still an open question about open-pal.so, but it's not needed, so seems simpler to avoid it and just link in the project library.

@jjhursey jjhursey force-pushed the explore/dlopen-local branch from a61e2f7 to 822d749 Compare August 21, 2017 21:51
@jjhursey
Copy link
Member Author

I just re-posted the commit where each component only links against it's project's lib. I'm rebuilding now to confirm it still fixes the problem.

@jjhursey
Copy link
Member Author

I confirmed that this linking model works for the python binding example.

@jjhursey
Copy link
Member Author

Per teleconf - This is looking good. A few notes:

  • Add the update script to contrib/ - so we have it for later
  • Note that PMIx and hwloc projects may want to consider a similar patch for their component setup.
  • Add oshmem project

I'll try to polish this off today/tomorrow.

@dalcinl
Copy link
Contributor

dalcinl commented Aug 22, 2017

@jjhursey I've rebuilt with your latest changes, mpi4py worked just fine. Do you still need help with the Java bindings? I can contribute a patch, but I have no clue how to test it.

@jjhursey jjhursey force-pushed the explore/dlopen-local branch 2 times, most recently from be33adf to f04dc6f Compare August 22, 2017 19:07
@dalcinl
Copy link
Contributor

dalcinl commented Aug 22, 2017

@jjhursey autogen.pl after git clean -dxf produces the following warnings. Is that expected?

...
opal/mca/btl/openib/Makefile.am:121: warning: mca_btl_openib_la_LIBADD was already defined in condition TRUE, which includes condition OPAL_cuda_support ...
opal/mca/btl/openib/Makefile.am:117: ... 'mca_btl_openib_la_LIBADD' previously defined here
configure.ac: installing 'config/ylwrap'
opal/mca/btl/smcuda/Makefile.am:57: warning: mca_btl_smcuda_la_LIBADD was already defined in condition TRUE, which includes condition OPAL_cuda_support ...
opal/mca/btl/smcuda/Makefile.am:52: ... 'mca_btl_smcuda_la_LIBADD' previously defined here
opal/mca/rcache/gpusm/Makefile.am:54: warning: mca_rcache_gpusm_la_LIBADD was already defined in condition TRUE, which includes condition OPAL_cuda_support ...
opal/mca/rcache/gpusm/Makefile.am:51: ... 'mca_rcache_gpusm_la_LIBADD' previously defined here
opal/mca/rcache/grdma/Makefile.am:53: warning: mca_rcache_grdma_la_LIBADD was already defined in condition TRUE, which includes condition OPAL_cuda_support ...
opal/mca/rcache/grdma/Makefile.am:50: ... 'mca_rcache_grdma_la_LIBADD' previously defined here
opal/mca/rcache/rgpusm/Makefile.am:52: warning: mca_rcache_rgpusm_la_LIBADD was already defined in condition TRUE, which includes condition OPAL_cuda_support ...
opal/mca/rcache/rgpusm/Makefile.am:49: ... 'mca_rcache_rgpusm_la_LIBADD' previously defined here
...

@jjhursey
Copy link
Member Author

@dalcinl Ah good catch. I'll fix that in the next series of pushes (missed the += in the script).
I also have the patch for Java that I'll push up too. I tested it locally with a simple Java MPI program and it worked fine.

@jjhursey jjhursey force-pushed the explore/dlopen-local branch from f04dc6f to 95a0552 Compare August 22, 2017 21:46
@dalcinl
Copy link
Contributor

dalcinl commented Aug 23, 2017

One minor detail and a question about Java bindings:

  • @jjhursey Remove the line static void *libmpi = NULL; from file ompi/mpi/java/c/mpi_MPI.c.
  • @jsquyres Is the call to opal_init_psm() still needed? See 5071602 for reference.

@jsquyres
Copy link
Member

@dalcinl Unfortunately, I do not think we'll ever be able to remove that call to opal_init_psm().

@jjhursey jjhursey force-pushed the explore/dlopen-local branch from 04e7fb3 to d6f5de3 Compare August 23, 2017 13:48
@jjhursey
Copy link
Member Author

I just pushed an update to the Java bindings commit that removed the static void *libmpi = NULL;. Thanks for catching that.

For the opal_init_psm() I wasn't sure if we needed it still or not. It seems like we still do from Jeff's comment. We should probably get someone to test PSM+Java with this branch to make sure all is ok.

@jsquyres
Copy link
Member

@matcabral Can you test the Java bindings on this branch, since it has PSM[2] implications?

Thanks!

@rhc54
Copy link
Contributor

rhc54 commented Aug 23, 2017

I'll only modify that "never" a bit to be "for the next few years". AFAIK, the problem in the PSM library has been resolved, and so we only need it to protect against earlier versions until "everyone" has upgraded.

@dalcinl
Copy link
Contributor

dalcinl commented Aug 23, 2017

@jsquyres I'm not sure you understood my question. Do you need to call opal_init_psm() right there in the Java bindings bootstrap function before MPI is initialized?

@matcabral
Copy link
Contributor

Hi all,
@jsquyres I will test your request.
@dalcinl , this has to be done here because and old version of PSM/PSM2 was hijacking signal handlers in the library constructor. Which implies this happened whenever it was dlopened. Details at #1781 . As @rhc54 mentions, this was fixed in newer versions.

@matcabral
Copy link
Contributor

matcabral commented Aug 23, 2017

@jsquyres I just remembered the JAVA bindings are expected to not work with PSM:
https://www.open-mpi.org/faq/?category=java#java_limitations

@dalcinl
Copy link
Contributor

dalcinl commented Aug 24, 2017

@matcabral Maybe that comment in the FAQ is actually outdated?

OK, so now I think you should keep the opal_init_psm() call in the Java bindings. If the Java bindings are loaded but MPI_Init() is never called, you may be in trouble with the PSM signal hijacking. So better to stay safe and keep the call.

@jjhursey
Copy link
Member Author

Once I get the 👍 from the reviewers we can merge, and then consider the change for the release branches.

@dalcinl
Copy link
Contributor

dalcinl commented Aug 24, 2017

@jsquyres What release is going to get this change? Maybe 3.0.0?
@jjhursey Do you know whether Spectrum MPI will incorporate the changes in this PR?

@jsquyres
Copy link
Member

@bwbarrett Is this too late for v3.0.0?

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty minor nit-picks in the python.

'oshmem' : ["$(top_builddir)/oshmem/liboshmem.la"],
}
# projects= {'ompi' : ["$(top_builddir)/ompi/lib@[email protected]"],
# }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably remove this commented-out code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I'll fix.

print("")

print("="*40);
print("{:>3} : Files skipped".format(len(skipped_files)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why len(..) here, but str(len(...)) elsewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably just need the len( version here since I'm using the new .format mode that doesn't need it to be converted to a string first. I'll update the others.

@bwbarrett
Copy link
Member

I think it's probably too late for 3.0.0, but since this doesn't break ABI, I think I'd be ok pulling it in for 3.0.1. Let's see how rc4 goes; if there are other major issues, we can re-evaluate.

 * This script will search for all of the `Makefile.am` files in each
   of the project-level components. Then it adds the project-level
   library to `mca_FRAMEWORK_COMPONENT_la_LIBADD`.
   - If the library is already in the LIBADD list then it's skipped.
     So it is safe to run multiple times on the same codebase.

Signed-off-by: Joshua Hursey <[email protected]>
 * Resolves open-mpi#3705
 * Components should link against the project level library to better
   support `dlopen` with `RTLD_LOCAL`.
 * Extend the `mca_FRAMEWORK_COMPONENT_la_LIBADD` in the `Makefile.am`
   with the appropriate project level library:
```
MCA components in ompi/
       $(top_builddir)/ompi/lib@[email protected]
MCA components in orte/
       $(top_builddir)/orte/lib@[email protected]
MCA components in opal/
       $(top_builddir)/opal/lib@[email protected]
MCA components in oshmem/
       $(top_builddir)/oshmem/liboshmem.la"
```

Note: The changes in this commit were automated by the script in
the commit that proceeds it with the `libadd_mca_comp_update.py`
script. Some components were not included in this change because
they are statically built only.

Signed-off-by: Joshua Hursey <[email protected]>
 * See discussion on Issue open-mpi#3705 regarding why this is no longer needed.

Signed-off-by: Joshua Hursey <[email protected]>
@jjhursey jjhursey force-pushed the explore/dlopen-local branch from d6f5de3 to 49c40f0 Compare August 24, 2017 15:56
@jjhursey
Copy link
Member Author

@jsquyres I just re-pushed fixing the python items you mentioned.

@gpaulsen
Copy link
Member

We'd love this to go into v3.0.0 (If there's going to be another RC of substantial content). If v3.0.0 ships, we're excited about that too. 🥇 v3.0.1 will work in that case.

@jjhursey
Copy link
Member Author

@rhc or @bwbarrett Any other comments on this PR before I merge?

@rhc54
Copy link
Contributor

rhc54 commented Aug 25, 2017

ok with me!

@bwbarrett
Copy link
Member

@jjhursey, Works for me.

@jjhursey jjhursey merged commit ad87aa2 into open-mpi:master Aug 25, 2017
@jjhursey jjhursey deleted the explore/dlopen-local branch August 25, 2017 18:15
@rhc54
Copy link
Contributor

rhc54 commented Aug 30, 2017

Ouch - I have found one downside to this PR. Now every top-level compile will rerun thru every component and relink it, even if nothing changed in it or anywhere in the code base. Significantly slower build times.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants