Skip to content

Fixing VERSION file for v5.0.0rc10 #11365

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 8, 2023

Conversation

gpaulsen
Copy link
Member

@gpaulsen gpaulsen commented Jan 31, 2023

bot:notacherrypick

Open MPI v5.0.0 shared libraries are ABI compatible with v4.1.x with a few subtle possible exceptions for Fortran.

In the rare case that you compile your application in such a way that the size of an integer for C is different than the size of an integer in Fortran, you'll need to rebuild and relink your application.

There are some additional Fortran API changes involving intents and asyncs, along with changing some interfaces from named to unamed.

Refer to https://docs.open-mpi.org/en/v5.0.x/version-numbering.html#shared-library-version-number for policy

Signed-off-by: Geoffrey Paulsen [email protected]

@github-actions

This comment was marked as resolved.

@gpaulsen gpaulsen force-pushed the topic/v5.0.x/VERSION branch from cec1cd5 to f5ebffb Compare January 31, 2023 18:45
@github-actions

This comment was marked as resolved.

@gpaulsen gpaulsen force-pushed the topic/v5.0.x/VERSION branch from f5ebffb to b94bc01 Compare February 1, 2023 11:20
@janjust
Copy link
Contributor

janjust commented Feb 1, 2023

I tested the PR by compiling a hello world with v4.1.x mpicc/mpifort and running it in a 5.0.x environment. v5.0.x with this PR succeeds and without this PR fails with unable to find shared lib.
./a.out: error while loading shared libraries: libopen-rte.so.40: cannot open shared object file: No such file or directory

@bwbarrett
Copy link
Member

@janjust which Fortran interface did you use (since they all use the mpifort compiler)?

@janjust
Copy link
Contributor

janjust commented Feb 1, 2023

f90

@lrbison
Copy link
Contributor

lrbison commented Feb 1, 2023

My testing in examples:

find -maxdepth 1 -type f -executable -exec mpirun -n 2 --prefix=/fsx/lrbison/installs/mpi-main -x LD_LIBRARY_PATH=/fsx/lrbison/installs/mpi-main/lib {} \; >/dev/null
./hello_c: Symbol `ompi_mpi_comm_world' has different size in shared object, consider re-linking
./hello_c: Symbol `ompi_mpi_comm_world' has different size in shared object, consider re-linking
./ring_c: Symbol `ompi_mpi_comm_world' has different size in shared object, consider re-linking
./ring_c: Symbol `ompi_mpi_comm_world' has different size in shared object, consider re-linking
./connectivity_c: Symbol `ompi_mpi_comm_world' has different size in shared object, consider re-linking
./connectivity_c: Symbol `ompi_mpi_comm_world' has different size in shared object, consider re-linking

Which ran on the following examples:

./hello_c
./ring_c
./connectivity_c
*
./hello_mpifh
./ring_mpifh
./hello_usempi
./ring_usempi
./hello_usempif08
./ring_usempif08

*I removed spc_example from the list, as I couldn't get it to run correctly in my 4.1.4 version anyways.

Looking at the output of the runs (which was too long to post here) all* the runs succeed although there are warnings in the c examples (not in the fortran ones).

I must pass in LD_LIBRARY_PATH explicitly, even with --prefix or /absolute/path/to/mpirun otherwise I get:

orte_ess_init failed
  --> Returned value Unable to start a daemon on the local node (-127) instead of ORTE_SUCCESS

So I guess my conclusion is that the Fortran ABI is probably stable. The c interface I'm not sure about given the warnings.

@bwbarrett
Copy link
Member

Yes, we need to dig further; I'm surprised we changed MCW's size; that needs investigating.

@janjust
Copy link
Contributor

janjust commented Feb 1, 2023

I used the c interfaces, a simple hello world example.

$cat hello.f90
program helloworld
  use mpi
  implicit none
  integer :: rank, comsize, ierr

  call MPI_Init(ierr)
  call MPI_Comm_size(MPI_COMM_WORLD, comsize, ierr)
  call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)

  print *,'Hello World, from task ', rank, 'of', comsize

  call MPI_Finalize(ierr)
end program helloworld

@jsquyres
Copy link
Member

jsquyres commented Feb 2, 2023

I used the c interfaces, a simple hello world example.

@janjust I'm not sure what you mean -- you say you used the C interfaces, but you showed an example Fortran program...?

@janjust
Copy link
Contributor

janjust commented Feb 2, 2023

nvm, fortran and me are in different universes, I thought that there may be different MPI interfaces for Fortran programs and that's what Brian was asking about? Either way, I used the above program, and mpif90 to build it and test it.

@jsquyres
Copy link
Member

jsquyres commented Feb 2, 2023

We have a bunch of top-level MPI interfaces:

  • C
  • Fortran mpif.h
    • In most -- but not all -- cases, these end up calling the C interfaces
  • Two different flavors (depending on compiler support) of Fortran mpi module
    • In most (all? I don't remember offhand) cases, these are effectively the mpif.h interfaces (i.e., the mpi module is just formal Fortran declarations for our mpif.h Fortran interfaces)
  • Fortran mpi_f08 module
    • In most (all? I don't remember offhand) cases, these call a F08 shim routine that then effectively calls the mpi module interface (which, per above, is effectively an alias for the mpif.h Fortran interfaces)

@gpaulsen
Copy link
Member Author

gpaulsen commented Feb 2, 2023

This PR is attempting to resolve issue #11347. Linking here.

@lrbison
Copy link
Contributor

lrbison commented Feb 2, 2023

I think Tommy's example is using the mpif.h include file, but without explicit "external" statements or MPI_ADDRESS_KIND constants. Fortran will let you call subroutines it knows nothing about, and it just hopes you know what you are doing, and looks for the symbol during linking. For example I think you could add a few extra integer arguments to MPI_Init there, and compile and link will complete without error, but runtime could fail mysteriously.

@lrbison
Copy link
Contributor

lrbison commented Feb 2, 2023

It looks like PREDEFINED_COMMUNICATOR_PAD went from 512 to 1024 on main and v5.0.x with #3634

This is the source of the size warning.

@lrbison
Copy link
Contributor

lrbison commented Feb 2, 2023

@jsquyres
Copy link
Member

jsquyres commented Feb 2, 2023

I think Tommy's example is using the mpif.h include file

His example used use mpi, not mpif.h.

It looks like PREDEFINED_COMMUNICATOR_PAD went from 512 to 1024

Ah yes; we all forgot about that.

So here's the question (and I don't remember): did we need to do that, or was that just a precautionary bump in size back when we thought we were ok with breaking ABI between 4.x and 5.x? I.e., if we move the PAD back to 512, is it still big enough?

@lrbison
Copy link
Contributor

lrbison commented Feb 2, 2023

use mpi

Ha, oops, I missed that line


Looks like the ompi_communicator_t struct is 538 bytes when I compile in main. 😞

@bwbarrett
Copy link
Member

I think Sessions put us over the edge. It might be time for another layer of indirection; I believe some of the big pieces in the communicator could survive being moved to an auxiliary structure that we reference via pointer. I don't have time this week to take on any of that work, but it does seem like that's our path forward.

@jsquyres
Copy link
Member

jsquyres commented Feb 2, 2023

@open-mpi/ompi-rm-5-0-x I filed #11373 to track this issue.

@gpaulsen
Copy link
Member Author

Now that #11373 has been merged to v5.0.x changing this is ready to go.

@gpaulsen gpaulsen marked this pull request as ready for review February 23, 2023 15:48
@gpaulsen gpaulsen requested a review from bwbarrett February 23, 2023 15:48
@janjust
Copy link
Contributor

janjust commented Feb 23, 2023

👍 looks good to me

Copy link
Member

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we commit this, can we either do the testing or remove the "must be tested" part of the commit message?

@wckzhang
Copy link
Contributor

wckzhang commented Mar 2, 2023

What's the status on the testing? I can do some compatibility testing if it hasn't been done so far

bot:notacherrypick

Open MPI v5.0.0 shared libraries are ABI compatible with v4.1.x with a
few subtle possible exceptions for Fortran.

In the rare case that you compile your application in such a way that
the size of an integer for C is different than the size of an integer
in Fortran, you'll need to rebuild and relink your application.

There are some additional Fortran API changes involving intents and
asyncs, along with changing some interfaces from named to unamed.

Resetting the age of all internal libraries to 0 for v5.0.0

Refer to https://docs.open-mpi.org/en/v5.0.x/version-numbering.html#shared-library-version-number for policy

Signed-off-by: Geoffrey Paulsen <[email protected]>
@gpaulsen
Copy link
Member Author

gpaulsen commented Mar 7, 2023

Retesting. Hope to merge today assuming all goes well.

@gpaulsen gpaulsen force-pushed the topic/v5.0.x/VERSION branch from b94bc01 to 6aaa8b5 Compare March 7, 2023 18:46
@gpaulsen
Copy link
Member Author

gpaulsen commented Mar 7, 2023

Testing succeeded. Only updated commit comment (no code change) in force push.

@lrbison
Copy link
Contributor

lrbison commented Mar 7, 2023

@gpaulsen, Did you build any test libraries? Wei (@wzamazon ) did some testing on a related issue here. I'll relay his results here:

If I use mpicc from Open MPI 4.1.4 to generate a shared library:
mpicc -o libtest.so -shared -fPIC test_mpilib.c
The file test_mpilib.c has a function that calls MPI_Allreduce
The resulted libtest.so are directly linked to libopen-rte.so.40 and libopen-pal.so.40.

But I don't think we install libopen-rte.so.40, because we use libprrte.so now, and I think our libopen-pal.so library is at version 80 (not 40).

We found this because we were attempting to run (closed source) Ansys Fluent using Open MPI 5.0.x, but found it's .so needed these files.

@gpaulsen
Copy link
Member Author

gpaulsen commented Mar 7, 2023

No, I built tests from examples, and our internal test harness with tip of v4.0.x and then I rm -Rf that OMPI install, and configured with the same dir via --prefix with tip of OMPI v5.0.x. I did NOT rebuild the tests, I just reran them and they ran successfully.

@lrbison
Copy link
Contributor

lrbison commented Mar 7, 2023

OK, I did some testing as well and confirmed ompi 4x is NOT directly linking those libraries as I reported in my last post. sorry for the confusion (Ansus Fluent, however is, but we will follow up with them separately).

I have double-checked all the executables in the examples folders, and with this PR they all run successfully.

@jsquyres jsquyres added this to the v5.0.0 milestone Mar 7, 2023
@bwbarrett
Copy link
Member

For posterity, it's worth noting that ldd does the full recursion of library chain. So if you run ldd on a binary that links against libmpi.so, you will see libopen-rte and libopen-pal appear, as they are pulled in as dependencies.

I did check that mpicc and ompi-c.pc in versions since 3.0.0 has not included libopen-rte or libopen-pal for dynamic linking. So while Fluent isn't backwards compatible, I don't think that is actually due to Open MPI; we shouldn't be held to them linking against internal libraries.

@gpaulsen gpaulsen merged commit c5fe4aa into open-mpi:v5.0.x Mar 8, 2023
@gpaulsen gpaulsen deleted the topic/v5.0.x/VERSION branch March 8, 2023 02:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants