Skip to content

Patch for linking libfabric #2519

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amckinstry opened this issue Dec 5, 2016 · 22 comments · Fixed by #6363
Closed

Patch for linking libfabric #2519

amckinstry opened this issue Dec 5, 2016 · 22 comments · Fixed by #6363
Assignees
Labels

Comments

@amckinstry
Copy link

Linking libfabric breaks on Debian/Ubuntu systems (at least) without the following patch:

Author: Gianfranco Costamagna <[email protected]>

--- openmpi-2.0.1.orig/ompi/mca/mtl/ofi/Makefile.am
+++ openmpi-2.0.1/ompi/mca/mtl/ofi/Makefile.am
@@ -43,7 +43,7 @@ mca_mtl_ofi_la_SOURCES = $(mtl_ofi_sourc
mca_mtl_ofi_la_LDFLAGS = \
$(ompi_mtl_ofi_LDFLAGS) \
-module -avoid-version
-mca_mtl_ofi_la_LIBADD = $(ompi_mtl_ofi_LIBS) \
+mca_mtl_ofi_la_LIBADD = $(ompi_mtl_ofi_LIBS) $(opal_common_libfabric_LIBS) \
$(OPAL_TOP_BUILDDIR)/opal/mca/common/libfabric/lib@OPAL_LIB_PREFIX@mca_common_libfabric.la

noinst_LTLIBRARIES = $(component_noinst)
@ggouaillardet
Copy link
Contributor

@amckinstry thanks for the patch

@rhc54 @jsquyres that is an intersting one ...

we link libmca_common_libfabric.la with -lfabric and naively hope libfabric.so is a dependency.
as shown by ldd, that is true on Centos 7, but not on debian (i tested ubuntu 14.04.3-LTS). if i understand correctly, the reason is libmca_common_libfabric.la does not need libfabric.so at all (!)

indeed, opal/mca/common/libfabric/common_libfabric.s only contains

int mca_common_libfabric_register_mca_variables(void)
{
    return OPAL_SUCCESS;
}

a possible fix is the suggested patch.
an other one is to really make libmca_common_libfabric.la depend on libfabric.so
for example

int mca_common_libfabric_register_mca_variables(void)
{
    if (fi_version() >= FI_VERSION(1,0)) {
        return OPAL_SUCCESS;
    } else {
        return OPAL_ERROR;
   }
}

and an other one is to simple remove opal/mca/common/libfabric (it does not do much today)

any thoughts ?

@rhc54
Copy link
Contributor

rhc54 commented Dec 6, 2016

There was some logic behind that component, but I honestly don't recall. I'd just use your patch for now.

@jsquyres
Copy link
Member

jsquyres commented Dec 6, 2016

@ggouaillardet Good call -- yes, calling fi_version() should do the trick.

@jsquyres jsquyres added the bug label Dec 6, 2016
@rashikakheria
Copy link

I am able to reproduce this issue using the OMPI v4.0.x on Ubuntu 16.04. I can see that the libmpi.so is not picking libfabric as its dynamic link.

ubuntu@ip-10-0-1-177:~/aws-ofi-nccl$ ldd /home/ubuntu/mpi-install/lib/libmpi.so
	linux-vdso.so.1 =>  (0x00007ffcd41e9000)
	libopen-rte.so.40 => /home/ubuntu/mpi-install/lib/libopen-rte.so.40 (0x00007ff897ea5000)
	libopen-pal.so.40 => /home/ubuntu/mpi-install/lib/libopen-pal.so.40 (0x00007ff897b91000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007ff897989000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007ff897680000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ff897463000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ff897099000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007ff896e95000)
	libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007ff896c92000)
	/lib64/ld-linux-x86-64.so.2 (0x00007ff898475000)

This causes basic MPI tests like ring to fail with the following error:

/home/ubuntu/ompi/examples/ring_c: symbol lookup error: /home/ubuntu/mpi-install/lib/openmpi/mca_mtl_ofi.so: undefined symbol: fi_dupinfo

Command used to configure OMPI build:

./configure --with-ofi=<absolute-path to libfbaric installation>

Any suggestions on fixing this?

1 similar comment
@rashikakheria
Copy link

I am able to reproduce this issue using the OMPI v4.0.x on Ubuntu 16.04. I can see that the libmpi.so is not picking libfabric as its dynamic link.

ubuntu@ip-10-0-1-177:~/aws-ofi-nccl$ ldd /home/ubuntu/mpi-install/lib/libmpi.so
	linux-vdso.so.1 =>  (0x00007ffcd41e9000)
	libopen-rte.so.40 => /home/ubuntu/mpi-install/lib/libopen-rte.so.40 (0x00007ff897ea5000)
	libopen-pal.so.40 => /home/ubuntu/mpi-install/lib/libopen-pal.so.40 (0x00007ff897b91000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007ff897989000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007ff897680000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ff897463000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ff897099000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007ff896e95000)
	libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007ff896c92000)
	/lib64/ld-linux-x86-64.so.2 (0x00007ff898475000)

This causes basic MPI tests like ring to fail with the following error:

/home/ubuntu/ompi/examples/ring_c: symbol lookup error: /home/ubuntu/mpi-install/lib/openmpi/mca_mtl_ofi.so: undefined symbol: fi_dupinfo

Command used to configure OMPI build:

./configure --with-ofi=<absolute-path to libfbaric installation>

Any suggestions on fixing this?

@ggouaillardet
Copy link
Contributor

Sounds like the fix never landed the repository ! Will do tomorrow.

@jsquyres
Copy link
Member

libmpi.so is not supposed to link against libfabric.so -- mca_mtl_ofi.so is supposed to link against libfabric.so.

Specifically, if you ldd $libdir/openmpi/mca_mtl_ofi.so, I think you'll see that it links against some libfabric.so on your system. It likely would not have linked/installed, otherwise.

I suspect that the issue you're seeing here is that you are linking against and older version of libfabric that does not have the fi_dupinfo() call. If that really is the case, we should ameliorate this in the Open MPI code somehow (i.e., have configure check libfabric for fi_dupinfo() and if it doesn't have it, either code around it or disqualify that installation of libfabric).

What version of libfabric are you linking against?

You should open a new issue for this -- this existing issue was a different problem that was already resolved / closed.

@rashikakheria
Copy link

@jsquyres I checked output for ldd mca_mtl_ofi.so and can see it is missing libfabric linking

ubuntu@ip-10-0-1-177:~/mpi-install/lib/openmpi$ ldd mca_mtl_ofi.so
	linux-vdso.so.1 =>  (0x00007ffd6fad2000)
	libmpi.so.40 => /home/ubuntu/mpi-install/lib/libmpi.so.40 (0x00007fbce6098000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fbce5e7b000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fbce5ab1000)
	libopen-rte.so.40 => /home/ubuntu/mpi-install/lib/libopen-rte.so.40 (0x00007fbce57fa000)
	libopen-pal.so.40 => /home/ubuntu/mpi-install/lib/libopen-pal.so.40 (0x00007fbce54f1000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fbce52e9000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fbce4fe0000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fbce659a000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fbce4ddc000)
	libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007fbce4bd9000)

I have confirmed in the past that my libfabric build does have a fi_dupinfo() call. See below:

ubuntu@ip-10-0-1-177:~$ objdump -T ofi-install/lib/libfabric.so | grep fi_dupinfo
00000000000121aa g    DF .text	00000000000004b3  FABRIC_1.2  fi_dupinfo
0000000000017b77 g    DF .text	000000000000046d (FABRIC_1.0) fi_dupinfo
0000000000018100 g    DF .text	000000000000007c (FABRIC_1.1) fi_dupinfo

Also, I am using OFI v1.7.x

I saw that this issue was still open and hence wanted to confirm if it was ever merged. I can open a new issue for this.

@jsquyres
Copy link
Member

That is pretty weird to me -- I have no idea how you would have an mca_mtl_ofi.so that does not link against libfabric. For example, here's my build from 4.0.0:

$ ldd ~/bogus/lib/openmpi/mca_mtl_ofi.so | grep fabric
        libfabric.so.1 => /home/jsquyres/libfabric-1.6.1/install/lib/libfabric.so.1 (0x00002aaaab9bd000)

Can you open a new issue and include the stdout from your configure, your config.log, and the stdout from make? (you might need to paste those large files into a gist or something)

@ggouaillardet
Copy link
Contributor

@jsquyres per my previous analysis, libfabric.so is pulled indirectly from the common lib ... expect on ubuntu.

ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Jan 29, 2019
use $(opal_common_ofi_*) variables since these are the only
one defined (by opal/mca/common/ofi/configure.m4)

Refs. open-mpi#2519

Thanks Alastair McKinstry for the report and initial fix.
Thanks Rashika Kheria for the reminder.

Signed-off-by: Gilles Gouaillardet <[email protected]>
@matcabral matcabral self-assigned this Jan 29, 2019
@matcabral
Copy link
Contributor

Hi @rashikakheria, would you please share the details of your setup? I have Ubuntu 16.04.5 LTS (in KVM) but cannot reproduce the issue you are seeing. I'm building from master and I have libfabric v1.7.0.

thanks,

macabral@ubuntu16:/tmp/ompi$ git log -1 |head -1
commit ea40d48

macabral@ubuntu16:/tmp/ompi$ head config.log |grep -e "./configure"
$ ./configure --with-ofi --prefix=/tmp/ompi-master-git

macabral@ubuntu16:/tmp$ ldd /tmp/ompi-master-git/lib/openmpi/mca_mtl_ofi.so
linux-vdso.so.1 => (0x00007ffd0c3ba000)
libmpi.so.0 => /tmp/ompi-master-git/lib/libmpi.so.0 (0x00007f446bc19000)
libopen-rte.so.0 => /tmp/ompi-master-git/lib/libopen-rte.so.0 (0x00007f446b960000)
libopen-pal.so.0 => /tmp/ompi-master-git/lib/libopen-pal.so.0 (0x00007f446b64d000)
libfabric.so.1 => /usr/local/lib/libfabric.so.1 (0x00007f446b3b8000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f446b19b000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f446add1000)
/lib64/ld-linux-x86-64.so.2 (0x00007f446c141000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f446abc9000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f446a8c0000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f446a6bc000)
libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f446a4b9000)

macabral@ubuntu16:/tmp/mpi_helloworld$ mpirun -np 2 -mca pml cm -mca mtl ofi -mca mtl_ofi_provider_include sockets ./mpi_helloworld
Hello world from host ubuntu16 processor ubuntu16, rank 0 out of 2 processors
Hello world from host ubuntu16 processor ubuntu16, rank 1 out of 2 processors

@rashikakheria
Copy link

@matcabral I tried the exact version you mentioned and still see the same issue

ubuntu@ip-10-0-1-51:~/ompi$ git log -1 |head -1
commit ea40d488993e3f1be8b7de943f4a751cfcfe37a6
ubuntu@ip-10-0-1-51:~/ompi$ head config.log |grep -e "./configure"
  $ ./configure --with-ofi --prefix=/home/ubuntu/ompi/install
ubuntu@ip-10-0-1-51:~/ompi$ ldd ./install/lib/openmpi/mca_mtl_ofi.so
	linux-vdso.so.1 =>  (0x00007ffd4cd68000)
	libmpi.so.0 => /home/ubuntu/ompi/install/lib/libmpi.so.0 (0x00007f14f1e0d000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f14f1bf0000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f14f1826000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f14f2335000)
	libopen-rte.so.0 => /home/ubuntu/ompi/install/lib/libopen-rte.so.0 (0x00007f14f156d000)
	libopen-pal.so.0 => /home/ubuntu/ompi/install/lib/libopen-pal.so.0 (0x00007f14f1259000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f14f1051000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f14f0d48000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f14f0b44000)
	libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f14f0941000)

I am also using Ubuntu 16.04 LTS. Here is my kernel version:

ubuntu@ip-10-0-1-51:~/ompi$ uname -a
Linux ip-10-0-1-51 4.4.0-1072-aws #82-Ubuntu SMP Fri Nov 2 15:00:21 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Let's follow further discussion on new issue as requested by @jsquyres

@jsquyres
Copy link
Member

jsquyres commented Feb 6, 2019

This is quite odd. I see the following:

mca_mtl_ofi_la_LIBADD = $(top_builddir)/ompi/lib@[email protected] \
$(ompi_mtl_ofi_LIBS) \
$(OPAL_TOP_BUILDDIR)/opal/mca/common/ofi/lib@OPAL_LIB_PREFIX@mca_common_ofi.la

which shows that mca_mtl_ofi.la should be linking against both $(ompi_mtl_ofi_LIBS) libmca_common_ofi.la, and

lib@OPAL_LIB_PREFIX@mca_common_ofi_la_LIBADD = $(opal_common_ofi_LIBS)

which shows that libmca_common_ofi.la should be linking against $(opal_common_ofi_LIBS).

In my build:

  • $(ompi_mtl_ofi_LIBS) is empty
  • $(opal_common_ofi_LIBS) is -lfabric

And:

$ ldd mca_mtl_ofi.so
...
        libmca_common_ofi.so.0 => /home/jsquyres/bogus/lib/libmca_common_ofi.so.0 (0x00002aaaab958000)
        libfabric.so.1 => /home/jsquyres/libfabric-1.6.1/install/lib/libfabric.so.1 (0x00002aaaabb59000)
...

Showing that mca_mtl_ofi.so is linked against both libfabric and the OPAL common OFI library. And just for more fun:

$ ldd libmca_common_ofi.so.0
...
        libfabric.so.1 => /home/jsquyres/libfabric-1.6.1/install/lib/libfabric.so.1 (0x00002aaaaacaf000)
...

Showing that the OPAL common OFI library is linked against libfabric.

@jsquyres
Copy link
Member

jsquyres commented Feb 6, 2019

Blarg. I just deleted my last comment because it was wrong. In both cases, Libtool inserted /path/to/libfabric.so in the linker line. So in both cases, libmca_common_ofi.so was properly linked against libfabric.so.

@jsquyres
Copy link
Member

jsquyres commented Feb 6, 2019

Note that there was a second issue opened for a while and some discussion happened over there -- be sure to see #6360 for some additional content. We closed that issue and will continue the discussion here, just to keep it all together.

@jsquyres
Copy link
Member

jsquyres commented Feb 6, 2019

@rashikakheria Can you do this in your build tree:

$ cd opal/mca/common/ofi
$ rm libmca_common_ofi.la
$ make V=1

and send the output?

@rashikakheria
Copy link

Here is the output:

ubuntu@ip-10-0-1-51:~/ompi$ cd opal/mca/common/ofi
ubuntu@ip-10-0-1-51:~/ompi/opal/mca/common/ofi$ ls libmca_common_ofi.la
libmca_common_ofi.la
ubuntu@ip-10-0-1-51:~/ompi/opal/mca/common/ofi$ rm libmca_common_ofi.la
ubuntu@ip-10-0-1-51:~/ompi/opal/mca/common/ofi$ make V=1
/bin/bash ../../../../libtool  --tag=CC   --mode=link gcc  -O3 -DNDEBUG -Wall -Wundef -Wno-long-long -Wsign-compare -Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic -Werror-implicit-function-declaration -finline-functions -fno-strict-aliasing -mcx16 -pthread  -version-info 0:0:0  -o libmca_common_ofi.la -rpath /home/ubuntu/ompi/install/lib  common_ofi.lo -lfabric  -lrt -lm -lutil
libtool: link: rm -fr  .libs/libmca_common_ofi.la .libs/libmca_common_ofi.lai .libs/libmca_common_ofi.so .libs/libmca_common_ofi.so.0 .libs/libmca_common_ofi.so.0.0.0
libtool: link: gcc -shared  -fPIC -DPIC  .libs/common_ofi.o   -lfabric -lrt -lm -lutil  -O3 -mcx16 -pthread   -pthread -Wl,-soname -Wl,libmca_common_ofi.so.0 -o .libs/libmca_common_ofi.so.0.0.0
libtool: link: (cd ".libs" && rm -f "libmca_common_ofi.so.0" && ln -s "libmca_common_ofi.so.0.0.0" "libmca_common_ofi.so.0")
libtool: link: (cd ".libs" && rm -f "libmca_common_ofi.so" && ln -s "libmca_common_ofi.so.0.0.0" "libmca_common_ofi.so")
libtool: link: ( cd ".libs" && rm -f "libmca_common_ofi.la" && ln -s "../libmca_common_ofi.la" "libmca_common_ofi.la" )
if test -z "libmca_common_ofi.la"; then \
  rm -f "libmca_common_ofi.la"; \
  ln -s "libmca_common_ofi_noinst.la" "libmca_common_ofi.la"; \
fi

@ggouaillardet
Copy link
Contributor

the patch fixes the issue for me (up-to-date ubuntu xenial)

stock master does not depend on libfabric.so

gilles@ubuntu:~$ ldd local/ompi/lib/openmpi/mca_mtl_ofi.so 
	linux-vdso.so.1 =>  (0x00007fff3c5bb000)
	libmpi.so.0 => /home/gilles/local/ompi/lib/libmpi.so.0 (0x00007f4c0365d000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f4c03439000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f4c0306f000)
	/lib64/ld-linux-x86-64.so.2 (0x000055fe6ec89000)
	libopen-rte.so.0 => /home/gilles/local/ompi/lib/libopen-rte.so.0 (0x00007f4c02d4e000)
	libopen-pal.so.0 => /home/gilles/local/ompi/lib/libopen-pal.so.0 (0x00007f4c029e2000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f4c027da000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f4c024d0000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f4c022cc000)
	libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007f4c020c9000)

patched master depends on libfabric.so

gilles@ubuntu:~$ ldd local/ompi.6313/lib/openmpi/mca_mtl_ofi.so 
	linux-vdso.so.1 =>  (0x00007ffdfa1cc000)
	libmpi.so.0 => /home/gilles/local/ompi.6313/lib/libmpi.so.0 (0x00007fce92560000)
	libfabric.so.1 => /home/gilles/local/libfabric-1.7.0/lib/libfabric.so.1 (0x00007fce92293000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fce9206f000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fce91ca4000)
	/lib64/ld-linux-x86-64.so.2 (0x0000563654d0e000)
	libopen-rte.so.0 => /home/gilles/local/ompi.6313/lib/libopen-rte.so.0 (0x00007fce91984000)
	libopen-pal.so.0 => /home/gilles/local/ompi.6313/lib/libopen-pal.so.0 (0x00007fce91618000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fce9140f000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fce91106000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fce90f02000)
	libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007fce90cfe000)

here is the up-to-date patch

diff --git a/ompi/mca/mtl/ofi/Makefile.am b/ompi/mca/mtl/ofi/Makefile.am
index 2499f85..58c7ce2 100644
--- a/ompi/mca/mtl/ofi/Makefile.am
+++ b/ompi/mca/mtl/ofi/Makefile.am
@@ -5,6 +5,8 @@
 # Copyright (c) 2017      Los Alamos National Security, LLC.  All rights
 #                         reserved.
 # Copyright (c) 2017      IBM Corporation.  All rights reserved.
+# Copyright (c) 2019      Research Organization for Information Science
+#                         and Technology (RIST).  All rights reserved.
 # $COPYRIGHT$
 #
 # Additional copyrights may follow
@@ -18,7 +20,7 @@ EXTRA_DIST = post_configure.sh \
 MAINTAINERCLEANFILES = \
        $(generated_sources)
 
-AM_CPPFLAGS = $(ompi_mtl_ofi_CPPFLAGS) $(opal_common_ofi_CPPFLAGS)
+AM_CPPFLAGS = $(opal_common_ofi_CPPFLAGS)
 
 dist_ompidata_DATA = help-mtl-ofi.txt
 
@@ -55,7 +57,7 @@ mtl_ofi_sources = \
 # files should be added to generated_source_modules, as well as adding
 # their .c variants to generated_sources.
 %.c : %.pm;
-       $(PERL) generate-opt-funcs.pl $@
+       $(PERL) -I$(top_srcdir)/ompi/mca/mtl/ofi  $(top_srcdir)/ompi/mca/mtl/ofi/generate-opt-funcs.pl $@
 
 # Make the output library in this directory, and name it either
 # mca_<type>_<name>.la (for DSO builds) or libmca_<type>_<name>.la
@@ -73,15 +75,15 @@ mcacomponentdir = $(ompilibdir)
 mcacomponent_LTLIBRARIES = $(component_install)
 mca_mtl_ofi_la_SOURCES = $(mtl_ofi_sources)
 mca_mtl_ofi_la_LDFLAGS = \
-        $(ompi_mtl_ofi_LDFLAGS) \
+        $(opal_common_ofi_LDFLAGS) \
         -module -avoid-version
 mca_mtl_ofi_la_LIBADD = $(top_builddir)/ompi/lib@[email protected] \
-       $(ompi_mtl_ofi_LIBS) \
+        $(opal_common_ofi_LIBS) \
         $(OPAL_TOP_BUILDDIR)/opal/mca/common/ofi/lib@OPAL_LIB_PREFIX@mca_common_ofi.la
 
 noinst_LTLIBRARIES = $(component_noinst)
 libmca_mtl_ofi_la_SOURCES = $(mtl_ofi_sources)
 libmca_mtl_ofi_la_LDFLAGS = \
-        $(ompi_mtl_ofi_LDFLAGS) \
+        $(opal_common_ofi_LDFLAGS) \
         -module -avoid-version
-libmca_mtl_ofi_la_LIBADD = $(ompi_mtl_ofi_LIBS)
+libmca_mtl_ofi_la_LIBADD = $(opal_common_ofi_LIBS)

mca_mtl_ofi.so currently relies on libmca_common_ofi.so to access libfabric.so.

common/ofi does not use any symbols from libfabric.so.
on redhat, libmca_common_ofi.so does depend on libfabric.so (because this is what we tell the linker to do). even if we do the same thing on ubuntu, the libfabric.so dependency is skipped by libtool/ld, and hence the failure.

@jsquyres
Copy link
Member

jsquyres commented Feb 6, 2019

@ggouaillardet I think you just hit the nail on the head: common/ofi doesn't actually utilize any libfabric symbols.

This brings up (again) the idea that we should just delete common/ofi, because it never fulfilled its original purpose and just causes indirect problems like this. I think we talked about this in the last week or two on the weekly webex, but I don't think an issue was created for it.

EDIT: Correction -- we talked about this on the webex and I put a comment on #6313.

@ggouaillardet
Copy link
Contributor

@jsquyres I will improve #6313 tomorrow (per Brian's comments)

@jsquyres
Copy link
Member

jsquyres commented Feb 6, 2019

@ggouaillardet I've got a solution in the works. Should have a PR shortly.

jsquyres added a commit to jsquyres/ompi that referenced this issue Feb 6, 2019
It never lived up to its purpose (and has caused amorphous indirect
errors such as open-mpi#2519), so
delete it.

Signed-off-by: Jeff Squyres <[email protected]>
jsquyres added a commit to jsquyres/ompi that referenced this issue Feb 6, 2019
It never lived up to its purpose (and has caused amorphous indirect
errors such as open-mpi#2519), so
delete it.

Signed-off-by: Jeff Squyres <[email protected]>
jsquyres added a commit to jsquyres/ompi that referenced this issue Feb 6, 2019
It never lived up to its purpose (and has caused amorphous indirect
errors such as open-mpi#2519), so
delete it.

Signed-off-by: Jeff Squyres <[email protected]>
@jsquyres
Copy link
Member

jsquyres commented Feb 6, 2019

Please see PR #6363, which is a rollup of all known outstanding OFI configure/linking issues.

jsquyres added a commit to jsquyres/ompi that referenced this issue Feb 7, 2019
It never lived up to its purpose (and has caused amorphous indirect
errors such as open-mpi#2519), so
delete it.

Signed-off-by: Jeff Squyres <[email protected]>
jsquyres added a commit to jsquyres/ompi that referenced this issue Feb 7, 2019
It never lived up to its purpose (and has caused amorphous indirect
errors such as open-mpi#2519), so
delete it.

Signed-off-by: Jeff Squyres <[email protected]>
jsquyres added a commit to jsquyres/ompi that referenced this issue Feb 7, 2019
It never lived up to its purpose (and has caused amorphous indirect
errors such as open-mpi#2519), so
delete it.

Signed-off-by: Jeff Squyres <[email protected]>
(cherry picked from commit dd20174)
jsquyres added a commit to jsquyres/ompi that referenced this issue Feb 7, 2019
It never lived up to its purpose (and has caused amorphous indirect
errors such as open-mpi#2519), so
delete it.

Signed-off-by: Jeff Squyres <[email protected]>
(cherry picked from commit dd20174)
guserav added a commit to guserav/ompi that referenced this issue Aug 12, 2019
As discussed in open-mpi#2519 the common component does not depend
on libfabric yet. This commit introduces this dependency by just calling
fi_version().

Signed-off-by: guserav <[email protected]>
bosilca pushed a commit to bosilca/ompi that referenced this issue Dec 27, 2019
As discussed in open-mpi#2519 the common component does not depend
on libfabric yet. This commit introduces this dependency by just calling
fi_version().

Signed-off-by: guserav <[email protected]>
bwbarrett pushed a commit to bwbarrett/ompi that referenced this issue Jun 1, 2020
As discussed in open-mpi#2519 the common component does not depend
on libfabric yet. This commit introduces this dependency by just calling
fi_version().

Signed-off-by: guserav <[email protected]>
(cherry picked from commit 8a67a95)
Signed-off-by: Brian Barrett <[email protected]>
bwbarrett pushed a commit to bwbarrett/ompi that referenced this issue Jun 10, 2020
As discussed in open-mpi#2519 the common component does not depend
on libfabric yet. This commit introduces this dependency by just calling
fi_version().

Signed-off-by: guserav <[email protected]>
(cherry picked from commit 8a67a95)
Signed-off-by: Brian Barrett <[email protected]>
bwbarrett pushed a commit to bwbarrett/ompi that referenced this issue Jun 11, 2020
As discussed in open-mpi#2519 the common component does not depend
on libfabric yet. This commit introduces this dependency by just calling
fi_version().

Signed-off-by: guserav <[email protected]>
(cherry picked from commit 8a67a95)
Signed-off-by: Brian Barrett <[email protected]>
bwbarrett pushed a commit to bwbarrett/ompi that referenced this issue Jun 11, 2020
As discussed in open-mpi#2519 the common component does not depend
on libfabric yet. This commit introduces this dependency by just calling
fi_version().

Signed-off-by: guserav <[email protected]>
(cherry picked from commit 8a67a95)
Signed-off-by: Brian Barrett <[email protected]>
bwbarrett pushed a commit to bwbarrett/ompi that referenced this issue Jun 11, 2020
As discussed in open-mpi#2519 the common component does not depend
on libfabric yet. This commit introduces this dependency by just calling
fi_version().

Signed-off-by: guserav <[email protected]>
(cherry picked from commit 8a67a95)
Signed-off-by: Brian Barrett <[email protected]>
bwbarrett pushed a commit to bwbarrett/ompi that referenced this issue Jun 11, 2020
As discussed in open-mpi#2519 the common component does not depend
on libfabric yet. This commit introduces this dependency by just calling
fi_version().

Signed-off-by: guserav <[email protected]>
(cherry picked from commit 8a67a95)
Signed-off-by: Brian Barrett <[email protected]>
bwbarrett pushed a commit to bwbarrett/ompi that referenced this issue Jun 11, 2020
As discussed in open-mpi#2519 the common component does not depend
on libfabric yet. This commit introduces this dependency by just calling
fi_version().

Signed-off-by: guserav <[email protected]>
(cherry picked from commit 8a67a95)
Signed-off-by: Brian Barrett <[email protected]>
bwbarrett pushed a commit to bwbarrett/ompi that referenced this issue Jun 17, 2020
As discussed in open-mpi#2519 the common component does not depend
on libfabric yet. This commit introduces this dependency by just calling
fi_version().

Signed-off-by: guserav <[email protected]>
(cherry picked from commit 8a67a95)
Signed-off-by: Brian Barrett <[email protected]>
clrpackages pushed a commit to clearlinux-pkgs/openmpi that referenced this issue Apr 27, 2021
….1.1

Aboorva Devarajan (3):
      pml/ucx: fix zero sized datatype transfers
      pml/ob1: fix build issue in CUDA path
      ompi/group: fix proc pointer comparison in groups

Alex Anenkov (1):
      coll/libnbc: add recursive doubling algorithm for MPI_Iallreduce

Aravind Gopalakrishnan (7):
      MTL OFI: Ask for FI_THREAD_DOMAIN support when not using MPI_THREAD_MULTIPLE
      MTL/OFI: Add OFI Scalable Endpoint support
      Fix for SEP when num local procs is greater than available contexts
      mtl/ofi: Add MCA variables to enable SEP and to request number of OFI contexts
      mtl/ofi: Fix reference to help text object
      btl/ofi: Fix valgrind complaints on uninitialized pointer use
      mtl/ofi: Fix segfault when not using Thread-Grouping feature

Artem Polyakov (3):
      schizo/slurm: Disable binding in case of Slurm direct launch
      pmix/pmix3x: Fix internal PMIx discovery logic.
      pmix: Fix detection of Externally-built PMIx

Aurelien Bouteiller (1):
      Always return a valid error code from collective operations

Austen Lauria (7):
      Make a managed allocation filter a hostfile/hostlist.
      Fix bug where orte under a managed allocation does not honor -host.
      Make sure MPIR_Breakpoint() is compiled without CFLAGS.
      osc/rdma: Tighten up concurrent memory region access.
      Fix case where debuggers cannot read the MPIR proctable.
      Powerpc atomics: Force usage of powerpc assembly.
      Fix "variadic macros" warning.

Bert Wesarg (2):
      oshmem/mca/sshmem: Fix build with `--enable-mem-debug`
      fs/lustre: Remove unneeded includes

Brelle Emmanuel (1):
      Bull update of coll/han : added barrier, a 'simple' scatter, some Doxygen and some fixes

Brian Barrett (20):
      dist: Start v4.1.x release series
      Revert "Remove the OFI/BTL component"
      mtl/ofi: Fix crash if no providers found
      mtl/ofi: Print descriptive error message on modex failure
      mtl/ofi: Provide av count hint during initialization
      ofi: Call add_procs through PML
      dist: Add OFI backports to NEWS
      coll libnbc: Remove dead code
      dist: Add Collectives backports to NEWS
      dist: Move version to 4.1.0rc1
      dist: Update NEWS file for 4.1.0rc1
      dist: Update version to 4.1.0rc2
      dist: Update NEWS for 4.1.0
      dist: Update NEWS file from branches
      dist: Add NEWS items for recent commits in v4.1.x series
      dist: Bump version after releasing 4.1.0rc2
      opal: Remove outdated MacOS workaround
      opal: Disable memory patcher component on MacOS
      dist: Prep for 4.1.1rc3
      dist: Update VERSION and README for v4.1.1rc4

Charles Shereda (1):
      Fixed uninitialzed memory access bug in base64 encoding.

Christoph Niethammer (3):
      Accept UCX 1.8 in configure of btl/uct
      Fix memory leak in configure, which prevents leak sanitizer usage
      Fix error with stricter quoting requirements of autoconf-2.70

Devendar Bureddy (1):
      UCX: initialize cuda from ucx pml component

Dipti Kothari (1):
      mca/pml: PML check for direct modex

Edgar Gabriel (6):
      common/ompio: use avg. file view size in the aggregator selection logic
      ompio: resync v4.1 branch to master
      fbtl/posix: ensure progressing aio requests
      common_ompio_file_set_view: fix handling of  MPI_DISPLACEMENT_CURRENT
      fbtl_posix_progress: aio_return can indicate partial completion
      common_ompio_file_set_view: recognize negative disp in access

Geoffrey Paulsen (1):
      Adding SLURM binding policy change to README

George Bosilca (19):
      Remove few warnings in libnbc identified by clang-1000.11.45.2
      Use the unaligned SSE memory access primitive.
      Check unaligned ops for correctness.
      Fix the cacheline usage in the CUDA BTL.
      A complete overhaul of the HAN code.
      Fix partial packing of non data elements.
      Fix HAN issues reported by Coverity.
      A started generalized request should be marked as pending.
      Major update to the AVX* detection and support
      AVX code generation improvements
      A better test for MPI_OP performance.
      Always specify the target architecture for AVX
      Early selection of the best PML.
      Prevent the establishment of new BTL connections during matching
      A new binomial scatter using packed data on intermediary processes.
      Always include the stddef.h header.
      Reenable the heterogeneous support.
      Fixing the partial pack unpack issue.
      Fix the Makefile to include the correct test.

Gilles Gouaillardet (14):
      mtl/ofi: fix configury when VPATH is used
      coll/libnbc: fix NBC_Unpack()
      coll/cuda: remove unnecessary references to ORTE
      mpi/c: fix param checks in [I]Neighbor_alltoall{v,w}
      fortran.m4: reword error message when sizeof(int) != sizeof(INTEGER)
      configury: make build Reproducible
      op/avx: check for _mm512_mullo_epi64() AVX512 intrinsic
      coll/base: do not drop const qualifier
      configury: fix OPAL_GET_VERSION
      configury: fix typos
      autogen.pl: patch libtool.m4 for OSX Big Sur
      gcc_builtin: fix performance regression on x86_64
      ofi: fix typo in macro name
      atomic/gcc_builtin: only apply the workaround when required.

Goldman, Adam (2):
      mtl/ofi: Add mising cq_data_size in hints for ofi mtl
      mtl/ofi: Disable CUDA convertor for specified ofi providers

Harumi Kuno (6):
      Fix mca_btl_ofi_finalize clean-up logic
      Add comments about order of close ops
      set ep to NULL to avoid double close
      mtl_btl_ofi_rcache_init() before creating domain
      Fix language text for example
      Fix .so filenames

Howard Pritchard (7):
      RAS:ALPS add support for ANL Cobalt
      add a common ofi whitelist/blacklist
      ofi mtl: fix problem with mrecv
      suppress icc long double message
      OFI: patch OFI MTL for GNI provider
      add blurb about issue 7968 to the README
      OSC/RDMA: fix typo in btl selection logic

Jeff Squyres (43):
      mpi.h.in: fixups for static assert messages
      mpi.h.in: Remove //-style comments
      tests/asm/run_tests: fix basename usage
      .mailmap: Add entry for Harumi Kuno
      mtl/ofi/Makefile.am: down with tabs!
      btl/ofi/Makefile.am: down with tabs!
      mtl/ofi: add a .gitignore
      mtl/ofi: check for FI_LOCAL_COMM+FI_REMOTE_COMM
      ofi: revamp OPAL_CHECK_OFI configury
      libnbc: remove some stale/dead code
      common_ofi: fix preprocessor macro typo
      pmix3x: Remove --enable-install-libpmix option
      fortran.m4: disallow when sizeof(int) != sizeof(INTEGER)
      opal_get_version.m4: properly quote dir args
      configure: abort if dirs with spaces are used
      opal_functions.m4: remove redundant code
      configure.ac: Add workaround on MacOS for "readlink -f"
      getdate.sh: make the date(1) usage more portable
      coll/adapt and coll/han: fix trivial compiler warnings
      keyval_parse.c: ensure to init values
      keyval_parse.c: update whitespace/comments
      NEWS: More updates for v4.1.0
      config/Makefile.am: ensure getdate.sh is in dist tarball
      opal_functions.m4: add comment
      orterun.1in: fix minor mistake in :PE=2 example
      orterun.1in: define "slot" and "processor element"
      orterun.1in: add some markup
      Fix many compiler warnings
      VERSION: 4.1.0rc4
      coll/base: fix compiler warnings
      NEWS: OMPIO is now the default everywhere
      v4.1.0: README, VERSION, and LICENSE final updates
      VERSION: Onward to v4.1.1
      MPI_Init_thread(3): update refs about MPI_THREAD_MULTIPLE
      MPI_Init_thread(3): fix statement about C++ binding
      config: Stash known-good copies of config.guess|sub
      autogen: use newer config.sub|guess if available
      op_avx: use MCA enum flags instead of integer values
      op_avx: Fix MCA enum flags
      First cut at Git commit checks as Github Actions
      git-commit-check: fix typo
      git-commit-checker: require cherry picks on this branch
      git-commit-checks: use a better name

Joseph Schuchart (19):
      osc rdma: check for outstanding fragments before completing a request
      OSC UCX: make sure no-op fetch in rget/rput is properly aligned
      osc rdma: check for outstanding fragments before completing a request in ompi_osc_rdma_put_complete_flush as well
      osc/rdma: fail query_btls if no endpoint for non-local peer is found
      OPAL: fix string buffer allocation for large env variables
      coll/tuned: add hint about dynamic rules to mca parameters
      coll/tuned: Mark global static algorithm as const
      coll/tuned: don't select algorithms knowing when it's clear they would fall back to linear
      coll/tuned: fix minor errors in comments
      COLL TUNED: remove stray selection of linear algs for alreduce and allgather
      COLL TUNED: Use per-rank data size instead of total size for decision
      coll/base: Fix collective module selection preference treatment
      coll/[sm|han|adapt]: don't disqualify on priority 0
      coll/han: remove references to experimental solo and shared collective components
      coll/han: reduce default segment size for reduce/allreduce to 64k
      OSC RDMA: put memory of each process into separate pages
      OSC RDMA: only touch pages before memory registration, don't fill them
      coll/han: fix coll preference selection in mca_coll_han_comm_create_new
      Fix man page for MPI_Win_attach

Josh Hursey (12):
      Add detection for JSM direct launch
      v4.1.x: schizo/jsm: Disable binding when direct launched
      Fix cpu-list for non-uniform nodes
      Update Internal PMIx to OpenPMIx v3.2.1rc1
      Disable man pages for internal OpenPMIx
      v4.1.x: Update Internal PMIx to OpenPMIx v3.2.1
      Fix external PMIx v4.x check
      Fix --debug-daemons CLI option
      Remove the orte_static_ports rollup path
      Check for librt when building LSF support
      LSF Config: Cleanup logic
      Fix/Cleanup the return value documentation for mpirun

Leonid Genkin (1):
      Replace usage of the deprecated NB API of UCX with NBX

Mark Allen (3):
      noinline to avoid compiler reading TOC before PATCHER_BEGIN
      symbol pollution
      make Type_create_resized set FLAG_USER_UB

Matias A Cabral (2):
      MTL OFI: Add support for mem_tag_format
      MTL_OFI: Changed Recv cancel to be non-blocking

Michael Heinz (2):
      Add check for PSM2 reference counting to PSM2 MTL #7721
      Add minimum library version needed to use PSM2 in OMPI #7779

Mikhail Brinskii (2):
      COLL/TUNED: Add linear scatter using isend for mlnx platform
      SHMEM/SCOLL: Fix inplace reductions

Mikhail Kurnosov (11):
      coll/base/allgatherv: fix MPI_IN_PLACE processing
      coll/libnbc: add recursive doubling algorithm for MPI_Iscan
      coll/libnbc: add recursive doubling algorithm for MPI_Iexscan
      coll/libnbc: add Rabenseifner's algorithm for MPI_Ireduce
      coll/libnbc: add knomial tree algorithm for MPI_Ibcast
      coll/libnbc: add recursive doubling algorithm for MPI_Iallgather
      coll/libnbc: add Rabenseifner's algorithm for MPI_Iallreduce
      coll/libnbc/ireduce: silence Coverity warning CID 1440360
      coll/libnbc: remove debug output
      Fix a typo in parsing locality string: L0 changed to L1
      coll/base: reduce memory consumption in Scatter

NARIBAYASHI Akira (1):
      opal/util: Fix typo

Nathan Hjelm (7):
      osc/rdma: fix bug in attach for non-debug builds
      opal: disable the __atomic built-in atomics by default on AArch64
      osc/rdma: ensure bml add_procs has been called for all local procs
      osc/rdma: fix errors in derived datatype handling for accumulate
      osc/rdma: rearrange accumulate code
      osc/rdma: remove extra retain on fop
      osc/rdma: fix amo-based accumulate

Nikola Dancejic (5):
      common/ofi: Added multi-NIC support to provider selection
      common/ofi: Fixing compilation issue with ofi versions that do not support fi_info.nic
      v4.1.x: common/ofi: added address format check to fix provider selection
      Adding ofi include to CPPFLAGS so that configure is able to check fabric.h
      v4.1.x: Using package_rank to select between NIC of equal distance from the process.

Pak Lui (1):
      oshmem/tools/oshmem_info: fix an issue with fortran keyword when compiling param.c

Raghu Raja (8):
      mtl/ofi: Do not fail if error CQ is empty
      mtl/ofi: Fix erroneous FI_PEEK/FI_CLAIM usage
      mtl/ofi: Check cq_data_size without querying providers again
      VERSION: 4.1.0rc5
      common/ofi: Use opal_show_help() to call out lack of locality info
      NEWS and VERSION updates for 4.1.1rc1
      NEWS updates for v4.1.1rc2
      VERSION updates for v4.1.1rc2

Ralph Castain (12):
      Increment the vpid after assignment
      Correct computation of relative locality
      Correctly skip the "mpirun" node when launching orted on it
      Remove PMIx man page setup
      Fix the verbose output in ess base
      Update PMIx to v3.2.2
      Update Slurm launch support
      Adjust copyrights
      Let Slurm know that our daemons are not MPI tasks
      Update PMIx to v3.2.3
      Add the userid to the vader backing file path
      Retrieve cpuset when configured with pmix rte

Robert Wespetal (1):
      mtl/ofi: Add workaround for EFA local/remote capabilities bug

Sami Ilvonen (1):
      Add fence_nb to flux pmix

Sergey Oblomov (4):
      COMMON/UCX: improved missing events test
      PML/UCX: improved error processing in MPI_Recv
      SPML/UCX: removed direct dependency to SPML UCX
      OSHMEM/SEGMENT-REGISTRATION: added segment filtering

Spruit, Neil R (1):
      MTL_OFI: Generation of specialized functions at build time

Thananon Patinyasakdikul (2):
      btl/ofi: Added 2 side communication support.
      btl/ofi: fixed compiler warning on OSX.

Tim Wickberg (1):
      Revert "v4.1.x: Update Slurm launch support"

Todd Kordenbrock (2):
      Use the active PML to call add_procs()
      mtl-portals4: replace abort() with ompi_rte_abort()

Tomislav Janjusic (1):
      Coll/hcoll: adding scatterv interface

Valentin Petrov (4):
      coll/hcoll: reduce_scatter(block) interface
      coll/hcoll: compile warning fix
      coll/hcoll: scatterv inplace fix
      PML/UCX: don't do pml_check_selected call

Wei Zhang (4):
      oob/tcp: fix a race condition on stop_thread pipe
      [v4.1.x] ompi : add memory barrier in PMIx registration callback
      [v4.1.x] btl/ofi: fix memory leaks in error handling path
      [4.1.x] orte/orted: enable OPAL's mutli-thread support

William Zhang (8):
      coll/tuned: Fix typos
      coll/tuned: Add NULL check to prevent segfault
      coll/tuned: Change the default collective algorithm selection
      btl/ofi: Use common provider include/exclude list
      btl/ofi: Disable EFA provider in versions earlier than libfabric 1.12.0
      btl/ofi: Disable ofi_rxm provider
      coll/tuned: Revert RSB and RS default algorithms
      coll/tuned: Fix dynamic message size for gather and scatter

Xi Luo (2):
      Bring ADAPT collective to 4.1
      Initial import of the HAN collective module

Yossi Itigin (3):
      ucx: disable version 1.8
      ucx: check supported transports and devices for setting priority
      pml/ucx: ignore request leak by default, override by mca param

bsergentm (1):
      Coll/han Bull

dongzhong (1):
      Add supports for MPI_OP using AVX512, AVX2 and MMX

guserav (4):
      Revert "Remove opal/mca/common/ofi."
      common/ofi: Fix check for OFI in build files
      common/ofi: Fix open-mpi/ompi#2519
      common/ofi: Set HPE as owner of component

raafatfeki (2):
      fs/gpfs: Support of GPFS file system
      fs/ime & fbtl/ime: Support of IME file system

tomhers (1):
      BTL/OFI: Fix missing include file.

4.1.1 -- April, 2021
--------------------

- Fix a number of datatype issues, including an issue with
  improper handling of partial datatypes that could lead to
  an unexpected application failure.
- Change UCX PML to not warn about MPI_Request leaks during
  MPI_FINALIZE by default.  The old behavior can be restored with
  the mca_pml_ucx_request_leak_check MCA parameter.
- Reverted temporary solution that worked around launch issues in
  SLURM v20.11.{0,1,2}. SchedMD encourages users to avoid these
  versions and to upgrade to v20.11.3 or newer.
- Updated PMIx to v3.2.2.
- Fixed configuration issue on Apple Silicon observed with
  Homebrew. Thanks to François-Xavier Coudert for reporting the issue.

(NEWS truncated at 15 lines)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants