Skip to content

CUDA build: make all fails with undefined references on master and v5.0.x #8656

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dbonner opened this issue Mar 19, 2021 · 17 comments
Closed

Comments

@dbonner
Copy link

dbonner commented Mar 19, 2021

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

branch: master
hash: d18d3f6

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

(For machine 1: 256 threads, machine 2: 36 threads, machine 3: 12 threads)
git clone --recursive https://github.com/open-mpi/ompi.git -j 256
cd ompi
export AUTOMAKE_JOBS=256
./autogen.pl
./configure --disable-picky --prefix=/usr/local --with-cuda=/usr/local/cuda-11.2 --with-ucx=/usr/local/ucx
make -j 256 all
---> ERROR

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

7145774 3rd-party/openpmix (v1.1.3-2852-g7145774e)
284d15d7b9be51c07ae3a3964b1567fde1a106e2 3rd-party/prrte (dev-31005-g284d15d7b9)

Please describe the system on which you are running

  • Operating system/version:
  • Computer hardware:
  • Network type:

I have tried this on 3 machines' bare metal and all 3 machines showed the same error:

  1. Dual AMD Epyc 7742, 8 x Nvidia A-100 40Gig
  2. Intel i9-10980XE, Nvidia 2080 Ti
  3. Intel i7-9750H, Nvidia 2080 MaxQ

All machines are set up with the same software:

Ubuntu 20.10
gcc-10
Cuda 11.2 update 2
nv_peer_memory built from latest source
gdrcopy built from latest source
ucx built from latest source
mlnx_ofed - latest version

Details of the problem

Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.

Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:

shell$ mpirun -np 2 ./hello_world
shell$ make -j 256 install
make[2]: Entering directory '/home/daniel/ompi/opal/tools/wrappers'
  CC       opal_wrapper.o
  CCLD     opal_wrapper
/usr/bin/ld: /usr/local/lib/libmca_common_cuda.so.0: undefined reference to `opal_cuda_add_initialization_function'
/usr/bin/ld: ../../../opal/.libs/libopen-pal.so: undefined reference to `opal_cuda_memmove'
/usr/bin/ld: ../../../opal/.libs/libopen-pal.so: undefined reference to `opal_cuda_memcpy'
/usr/bin/ld: ../../../opal/.libs/libopen-pal.so: undefined reference to `mca_cuda_convertor_init'
/usr/bin/ld: ../../../opal/.libs/libopen-pal.so: undefined reference to `opal_cuda_check_bufs'
/usr/bin/ld: ../../../opal/.libs/libopen-pal.so: undefined reference to `opal_cuda_memcpy_sync'
collect2: error: ld returned 1 exit status
make[2]: *** [Makefile:1443: opal_wrapper] Error 1
make[2]: Leaving directory '/home/daniel/ompi/opal/tools/wrappers'
make[1]: *** [Makefile:1868: all-recursive] Error 1
make[1]: Leaving directory '/home/daniel/ompi/opal'
make: *** [Makefile:1437: all-recursive] Error 1
Command exited with non-zero status 2
104.88user 28.62system 0:50.56elapsed 264%CPU (0avgtext+0avgdata 22904maxresident)k
3608inputs+327112outputs (0major+7489985minor)pagefaults 0swaps
@jsquyres
Copy link
Member

FYI @Akshay-Venkatesh

@awlauria
Copy link
Contributor

awlauria commented Apr 1, 2021

Looks the same as #8736

@awlauria awlauria changed the title make all fails with undefined references CUDA build: make all fails with undefined references Apr 5, 2021
@awlauria
Copy link
Contributor

awlauria commented Apr 5, 2021

@mwheinz also reported this is an issue on v5.0.x, not just master.

@awlauria awlauria changed the title CUDA build: make all fails with undefined references CUDA build: make all fails with undefined references on master and v5.0.x Apr 5, 2021
@wckzhang
Copy link
Contributor

wckzhang commented Apr 6, 2021

I was able to reproduce #8736 with duplicate symbols, looking into it

@wckzhang
Copy link
Contributor

wckzhang commented Apr 6, 2021

This issue appeared sometime after my patch series, doing a bisect to see if I can pin which commit is the cause.

@wckzhang
Copy link
Contributor

wckzhang commented Apr 6, 2021

It appears to have started from:

commit 856a2b7f6f6f3380c7617114d9007fa00e631095 (HEAD)
Merge: ff1ba016d6 930260cb45
Author: Brian Barrett <[email protected]>
Date:   Thu Mar 25 07:17:07 2021 -0700

    Merge pull request #8132 from bwbarrett/feature/3rdparty-packaging
    
    Change MCA component build style default to static

Not sure I understand it, I'll have to think about it

@bwbarrett
Copy link
Member

Try compiling with --enable-mca-static --disable-mca-dso and I bet the issue was there before that patch :)

@wckzhang
Copy link
Contributor

wckzhang commented Apr 6, 2021

I think I have a gist of what happened, I'll try with the static build before that patch.

@wckzhang
Copy link
Contributor

wckzhang commented Apr 8, 2021

So I reverted my moving of the opal_datatype_cuda -> common_cuda and it compiled fine, I'm pretty sure this segment of code is at fault since it compiled without it, not sure why I added it in the first place:

if OPAL_cuda_support
lib@OPAL_LIB_PREFIX@open_pal_la_LIBADD += \
        mca/common/cuda/libmca_common_cuda.la
lib@OPAL_LIB_PREFIX@open_pal_la_DEPENDENCIES += \
        mca/common/cuda/libmca_common_cuda.la
endif

@wckzhang
Copy link
Contributor

wckzhang commented Apr 8, 2021

This is probably a different issue from #8736 not sure why they were called out as duplicate issues. I don't think they'll be fixed from the same patch, but it's worth a try, @dbonner can you see if this issue still occurs with #8788 ?

@dbonner
Copy link
Author

dbonner commented Apr 8, 2021

Thanks for working on this @wckzhang
I believe that #8788 does fix the problem with my machine/setup.
I build ompi successfully with the following steps:

git clone --recursive https://github.com/wckzhang/ompi.git
cd ompi
git checkout remotes/origin/compile
git checkout --recuse-submodules c81cdd76897499fb42099ef784fb2dfd86cc9f06     # I set the repo to the time of your commit. I'm not sure if I needed to do this.

The first time I built ompi from the above, I noticed that libraries from a previous build were left behind (including libmca*), so I erased existing libraries with this process:

sudo make -j 256 uninstall
sudo rm -rf /usr/local/lib/openmpi
sudo rm -rf /usr/local/lib/prte
sudo rm -rf /usr/local/lib/pmix
sudo rm /usr/local/lib/ompi*
sudo rm /usr/local/lib/libmca*

Then I built ompi using the attached sh script.
I then tested that openmpi worked with horovod/tensorflow, and it did :) using the attached python script with this command:

mpirun \
    -np 8 -H localhost:8 \
    --bind-to none --map-by slot \
    -x NCCL_DEBUG=INFO \
    -x LD_LIBRARY_PATH \
    -x PATH \
    -x NCCL_TREE_THRESHOLD=0 \
    -x RDMAV_FORK_SAFE=1 \
    --mca btl tcp,self \
    --mca btl_tcp_if_exclude lo,docker0 \
    python /home/daniel/localgpu/hvd_bps_bench/hvd_tensorflow2_synthetic_benchmark.py --fp16-allreduce

hvd_tensorflow2_synthetic_benchmark.py.txt
ompi-amd.sh.txt
Much appreciated :)
Daniel

@rajachan
Copy link
Member

rajachan commented Apr 8, 2021

Github PR hack for future reference :)When testing PRs, you can replace this:

git clone --recursive https://github.com/wckzhang/ompi.git
cd ompi
git checkout remotes/origin/compile
git checkout --recuse-submodules c81cdd76897499fb42099ef784fb2dfd86cc9f06

with this from within a clone of the OMPI repo:

git fetch origin pull/8656/head:$local_branch_name

@dbonner
Copy link
Author

dbonner commented Apr 9, 2021

Thanks @rajachan the method you posted for testing PRs is much easier than what I did. Definitely will use the method you posted next time :)

@jsquyres
Copy link
Member

jsquyres commented Apr 9, 2021

You can also use https://hub.github.com/ -- I use it all the time to check out PRs. It does all the magic git commands you need behind the scenes. E.g.:

cd path/to/your/ompi/clone
hub checkout https://github.com/open-mpi/ompi/pull/8788

@dbonner
Copy link
Author

dbonner commented Apr 10, 2021

Thanks @jsquyres . I'll give 'hub' a try :)

@awlauria
Copy link
Contributor

Can this issue be closed?

master: #8788
v5.0.x: #8809

@gpaulsen
Copy link
Member

If this is still an issue, please reopen with a comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants