-
Notifications
You must be signed in to change notification settings - Fork 900
CUDA build: make all fails with undefined references on master and v5.0.x #8656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Looks the same as #8736 |
@mwheinz also reported this is an issue on v5.0.x, not just master. |
I was able to reproduce #8736 with duplicate symbols, looking into it |
This issue appeared sometime after my patch series, doing a bisect to see if I can pin which commit is the cause. |
It appears to have started from:
Not sure I understand it, I'll have to think about it |
Try compiling with |
I think I have a gist of what happened, I'll try with the static build before that patch. |
So I reverted my moving of the opal_datatype_cuda -> common_cuda and it compiled fine, I'm pretty sure this segment of code is at fault since it compiled without it, not sure why I added it in the first place:
|
Thanks for working on this @wckzhang
The first time I built ompi from the above, I noticed that libraries from a previous build were left behind (including libmca*), so I erased existing libraries with this process:
Then I built ompi using the attached sh script.
hvd_tensorflow2_synthetic_benchmark.py.txt |
Github PR hack for future reference :)When testing PRs, you can replace this:
with this from within a clone of the OMPI repo:
|
Thanks @rajachan the method you posted for testing PRs is much easier than what I did. Definitely will use the method you posted next time :) |
You can also use https://hub.github.com/ -- I use it all the time to check out PRs. It does all the magic git commands you need behind the scenes. E.g.:
|
Thanks @jsquyres . I'll give 'hub' a try :) |
If this is still an issue, please reopen with a comment. |
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
branch: master
hash: d18d3f6
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
(For machine 1: 256 threads, machine 2: 36 threads, machine 3: 12 threads)
git clone --recursive https://github.com/open-mpi/ompi.git -j 256
cd ompi
export AUTOMAKE_JOBS=256
./autogen.pl
./configure --disable-picky --prefix=/usr/local --with-cuda=/usr/local/cuda-11.2 --with-ucx=/usr/local/ucx
make -j 256 all
---> ERROR
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.7145774 3rd-party/openpmix (v1.1.3-2852-g7145774e)
284d15d7b9be51c07ae3a3964b1567fde1a106e2 3rd-party/prrte (dev-31005-g284d15d7b9)
Please describe the system on which you are running
I have tried this on 3 machines' bare metal and all 3 machines showed the same error:
All machines are set up with the same software:
Ubuntu 20.10
gcc-10
Cuda 11.2 update 2
nv_peer_memory built from latest source
gdrcopy built from latest source
ucx built from latest source
mlnx_ofed - latest version
Details of the problem
Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.
Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:
The text was updated successfully, but these errors were encountered: