Skip to content

v5.0.x accelerator/cuda: Add delayed initialization logic #11296

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1,690 commits into from

Conversation

wckzhang
Copy link
Contributor

Backport of #11253 PR

wckzhang and others added 30 commits October 11, 2022 21:54
Signed-off-by: William Zhang <[email protected]>
(cherry picked from commit f0580fd)
Instead of dlopening cuda, add direct dependency on libcuda.
This also means we can remove the dlopen dependency.

Signed-off-by: William Zhang <[email protected]>
(cherry picked from commit 26e244c)
Signed-off-by: William Zhang <[email protected]>
(cherry picked from commit 8ed9056)
Signed-off-by: William Zhang <[email protected]>
(cherry picked from commit d5ba0a3)
Signed-off-by: George Bosilca <[email protected]>
(cherry picked from commit d8c1471)
Signed-off-by: William Zhang <[email protected]>
Many compilers (tested with gcc and clang) use the memcpy and memmove
keywords as intrinsics functions. They also lack a proper syntactic
matching, and this prevents the use of any intrincs names as members
of structures. Use a different name for the 2 members of the accelerator
framework that handles memory copies and moves.

Fixes open-mpi#10869.

Signed-off-by: George Bosilca <[email protected]>
(cherry picked from commit 8be95d7)
Signed-off-by: William Zhang <[email protected]>
this pr removes the function table from the rocm component (and hence
the dlopen functionality), as well as the lock used during initialization
and shutdown. Some minor changes are further also required to configure
and Makefile logic.

Signed-off-by: Edgar Gabriel <[email protected]>
(cherry picked from commit 86bc10a)
Previously did not use the right offset and caused
data validation issues.

Signed-off-by: William Zhang <[email protected]>
(cherry picked from commit 36a35fb)
The selected component was not properly skipped due
to using the wrong pointer for the skip parameter.

Also changed to using mca_base_framework_components_close

Signed-off-by: William Zhang <[email protected]>
(cherry picked from commit 0bbe734)
Previously did not use the right offset, same as
36a35fb

Signed-off-by: William Zhang <[email protected]>
(cherry picked from commit 13bcfab)
Added updated documentation for the dso type cuda
support and the updated ofi mtl support.

Signed-off-by: William Zhang <[email protected]>
(cherry picked from commit f914632)
Implement MPI_COMM_TYPE_HW_UNGUIDED for MPI_Comm_split_type for V5.0.x
Make opal/ompi symbols used in only one file 'static'

Fix review comments

Signed-off-by: David Wootton <[email protected]>
(cherry picked from commit 8ebf0d2)
Signed-off-by: Joseph Schuchart <[email protected]>
(cherry picked from commit 77e502b)
…d-v5.0.x

[v5.0.x] Fix compilation of x86-64-asm based atomic backend
Signed-off-by: Boris Karasev <[email protected]>
Co-authored-by: Sergey Oblomov <[email protected]>
(cherry picked from commit 8362a2d)
George Katevenis had two names in git logs.  Add a mailmap entry
to force use of his full name.

Signed-off-by: Brian Barrett <[email protected]>
(cherry picked from commit ce5d507)
I had a slightly different name for commits from my personal email account,
resulting in two AUTHORS entries.  Update the name so that they are merged
into one entry.

Signed-off-by: Brian Barrett <[email protected]>
(cherry picked from commit d427ab0)
Signed-off-by: Edgar Gabriel <[email protected]>
(cherry picked from commit 9bdaf3e)
Add some formulaic text to the MPIX man pages:

* Indicated that these functions are only present if the corresponding
  extenion was built
* Described the available preprocessor macros
* Added a link to the Open MPI Extensions section
* Fixed string errors in the example code
* Used proper #if conditionals in the example
* Added a See Also section

Signed-off-by: Jeff Squyres <[email protected]>
(cherry picked from commit cc976e7)
Signed-off-by: George Katevenis <[email protected]>
(cherry picked from commit 9e13c2a)
…_warn

v5.0.x: ucx/pml: show warning if already unsupported UCX version is used
…ge-updates

v5.0.x: docs: Minor updates to MPIX man pages
v5.0.x: Initialize opal/smsc outside of btl/sm, to enable its use without it
Signed-off-by: David Wootton <[email protected]>
(cherry picked from commit 3fcad0e)
Fix 1 byte overlay in comm_method_string: Coverity CID 1515829
Under normal circumstances epoll and poll produce similar performance on Linux.
When busy polling is enabled they do not. Testing with a TCP-based system shows
a significan performance degredation when using poll with busy waiting enabled.
This performance regression is not seen when using epoll. This PR adjusts the
default value of opal_event_include to epoll on Linux only to fix the
regression.

Fixes open-mpi#10929

Signed-off-by: Nathan Hjelm <[email protected]>
(cherry picked from commit 279f6b6)
On architectures that store long doubles as 80 bit extended precisions
or as 64 bit "float64"s, we need conversions to 128 bit quad precision to
satisfy MPI_Pack_external/Unpack_external.  I added a couple more
arguments to pFunction to know what architecture the 'to' and 'from'
buffers are.  Previously we had architecture info 'local' and 'remote'
but I don't know how to correlate local/remote with to/from without
adding more arguments as I did.

With the incresed information about the context, the conversion function
can now convert the long double as needed.

I'm using code Lisandro Dalcin contributed for the floating point
conversions in f80_to_f128, f64_to_f128, f128_to_f80, and f128_to_f64.
These conversion functions require the data to be in local endianness,
but one of the sides in pack/unpack is always local so operations can
be done in an order that allows the long double conversion to see the
data in local endianness.

I also added a path to use __float128 for the conversion
for #ifdef HAVE___FLOAT128 as that ought to be the more reliable
method than rolling our own bitwise conversions.

The reason for all the arch.h changes is the former code was
inconsistent as to how bits were labeled within a byte, and had
masks like LONGISxx that didn't match the bits they were supposed
to contain.

Signed-off-by: Mark Allen <[email protected]>
(cherry picked from commit 308a94e)
OpenPMIx commits since last update:

b4a55542 - Update NEWS
2b92a6af - Handle session-info in the gds/hash component
4dd99584 - Handle app-info in the gds/hash component
59c8b8c3 - Update NEWS
7b0bb406 - Stop-in-init applies to all procs in a job
8c4cdd37 - Cleanup some store/retrieve issues
b1a65392 - Update EXCEPTIONS
2fb902e5 - Provide a little more useful error output
3147fba1 - Add some debug macros for tracking key values
8ebc45fc - PMIX_OBJ_STATIC_INIT: fixed initialization
ca350205 - Roll to rc2
31362d74 - Enhance the performance of the var_scope_push/pop script
4685b607 - pnet/nvd: Fix macro escaping issue
92fbde60 - llvm/oneapi: fixes to bring pmix up to iso c99
f1171cf5 - Fix some memory leaks and cleanup macro defns
fed0ad14 - Plug a memory leak

PRRTe commits since last update:

a3e81f2efb - Update NEWS
34735ca44a - Pickup missing changes
c49aa76728 - Reduce debugger confusion
3fde1e53cc - misc unused var cleanups
69b0570e8a - remove unused vars and fix rc/ret typo
0041d2278c - more unused vars
f9049d514a - unused var
df47bc6dea - squash warnings
135452cd11 - remove some unused vars
428a51cc6a - Cleanup grpcomm cruft
cc402aa402 - Support query of pset membership
08c03741ba - plm/tm: Fix build breakage
0387c18b7b - Fix memory leaks in RML and at job termination.
a001245cfa - Update Open MPI mpirun help text
1a25f6602f - Change --stop-in-* to take optional arguments.
799d7fd769 - schizo/ompi: Fix --report-pid/sid.
59b4d6bb81 - alps fixes for mca move
ecb4d2d125 - Allow prterun to act as prun
d5d47c8ea6 - Catch some more component updates
b2302a09dd - ras/lsf: Fix build breakage
4349e72a6f - Fix a typo and expand debugger example range to cover MPI
1d2bfabb81 - Fix mapping by pe-list when oversubscribed
6944a64068 - Push launch-agent CLI into the env
b79d6b0a03 - Fix print statement
1b850dd64f - Actually support the output-proctable option
96096dc428 - Fix some memory leaks during resource mapping.
989a73cc9b - Fix some memory leaks during resource mapping.
c7691c7e82 - Update NEWS
7aa528613a - Fix --preload-binary.
111d2baddd - schizo/ompi: Fix --use-hwthread-cpus option.
87fb5c670a - Plug a memory leak
220b7e80a1 - BuildRequires: gcc
5f591bf93b - Complete help text on notifications

Signed-off-by: Austen Lauria <[email protected]>
David Wootton and others added 28 commits December 19, 2022 09:23
Coverity CID 1498717

Signed-off-by: David Wootton <[email protected]>
(cherry picked from commit 64a8d74)
v5: Fix uninitialized pointer in mca_smpl_ucx_register
v5: Fix memory leak in dpm_convert (dpm.c)
v5: Fixing missing lock release in mca_pml_ob1_record_htod_event
v5: Fix missing lock release in oshmem_proc_group_create
v5: Fix memory leak in mca_coll_han_init_dynamic_rules: Coverity CID 1516452
v5: Fix invalid access after free in do_recv: Coverity CID 1517308
You cannot daemonize the "prte" executable when spawning it
to support a singleton as that will cause things to hang. Also
fix IO forwarding thru the singleton for the spawned child
procs by correcting a mistake that caused the IOF request
attributes to be overlooked when constructing the job info for
the PMIx_Spawn call.

Includes an update to the PMIx and PRRTE submodule pointers to
pickup a couple of relevant corrections there. See:

openpmix/prrte#1621
openpmix/openpmix#2881

This brings the submodule pointers to the HEAD of their respective
release branches, which are basically at an rc1 level (but not
tagged yet).

Signed-off-by: Ralph Castain <[email protected]>
(cherry picked from commit 8ca4d7c)
Coverity CID 1458001

Signed-off-by: David Wootton <[email protected]>
(cherry picked from commit 8a798fd)
Fix a segfault when operating with a login node
that has a different topology than the compute
nodes.

Signed-off-by: Ralph Castain <[email protected]>
…nagement Layer. I have corrected it.

Signed-off-by: zhuodong <[email protected]>
(cherry picked from commit 19b515f)
The net provider is an enhanced version of tcp provider, therefore
should also be excluded.

Signed-off-by: Wei Zhang <[email protected]>
(cherry picked from commit d7ef0d4)
v5: Fix memory leak in mca_btl_tcp_proc_handle_modex_addresses
v5.0.x: Increment the PMIx/PRRTE submodule pointers
Signed-off-by: Mamzi Bayatpour  <[email protected]>
(cherry picked from commit a12aa2f)
which sets the LD_LIBRARY_PATH to point to a system pmix
which is too old for the prte used by main and v5.0.x.

Signed-off-by: Howard Pritchard <[email protected]>
(cherry picked from commit fdaa901)
…-read-write-v5.0

common/ompio: implement pipelined read and write operation
…ap-v5.0

fs/lustre: fix assignment of info objects to lustre args
…r-fix-v5.0

accelerator/rocm: fix check_addr function
…ovider

[v5.0.x] opal/common/ofi: add net to provider exclude list
…_module_v50x

LANL/CI: workaround for aocc module
application

Signed-off-by: Mamzi Bayatpour  <[email protected]>
Co-authored-by: Tomislav Janjusic <[email protected]>
(cherry picked from commit 076fca7)
…evel-v5

v5.0.x OSC/UCX: avoid creating ucp context if the application does not have MPI-RMA
v5.0.x: pml/ucx: move pmix finalize to the end of ompi_rte_finalize()
The current implementation requires the application to
do cudaInit before calling MPI_Init. Added delayed
initilization logic to wait as long as possible
before creating resources requiring a cuContext.

Signed-off-by: William Zhang <[email protected]>
(cherry picked from commit b751060)
Signed-off-by: William Zhang <[email protected]>
(cherry picked from commit 48ae44b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.