Skip to content

pr/fix-yield_when_idle #7154

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 751 commits into from
Closed

Conversation

wbailey2
Copy link
Contributor

Cherry-pick of mpi_yield_while_idle setting in v.4.0.x .

Issue #6433 notes that although yield_while_idle was already fixed it had not been pushed to v.4.0.x . To remedy this I cherry-picked the fix from user ggouaillardet and pushed it to v.4.0.x.

@jsquyres
Fixes #6433

xinzhao3 and others added 30 commits March 21, 2019 23:59
…s/delete oshmem_barrier in shmem_ctx_destroy

ompi/oshmem/spml/ucx: optimize spml ucx progress

Signed-off-by: Tomislav Janjusic <[email protected]>
(cherry picked from commit 9c3d00b)
Signed-off-by: Scott Miller <[email protected]>
(cherry picked from commit 6b294e0)
Describing Issue 6114 with v4.0.0 in README.
[skip ci]

Signed-off-by: Howard Pritchard <[email protected]>
Signed-off-by: Howard Pritchard <[email protected]>
Signed-off-by: Bert Wesarg <[email protected]>
(cherry picked from commit 18525ce)
See open-mpi#5798 (comment)
for a lengthy explanation

Signed-off-by: Ralph Castain <[email protected]>
(cherry picked from commit 57f6b94)
Functionality moved to PMIx

Signed-off-by: Ralph Castain <[email protected]>
(cherry picked from commit cfdd08d)
…esent in the stack starting from MPI_Init.

This is so when a debugger attaches using MPIR, it can step out of this stack back into main.
This cannot be done with certain aggressive optimisations and missing debug information.

Signed-off-by: James Clark <[email protected]>
Signed-off-by: Jeff Squyres <[email protected]>

Co-authored-by: Jeff Squyres <[email protected]>

(cherry-picked from 20f5840)
v4.0.x: shmem/fortran: Fix invalid datatype size in call to atomic cswap
…s-v4.0.x

v4.0.x: Oshmem multiple contexts
Add a compilation flag that adds unwind info to all files that are present in the stack starting from MPI_Init (v4.0.x cherry pick)
Signed-off-by: Ben Menadue <[email protected]>
(cherry picked from commit 063596b)
…tion

Fix use of bitwise operation in CPP condition (v4.0.x)
In fint_2_int.h there are some conversion macros for logicals. It has
one path for OMPI_SIZEOF_FORTRAN_LOGICAL != SIZEOF_INT where a new array
would be allocated and the conversions then might expand to
    c_array[i] = (array[i] == 0 ? 0 : 1)
and another path for OMPI_SIZEOF_FORTRAN_LOGICAL == SIZEOF_INT where it
does things "in place", so the same conversion there would just be
    array[i] = (array[i] == 0 ? 0 : 1)

The problem is some of the logical arrays being converted are INPUT
arguments. And it's possible for some compilers to even put the argument
in read-only memory so the above "in place" conversion SEGV's.  A
testcase I have used
    call MPI_CART_SUB(oldcomm, (/.true.,.false./), newcomm, ierr)
and gfortran put the second arg in read-only mem.

In cart_sub_f.c you can trace the ompi_fortran_logical_t *remain_dims arg.
remain_dims[] is for input only, but the file uses
    OMPI_LOGICAL_ARRAY_NAME_DECL(remain_dims);
    OMPI_ARRAY_LOGICAL_2_INT(remain_dims, ndims);
    PMPI_Cart_sub(..., OMPI_LOGICAL_ARRAY_NAME_CONVERT(remain_dims), ...);
    OMPI_ARRAY_INT_2_LOGICAL(remain_dims, ndims);
to convert it to c-ints make a C call then restore it to Fortran logicals
before returning.

It's not always wrong to convert purely in-place, eg cart_get_f.c has
a periods[] that's exclusively for OUTPUT and it would be fine with the
macros as they were. But I still say the macros are invalid because they
don't distinguish whether they're being used on INPUT or OUTPUT args and
thus they can't be used in a way that's legal for both cases.

It might be possible to fix the macros by adding more of them so that
cart_create_f.c and cart_get_f.c would use different macros that give
more context. But my fix here is just to turn off the first block and
make all paths run as if OMPI_SIZEOF_FORTRAN_LOGICAL != SIZEOF_INT.

The main macros that get enlarged by this change are
    define OMPI_ARRAY_LOGICAL_2_INT_ALLOC : mallocs now
    define OMPI_ARRAY_LOGICAL_2_INT : also mallocs now
But these are only used in 4 places, three of which are the purpose of
this checkin, to avoid the former in-place expansion of an INPUT arg:
    cart_create_f.c
    cart_map_f.c
    cart_sub_f.c
and one of which is an OUPUT arg that was fine and that gets
unnecessarily expanded into a separate array by this checkin.
    cart_get_f.c

So I think an unnecessary malloc in cart_get_f.c is the only downside
to this change, where the logicals array argument could have been used
and converted in place.

Signed-off-by: Mark Allen <[email protected]>

Update provided by Gilles Gouaillardet to keep the in-place option
if OMPI_FORTRAN_VALUE_TRUE == 1 where no conversion is needed.

Signed-off-by: Gilles Gouaillardet <[email protected]>
(cherry picked from commit 0a7f1e3)
…text

v4.0.x: add missing #include to oshmem/shmem/c/shmem_context.c.
v4.0.x: scoll/fca: add missing argument to call to original broadcast
v4.0.x: Cleanup race condition in finalize that leads to incomplete vader cleanup
We missed an assert to check if ALLOW_OVERTAKE is set or not before
validating the sequence number and this will cause deadlock.

Signed-off-by: Thananon Patinyasakdikul <[email protected]>
(cherry picked from commit 0263456)
v4.0.x: pml/ob1: fix deadlock with communicator flag ALLOW_OVERTAKE.
      The new proc group is created from the "world_group" based on the
      ranks mapping which can be directly taken from proc_name->vpid.

Signed-off-by: Valentin Petrov <[email protected]>
V4.0.x Fixes the O(N^2) loop in the mca_scoll_mpi_comm_query
…ull_addr_handling

OSC/UCX: correctly handle NULL origin addr and MPI_NO_OP
In the case the btl_get fails Ob1 tries to fallback on btl_put first but
the return code was ignored. So the code fell back on both btl_put and
btl_send.

Signed-off-by: Brelle Emmanuel <[email protected]>
(cherry picked from commit 9c689f2)
gpaulsen and others added 27 commits September 24, 2019 12:26
…ompilation-v4.0

IKRIT: restored compilation - v4.0
one off patch for v4.0.x.  for some reason commit on master
didn't have this problem.

Signed-off-by: Howard Pritchard <[email protected]>
mtl/ofi: replace OMPI_UNLIKELY with OPAL version
Remove code for multiple OOB progress threads as it is an optimization
nobody uses. Also turns out to have a race condition that can cause
segfault on finalize, so maybe good that nobody is using it.

Signed-off-by: Ralph Castain <[email protected]>
(cherry picked from commit 41eb41c)
(cherry picked from commit a2f35c1)
v4.0.x: Cleanup stale code in ORTE/OOB
…letion.

 * The user can set `-mca odls_base_sigkill_timeout 30` to have ORTE wait
   30 seconds before sending SIGTERM then another 30 seconds before sending
   SIGKILL to remaining processes. This usually happens on an abnormal
   termination. Sometimes the user wants to delay the cleanup to give the
   system time to write out corefile or run other diagnostics.
 * The problem is that child processes may be completing while ORTE is
   in this loop. The SIGCHLD will interrupt the `sleep` system call.
   Without the loop the sleep could effectively be ignored in this case.
   - Sleep returns the amount of time remaining to sleep. If it was
     interrupted by a signal then it is a positive number less than or
     equal to the parameter passed to it. If it slept the whole time
     then it returns 0.

Signed-off-by: Joshua Hursey <[email protected]>
(cherry picked from commit 0e8a97c)
Rename "get_nsec()" to "get_ticks()" to more accurately reflect that
this function has no correlation to wall clock time at all.

Signed-off-by: Jeff Squyres <[email protected]>
(cherry picked from commit ce2910a)
New MCA parameter: btl_usnic_ack_iteration_delay.  Set this to the
number of times through the usNIC component progress function before
sending a standalone ACK (vs. piggy-backing the ACK on any other send
going to the target peer).

Use "ticks" language to clarify that we're really counting the number
of times through the usNIC component DATA_CHANNEL completion check (to
check for incoming messages) -- it has no relation to wall clock time
whatsoever.

Also slightly change the channel-checking scheme in usNIC component
progress: only check the PRIORITY channel once (vs. checking it once,
not finding anything, and then falling through the progress_2() where we
check PRIORITY again and then check the DATA channel).

As before, if our "progress" libevent fires, increment the tick
counter enough to guarantee that all endpoints that need an ACK will
get triggered to send standalone ACKs the next time through progress,
if necessary.

Signed-off-by: Jeff Squyres <[email protected]>
(cherry picked from commit 968b1a5)
Significantly increase the default retrans timeout.  If the
retrans timeout is too soon, we can end up in a retransmission storm
where the logic will continually re-transmit the same frames during a
single run through the usNIC progress function (because the timer for
a single frame expires before we have run through re-transmitting all
the frames pending re-transmission).

Signed-off-by: Jeff Squyres <[email protected]>
(cherry picked from commit 3cc95d8)
New MCA param: btl_usnic_max_resends_per_iteration.  This is the max
number of resends we'll do in a single pass through usNIC component
progress.  This prevents progress from getting stuck in an endless
loop of retransmissions (i.e., if more retransmissions are triggered
during the sending of retransmissions).  Specifically: we need to
leave the resend loop to allow receives to happen (which may ACK
messages we have sent previously, and therefore cause pending resends
to be moot).

Signed-off-by: Jeff Squyres <[email protected]>
(cherry picked from commit 27e3040)
Move the prefix area from the head to the body in relevant size
computations.  This fixes a problem in high traffic situations where
usNIC may have sent from unregistered memory.

Signed-off-by: Jeff Squyres <[email protected]>
(cherry picked from commit fe7f772)
[skip ci]

Signed-off-by: Howard Pritchard <[email protected]>
…_v402

NEWS: update for the v4.0.2 release
It was previously accidentally set to 0.

Signed-off-by: Jeff Squyres <[email protected]>
(cherry picked from commit 132e4ca)
Signed-off-by: Jeff Squyres <[email protected]>
(cherry picked from commit 3080033)
…and-optimizations

v4.0.x: usnic fixes and optimizations
Change the ncounts argument to MPI_Count and use
MPI_Status_set_elements_x for enabling read/write operations beyond
the 2GB limit.

Thanks to  Richard Warren from the HDF5 group for reporting the issue
and providing the suggested fix for romio.

Signed-off-by: Edgar Gabriel <[email protected]>
(cherry picked from commit 8a3abbf)
individual read/write operations exceeding 2GB fail in ompio
due to improper conversions from size_t to int in two different
locations. This commit fixes an issue reported by Richard Warren
from the HDF5 group.

Fixes Issue open-mpi#7045

Cherry-picked from commit a130f56

Signed-off-by: Edgar Gabriel <[email protected]>
v4.0.x:Fix the sigkill timeout sleep to prevent SIGCHLD from preventing completion
Signed-off-by: Joseph Schuchart <[email protected]>
(cherry picked from commit 37e6bbb)
Signed-off-by: Scott Miller <[email protected]>
(cherry picked from commit c1b8599)

Conflicts:
	orte/mca/plm/rsh/plm_rsh_module.c
plm/rsh: Add chdir option to change directory before orted exec
…-bug

comomn_ompio_file_read/write: fix 2GB limiting issue
…-status-set-elements-fix

MPIR_Status_set_bytes: fix for large count sizes
Ensure that grequestx continuously make progress (v4.0.x)
in schizo/ompi, sets the new OMPI_MCA_mpi_oversubscribe environment
variable according to the node oversubscription state.

This MCA parameter is used to set the default value of the
mpi_yield_when_idle parameter.

This two steps tango is needed so the mpi_yield_when_idle setting
is always honored when set in a config file.

Refs. open-mpi#6433

Signed-off-by: Gilles Gouaillardet <[email protected]>
Issue open-mpi#6433 notes that although yield_when_idle was fixed it had not
actually been pushed to v.4.0.x. To remedy this I cherry picked the
fix from user ggouaillardet and pushed it to v.4.0.x.

Signed-off-by: William Bailey <[email protected]>
@ompiteam-bot
Copy link

Can one of the admins verify this patch?

@ggouaillardet
Copy link
Contributor

Please re-issue this PR, and make sure it is against the v4.0.x branch and not the master branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

mpi_yield_when_idle setting from etc/openmpi-mca-params.conf is ignored in 4.0.0