Skip to content

v4.0.x:Fix the sigkill timeout sleep to prevent SIGCHLD from preventing completion #7033

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 22, 2019
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 24 additions & 4 deletions orte/mca/odls/base/odls_base_default_fns.c
Original file line number Diff line number Diff line change
Expand Up @@ -1767,7 +1767,7 @@ int orte_odls_base_default_kill_local_procs(opal_pointer_array_t *procs,
orte_proc_t *child;
opal_list_t procs_killed;
orte_proc_t *proc, proctmp;
int i, j;
int i, j, ret;
opal_pointer_array_t procarray, *procptr;
bool do_cleanup;
orte_odls_quick_caddy_t *cd;
Expand Down Expand Up @@ -1913,7 +1913,17 @@ int orte_odls_base_default_kill_local_procs(opal_pointer_array_t *procs,
/* if we are issuing signals, then we need to wait a little
* and send the next in sequence */
if (0 < opal_list_get_size(&procs_killed)) {
sleep(orte_odls_globals.timeout_before_sigkill);
/* Wait a little. Do so in a loop since sleep() can be interrupted by a
* signal. Most likely SIGCHLD in this case */
ret = orte_odls_globals.timeout_before_sigkill;
while( ret > 0 ) {
OPAL_OUTPUT_VERBOSE((5, orte_odls_base_framework.framework_output,
"%s Sleep %d sec (total = %d)",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
ret, orte_odls_globals.timeout_before_sigkill));
ret = sleep(ret);
}

/* issue a SIGTERM to all */
OPAL_LIST_FOREACH(cd, &procs_killed, orte_odls_quick_caddy_t) {
OPAL_OUTPUT_VERBOSE((5, orte_odls_base_framework.framework_output,
Expand All @@ -1922,8 +1932,18 @@ int orte_odls_base_default_kill_local_procs(opal_pointer_array_t *procs,
ORTE_NAME_PRINT(&cd->child->name)));
kill_local(cd->child->pid, SIGTERM);
}
/* wait a little again */
sleep(orte_odls_globals.timeout_before_sigkill);

/* Wait a little. Do so in a loop since sleep() can be interrupted by a
* signal. Most likely SIGCHLD in this case */
ret = orte_odls_globals.timeout_before_sigkill;
while( ret > 0 ) {
OPAL_OUTPUT_VERBOSE((5, orte_odls_base_framework.framework_output,
"%s Sleep %d sec (total = %d)",
ORTE_NAME_PRINT(ORTE_PROC_MY_NAME),
ret, orte_odls_globals.timeout_before_sigkill));
ret = sleep(ret);
}

/* issue a SIGKILL to all */
OPAL_LIST_FOREACH(cd, &procs_killed, orte_odls_quick_caddy_t) {
OPAL_OUTPUT_VERBOSE((5, orte_odls_base_framework.framework_output,
Expand Down