Skip to content

v4.0.x:Fix the sigkill timeout sleep to prevent SIGCHLD from preventing completion #7033

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 22, 2019

Conversation

jjhursey
Copy link
Member

@jjhursey jjhursey commented Oct 2, 2019

  • The user can set -mca odls_base_sigkill_timeout 30 to have ORTE wait
    30 seconds before sending SIGTERM then another 30 seconds before sending
    SIGKILL to remaining processes. This usually happens on an abnormal
    termination. Sometimes the user wants to delay the cleanup to give the
    system time to write out corefile or run other diagnostics.
  • The problem is that child processes may be completing while ORTE is
    in this loop. The SIGCHLD will interrupt the sleep system call.
    Without the loop the sleep could effectively be ignored in this case.
    • Sleep returns the amount of time remaining to sleep. If it was
      interrupted by a signal then it is a positive number less than or
      equal to the parameter passed to it. If it slept the whole time
      then it returns 0.

…letion.

 * The user can set `-mca odls_base_sigkill_timeout 30` to have ORTE wait
   30 seconds before sending SIGTERM then another 30 seconds before sending
   SIGKILL to remaining processes. This usually happens on an abnormal
   termination. Sometimes the user wants to delay the cleanup to give the
   system time to write out corefile or run other diagnostics.
 * The problem is that child processes may be completing while ORTE is
   in this loop. The SIGCHLD will interrupt the `sleep` system call.
   Without the loop the sleep could effectively be ignored in this case.
   - Sleep returns the amount of time remaining to sleep. If it was
     interrupted by a signal then it is a positive number less than or
     equal to the parameter passed to it. If it slept the whole time
     then it returns 0.

Signed-off-by: Joshua Hursey <[email protected]>
(cherry picked from commit 0e8a97c)
@jjhursey jjhursey added this to the v4.0.2 milestone Oct 2, 2019
@jjhursey jjhursey requested a review from rhc54 October 2, 2019 19:51
@rhc54 rhc54 changed the title Fix the sigkill timeout sleep to prevent SIGCHLD from preventing completion v4.0.x:Fix the sigkill timeout sleep to prevent SIGCHLD from preventing completion Oct 2, 2019
@gpaulsen gpaulsen modified the milestones: v4.0.2, v4.0.3 Oct 3, 2019
@gpaulsen
Copy link
Member

gpaulsen commented Oct 3, 2019

punting until v4.0.3

@gpaulsen gpaulsen merged commit 3dba9ec into open-mpi:v4.0.x Oct 22, 2019
@jjhursey jjhursey deleted the v4-fix-sigkill-wait branch October 22, 2019 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants