Skip to content

Update PMIx/PRRTE submodule pointers #11563

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

lrbison
Copy link
Contributor

@lrbison lrbison commented Apr 5, 2023

Includes a fix for intermittent launch hangs on main

Includes a fix for intermittent launch hangs on main

Signed-off-by: Luke Robison <[email protected]>
@lrbison
Copy link
Contributor Author

lrbison commented Apr 5, 2023

Additional testing suggests that bumping the version does not fix the intermittent hangs. Closing this PR until I find the real cause.

@lrbison lrbison closed this Apr 5, 2023
@lrbison
Copy link
Contributor Author

lrbison commented Apr 6, 2023

Letting my tests run overnight does suggest some improvement on main-branch by updating the submodules. The following screenshot shows this PR labeled "Forward" on top against main as it stands now. Each dot represents an Open MPI run of a simple benchmark on 8 hosts with 64 ranks each that completed, and each F represents a failure which hangs.

image

As you can see the main branch as it stands now has many more failures than this PR. I will still try and find a stack trace on the remaining hangs.

Some statistics:

with this PR I only see 71 failures after 2370 tests (3%)
without this PR, I see 533 failures after 2370 tests (22%)

@lrbison lrbison reopened this Apr 6, 2023
@lrbison lrbison mentioned this pull request Apr 6, 2023
@lrbison
Copy link
Contributor Author

lrbison commented Apr 14, 2023

Progress happening over at #11566 in finding the real root cause. Closing this PR.

@lrbison lrbison closed this Apr 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant