Skip to content

v3.0.x: Fix oob_tcp tcp_component_close segfault with active listeners #7012

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 26, 2019

Conversation

jsquyres
Copy link
Member

oob_tcp in non-HNP mode shares libevent event_base with oob_base [1].
orte_oob_base_close calls:
(1) oob_tcp component_shutdown, then
(2) opal_progress_thread_finalize, then
(3) oob_tcp tcp_component_close [2].
opal_progress_thread_finalize calls tracker_destructor [3] that frees the
event_base [4]. If any oob_tcp event listeners are active at this time, oob_tcp
will crash trying to delete them at [5] [6].

This change moves oob_tcp event listener cleanup from component_close to
component_shutdown so that it happens before the event_base is freed.

[1] https://github.com/open-mpi/ompi/blob/v4.0.1/orte/mca/oob/tcp/oob_tcp_listener.c#L160
[2] https://github.com/open-mpi/ompi/blob/v4.0.1/orte/mca/oob/base/oob_base_frame.c#L95
[3] https://github.com/open-mpi/ompi/blob/v4.0.1/opal/runtime/opal_progress_threads.c#L232
[4] https://github.com/open-mpi/ompi/blob/v4.0.1/opal/runtime/opal_progress_threads.c#L65
[5] https://github.com/open-mpi/ompi/blob/v4.0.1/orte/mca/oob/tcp/oob_tcp_component.c#L192
[6] https://github.com/open-mpi/ompi/blob/v4.0.1/orte/mca/oob/tcp/oob_tcp_listener.c#L955

Signed-off-by: Orivej Desh [email protected]
(cherry picked from commit 78b7e34)

oob_tcp in non-HNP mode shares libevent event_base with oob_base [1].
orte_oob_base_close calls:
(1) oob_tcp component_shutdown, then
(2) opal_progress_thread_finalize, then
(3) oob_tcp tcp_component_close [2].
opal_progress_thread_finalize calls tracker_destructor [3] that frees the
event_base [4]. If any oob_tcp event listeners are active at this time, oob_tcp
will crash trying to delete them at [5] [6].

This change moves oob_tcp event listener cleanup from component_close to
component_shutdown so that it happens before the event_base is freed.

[1] https://github.com/open-mpi/ompi/blob/v4.0.1/orte/mca/oob/tcp/oob_tcp_listener.c#L160
[2] https://github.com/open-mpi/ompi/blob/v4.0.1/orte/mca/oob/base/oob_base_frame.c#L95
[3] https://github.com/open-mpi/ompi/blob/v4.0.1/opal/runtime/opal_progress_threads.c#L232
[4] https://github.com/open-mpi/ompi/blob/v4.0.1/opal/runtime/opal_progress_threads.c#L65
[5] https://github.com/open-mpi/ompi/blob/v4.0.1/orte/mca/oob/tcp/oob_tcp_component.c#L192
[6] https://github.com/open-mpi/ompi/blob/v4.0.1/orte/mca/oob/tcp/oob_tcp_listener.c#L955

Signed-off-by: Orivej Desh <[email protected]>
(cherry picked from commit 78b7e34)
@jsquyres
Copy link
Member Author

This PR should hypothetically become moot when #7010 is cherry-picked back to the v3.0.x branch. But I'm making it anyway, just so that it's not forgotten (in case we do turn out to need it).

@wckzhang
Copy link
Contributor

Looks sane and self contained to me, do you know how often this segfault happens?

@jsquyres
Copy link
Member Author

@wckzhang This particular seg fault wasn't happening to me on the v3.0.x / v3.1.x branches, but I didn't know if this was a race condition (and therefore may not happen every time), and just hadn't been brought back to the v3.0.x / v3.1.x branches.

Also, I don't know if it's now moot because of #7014...?

@wckzhang
Copy link
Contributor

Theoretically, I'd assume #7014 would make this patch unnecessary...

Ralph had this comment in #7010 which makes me wonder if this patch actually solved the issue since apparently there was still a race condition after this patch?

Also turns out to have a race condition that can cause
segfault on finalize, so maybe good that nobody is using it.

@rhc54
Copy link
Contributor

rhc54 commented Sep 26, 2019

Best to just run a quick check - this patch should no longer be necessary.

@jsquyres
Copy link
Member Author

I ran the following:

I didn't get segv's in any of those cases. But #7014 is definitely a race condition. I didn't know if this PR was also addressing a race condition (and therefore I may or may not actually experience the problem).

Copy link
Contributor

@rhc54 rhc54 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't be needed but does no harm - go for it!

@jsquyres jsquyres merged commit c7ddec9 into open-mpi:v3.0.x Sep 26, 2019
@jsquyres jsquyres deleted the pr/v3.0.x/oob-tcp-fix branch September 26, 2019 23:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants