Skip to content

Fix oob_tcp tcp_component_close segfault with active listeners #6796

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 8, 2019

Conversation

orivej
Copy link
Contributor

@orivej orivej commented Jul 4, 2019

oob_tcp in non-HNP mode shares libevent event_base with oob_base [1].
orte_oob_base_close calls:
(1) oob_tcp component_shutdown, then
(2) opal_progress_thread_finalize, then
(3) oob_tcp tcp_component_close [2].
opal_progress_thread_finalize calls tracker_destructor [3] that frees the event_base [4]. If any oob_tcp event listeners are active at this time, oob_tcp will crash trying to delete them at [5] [6].

This change moves oob_tcp event listener cleanup from component_close to component_shutdown so that it happens before the event_base is freed.

[1] https://github.com/open-mpi/ompi/blob/v4.0.1/orte/mca/oob/tcp/oob_tcp_listener.c#L160
[2] https://github.com/open-mpi/ompi/blob/v4.0.1/orte/mca/oob/base/oob_base_frame.c#L95
[3] https://github.com/open-mpi/ompi/blob/v4.0.1/opal/runtime/opal_progress_threads.c#L232
[4] https://github.com/open-mpi/ompi/blob/v4.0.1/opal/runtime/opal_progress_threads.c#L65
[5] https://github.com/open-mpi/ompi/blob/v4.0.1/orte/mca/oob/tcp/oob_tcp_component.c#L192
[6] https://github.com/open-mpi/ompi/blob/v4.0.1/orte/mca/oob/tcp/oob_tcp_listener.c#L955

@ompiteam-bot
Copy link

Can one of the admins verify this patch?

oob_tcp in non-HNP mode shares libevent event_base with oob_base [1].
orte_oob_base_close calls:
(1) oob_tcp component_shutdown, then
(2) opal_progress_thread_finalize, then
(3) oob_tcp tcp_component_close [2].
opal_progress_thread_finalize calls tracker_destructor [3] that frees the
event_base [4]. If any oob_tcp event listeners are active at this time, oob_tcp
will crash trying to delete them at [5] [6].

This change moves oob_tcp event listener cleanup from component_close to
component_shutdown so that it happens before the event_base is freed.

[1] https://github.com/open-mpi/ompi/blob/v4.0.1/orte/mca/oob/tcp/oob_tcp_listener.c#L160
[2] https://github.com/open-mpi/ompi/blob/v4.0.1/orte/mca/oob/base/oob_base_frame.c#L95
[3] https://github.com/open-mpi/ompi/blob/v4.0.1/opal/runtime/opal_progress_threads.c#L232
[4] https://github.com/open-mpi/ompi/blob/v4.0.1/opal/runtime/opal_progress_threads.c#L65
[5] https://github.com/open-mpi/ompi/blob/v4.0.1/orte/mca/oob/tcp/oob_tcp_component.c#L192
[6] https://github.com/open-mpi/ompi/blob/v4.0.1/orte/mca/oob/tcp/oob_tcp_listener.c#L955

Signed-off-by: Orivej Desh <[email protected]>
@orivej orivej force-pushed the fix-tcp_component_close-segfault branch from 10c7447 to 78b7e34 Compare July 4, 2019 20:45
@orivej
Copy link
Contributor Author

orivej commented Jul 4, 2019

I was debugging a reliable orted crash on a host that could not connect back to its master host. Here are the backtraces that I have obtained with OpenMPI 4.0.1.

This is the initial (expected) part of the backtrace:

main+176 at orte/tools/orted/orted.c:60
orte_daemon+12052 at orte/orted/orted_main.c:1048
orte_finalize+210 at orte/runtime/orte_finalize.c:77
rte_finalize+-985443 at orte/mca/ess/env/ess_env_module.c:122
orte_ess_base_orted_finalize+540 at orte/mca/ess/base/ess_base_std_orted.c:695
mca_base_framework_close+310 at opal/mca/base/mca_base_framework.c:216

This is the rest that crashes:

orte_oob_base_close+487 at orte/mca/oob/base/oob_base_frame.c:110
mca_base_framework_components_close+40 at opal/mca/base/mca_base_components_close.c:65
mca_base_components_close+103 at opal/mca/base/mca_base_components_close.c:85
mca_base_component_close+46 at opal/mca/base/mca_base_components_close.c:53
tcp_component_close+135 at orte/mca/oob/tcp/oob_tcp_component.c:192
opal_obj_run_destructors+-831097 at opal/class/opal_object.h:462
tcp_ev_des+42 at orte/mca/oob/tcp/oob_tcp_listener.c:955
event_del+146 at libevent/event.c:2204
evthread_posix_lock+64 at libevent/evthread_pthread.c:72
SIGSEGV

It crashes because the event base with its lock were freed here:

orte_oob_base_close+234 at orte/mca/oob/base/oob_base_frame.c:95
opal_progress_thread_finalize+276 at opal/runtime/opal_progress_threads.c:232
opal_obj_run_destructors+-371001 at opal/class/opal_object.h:462
tracker_destructor+90 at opal/runtime/opal_progress_threads.c:65
event_base_free+1489 at libevent/event.c:791

@bwbarrett
Copy link
Member

ok to test

@rhc54 rhc54 added this to the v4.0.2 milestone Jul 8, 2019
Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch -- thank you!

@jsquyres jsquyres merged commit b738fa2 into open-mpi:master Jul 8, 2019
rhc54 added a commit to rhc54/prrte that referenced this pull request Jul 10, 2019
Backport open-mpi/ompi#6796

Signed-off-by: Ralph Castain <[email protected]>
@hppritcha hppritcha removed this from the v4.0.2 milestone Jul 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants