Skip to content

PR 7010 broke mpirun, at least on systems using slurmd on head node #7020

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hppritcha opened this issue Sep 30, 2019 · 1 comment
Closed

Comments

@hppritcha
Copy link
Member

I updated my OMPI work area on a Cray-XC system to head of master yesterday, which pulled in #7010, and now mpirun is broken:

hpp@nid00192:/usr/projects/hpctools/hpp/ompi/examples> (master *)mpirun -np 8 ./connectivity_c
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).

Here's what gdb shows:

#0  0x00002aaaab088a14 in opal_libevent2022_event_priority_set (ev=0x7dfb30, pri=3) at event.c:1859
#1  0x00002aaab2449490 in orte_oob_tcp_start_listening () at oob_tcp_listener.c:165
#2  0x00002aaab2444673 in component_startup () at oob_tcp_component.c:636
#3  0x00002aaaaad67194 in orte_oob_base_select () at base/oob_base_select.c:87
#4  0x00002aaaaad47d66 in orte_ess_base_orted_setup () at base/ess_base_std_orted.c:393
#5  0x00002aaaadd5d0fa in rte_init () at ess_slurm_module.c:79
#6  0x00002aaaaad9b49c in orte_init (pargc=0x7fffffff68bc, pargv=0x7fffffff68b0, flags=2) at runtime/orte_init.c:273
#7  0x00002aaaaad1f5a0 in orte_daemon (argc=16, argv=0x7fffffff6e28) at orted/orted_main.c:358
#8  0x0000000000400838 in main (argc=16, argv=0x7fffffff6e28) at orted.c:60
(gdb) list
1854	{
1855		_event_debug_assert_is_setup(ev);
1856	
1857		if (ev->ev_flags & EVLIST_ACTIVE)
1858			return (-1);
1859		if (pri < 0 || pri >= ev->ev_base->nactivequeues)
1860			return (-1);
1861	
1862		ev->ev_pri = pri;
1863	
(gdb) print ev
$1 = (struct event *) 0x7dfb30
(gdb) print ev->ev_base
$2 = (struct event_base *) 0x0

Problem with the OOB remove old code PR was that it removed these lines:

@@ -122,13 +110,6 @@ static int orte_oob_base_open(mca_base_open_flag_t flags)
     opal_hash_table_init(&orte_oob_base.peers, 128);
     OBJ_CONSTRUCT(&orte_oob_base.actives, opal_list_t);
 
-    if (ORTE_PROC_IS_APP || ORTE_PROC_IS_TOOL) {
-        orte_oob_base.ev_base = orte_event_base;
-    } else {
-        orte_oob_base.ev_base = opal_progress_thread_init("OOB-BASE");
-    }
-

Opening a PR shortly to fix this.

@hppritcha hppritcha changed the title PR 7070 broke mpirun, at least on systems using slurmd on head node PR 7010 broke mpirun, at least on systems using slurmd on head node Sep 30, 2019
@hppritcha
Copy link
Member Author

fixed by #7022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant