You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I updated my OMPI work area on a Cray-XC system to head of master yesterday, which pulled in #7010, and now mpirun is broken:
hpp@nid00192:/usr/projects/hpctools/hpp/ompi/examples> (master *)mpirun -np 8 ./connectivity_c
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
Here's what gdb shows:
#0 0x00002aaaab088a14 in opal_libevent2022_event_priority_set (ev=0x7dfb30, pri=3) at event.c:1859
#1 0x00002aaab2449490 in orte_oob_tcp_start_listening () at oob_tcp_listener.c:165
#2 0x00002aaab2444673 in component_startup () at oob_tcp_component.c:636
#3 0x00002aaaaad67194 in orte_oob_base_select () at base/oob_base_select.c:87
#4 0x00002aaaaad47d66 in orte_ess_base_orted_setup () at base/ess_base_std_orted.c:393
#5 0x00002aaaadd5d0fa in rte_init () at ess_slurm_module.c:79
#6 0x00002aaaaad9b49c in orte_init (pargc=0x7fffffff68bc, pargv=0x7fffffff68b0, flags=2) at runtime/orte_init.c:273
#7 0x00002aaaaad1f5a0 in orte_daemon (argc=16, argv=0x7fffffff6e28) at orted/orted_main.c:358
#8 0x0000000000400838 in main (argc=16, argv=0x7fffffff6e28) at orted.c:60
(gdb) list
1854 {
1855 _event_debug_assert_is_setup(ev);
1856
1857 if (ev->ev_flags & EVLIST_ACTIVE)
1858 return (-1);
1859 if (pri < 0 || pri >= ev->ev_base->nactivequeues)
1860 return (-1);
1861
1862 ev->ev_pri = pri;
1863
(gdb) print ev
$1 = (struct event *) 0x7dfb30
(gdb) print ev->ev_base
$2 = (struct event_base *) 0x0
Problem with the OOB remove old code PR was that it removed these lines:
The text was updated successfully, but these errors were encountered:
hppritcha
changed the title
PR 7070 broke mpirun, at least on systems using slurmd on head node
PR 7010 broke mpirun, at least on systems using slurmd on head node
Sep 30, 2019
I updated my OMPI work area on a Cray-XC system to head of master yesterday, which pulled in #7010, and now mpirun is broken:
Here's what gdb shows:
Problem with the OOB remove old code PR was that it removed these lines:
Opening a PR shortly to fix this.
The text was updated successfully, but these errors were encountered: