Skip to content

Ensure that nodes are always used in order provided #6493

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 19, 2019
Merged

Ensure that nodes are always used in order provided #6493

merged 4 commits into from
Mar 19, 2019

Conversation

rhc54
Copy link
Contributor

@rhc54 rhc54 commented Mar 15, 2019

If a user provides a list of nodes to use via -host or -hostfile, then
ensure that the ranks are placed according to that order. Also fix a bug
where the number of slots on a node was incorrectly computed for
localhost if the name given didn't exactly match the return from
get_hostname.

Fixes #6298

Signed-off-by: Ralph Castain [email protected]

If a user provides a list of nodes to use via -host or -hostfile, then
ensure that the ranks are placed according to that order. Also fix a bug
where the number of slots on a node was incorrectly computed for
localhost if the name given didn't exactly match the return from
get_hostname.

Signed-off-by: Ralph Castain <[email protected]>
@rhc54 rhc54 requested a review from jsquyres March 15, 2019 20:00
@jjhursey
Copy link
Member

Does this PR also fix Issue #4327 ? It sounds related so thought I would ask.

@jsquyres
Copy link
Member

jsquyres commented Mar 15, 2019

This sorta works. In one case, everything appears fine:

$ mpirun --host mpi001,mpi002 ./foo.sh
mpi001: 0 rank
mpi002: 1 rank

But if I switch the order, it hangs:

$ mpirun --host mpi002,mpi001 ./foo.sh
mpi001: 1 rank
mpi002: 1 rank
[...hang...]

Notice that both of them got an $OMPI_COMM_WORLD_RANK of 1.

It looks like the orted is still running on mpi002; that's why mpirun is hanging.

I gdb-attached to the orted -- it's just in the event loop -- not a whole lot of detail you can glean from that:

#0  0x00000039b46df393 in poll () from /lib64/libc.so.6
#1  0x00002aaaaae6f425 in poll_dispatch (base=0x64b6f0, tv=0x0) at poll.c:165
#2  0x00002aaaaae665c2 in opal_libevent2022_event_base_loop (base=0x64b6f0, flags=1) at event.c:1630
#3  0x00002aaaaaafd04c in orte_daemon (argc=32, argv=0x7fffffffd5f8) at orted/orted_main.c:1024
#4  0x00000000004008ae in main (argc=32, argv=0x7fffffffd5f8) at orted.c:60
I ran with `--mca state_base_verbose 100`; here's the output I got:
[mpi001:29887] mca: base: components_register: registering framework state components
[mpi001:29887] mca: base: components_register: found loaded component novm
[mpi001:29887] mca: base: components_register: component novm has no register or open function
[mpi001:29887] mca: base: components_register: found loaded component orted
[mpi001:29887] mca: base: components_register: component orted has no register or open function
[mpi001:29887] mca: base: components_register: found loaded component app
[mpi001:29887] mca: base: components_register: component app has no register or open function
[mpi001:29887] mca: base: components_register: found loaded component tool
[mpi001:29887] mca: base: components_register: component tool has no register or open function
[mpi001:29887] mca: base: components_register: found loaded component hnp
[mpi001:29887] mca: base: components_register: component hnp has no register or open function
[mpi001:29887] mca: base: components_open: opening state components
[mpi001:29887] mca: base: components_open: found loaded component novm
[mpi001:29887] mca: base: components_open: component novm open function successful
[mpi001:29887] mca: base: components_open: found loaded component orted
[mpi001:29887] mca: base: components_open: component orted open function successful
[mpi001:29887] mca: base: components_open: found loaded component app
[mpi001:29887] mca: base: components_open: component app open function successful
[mpi001:29887] mca: base: components_open: found loaded component tool
[mpi001:29887] mca: base: components_open: component tool open function successful
[mpi001:29887] mca: base: components_open: found loaded component hnp
[mpi001:29887] mca: base: components_open: component hnp open function successful
[mpi001:29887] mca:base:select: Auto-selecting state components
[mpi001:29887] mca:base:select:(state) Querying component [novm]
[mpi001:29887] mca:base:select:(state) Querying component [orted]
[mpi001:29887] mca:base:select:(state) Querying component [app]
[mpi001:29887] mca:base:select:(state) Querying component [tool]
[mpi001:29887] mca:base:select:(state) Querying component [hnp]
[mpi001:29887] mca:base:select:(state) Query of component [hnp] set priority to 60
[mpi001:29887] mca:base:select:(state) Selected component [hnp]
[mpi001:29887] mca: base: close: component novm closed
[mpi001:29887] mca: base: close: unloading component novm
[mpi001:29887] mca: base: close: component orted closed
[mpi001:29887] mca: base: close: unloading component orted
[mpi001:29887] mca: base: close: component app closed
[mpi001:29887] mca: base: close: unloading component app
[mpi001:29887] mca: base: close: component tool closed
[mpi001:29887] mca: base: close: unloading component tool
[mpi001:29887] ORTE_JOB_STATE_MACHINE:
[mpi001:29887] 	State: PENDING INIT cbfunc: DEFINED
[mpi001:29887] 	State: INIT_COMPLETE cbfunc: DEFINED
[mpi001:29887] 	State: PENDING ALLOCATION cbfunc: DEFINED
[mpi001:29887] 	State: ALLOCATION COMPLETE cbfunc: DEFINED
[mpi001:29887] 	State: DAEMONS LAUNCHED cbfunc: DEFINED
[mpi001:29887] 	State: ALL DAEMONS REPORTED cbfunc: DEFINED
[mpi001:29887] 	State: VM READY cbfunc: DEFINED
[mpi001:29887] 	State: PENDING MAPPING cbfunc: DEFINED
[mpi001:29887] 	State: MAP COMPLETE cbfunc: DEFINED
[mpi001:29887] 	State: PENDING FINAL SYSTEM PREP cbfunc: DEFINED
[mpi001:29887] 	State: PENDING APP LAUNCH cbfunc: DEFINED
[mpi001:29887] 	State: SENDING LAUNCH MSG cbfunc: DEFINED
[mpi001:29887] 	State: LOCAL LAUNCH COMPLETE cbfunc: DEFINED
[mpi001:29887] 	State: RUNNING cbfunc: DEFINED
[mpi001:29887] 	State: SYNC REGISTERED cbfunc: DEFINED
[mpi001:29887] 	State: NORMALLY TERMINATED cbfunc: DEFINED
[mpi001:29887] 	State: NOTIFY COMPLETED cbfunc: DEFINED
[mpi001:29887] 	State: NOTIFIED cbfunc: DEFINED
[mpi001:29887] 	State: ALL JOBS COMPLETE cbfunc: DEFINED
[mpi001:29887] 	State: DAEMONS TERMINATED cbfunc: DEFINED
[mpi001:29887] 	State: FORCED EXIT cbfunc: DEFINED
[mpi001:29887] 	State: REPORT PROGRESS cbfunc: DEFINED
[mpi001:29887] ORTE_PROC_STATE_MACHINE:
[mpi001:29887] 	State: RUNNING cbfunc: DEFINED
[mpi001:29887] 	State: SYNC REGISTERED cbfunc: DEFINED
[mpi001:29887] 	State: IOF COMPLETE cbfunc: DEFINED
[mpi001:29887] 	State: WAITPID FIRED cbfunc: DEFINED
[mpi001:29887] 	State: NORMALLY TERMINATED cbfunc: DEFINED
[mpi001:29887] [[14356,0],0] ACTIVATE JOB [INVALID] STATE PENDING INIT AT plm_rsh_module.c:931
[mpi001:29887] [[14356,0],0] ACTIVATING JOB [INVALID] STATE PENDING INIT PRI 4
[mpi001:29887] [[14356,0],0] ACTIVATE JOB [14356,1] STATE INIT_COMPLETE AT base/plm_base_launch_support.c:451
[mpi001:29887] [[14356,0],0] ACTIVATING JOB [14356,1] STATE INIT_COMPLETE PRI 4
[mpi001:29887] [[14356,0],0] ACTIVATE JOB [14356,1] STATE PENDING ALLOCATION AT base/plm_base_launch_support.c:464
[mpi001:29887] [[14356,0],0] ACTIVATING JOB [14356,1] STATE PENDING ALLOCATION PRI 4
[mpi001:29887] [[14356,0],0] ACTIVATE JOB [14356,1] STATE ALLOCATION COMPLETE AT base/ras_base_allocate.c:474
[mpi001:29887] [[14356,0],0] ACTIVATING JOB [14356,1] STATE ALLOCATION COMPLETE PRI 4
[mpi001:29887] [[14356,0],0] ACTIVATE JOB [14356,1] STATE PENDING DAEMON LAUNCH AT base/plm_base_launch_support.c:279
[mpi001:29887] [[14356,0],0] ACTIVATING JOB [14356,1] STATE PENDING DAEMON LAUNCH PRI 4
[mpi002:03515] mca: base: components_register: registering framework state components
[mpi002:03515] mca: base: components_register: found loaded component novm
[mpi002:03515] mca: base: components_register: component novm has no register or open function
[mpi002:03515] mca: base: components_register: found loaded component orted
[mpi002:03515] mca: base: components_register: component orted has no register or open function
[mpi002:03515] mca: base: components_register: found loaded component app
[mpi002:03515] mca: base: components_register: component app has no register or open function
[mpi002:03515] mca: base: components_register: found loaded component tool
[mpi002:03515] mca: base: components_register: component tool has no register or open function
[mpi002:03515] mca: base: components_register: found loaded component hnp
[mpi002:03515] mca: base: components_register: component hnp has no register or open function
[mpi002:03515] mca: base: components_open: opening state components
[mpi002:03515] mca: base: components_open: found loaded component novm
[mpi002:03515] mca: base: components_open: component novm open function successful
[mpi002:03515] mca: base: components_open: found loaded component orted
[mpi002:03515] mca: base: components_open: component orted open function successful
[mpi002:03515] mca: base: components_open: found loaded component app
[mpi002:03515] mca: base: components_open: component app open function successful
[mpi002:03515] mca: base: components_open: found loaded component tool
[mpi002:03515] mca: base: components_open: component tool open function successful
[mpi002:03515] mca: base: components_open: found loaded component hnp
[mpi002:03515] mca: base: components_open: component hnp open function successful
[mpi002:03515] mca:base:select: Auto-selecting state components
[mpi002:03515] mca:base:select:(state) Querying component [novm]
[mpi002:03515] mca:base:select:(state) Querying component [orted]
[mpi002:03515] mca:base:select:(state) Query of component [orted] set priority to 100
[mpi002:03515] mca:base:select:(state) Querying component [app]
[mpi002:03515] mca:base:select:(state) Querying component [tool]
[mpi002:03515] mca:base:select:(state) Querying component [hnp]
[mpi002:03515] mca:base:select:(state) Selected component [orted]
[mpi002:03515] mca: base: close: component novm closed
[mpi002:03515] mca: base: close: unloading component novm
[mpi002:03515] mca: base: close: component app closed
[mpi002:03515] mca: base: close: unloading component app
[mpi002:03515] mca: base: close: component tool closed
[mpi002:03515] mca: base: close: unloading component tool
[mpi002:03515] mca: base: close: component hnp closed
[mpi002:03515] mca: base: close: unloading component hnp
[mpi002:03515] ORTE_JOB_STATE_MACHINE:
[mpi002:03515] 	State: LOCAL LAUNCH COMPLETE cbfunc: DEFINED
[mpi002:03515] 	State: FORCED EXIT cbfunc: DEFINED
[mpi002:03515] 	State: DAEMONS TERMINATED cbfunc: DEFINED
[mpi002:03515] ORTE_PROC_STATE_MACHINE:
[mpi002:03515] 	State: RUNNING cbfunc: DEFINED
[mpi002:03515] 	State: SYNC REGISTERED cbfunc: DEFINED
[mpi002:03515] 	State: IOF COMPLETE cbfunc: DEFINED
[mpi002:03515] 	State: WAITPID FIRED cbfunc: DEFINED
[mpi002:03515] 	State: NORMALLY TERMINATED cbfunc: DEFINED
Daemon [[14356,0],1] checking in as pid 3515 on host mpi002
[mpi002:03515] [[14356,0],1] orted: up and running - waiting for commands!
[mpi001:29887] [[14356,0],0] ACTIVATE JOB [14356,1] STATE ALL DAEMONS REPORTED AT base/plm_base_launch_support.c:1400
[mpi001:29887] [[14356,0],0] ACTIVATING JOB [14356,1] STATE ALL DAEMONS REPORTED PRI 4
[mpi001:29887] [[14356,0],0] ACTIVATE JOB [14356,1] STATE VM READY AT base/plm_base_launch_support.c:258
[mpi001:29887] [[14356,0],0] ACTIVATING JOB [14356,1] STATE VM READY PRI 4
[mpi001:29887] [[14356,0],0] orted:comm:process_commands() Processing Command: Unknown Command!
[mpi001:29887] [[14356,0],0] orted_cmd: received pass_node_info
[mpi001:29887] [[14356,0],0] ACTIVATE JOB [14356,1] STATE PENDING MAPPING AT base/plm_base_launch_support.c:307
[mpi001:29887] [[14356,0],0] ACTIVATING JOB [14356,1] STATE PENDING MAPPING PRI 4
[mpi002:03515] [[14356,0],1] orted:comm:process_commands() Processing Command: Unknown Command!
[mpi002:03515] [[14356,0],1] orted_cmd: received pass_node_info
[mpi001:29887] [[14356,0],0] ACTIVATE JOB [14356,1] STATE MAP COMPLETE AT base/rmaps_base_map_job.c:479
[mpi001:29887] [[14356,0],0] ACTIVATING JOB [14356,1] STATE MAP COMPLETE PRI 4
[mpi001:29887] [[14356,0],0] ACTIVATE JOB [14356,1] STATE PENDING FINAL SYSTEM PREP AT base/plm_base_launch_support.c:337
[mpi001:29887] [[14356,0],0] ACTIVATING JOB [14356,1] STATE PENDING FINAL SYSTEM PREP PRI 4
[mpi001:29887] [[14356,0],0] ACTIVATE JOB [14356,1] STATE PENDING APP LAUNCH AT base/plm_base_launch_support.c:564
[mpi001:29887] [[14356,0],0] ACTIVATING JOB [14356,1] STATE PENDING APP LAUNCH PRI 4
[mpi001:29887] [[14356,0],0] ACTIVATE JOB [14356,1] STATE SENDING LAUNCH MSG AT base/odls_base_default_fns.c:133
[mpi001:29887] [[14356,0],0] ACTIVATING JOB [14356,1] STATE SENDING LAUNCH MSG PRI 4
[mpi001:29887] [[14356,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_ADD_LOCAL_PROCS
[mpi001:29887] [[14356,0],0] orted_cmd: received add_local_procs
[mpi002:03515] [[14356,0],1] orted:comm:process_commands() Processing Command: ORTE_DAEMON_ADD_LOCAL_PROCS
[mpi002:03515] [[14356,0],1] orted_cmd: received add_local_procs
[mpi001:29887] [[14356,0],0] ACTIVATE PROC [[14356,1],1] STATE RUNNING AT base/odls_base_default_fns.c:1047
[mpi001:29887] [[14356,0],0] ACTIVATING PROC [[14356,1],1] STATE RUNNING PRI 4
[mpi001:29887] [[14356,0],0] state:base:track_procs called for proc [[14356,1],1] state RUNNING
[mpi002:03515] [[14356,0],1] ACTIVATE PROC [[14356,1],1] STATE RUNNING AT base/odls_base_default_fns.c:1047
[mpi002:03515] [[14356,0],1] ACTIVATING PROC [[14356,1],1] STATE RUNNING PRI 4
[mpi002:03515] [[14356,0],1] state:orted:track_procs called for proc [[14356,1],1] state RUNNING
[mpi002:03515] [[14356,0],1] ACTIVATE JOB [14356,1] STATE LOCAL LAUNCH COMPLETE AT state_orted.c:295
[mpi002:03515] [[14356,0],1] ACTIVATING JOB [14356,1] STATE LOCAL LAUNCH COMPLETE PRI 4
[mpi002:03515] [[14356,0],1] state:orted:track_jobs sending local launch complete for job [14356,1]
[mpi001:29887] [[14356,0],0] ACTIVATE PROC [[14356,1],1] STATE RUNNING AT base/plm_base_receive.c:351
[mpi001:29887] [[14356,0],0] ACTIVATING PROC [[14356,1],1] STATE RUNNING PRI 4
[mpi001:29887] [[14356,0],0] state:base:track_procs called for proc [[14356,1],1] state RUNNING
[mpi001:29887] [[14356,0],0] ACTIVATE JOB [14356,1] STATE RUNNING AT base/state_base_fns.c:665
[mpi001:29887] [[14356,0],0] ACTIVATING JOB [14356,1] STATE RUNNING PRI 4
[mpi001:29887] [[14356,0],0] ACTIVATE PROC [[14356,1],1] STATE IOF COMPLETE AT iof_hnp_read.c:308
[mpi001:29887] [[14356,0],0] ACTIVATING PROC [[14356,1],1] STATE IOF COMPLETE PRI 4
mpi001: 1 rank
[mpi001:29887] [[14356,0],0] state:base:track_procs called for proc [[14356,1],1] state IOF COMPLETE
[mpi002:03515] [[14356,0],1] ACTIVATE PROC [[14356,1],1] STATE IOF COMPLETE AT iof_orted_read.c:181
[mpi002:03515] [[14356,0],1] ACTIVATING PROC [[14356,1],1] STATE IOF COMPLETE PRI 4
[mpi002:03515] [[14356,0],1] ACTIVATE PROC [[14356,1],1] STATE WAITPID FIRED AT base/odls_base_default_fns.c:1741
[mpi002:03515] [[14356,0],1] ACTIVATING PROC [[14356,1],1] STATE WAITPID FIRED PRI 4
[mpi002:03515] [[14356,0],1] state:orted:track_procs called for proc [[14356,1],1] state IOF COMPLETE
[mpi002:03515] [[14356,0],1] state:orted:track_procs called for proc [[14356,1],1] state WAITPID FIRED
[mpi002:03515] [[14356,0],1] ACTIVATE PROC [[14356,1],1] STATE NORMALLY TERMINATED AT state_orted.c:370
[mpi002:03515] [[14356,0],1] ACTIVATING PROC [[14356,1],1] STATE NORMALLY TERMINATED PRI 4
[mpi002:03515] [[14356,0],1] state:orted:track_procs called for proc [[14356,1],1] state NORMALLY TERMINATED
mpi002: 1 rank
[mpi002:03515] [[14356,0],1] state:orted: SENDING JOB LOCAL TERMINATION UPDATE FOR JOB [14356,1]
[mpi001:29887] [[14356,0],0] ACTIVATE PROC [[14356,1],1] STATE NORMALLY TERMINATED AT base/plm_base_receive.c:351
[mpi001:29887] [[14356,0],0] ACTIVATING PROC [[14356,1],1] STATE NORMALLY TERMINATED PRI 4
[mpi001:29887] [[14356,0],0] state:base:track_procs called for proc [[14356,1],1] state NORMALLY TERMINATED
[mpi001:29887] [[14356,0],0] state:base:cleanup_node on proc [[14356,1],1]
[mpi002:03515] [[14356,0],1] state:orted releasing procs from node mpi001
[mpi002:03515] [[14356,0],1] state:orted releasing proc [[14356,1],0] from node mpi001
[mpi002:03515] [[14356,0],1] state:orted releasing procs from node mpi002
[mpi002:03515] [[14356,0],1] state:orted releasing proc [[14356,1],1] from node mpi002

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 15, 2019

i see the problem - the two daemons both think they are launching rank 1 and so nobody reports rank 0 as "complete". obviously, we need the two to get the same mapping answer. i'll play a little with it - can't promise an immediate fix.

rhc54 added 2 commits March 16, 2019 01:20
Signed-off-by: Ralph Castain <[email protected]>
@rhc54
Copy link
Contributor Author

rhc54 commented Mar 16, 2019

@jsquyres Could you please give this a try again? I'm afraid my VirtualBox machines are down at the moment due to upgrade so I can't verify it.

@jsquyres
Copy link
Member

I'm afraid I get hangs now, in both the ssh and SLURM cases.

output from `mpirun --mca state_base_verbose 100 --debug-daemons --host mpi021,mpi022 ./foo.sh` ssh run (no SLURM)
[mpi021:16094] mca: base: components_register: registering framework state components
[mpi021:16094] mca: base: components_register: found loaded component novm
[mpi021:16094] mca: base: components_register: component novm has no register or open function
[mpi021:16094] mca: base: components_register: found loaded component orted
[mpi021:16094] mca: base: components_register: component orted has no register or open function
[mpi021:16094] mca: base: components_register: found loaded component app
[mpi021:16094] mca: base: components_register: component app has no register or open function
[mpi021:16094] mca: base: components_register: found loaded component tool
[mpi021:16094] mca: base: components_register: component tool has no register or open function
[mpi021:16094] mca: base: components_register: found loaded component hnp
[mpi021:16094] mca: base: components_register: component hnp has no register or open function
[mpi021:16094] mca: base: components_open: opening state components
[mpi021:16094] mca: base: components_open: found loaded component novm
[mpi021:16094] mca: base: components_open: component novm open function successful
[mpi021:16094] mca: base: components_open: found loaded component orted
[mpi021:16094] mca: base: components_open: component orted open function successful
[mpi021:16094] mca: base: components_open: found loaded component app
[mpi021:16094] mca: base: components_open: component app open function successful
[mpi021:16094] mca: base: components_open: found loaded component tool
[mpi021:16094] mca: base: components_open: component tool open function successful
[mpi021:16094] mca: base: components_open: found loaded component hnp
[mpi021:16094] mca: base: components_open: component hnp open function successful
[mpi021:16094] mca:base:select: Auto-selecting state components
[mpi021:16094] mca:base:select:(state) Querying component [novm]
[mpi021:16094] mca:base:select:(state) Querying component [orted]
[mpi021:16094] mca:base:select:(state) Querying component [app]
[mpi021:16094] mca:base:select:(state) Querying component [tool]
[mpi021:16094] mca:base:select:(state) Querying component [hnp]
[mpi021:16094] mca:base:select:(state) Query of component [hnp] set priority to 60
[mpi021:16094] mca:base:select:(state) Selected component [hnp]
[mpi021:16094] mca: base: close: component novm closed
[mpi021:16094] mca: base: close: unloading component novm
[mpi021:16094] mca: base: close: component orted closed
[mpi021:16094] mca: base: close: unloading component orted
[mpi021:16094] mca: base: close: component app closed
[mpi021:16094] mca: base: close: unloading component app
[mpi021:16094] mca: base: close: component tool closed
[mpi021:16094] mca: base: close: unloading component tool
[mpi021:16094] ORTE_JOB_STATE_MACHINE:
[mpi021:16094] 	State: PENDING INIT cbfunc: DEFINED
[mpi021:16094] 	State: INIT_COMPLETE cbfunc: DEFINED
[mpi021:16094] 	State: PENDING ALLOCATION cbfunc: DEFINED
[mpi021:16094] 	State: ALLOCATION COMPLETE cbfunc: DEFINED
[mpi021:16094] 	State: DAEMONS LAUNCHED cbfunc: DEFINED
[mpi021:16094] 	State: ALL DAEMONS REPORTED cbfunc: DEFINED
[mpi021:16094] 	State: VM READY cbfunc: DEFINED
[mpi021:16094] 	State: PENDING MAPPING cbfunc: DEFINED
[mpi021:16094] 	State: MAP COMPLETE cbfunc: DEFINED
[mpi021:16094] 	State: PENDING FINAL SYSTEM PREP cbfunc: DEFINED
[mpi021:16094] 	State: PENDING APP LAUNCH cbfunc: DEFINED
[mpi021:16094] 	State: SENDING LAUNCH MSG cbfunc: DEFINED
[mpi021:16094] 	State: LOCAL LAUNCH COMPLETE cbfunc: DEFINED
[mpi021:16094] 	State: RUNNING cbfunc: DEFINED
[mpi021:16094] 	State: SYNC REGISTERED cbfunc: DEFINED
[mpi021:16094] 	State: NORMALLY TERMINATED cbfunc: DEFINED
[mpi021:16094] 	State: NOTIFY COMPLETED cbfunc: DEFINED
[mpi021:16094] 	State: NOTIFIED cbfunc: DEFINED
[mpi021:16094] 	State: ALL JOBS COMPLETE cbfunc: DEFINED
[mpi021:16094] 	State: DAEMONS TERMINATED cbfunc: DEFINED
[mpi021:16094] 	State: FORCED EXIT cbfunc: DEFINED
[mpi021:16094] 	State: REPORT PROGRESS cbfunc: DEFINED
[mpi021:16094] ORTE_PROC_STATE_MACHINE:
[mpi021:16094] 	State: RUNNING cbfunc: DEFINED
[mpi021:16094] 	State: SYNC REGISTERED cbfunc: DEFINED
[mpi021:16094] 	State: IOF COMPLETE cbfunc: DEFINED
[mpi021:16094] 	State: WAITPID FIRED cbfunc: DEFINED
[mpi021:16094] 	State: NORMALLY TERMINATED cbfunc: DEFINED
[mpi021:16094] [[32894,0],0] ACTIVATE JOB [INVALID] STATE PENDING INIT AT plm_rsh_module.c:931
[mpi021:16094] [[32894,0],0] ACTIVATING JOB [INVALID] STATE PENDING INIT PRI 4
[mpi021:16094] [[32894,0],0] ACTIVATE JOB [32894,1] STATE INIT_COMPLETE AT base/plm_base_launch_support.c:451
[mpi021:16094] [[32894,0],0] ACTIVATING JOB [32894,1] STATE INIT_COMPLETE PRI 4
[mpi021:16094] [[32894,0],0] ACTIVATE JOB [32894,1] STATE PENDING ALLOCATION AT base/plm_base_launch_support.c:464
[mpi021:16094] [[32894,0],0] ACTIVATING JOB [32894,1] STATE PENDING ALLOCATION PRI 4
[mpi021:16094] [[32894,0],0] ACTIVATE JOB [32894,1] STATE ALLOCATION COMPLETE AT base/ras_base_allocate.c:474
[mpi021:16094] [[32894,0],0] ACTIVATING JOB [32894,1] STATE ALLOCATION COMPLETE PRI 4
[mpi021:16094] [[32894,0],0] ACTIVATE JOB [32894,1] STATE PENDING DAEMON LAUNCH AT base/plm_base_launch_support.c:279
[mpi021:16094] [[32894,0],0] ACTIVATING JOB [32894,1] STATE PENDING DAEMON LAUNCH PRI 4
[mpi022:09348] mca: base: components_register: registering framework state components
[mpi022:09348] mca: base: components_register: found loaded component novm
[mpi022:09348] mca: base: components_register: component novm has no register or open function
[mpi022:09348] mca: base: components_register: found loaded component orted
[mpi022:09348] mca: base: components_register: component orted has no register or open function
[mpi022:09348] mca: base: components_register: found loaded component app
[mpi022:09348] mca: base: components_register: component app has no register or open function
[mpi022:09348] mca: base: components_register: found loaded component tool
[mpi022:09348] mca: base: components_register: component tool has no register or open function
[mpi022:09348] mca: base: components_register: found loaded component hnp
[mpi022:09348] mca: base: components_register: component hnp has no register or open function
[mpi022:09348] mca: base: components_open: opening state components
[mpi022:09348] mca: base: components_open: found loaded component novm
[mpi022:09348] mca: base: components_open: component novm open function successful
[mpi022:09348] mca: base: components_open: found loaded component orted
[mpi022:09348] mca: base: components_open: component orted open function successful
[mpi022:09348] mca: base: components_open: found loaded component app
[mpi022:09348] mca: base: components_open: component app open function successful
[mpi022:09348] mca: base: components_open: found loaded component tool
[mpi022:09348] mca: base: components_open: component tool open function successful
[mpi022:09348] mca: base: components_open: found loaded component hnp
[mpi022:09348] mca: base: components_open: component hnp open function successful
[mpi022:09348] mca:base:select: Auto-selecting state components
[mpi022:09348] mca:base:select:(state) Querying component [novm]
[mpi022:09348] mca:base:select:(state) Querying component [orted]
[mpi022:09348] mca:base:select:(state) Query of component [orted] set priority to 100
[mpi022:09348] mca:base:select:(state) Querying component [app]
[mpi022:09348] mca:base:select:(state) Querying component [tool]
[mpi022:09348] mca:base:select:(state) Querying component [hnp]
[mpi022:09348] mca:base:select:(state) Selected component [orted]
[mpi022:09348] mca: base: close: component novm closed
[mpi022:09348] mca: base: close: unloading component novm
[mpi022:09348] mca: base: close: component app closed
[mpi022:09348] mca: base: close: unloading component app
[mpi022:09348] mca: base: close: component tool closed
[mpi022:09348] mca: base: close: unloading component tool
[mpi022:09348] mca: base: close: component hnp closed
[mpi022:09348] mca: base: close: unloading component hnp
[mpi022:09348] ORTE_JOB_STATE_MACHINE:
[mpi022:09348] 	State: LOCAL LAUNCH COMPLETE cbfunc: DEFINED
[mpi022:09348] 	State: FORCED EXIT cbfunc: DEFINED
[mpi022:09348] 	State: DAEMONS TERMINATED cbfunc: DEFINED
[mpi022:09348] ORTE_PROC_STATE_MACHINE:
[mpi022:09348] 	State: RUNNING cbfunc: DEFINED
[mpi022:09348] 	State: SYNC REGISTERED cbfunc: DEFINED
[mpi022:09348] 	State: IOF COMPLETE cbfunc: DEFINED
[mpi022:09348] 	State: WAITPID FIRED cbfunc: DEFINED
[mpi022:09348] 	State: NORMALLY TERMINATED cbfunc: DEFINED
Daemon [[32894,0],1] checking in as pid 9348 on host mpi022
[mpi022:09348] [[32894,0],1] orted: up and running - waiting for commands!
[mpi021:16094] [[32894,0],0] ACTIVATE JOB [32894,1] STATE ALL DAEMONS REPORTED AT base/plm_base_launch_support.c:1400
[mpi021:16094] [[32894,0],0] ACTIVATING JOB [32894,1] STATE ALL DAEMONS REPORTED PRI 4
[mpi021:16094] [[32894,0],0] ACTIVATE JOB [32894,1] STATE VM READY AT base/plm_base_launch_support.c:258
[mpi021:16094] [[32894,0],0] ACTIVATING JOB [32894,1] STATE VM READY PRI 4
[mpi021:16094] [[32894,0],0] orted:comm:process_commands() Processing Command: Unknown Command!
[mpi021:16094] [[32894,0],0] orted_cmd: received pass_node_info
[mpi021:16094] [[32894,0],0] ACTIVATE JOB [32894,1] STATE PENDING MAPPING AT base/plm_base_launch_support.c:307
[mpi021:16094] [[32894,0],0] ACTIVATING JOB [32894,1] STATE PENDING MAPPING PRI 4
[mpi022:09348] [[32894,0],1] orted:comm:process_commands() Processing Command: Unknown Command!
[mpi022:09348] [[32894,0],1] orted_cmd: received pass_node_info
[mpi021:16094] [[32894,0],0] ACTIVATE JOB [32894,1] STATE MAP COMPLETE AT base/rmaps_base_map_job.c:479
[mpi021:16094] [[32894,0],0] ACTIVATING JOB [32894,1] STATE MAP COMPLETE PRI 4
[mpi021:16094] [[32894,0],0] ACTIVATE JOB [32894,1] STATE PENDING FINAL SYSTEM PREP AT base/plm_base_launch_support.c:337
[mpi021:16094] [[32894,0],0] ACTIVATING JOB [32894,1] STATE PENDING FINAL SYSTEM PREP PRI 4
[mpi021:16094] [[32894,0],0] ACTIVATE JOB [32894,1] STATE PENDING APP LAUNCH AT base/plm_base_launch_support.c:564
[mpi021:16094] [[32894,0],0] ACTIVATING JOB [32894,1] STATE PENDING APP LAUNCH PRI 4
[mpi021:16094] [[32894,0],0] ACTIVATE JOB [32894,1] STATE SENDING LAUNCH MSG AT base/odls_base_default_fns.c:133
[mpi021:16094] [[32894,0],0] ACTIVATING JOB [32894,1] STATE SENDING LAUNCH MSG PRI 4
[mpi021:16094] [[32894,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_ADD_LOCAL_PROCS
[mpi021:16094] [[32894,0],0] orted_cmd: received add_local_procs
[mpi022:09348] [[32894,0],1] orted:comm:process_commands() Processing Command: ORTE_DAEMON_ADD_LOCAL_PROCS
[mpi022:09348] [[32894,0],1] orted_cmd: received add_local_procs
[mpi021:16094] [[32894,0],0] ACTIVATE PROC [[32894,1],0] STATE RUNNING AT base/odls_base_default_fns.c:1047
[mpi021:16094] [[32894,0],0] ACTIVATING PROC [[32894,1],0] STATE RUNNING PRI 4
[mpi021:16094] [[32894,0],0] state:base:track_procs called for proc [[32894,1],0] state RUNNING
[mpi021:16094] [[32894,0],0] ACTIVATE PROC [[32894,1],0] STATE IOF COMPLETE AT iof_hnp_read.c:308
[mpi021:16094] [[32894,0],0] ACTIVATING PROC [[32894,1],0] STATE IOF COMPLETE PRI 4
mpi021: 0 rank
[mpi021:16094] [[32894,0],0] ACTIVATE PROC [[32894,1],0] STATE WAITPID FIRED AT base/odls_base_default_fns.c:1741
[mpi021:16094] [[32894,0],0] ACTIVATING PROC [[32894,1],0] STATE WAITPID FIRED PRI 4
[mpi021:16094] [[32894,0],0] state:base:track_procs called for proc [[32894,1],0] state IOF COMPLETE
[mpi021:16094] [[32894,0],0] state:base:track_procs called for proc [[32894,1],0] state WAITPID FIRED
[mpi021:16094] [[32894,0],0] ACTIVATE PROC [[32894,1],0] STATE NORMALLY TERMINATED AT base/state_base_fns.c:697
[mpi021:16094] [[32894,0],0] ACTIVATING PROC [[32894,1],0] STATE NORMALLY TERMINATED PRI 4
[mpi021:16094] [[32894,0],0] state:base:track_procs called for proc [[32894,1],0] state NORMALLY TERMINATED
[mpi021:16094] [[32894,0],0] state:base:cleanup_node on proc [[32894,1],0]

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 18, 2019

Think I got this now - managed to restore my Vboxes and verify it here. Please give it a spin.

@jsquyres
Copy link
Member

Yay!

$ mpirun --host mpi021,mpi022 ./foo.sh | sort
mpi021: 0 rank
mpi022: 1 rank
$ mpirun --host mpi022,mpi021 ./foo.sh | sort
mpi021: 1 rank
mpi022: 0 rank

Thank you!

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 18, 2019

@jsquyres Looks like the Cray host is timing out. Otherwise looks okay

@jsquyres
Copy link
Member

bot:ompi:retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Question about rank ordering of processes
3 participants