-
Notifications
You must be signed in to change notification settings - Fork 900
Ensure that nodes are always used in order provided #6493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
If a user provides a list of nodes to use via -host or -hostfile, then ensure that the ranks are placed according to that order. Also fix a bug where the number of slots on a node was incorrectly computed for localhost if the name given didn't exactly match the return from get_hostname. Signed-off-by: Ralph Castain <[email protected]>
Does this PR also fix Issue #4327 ? It sounds related so thought I would ask. |
This sorta works. In one case, everything appears fine:
But if I switch the order, it hangs:
Notice that both of them got an It looks like the I gdb-attached to the orted -- it's just in the event loop -- not a whole lot of detail you can glean from that:
I ran with `--mca state_base_verbose 100`; here's the output I got:
|
i see the problem - the two daemons both think they are launching rank 1 and so nobody reports rank 0 as "complete". obviously, we need the two to get the same mapping answer. i'll play a little with it - can't promise an immediate fix. |
Signed-off-by: Ralph Castain <[email protected]>
Signed-off-by: Ralph Castain <[email protected]>
@jsquyres Could you please give this a try again? I'm afraid my VirtualBox machines are down at the moment due to upgrade so I can't verify it. |
I'm afraid I get hangs now, in both the ssh and SLURM cases. output from `mpirun --mca state_base_verbose 100 --debug-daemons --host mpi021,mpi022 ./foo.sh` ssh run (no SLURM)
|
Signed-off-by: Ralph Castain <[email protected]>
Think I got this now - managed to restore my Vboxes and verify it here. Please give it a spin. |
Yay!
Thank you! |
@jsquyres Looks like the Cray host is timing out. Otherwise looks okay |
bot:ompi:retest |
If a user provides a list of nodes to use via -host or -hostfile, then
ensure that the ranks are placed according to that order. Also fix a bug
where the number of slots on a node was incorrectly computed for
localhost if the name given didn't exactly match the return from
get_hostname.
Fixes #6298
Signed-off-by: Ralph Castain [email protected]