Description
We've noticed a difference in rank ordering behavior. It is easiest to describe this using 2 variations of the following 2 examples (i.e., 4 different cases):
$ cat foo.sh
#!/bin/sh
echo "`hostname`: MCW rank $OMPI_COMM_WORLD_RANK"
$ mpirun --host aaa,bbb ./foo.sh
[...output 1...]
$ mpirun --host bbb,aaa ./foo.sh
[...output 2...]
CASE 1: OMPI v2.1.x + localhost
- Open MPI v2.1.x
- When launching
mpirun
from machineaaa
(i.e., when launching on localhost)
In this case, the two outputs are:
# Output 1
aaa: MCW rank 0
bbb: MCW rank 1
# Output 2
aaa: MCW rank 1
bbb: MCW rank 0
Notice that the order of MCW ranks follows the order of the hosts in the --host
argument.
Case 2: OMPI v2.1.x + no localhost
- Open MPI v2.1.x
- When launching
mpirun
from a 3rd machine (i.e., when not launching on localhost):
In this case, the two outputs are:
# Output 1
aaa: MCW rank 0
bbb: MCW rank 1
# Output 2
aaa: MCW rank 1
bbb: MCW rank 0
Notice that -- just like case 1 -- the order of MCW ranks follows the order of the hosts in the --host
argument.
Case 3: OMPI v3.0.x + localhost
- Open MPI v3.0.x and beyond
- When launching
mpirun
from machineaaa
(i.e., when launching on localhost)
In this case, the two outputs are:
# Output 1
aaa: MCW rank 0
bbb: MCW rank 1
# Output 2
aaa: MCW rank 0
bbb: MCW rank 1
Notice that the order of MCW ranks does not follow the order of the hosts in the --host
argument -- it stays constant.
Case 4: OMPI V3.0.x + no localhost
- Open MPI v3.0.x and beyond
- When launching
mpirun
from a 3rd machine (i.e., when not launching on localhost):
In this case, the two outputs are:
# Output 1
aaa: MCW rank 0
bbb: MCW rank 1
# Output 2
aaa: MCW rank 1
bbb: MCW rank 0
Notice that -- just like cases 1 and 2, but unlike case 3 -- the order of MCW ranks follows the order of the hosts in the --host
argument.
Do we know / remember if case 3 is intentional?
We ask because:
- the behavior changed from v2.1.x to v3.0.x (and beyond)
- the behavior is different depending on whether localhost is in the
--host
list or not (which, if this was a deliberate change in behavior, seems odd)
...or is rank ordering according to the ordering of hosts in --host
not guaranteed? I.e., are cases 1, 2, and 4 just happenstance?
FYI @bturrubiates