Skip to content

Question about rank ordering of processes #6298

Closed
@jsquyres

Description

@jsquyres

We've noticed a difference in rank ordering behavior. It is easiest to describe this using 2 variations of the following 2 examples (i.e., 4 different cases):

$ cat foo.sh
#!/bin/sh
echo "`hostname`: MCW rank $OMPI_COMM_WORLD_RANK"
$ mpirun --host aaa,bbb ./foo.sh
[...output 1...]
$ mpirun --host bbb,aaa ./foo.sh
[...output 2...]

CASE 1: OMPI v2.1.x + localhost

  • Open MPI v2.1.x
  • When launching mpirun from machine aaa (i.e., when launching on localhost)

In this case, the two outputs are:

# Output 1
aaa: MCW rank 0
bbb: MCW rank 1
# Output 2
aaa: MCW rank 1
bbb: MCW rank 0

Notice that the order of MCW ranks follows the order of the hosts in the --host argument.

Case 2: OMPI v2.1.x + no localhost

  • Open MPI v2.1.x
  • When launching mpirun from a 3rd machine (i.e., when not launching on localhost):

In this case, the two outputs are:

# Output 1
aaa: MCW rank 0
bbb: MCW rank 1
# Output 2
aaa: MCW rank 1
bbb: MCW rank 0

Notice that -- just like case 1 -- the order of MCW ranks follows the order of the hosts in the --host argument.

Case 3: OMPI v3.0.x + localhost

  • Open MPI v3.0.x and beyond
  • When launching mpirun from machine aaa (i.e., when launching on localhost)

In this case, the two outputs are:

# Output 1
aaa: MCW rank 0
bbb: MCW rank 1
# Output 2
aaa: MCW rank 0
bbb: MCW rank 1

Notice that the order of MCW ranks does not follow the order of the hosts in the --host argument -- it stays constant.

Case 4: OMPI V3.0.x + no localhost

  • Open MPI v3.0.x and beyond
  • When launching mpirun from a 3rd machine (i.e., when not launching on localhost):

In this case, the two outputs are:

# Output 1
aaa: MCW rank 0
bbb: MCW rank 1
# Output 2
aaa: MCW rank 1
bbb: MCW rank 0

Notice that -- just like cases 1 and 2, but unlike case 3 -- the order of MCW ranks follows the order of the hosts in the --host argument.


Do we know / remember if case 3 is intentional?

We ask because:

  • the behavior changed from v2.1.x to v3.0.x (and beyond)
  • the behavior is different depending on whether localhost is in the --host list or not (which, if this was a deliberate change in behavior, seems odd)

...or is rank ordering according to the ordering of hosts in --host not guaranteed? I.e., are cases 1, 2, and 4 just happenstance?

FYI @bturrubiates

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions