You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've noticed a difference in rank ordering behavior. It is easiest to describe this using 2 variations of the following 2 examples (i.e., 4 different cases):
Heck if I know - the consensus has changed over the years. IIRC, the last time we went around on this, I believe we decided that the ordering should follow the --host list. However, we always got in knots over the various cases (when resources are managed, a hostfile is provided, etc.).
The problem is that the ordering can be really important when you are on clusters with topological fabrics. Most users don't know how the nodes sit on the topology, but the scheduler does and assigns the hosts in the required order for best performance. In those cases, you really only want -host to act as a filter and not necessarily specify the ordering.
That said, we have had users complain about that behavior too, regardless of the possible performance impact. What you probably really need is a different "marker" in the -host option that indicates a request for rigid ordering. We already have markers for empty nodes, so adding another marker to indicate rigid ordering shouldn't be too hard. You then just need to ensure that the node ordering on the list of available nodes (as constructed in rmaps_base_support_fns.c) matches the requested ordering so the map gets constructed correctly.
We've noticed a difference in rank ordering behavior. It is easiest to describe this using 2 variations of the following 2 examples (i.e., 4 different cases):
CASE 1: OMPI v2.1.x + localhost
mpirun
from machineaaa
(i.e., when launching on localhost)In this case, the two outputs are:
Notice that the order of MCW ranks follows the order of the hosts in the
--host
argument.Case 2: OMPI v2.1.x + no localhost
mpirun
from a 3rd machine (i.e., when not launching on localhost):In this case, the two outputs are:
Notice that -- just like case 1 -- the order of MCW ranks follows the order of the hosts in the
--host
argument.Case 3: OMPI v3.0.x + localhost
mpirun
from machineaaa
(i.e., when launching on localhost)In this case, the two outputs are:
Notice that the order of MCW ranks does not follow the order of the hosts in the
--host
argument -- it stays constant.Case 4: OMPI V3.0.x + no localhost
mpirun
from a 3rd machine (i.e., when not launching on localhost):In this case, the two outputs are:
Notice that -- just like cases 1 and 2, but unlike case 3 -- the order of MCW ranks follows the order of the hosts in the
--host
argument.Do we know / remember if case 3 is intentional?
We ask because:
--host
list or not (which, if this was a deliberate change in behavior, seems odd)...or is rank ordering according to the ordering of hosts in
--host
not guaranteed? I.e., are cases 1, 2, and 4 just happenstance?FYI @bturrubiates
The text was updated successfully, but these errors were encountered: