Skip to content

Commit bf3980d

Browse files
committed
fix hang in -np 3 --rank-by core
The following command hangs: % mpirun --rank-by core -np 3 --report-bindings hostname because of a loop where i is supposed to cycle through an array of size num_objs, but for some reason it's only looking at node->num_procs entries. I changed the counter so it stays in the loop (stays on this node) until it makes a full cycle through the array of objects without any assignments then it ends the loop so it can go to the next node. Signed-off-by: Mark Allen <[email protected]>
1 parent bdd92a7 commit bf3980d

File tree

1 file changed

+20
-1
lines changed

1 file changed

+20
-1
lines changed

orte/mca/rmaps/base/rmaps_base_ranking.c

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -378,8 +378,25 @@ static int rank_by(orte_job_t *jdata,
378378
* Perhaps someday someone will come up with a more efficient
379379
* algorithm, but this works for now.
380380
*/
381+
// In 3.x this was two loops:
382+
// while (cnt < app->num_procs)
383+
// for (i=0; i<num_objs; ...)
384+
// Then in 4.x it switched to
385+
// while (cnt < app->num_procs && i < (int)node->num_procs)
386+
// where that extra i part seems wrong to me. First of all if anything
387+
// it seems like it should be i<num_objs since that's the array i is
388+
// cycling through, but even then all the usage of i below is
389+
// (i % num_objs) so I think i is intended to wrap and you should
390+
// keep looping until you've made all the assignments you can for
391+
// this node.
392+
//
393+
// So that's what I added the other loop counter for, figuring if it
394+
// cycles through the whole array of objs without making an assignment
395+
// it's time for this loop to end and the outer loop to take us to the
396+
// next node.
381397
i = 0;
382-
while (cnt < app->num_procs && i < (int)node->num_procs) {
398+
int niters_of_i_without_assigning_a_proc = 0;
399+
while (cnt < app->num_procs && niters_of_i_without_assigning_a_proc <= num_objs) {
383400
/* get the next object */
384401
obj = (hwloc_obj_t)opal_pointer_array_get_item(&objs, i % num_objs);
385402
if (NULL == obj) {
@@ -447,6 +464,7 @@ static int rank_by(orte_job_t *jdata,
447464
return rc;
448465
}
449466
num_ranked++;
467+
niters_of_i_without_assigning_a_proc = 0;
450468
/* track where the highest vpid landed - this is our
451469
* new bookmark
452470
*/
@@ -455,6 +473,7 @@ static int rank_by(orte_job_t *jdata,
455473
break;
456474
}
457475
i++;
476+
++niters_of_i_without_assigning_a_proc;
458477
}
459478
}
460479
/* cleanup */

0 commit comments

Comments
 (0)