Skip to content

Error launching under slurm (Out of resource) #11371

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gkatev opened this issue Feb 2, 2023 · 41 comments · Fixed by openpmix/prrte#1669
Closed

Error launching under slurm (Out of resource) #11371

gkatev opened this issue Feb 2, 2023 · 41 comments · Fixed by openpmix/prrte#1669

Comments

@gkatev
Copy link
Contributor

gkatev commented Feb 2, 2023

Hi, I've been unable to start mpi jobs under slurm reservations with the latest main.
I'm under salloc -N 1 -n 48, and the mesage is:

$ mpirun -n 1 hostname
--------------------------------------------------------------------------
Your job failed to map. Either no mapper was available, or none
of the available mappers was able to perform the requested
mapping operation.

  Mapper result:       Out of resource
  Application:         hostname
  #procs to be mapped: 1
  Mapping policy:      BYSLOT
  Binding policy:      CORE

--------------------------------------------------------------------------

This happens with 1 as well as with 2 nodes in the reservation. It also doesn't work in 5.0.x, but in 5.0.0rc8 all is well. It doesn't happen when not under slurm.

I tried to chase it down a bit:

$ mpirun -n 1 --prtemca rmaps_base_verbose 10 --display alloc --output tag hostname
[deepv:02593] mca: base: component_find: searching NULL for rmaps components
[deepv:02593] mca: base: find_dyn_components: checking NULL for rmaps components
[deepv:02593] pmix:mca: base: components_register: registering framework rmaps components
[deepv:02593] pmix:mca: base: components_register: found loaded component ppr
[deepv:02593] pmix:mca: base: components_register: component ppr register function successful
[deepv:02593] pmix:mca: base: components_register: found loaded component rank_file
[deepv:02593] pmix:mca: base: components_register: component rank_file has no register or open function
[deepv:02593] pmix:mca: base: components_register: found loaded component round_robin
[deepv:02593] pmix:mca: base: components_register: component round_robin register function successful
[deepv:02593] pmix:mca: base: components_register: found loaded component seq
[deepv:02593] pmix:mca: base: components_register: component seq register function successful
[deepv:02593] [prterun-deepv-2593@0,0] rmaps:base set policy with slot
[deepv:02593] mca: base: components_open: opening rmaps components
[deepv:02593] mca: base: components_open: found loaded component ppr
[deepv:02593] mca: base: components_open: component ppr open function successful
[deepv:02593] mca: base: components_open: found loaded component rank_file
[deepv:02593] mca: base: components_open: found loaded component round_robin
[deepv:02593] mca: base: components_open: component round_robin open function successful
[deepv:02593] mca: base: components_open: found loaded component seq
[deepv:02593] mca: base: components_open: component seq open function successful
[deepv:02593] mca:rmaps:select: checking available component ppr
[deepv:02593] mca:rmaps:select: Querying component [ppr]
[deepv:02593] mca:rmaps:select: checking available component rank_file
[deepv:02593] mca:rmaps:select: Querying component [rank_file]
[deepv:02593] mca:rmaps:select: checking available component round_robin
[deepv:02593] mca:rmaps:select: Querying component [round_robin]
[deepv:02593] mca:rmaps:select: checking available component seq
[deepv:02593] mca:rmaps:select: Querying component [seq]
[deepv:02593] [prterun-deepv-2593@0,0]: Final mapper priorities
[deepv:02593] 	Mapper: rank_file Priority: 100
[deepv:02593] 	Mapper: ppr Priority: 90
[deepv:02593] 	Mapper: seq Priority: 60
[deepv:02593] 	Mapper: round_robin Priority: 10

======================   ALLOCATED NODES   ======================
    dp-dam01: slots=48 max_slots=0 slots_inuse=0 state=UP
	Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
	aliases: 10.2.10.41,10.2.17.81
=================================================================
[deepv:02593] mca:rmaps: mapping job prterun-deepv-2593@1
[deepv:02593] mca:rmaps: setting mapping policies for job prterun-deepv-2593@1 inherit TRUE hwtcpus FALSE
[deepv:02593] mca:rmaps mapping given by MCA param
[deepv:02593] mca:rmaps[540] default binding policy given
[deepv:02593] mca:rmaps:rf: job prterun-deepv-2593@1 not using rankfile policy
[deepv:02593] mca:rmaps:ppr: job prterun-deepv-2593@1 not using ppr mapper PPR NULL policy PPR NOTSET
[deepv:02593] [prterun-deepv-2593@0,0] rmaps:seq called on job prterun-deepv-2593@1
[deepv:02593] mca:rmaps:seq: job prterun-deepv-2593@1 not using seq mapper
[deepv:02593] mca:rmaps:rr: mapping job prterun-deepv-2593@1
[deepv:02593] [prterun-deepv-2593@0,0] Starting with 1 nodes in list
[deepv:02593] [prterun-deepv-2593@0,0] Filtering thru apps
[deepv:02593] [prterun-deepv-2593@0,0] Retained 1 nodes in list
[deepv:02593] [prterun-deepv-2593@0,0] node dp-dam01 has 48 slots available
[deepv:02593] AVAILABLE NODES FOR MAPPING:
[deepv:02593]     node: dp-dam01 daemon: 1 slots_available: 48
[deepv:02593] mca:rmaps:rr: mapping by slot for job prterun-deepv-2593@1 slots 48 num_procs 1
[deepv:02593] mca:rmaps:rr:slot working node dp-dam01
[deepv:02593] [prterun-deepv-2593@0,0] get_avail_ncpus: node dp-dam01 has 0 procs on it
[deepv:02593] mca:rmaps:rr:slot job prterun-deepv-2593@1 is oversubscribed - performing second pass
[deepv:02593] mca:rmaps:rr:slot working node dp-dam01
[deepv:02593] [prterun-deepv-2593@0,0] get_avail_ncpus: node dp-dam01 has 0 procs on it
--------------------------------------------------------------------------
Your job failed to map. Either no mapper was available, or none
of the available mappers was able to perform the requested
mapping operation.

  Mapper result:       Out of resource
  Application:         hostname
  #procs to be mapped: 1
  Mapping policy:      BYSLOT
  Binding policy:      CORE

--------------------------------------------------------------------------

It looked to me like the failure starts happening because prte_rmaps_base_get_ncpus() returned 0. These debug prints:

diff --git a/src/mca/rmaps/base/rmaps_base_support_fns.c b/src/mca/rmaps/base/rmaps_base_support_fns.c
index 8a2974a90f..c345c2e727 100644
--- a/src/mca/rmaps/base/rmaps_base_support_fns.c
+++ b/src/mca/rmaps/base/rmaps_base_support_fns.c
@@ -668,6 +668,7 @@ int prte_rmaps_base_get_ncpus(prte_node_t *node,
     int ncpus;
 
 #if HWLOC_API_VERSION < 0x20000
+    printf("HWLOC_API_VERSION < 0x20000\n");
     hwloc_obj_t root;
     root = hwloc_get_root_obj(node->topology->topo);
     if (NULL == options->job_cpuset) {
@@ -679,6 +680,7 @@ int prte_rmaps_base_get_ncpus(prte_node_t *node,
         hwloc_bitmap_and(prte_rmaps_base.available, prte_rmaps_base.available, obj->allowed_cpuset);
     }
 #else
+    printf("HWLOC_API_VERSION >= 0x20000\n");
     if (NULL == options->job_cpuset) {
         hwloc_bitmap_copy(prte_rmaps_base.available, hwloc_topology_get_allowed_cpuset(node->topology->topo));
     } else {
diff --git a/src/mca/rmaps/round_robin/rmaps_rr_mappers.c b/src/mca/rmaps/round_robin/rmaps_rr_mappers.c
index 484449ce7a..b3e631fea6 100644
--- a/src/mca/rmaps/round_robin/rmaps_rr_mappers.c
+++ b/src/mca/rmaps/round_robin/rmaps_rr_mappers.c
@@ -123,6 +123,7 @@ pass:
          * the user didn't specify a required binding, then we set
          * the binding policy to do-not-bind for this node */
         ncpus = prte_rmaps_base_get_ncpus(node, NULL, options);
+        printf("prte_rmaps_base_get_ncpus() = %d\n", ncpus);
         if (options->nprocs > ncpus &&
             options->nprocs <= node->slots_available &&
             !PRTE_BINDING_POLICY_IS_SET(jdata->map->binding)) {

Produce:

prte_rmaps_base_get_ncpus() = 0
HWLOC_API_VERSION >= 0x20000
@rhc54
Copy link
Contributor

rhc54 commented Feb 2, 2023

Try this while under the Slurm allocation: mpirun --prtemca plm ssh -n 1 hostname. Also, what version of Slurm are you using?

@gkatev
Copy link
Contributor Author

gkatev commented Feb 2, 2023

Do we expect this command to hang? (it does). I'm using my cluster's system-wide version, which as I see is 22.05.7.

@gkatev
Copy link
Contributor Author

gkatev commented Feb 2, 2023

I realized I had map-by=slot and bind-to=core set in an env var. Apparently without these params the problem does not appear. But I don't suppose this combination is "illegal"?

Revised effects without any env vars:

$ mpirun -n 1 --map-by slot --bind-to core hostname
--------------------------------------------------------------------------
Your job failed to map. Either no mapper was available, or none
of the available mappers was able to perform the requested
mapping operation.

  Mapper result:       Out of resource
  Application:         hostname
  #procs to be mapped: 1
  Mapping policy:      BYSLOT
  Binding policy:      CORE

--------------------------------------------------------------------------

$ mpirun -n 1 --bind-to core hostname
--------------------------------------------------------------------------
Your job failed to map. Either no mapper was available, or none
of the available mappers was able to perform the requested
mapping operation.

  Mapper result:       Out of resource
  Application:         hostname
  #procs to be mapped: 1
  Mapping policy:      BYCORE
  Binding policy:      CORE

--------------------------------------------------------------------------

$ mpirun -n 1 --map-by slot hostname
/* all good */

$ mpirun -n 1 hostname
/* all good */

Still stands that it appears in main and 5.0.x but not 5.0.0rc8.

Edit: When running the second command above (mpirun -n 1 --bind-to core hostname), that "HWLOC_API_VERSION >= 0x20000" debug print I had added appears in the output 2306 times!

@rhc54
Copy link
Contributor

rhc54 commented Feb 2, 2023

If you specify the binding, then we require that the binding be done - i.e., if you say --bind-to core, then we error out if we cannot do it for some reason. If you don't specify the binding, then the default binding becomes optional - so if we can't bind for some reason, we still go ahead and map (we just leave the resulting procs unbound).

My best guess here is that Slurm is assigning you to some place where we cannot bind your procs to cores. Perhaps there are only hwthreads and no cores? If you do --map-by slot:hwthreadcpus --bind-to hwthread, does it work?

I don't have access to a Slurm machine, but we aren't hearing of any problems from people who do - so I'm thinking there might be something about this Slurm setup that is causing the problem. I asked for the version because I was just contacted by SchedMD about a bug in one of their releases that causes OMPI some issues, but your version doesn't match so that isn't the cause.

@gkatev
Copy link
Contributor Author

gkatev commented Feb 2, 2023

I see. Perhaps I also have to contact the system's maintentainers, I'm not super familiar with slurm either.

The hwthread command does yield an improvement, but I can only get at most 20 processes (seems arbitrary, system has 48/96 cores/threads across 2 sockets):

$ mpirun -n 1 --map-by slot:hwtcpus --bind-to hwthread hostname
HWLOC_API_VERSION >= 0x20000
prte_rmaps_base_get_ncpus() = 20
HWLOC_API_VERSION >= 0x20000
/* All good */

$ mpirun -n 21 --map-by slot:hwtcpus --bind-to hwthread hostname
HWLOC_API_VERSION >= 0x20000
prte_rmaps_base_get_ncpus() = 20
HWLOC_API_VERSION >= 0x20000
HWLOC_API_VERSION >= 0x20000
prte_rmaps_base_get_ncpus() = 0
HWLOC_API_VERSION >= 0x20000
/* Out of resource */

I will also try out a couple different salloc parameters (e.g. cpus instead of tasks, or something like that) to see if I spot an improvement.

@rhc54
Copy link
Contributor

rhc54 commented Feb 2, 2023

Could be some kind of cgroup setting - we are only seeing 20 hwthreads in the allocation. Sounds suspicious.

@naughtont3
Copy link
Contributor

Not sure this fits all of your issue, but here are a few items we check/use for running prte/ompi within a local slurm system.

  1. Configure/build
    ensure you configured with `--enable-slurm`
  1. Ensure run using PLM and RAS as slurm modules
    export PRTE_MCA_ras=slurm
    export PRTE_MCA_plm=slurm

      # *** or directly on command-line ***
     
    mpirun \
        --prtemca plm slurm \
        --prtemca ras slurm \
        ...
  1. Setting following options may also be helpful.
     # Useful to ensure can use all cores in allocation
     # when running mpirun in allocation
    export PRTE_MCA_ras_slurm_use_entire_allocation=1

     # Needed for VNI enabled Cray XE SS11 system
    export PRTE_MCA_ras_base_launch_orted_on_hn=1

      # *** or directly on command-line ***

    mpirun \
        --prtemca ras_slurm_use_entire_allocation 1 \
        --prtemca ras_base_launch_orted_on_hn 1 \
        ...
  1. SLURM allocation option for reserving service cores, which will reduce the number of cores available for apps.

    The -S n option can control this, e.g., CRUSHER @ OLCF currently defaults to -S 8 (56 cores).

   salloc -S 0 ... # do not reserve any cores for system noise
   salloc -S 1 ... # reserve 1 core
   salloc -S 8 ... # reserve 8 cores (one per l3cache) [default]

Look at SLURM_JOB_CPUS_PER_NODE inside the allocation to better understand what you have available. This value is used for #slots when using the ras_slurm_use_entire_allocation=1 option

  1. SLURM allocation option for hwthreads mapping/usage, if intend to use hwthreads for apps (e.g., --map-by hwthread)
    then you will want to add --threads-per-core to your allocation request. For example,
     salloc -N 2 -t 10 -A $MYACCT -S 0 --threads-per-core 2
  1. Useful debug options
  --prtemca ras_base_verbose 20           # See that #slots matches SLURM_JOB_CPUS_PER_NODE
  --prtemca plm_base_verbose 20           # Ensure slurm selected

@gkatev
Copy link
Contributor Author

gkatev commented Feb 3, 2023

Regarding the initial claim about the versions, there was a mistake in my 5.0.0rc8 installation. So after clearing that up, it is in fact 5.0.0rc7 that works fine.

5.0.0rc8 also erros out, even without the bind-to param (maybe the default there was to bind to core?), but with a different message

$ 5.0.0.rc8/mpirun -n 1 hostname
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------

$ 5.0.0.rc8/mpirun -n 1 --map-by slot hostname
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------

$ 5.0.0.rc8/mpirun -n 1 --bind-to none hostname
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------

5.0.x and main also error out, as described above (no change)

Do we know how this variable (5.0.0rc7 vs 5.0.0rc8) might play a role in this? I see a bunch of changes in prte_rmaps_rr_byslot (assuming this indeed used in these scenarios) between the versions in rc7 and rc8.

@rhc54
Copy link
Contributor

rhc54 commented Feb 3, 2023

I know that those release candidates are quite stale (in terms of PRRTE), and so I can't be sure you aren't just hitting old problems that have already been fixed. Likewise, OMPI main was just updated yesterday (IIRC - it was in the last day or two).

Are you using the head of OMPI main as of today? If not, it might be worth updating so we know we are all looking at the same thing. You might also want to go into the 3rd-party openpmix and prrte submodule directories and do "git checkout master" followed by "git pull" on each of them, just to ensure you have the latest of both code bases.

Please be sure to configure with --enable-debug.

The error message is really suspicious to me. For whatever reason, you don't seem to have a usable allocation. Let's try with an updated OMPI main (as per above) and see what you get. If it fails, then add "--prtemca rmaps_base_verbose 100" to the mpirun cmd line.

@gkatev
Copy link
Contributor Author

gkatev commented Feb 3, 2023

I see. In the above I had prrte @ dc6ccf6 and openmpix @ 415d704 (few days old). I made a new build with the latest prrte/openmpix master.

I'm now at pprrte @ 081890a, openpmix @ 0818181 (debug build), the problem does remain.

$ mpirun -n 1 --map-by slot --bind-to core --prtemca rmaps_base_verbose 100 hostname
[deepv:22377] mca: base: component_find: searching NULL for rmaps components
[deepv:22377] mca: base: find_dyn_components: checking NULL for rmaps components
[deepv:22377] pmix:mca: base: components_register: registering framework rmaps components
[deepv:22377] pmix:mca: base: components_register: found loaded component ppr
[deepv:22377] pmix:mca: base: components_register: component ppr register function successful
[deepv:22377] pmix:mca: base: components_register: found loaded component rank_file
[deepv:22377] pmix:mca: base: components_register: component rank_file has no register or open function
[deepv:22377] pmix:mca: base: components_register: found loaded component round_robin
[deepv:22377] pmix:mca: base: components_register: component round_robin register function successful
[deepv:22377] pmix:mca: base: components_register: found loaded component seq
[deepv:22377] pmix:mca: base: components_register: component seq register function successful
[deepv:22377] mca: base: components_open: opening rmaps components
[deepv:22377] mca: base: components_open: found loaded component ppr
[deepv:22377] mca: base: components_open: component ppr open function successful
[deepv:22377] mca: base: components_open: found loaded component rank_file
[deepv:22377] mca: base: components_open: found loaded component round_robin
[deepv:22377] mca: base: components_open: component round_robin open function successful
[deepv:22377] mca: base: components_open: found loaded component seq
[deepv:22377] mca: base: components_open: component seq open function successful
[deepv:22377] mca:rmaps:select: checking available component ppr
[deepv:22377] mca:rmaps:select: Querying component [ppr]
[deepv:22377] mca:rmaps:select: checking available component rank_file
[deepv:22377] mca:rmaps:select: Querying component [rank_file]
[deepv:22377] mca:rmaps:select: checking available component round_robin
[deepv:22377] mca:rmaps:select: Querying component [round_robin]
[deepv:22377] mca:rmaps:select: checking available component seq
[deepv:22377] mca:rmaps:select: Querying component [seq]
[deepv:22377] [prterun-deepv-22377@0,0]: Final mapper priorities
[deepv:22377] 	Mapper: rank_file Priority: 100
[deepv:22377] 	Mapper: ppr Priority: 90
[deepv:22377] 	Mapper: seq Priority: 60
[deepv:22377] 	Mapper: round_robin Priority: 10
[deepv:22377] [prterun-deepv-22377@0,0] rmaps:base set policy with slot
[deepv:22377] mca:rmaps: mapping job prterun-deepv-22377@1
[deepv:22377] mca:rmaps: setting mapping policies for job prterun-deepv-22377@1 inherit TRUE hwtcpus FALSE
[deepv:22377] mca:rmaps:rf: job prterun-deepv-22377@1 not using rankfile policy
[deepv:22377] mca:rmaps:ppr: job prterun-deepv-22377@1 not using ppr mapper PPR NULL policy PPR NOTSET
[deepv:22377] [prterun-deepv-22377@0,0] rmaps:seq called on job prterun-deepv-22377@1
[deepv:22377] mca:rmaps:seq: job prterun-deepv-22377@1 not using seq mapper
[deepv:22377] mca:rmaps:rr: mapping job prterun-deepv-22377@1
[deepv:22377] [prterun-deepv-22377@0,0] Starting with 1 nodes in list
[deepv:22377] [prterun-deepv-22377@0,0] Filtering thru apps
[deepv:22377] [prterun-deepv-22377@0,0] Retained 1 nodes in list
[deepv:22377] [prterun-deepv-22377@0,0] node dp-dam01 has 48 slots available
[deepv:22377] AVAILABLE NODES FOR MAPPING:
[deepv:22377]     node: dp-dam01 daemon: 1 slots_available: 48
[deepv:22377] mca:rmaps:rr: mapping by slot for job prterun-deepv-22377@1 slots 48 num_procs 1
[deepv:22377] mca:rmaps:rr:slot working node dp-dam01
[deepv:22377] [prterun-deepv-22377@0,0] get_avail_ncpus: node dp-dam01 has 0 procs on it
[deepv:22377] mca:rmaps:rr:slot job prterun-deepv-22377@1 is oversubscribed - performing second pass
[deepv:22377] mca:rmaps:rr:slot working node dp-dam01
[deepv:22377] [prterun-deepv-22377@0,0] get_avail_ncpus: node dp-dam01 has 0 procs on it
--------------------------------------------------------------------------
Your job failed to map. Either no mapper was available, or none
of the available mappers was able to perform the requested
mapping operation.

  Mapper result:       Out of resource
  Application:         hostname
  #procs to be mapped: 1
  Mapping policy:      BYSLOT
  Binding policy:      CORE

--------------------------------------------------------------------------

Again with the hwthreads I can spawn up to 20 procs but not more. Could you elaborate a bit on the cgroup thing with which I don't have experience? Can I easily check it's not in effect?

Also thanks @naughtont3 for the debug info, generally things seem in order. Under my salloc -N 1 -n 48, which I hope is right (it's what I've used for some time now):

$ echo $SLURM_JOB_CPUS_PER_NODE 
96
$ echo $SLURM_NTASKS
48
$ echo $SLURM_NPROCS
48

Do we think this looks like a slurm installation issue or like a prrte issue? Should prte_rmaps_base_get_ncpus return 0? Is prte_rmaps_base_get_ncpus affected by slurm or does it just use hwloc to find the number of cpus on the node?

@rhc54
Copy link
Contributor

rhc54 commented Feb 3, 2023

Do we think this looks like a slurm installation issue or like a prrte issue?

Really not sure at this point. Your Slurm envars look correct, but it is also clear that we are not seeing the allocation.

Should prte_rmaps_base_get_ncpus return 0? Is prte_rmaps_base_get_ncpus affected by slurm or does it just use hwloc to find the number of cpus on the node?

If "get_ncpus" returns 0, then we can't assign a process to that location as we don't have any available cpus. It uses hwloc to find the cpus, but the number of available cpus is impacted by Slurm, which can control what hwloc sees by setting a "window" of allowed cpus (cgroups is the mechanism by which that is done).

Basically, think of it as Slurm "binding" anything you run to a specified set of cpus. mpirun can assign a process to any cpu inside that window, but not outside it.

In this case, the allocation output from mpirun indicates that it sees 48 slots allocated to it, which matches the SLURM_NTASKS we were provided. What we aren't seeing are the cpus on the node.

Try applying the following patch to the 3rd-party/prrte directory:

diff --git a/src/mca/rmaps/base/rmaps_base_support_fns.c b/src/mca/rmaps/base/rmaps_base_support_fns.c
index c7d04044ab..2fa6b7d6bc 100644
--- a/src/mca/rmaps/base/rmaps_base_support_fns.c
+++ b/src/mca/rmaps/base/rmaps_base_support_fns.c
@@ -674,6 +674,7 @@ int prte_rmaps_base_get_ncpus(prte_node_t *node,
                               prte_rmaps_options_t *options)
 {
     int ncpus;
+    char *tmp;
 
 #if HWLOC_API_VERSION < 0x20000
     hwloc_obj_t root;
@@ -687,15 +688,30 @@ int prte_rmaps_base_get_ncpus(prte_node_t *node,
         hwloc_bitmap_and(prte_rmaps_base.available, prte_rmaps_base.available, obj->allowed_cpuset);
     }
 #else
+    if (NULL != options->job_cpuset) {
+        tmp = NULL;
+        hwloc_bitmap_list_asprintf(&tmp, options->job_cpuset);
+        pmix_output(0, "JOBCPUSET: %s", (NULL == tmp) ? "NO-CPUS" : tmp);
+    } else {
+        pmix_output(0, "JOBCPUSET IS NULL");
+    }
+    tmp = NULL;
+    hwloc_bitmap_list_asprintf(&tmp, hwloc_topology_get_allowed_cpuset(node->topology->topo));
+    pmix_output(0, "ALLOWEDCPUSET: %s", (NULL == tmp) ? "NULL" : tmp);
     if (NULL == options->job_cpuset) {
         hwloc_bitmap_copy(prte_rmaps_base.available, hwloc_topology_get_allowed_cpuset(node->topology->topo));
     } else {
         hwloc_bitmap_and(prte_rmaps_base.available, hwloc_topology_get_allowed_cpuset(node->topology->topo), options->job_cpuset);
     }
+    pmix_output(0, "OBJ: %s", (NULL == obj) ? "NULL" : "NON-NULL");
     if (NULL != obj) {
         hwloc_bitmap_and(prte_rmaps_base.available, prte_rmaps_base.available, obj->cpuset);
     }
 #endif
+    tmp = NULL;
+    hwloc_bitmap_list_asprintf(&tmp, prte_rmaps_base.available);
+    pmix_output(0, "GETNCPUS: %s", (NULL == tmp) ? "NULL" : tmp);
+
     if (options->use_hwthreads) {
         ncpus = hwloc_bitmap_weight(prte_rmaps_base.available);
     } else {

Should hopefully provide a little more insight into what is going on.

@gkatev
Copy link
Contributor Author

gkatev commented Feb 3, 2023

I see sounds reasonable. Let me also note that I'm not 100% confident in my salloc params, but I did also try other ones, e.g. with -t or -c instead of -n, but no dice.

With the above patch applied:

$ mpirun -n 1 --map-by slot --bind-to core --prtemca rmaps_base_verbose 100 hostname
[deepv:27740] mca: base: component_find: searching NULL for rmaps components
[deepv:27740] mca: base: find_dyn_components: checking NULL for rmaps components
[deepv:27740] pmix:mca: base: components_register: registering framework rmaps components
[deepv:27740] pmix:mca: base: components_register: found loaded component ppr
[deepv:27740] pmix:mca: base: components_register: component ppr register function successful
[deepv:27740] pmix:mca: base: components_register: found loaded component rank_file
[deepv:27740] pmix:mca: base: components_register: component rank_file has no register or open function
[deepv:27740] pmix:mca: base: components_register: found loaded component round_robin
[deepv:27740] pmix:mca: base: components_register: component round_robin register function successful
[deepv:27740] pmix:mca: base: components_register: found loaded component seq
[deepv:27740] pmix:mca: base: components_register: component seq register function successful
[deepv:27740] mca: base: components_open: opening rmaps components
[deepv:27740] mca: base: components_open: found loaded component ppr
[deepv:27740] mca: base: components_open: component ppr open function successful
[deepv:27740] mca: base: components_open: found loaded component rank_file
[deepv:27740] mca: base: components_open: found loaded component round_robin
[deepv:27740] mca: base: components_open: component round_robin open function successful
[deepv:27740] mca: base: components_open: found loaded component seq
[deepv:27740] mca: base: components_open: component seq open function successful
[deepv:27740] mca:rmaps:select: checking available component ppr
[deepv:27740] mca:rmaps:select: Querying component [ppr]
[deepv:27740] mca:rmaps:select: checking available component rank_file
[deepv:27740] mca:rmaps:select: Querying component [rank_file]
[deepv:27740] mca:rmaps:select: checking available component round_robin
[deepv:27740] mca:rmaps:select: Querying component [round_robin]
[deepv:27740] mca:rmaps:select: checking available component seq
[deepv:27740] mca:rmaps:select: Querying component [seq]
[deepv:27740] [prterun-deepv-27740@0,0]: Final mapper priorities
[deepv:27740] 	Mapper: rank_file Priority: 100
[deepv:27740] 	Mapper: ppr Priority: 90
[deepv:27740] 	Mapper: seq Priority: 60
[deepv:27740] 	Mapper: round_robin Priority: 10
[deepv:27740] [prterun-deepv-27740@0,0] rmaps:base set policy with slot
[deepv:27740] mca:rmaps: mapping job prterun-deepv-27740@1
[deepv:27740] mca:rmaps: setting mapping policies for job prterun-deepv-27740@1 inherit TRUE hwtcpus FALSE
[deepv:27740] mca:rmaps:rf: job prterun-deepv-27740@1 not using rankfile policy
[deepv:27740] mca:rmaps:ppr: job prterun-deepv-27740@1 not using ppr mapper PPR NULL policy PPR NOTSET
[deepv:27740] [prterun-deepv-27740@0,0] rmaps:seq called on job prterun-deepv-27740@1
[deepv:27740] mca:rmaps:seq: job prterun-deepv-27740@1 not using seq mapper
[deepv:27740] mca:rmaps:rr: mapping job prterun-deepv-27740@1
[deepv:27740] [prterun-deepv-27740@0,0] Starting with 1 nodes in list
[deepv:27740] [prterun-deepv-27740@0,0] Filtering thru apps
[deepv:27740] [prterun-deepv-27740@0,0] Retained 1 nodes in list
[deepv:27740] [prterun-deepv-27740@0,0] node dp-dam01 has 48 slots available
[deepv:27740] AVAILABLE NODES FOR MAPPING:
[deepv:27740]     node: dp-dam01 daemon: 1 slots_available: 48
[deepv:27740] mca:rmaps:rr: mapping by slot for job prterun-deepv-27740@1 slots 48 num_procs 1
[deepv:27740] mca:rmaps:rr:slot working node dp-dam01
[deepv:27740] JOBCPUSET: 0-19
[deepv:27740] ALLOWEDCPUSET: 0-95
[deepv:27740] OBJ: NULL
[deepv:27740] GETNCPUS: 0-19
prte_rmaps_base_get_ncpus() = 0
[deepv:27740] [prterun-deepv-27740@0,0] get_avail_ncpus: node dp-dam01 has 0 procs on it
[deepv:27740] JOBCPUSET: 0-19
[deepv:27740] ALLOWEDCPUSET: 0-95
[deepv:27740] OBJ: NULL
[deepv:27740] GETNCPUS: 0-19
[deepv:27740] mca:rmaps:rr:slot job prterun-deepv-27740@1 is oversubscribed - performing second pass
[deepv:27740] mca:rmaps:rr:slot working node dp-dam01
[deepv:27740] JOBCPUSET: 0-19
[deepv:27740] ALLOWEDCPUSET: 0-95
[deepv:27740] OBJ: NULL
[deepv:27740] GETNCPUS: 0-19
prte_rmaps_base_get_ncpus() = 0
[deepv:27740] [prterun-deepv-27740@0,0] get_avail_ncpus: node dp-dam01 has 0 procs on it
[deepv:27740] JOBCPUSET: 0-19
[deepv:27740] ALLOWEDCPUSET: 0-95
[deepv:27740] OBJ: NULL
[deepv:27740] GETNCPUS: 0-19
--------------------------------------------------------------------------
Your job failed to map. Either no mapper was available, or none
of the available mappers was able to perform the requested
mapping operation.

  Mapper result:       Out of resource
  Application:         hostname
  #procs to be mapped: 1
  Mapping policy:      BYSLOT
  Binding policy:      CORE

--------------------------------------------------------------------------

@rhc54
Copy link
Contributor

rhc54 commented Feb 3, 2023

Hmmm...well, that certainly wasn't what I expected to see! It looks like you have something that is setting the job cpuset (like an MCA param for "hwloc_default_cpu_list") that is restricting the available cpus to 0-19. Could you add the following diff:

diff --git a/src/mca/rmaps/base/rmaps_base_map_job.c b/src/mca/rmaps/base/rmaps_base_map_job.c
index 4ae7df6493..a81267e8fa 100644
--- a/src/mca/rmaps/base/rmaps_base_map_job.c
+++ b/src/mca/rmaps/base/rmaps_base_map_job.c
@@ -312,6 +312,7 @@ void prte_rmaps_base_map_job(int fd, short args, void *cbdata)
 
     /* set some convenience params */
     prte_get_attribute(&jdata->attributes, PRTE_JOB_CPUSET, (void**)&options.cpuset, PMIX_STRING);
+    pmix_output(0, "JOBCPUSET ATTR: %s", (NULL == options.cpuset) ? "NULL" : options.cpuset);
     if (prte_get_attribute(&jdata->attributes, PRTE_JOB_PES_PER_PROC, (void **) &u16ptr, PMIX_UINT16)) {
         options.cpus_per_rank = u16;
     } else {
diff --git a/src/mca/rmaps/round_robin/rmaps_rr_mappers.c b/src/mca/rmaps/round_robin/rmaps_rr_mappers.c
index 1ba0053f17..df791071b1 100644
--- a/src/mca/rmaps/round_robin/rmaps_rr_mappers.c
+++ b/src/mca/rmaps/round_robin/rmaps_rr_mappers.c
@@ -53,6 +53,7 @@ int prte_rmaps_rr_byslot(prte_job_t *jdata,
     prte_proc_t *proc;
     bool second_pass = false;
     prte_binding_policy_t savebind = options->bind;
+    char *tmp;
 
     pmix_output_verbose(2, prte_rmaps_base_framework.framework_output,
                         "mca:rmaps:rr: mapping by slot for job %s slots %d num_procs %lu",
@@ -84,6 +85,14 @@ pass:
                             "mca:rmaps:rr:slot working node %s", node->name);
 
         prte_rmaps_base_get_cpuset(jdata, node, options);
+        pmix_output(0, "CPUSET MAPPER: %s", (NULL == options->cpuset) ? "NULL" : options->cpuset);
+        if (NULL == options->job_cpuset) {
+            pmix_output(0, "JOBCPUSET MAPPER: NULL");
+        } else {
+            tmp = NULL;
+            hwloc_bitmap_list_asprintf(&tmp, options->job_cpuset);
+            pmix_output(0, "JOBCPUSET MAPPER: %s", (NULL == tmp) ? "NULL" : tmp);
+        }
 
         /* compute the number of procs to go on this node */
         if (second_pass) {

@gkatev
Copy link
Contributor Author

gkatev commented Feb 3, 2023

Hmm, my env seem clear, here's the new output:

$ env | grep -i MCA
/* clear */

$ mpirun -n 1 --map-by slot --bind-to core --prtemca rmaps_base_verbose 100 hostname
[deepv:24019] mca: base: component_find: searching NULL for rmaps components
[deepv:24019] mca: base: find_dyn_components: checking NULL for rmaps components
[deepv:24019] pmix:mca: base: components_register: registering framework rmaps components
[deepv:24019] pmix:mca: base: components_register: found loaded component ppr
[deepv:24019] pmix:mca: base: components_register: component ppr register function successful
[deepv:24019] pmix:mca: base: components_register: found loaded component rank_file
[deepv:24019] pmix:mca: base: components_register: component rank_file has no register or open function
[deepv:24019] pmix:mca: base: components_register: found loaded component round_robin
[deepv:24019] pmix:mca: base: components_register: component round_robin register function successful
[deepv:24019] pmix:mca: base: components_register: found loaded component seq
[deepv:24019] pmix:mca: base: components_register: component seq register function successful
[deepv:24019] mca: base: components_open: opening rmaps components
[deepv:24019] mca: base: components_open: found loaded component ppr
[deepv:24019] mca: base: components_open: component ppr open function successful
[deepv:24019] mca: base: components_open: found loaded component rank_file
[deepv:24019] mca: base: components_open: found loaded component round_robin
[deepv:24019] mca: base: components_open: component round_robin open function successful
[deepv:24019] mca: base: components_open: found loaded component seq
[deepv:24019] mca: base: components_open: component seq open function successful
[deepv:24019] mca:rmaps:select: checking available component ppr
[deepv:24019] mca:rmaps:select: Querying component [ppr]
[deepv:24019] mca:rmaps:select: checking available component rank_file
[deepv:24019] mca:rmaps:select: Querying component [rank_file]
[deepv:24019] mca:rmaps:select: checking available component round_robin
[deepv:24019] mca:rmaps:select: Querying component [round_robin]
[deepv:24019] mca:rmaps:select: checking available component seq
[deepv:24019] mca:rmaps:select: Querying component [seq]
[deepv:24019] [prterun-deepv-24019@0,0]: Final mapper priorities
[deepv:24019] 	Mapper: rank_file Priority: 100
[deepv:24019] 	Mapper: ppr Priority: 90
[deepv:24019] 	Mapper: seq Priority: 60
[deepv:24019] 	Mapper: round_robin Priority: 10
[deepv:24019] [prterun-deepv-24019@0,0] rmaps:base set policy with slot
[deepv:24019] mca:rmaps: mapping job prterun-deepv-24019@1
[deepv:24019] JOBCPUSET ATTR: NULL
[deepv:24019] mca:rmaps: setting mapping policies for job prterun-deepv-24019@1 inherit TRUE hwtcpus FALSE
[deepv:24019] mca:rmaps:rf: job prterun-deepv-24019@1 not using rankfile policy
[deepv:24019] mca:rmaps:ppr: job prterun-deepv-24019@1 not using ppr mapper PPR NULL policy PPR NOTSET
[deepv:24019] [prterun-deepv-24019@0,0] rmaps:seq called on job prterun-deepv-24019@1
[deepv:24019] mca:rmaps:seq: job prterun-deepv-24019@1 not using seq mapper
[deepv:24019] mca:rmaps:rr: mapping job prterun-deepv-24019@1
[deepv:24019] [prterun-deepv-24019@0,0] Starting with 1 nodes in list
[deepv:24019] [prterun-deepv-24019@0,0] Filtering thru apps
[deepv:24019] [prterun-deepv-24019@0,0] Retained 1 nodes in list
[deepv:24019] [prterun-deepv-24019@0,0] node dp-dam01 has 48 slots available
[deepv:24019] AVAILABLE NODES FOR MAPPING:
[deepv:24019]     node: dp-dam01 daemon: 1 slots_available: 48
[deepv:24019] mca:rmaps:rr: mapping by slot for job prterun-deepv-24019@1 slots 48 num_procs 1
[deepv:24019] mca:rmaps:rr:slot working node dp-dam01
[deepv:24019] CPUSET MAPPER: NULL
[deepv:24019] JOBCPUSET MAPPER: 0-19
[deepv:24019] JOBCPUSET: 0-19
[deepv:24019] ALLOWEDCPUSET: 0-95
[deepv:24019] OBJ: NULL
[deepv:24019] GETNCPUS: 0-19
prte_rmaps_base_get_ncpus() = 0
[deepv:24019] [prterun-deepv-24019@0,0] get_avail_ncpus: node dp-dam01 has 0 procs on it
[deepv:24019] JOBCPUSET: 0-19
[deepv:24019] ALLOWEDCPUSET: 0-95
[deepv:24019] OBJ: NULL
[deepv:24019] GETNCPUS: 0-19
[deepv:24019] mca:rmaps:rr:slot job prterun-deepv-24019@1 is oversubscribed - performing second pass
[deepv:24019] mca:rmaps:rr:slot working node dp-dam01
[deepv:24019] CPUSET MAPPER: NULL
[deepv:24019] JOBCPUSET MAPPER: 0-19
[deepv:24019] JOBCPUSET: 0-19
[deepv:24019] ALLOWEDCPUSET: 0-95
[deepv:24019] OBJ: NULL
[deepv:24019] GETNCPUS: 0-19
prte_rmaps_base_get_ncpus() = 0
[deepv:24019] [prterun-deepv-24019@0,0] get_avail_ncpus: node dp-dam01 has 0 procs on it
[deepv:24019] JOBCPUSET: 0-19
[deepv:24019] ALLOWEDCPUSET: 0-95
[deepv:24019] OBJ: NULL
[deepv:24019] GETNCPUS: 0-19
--------------------------------------------------------------------------
Your job failed to map. Either no mapper was available, or none
of the available mappers was able to perform the requested
mapping operation.

  Mapper result:       Out of resource
  Application:         hostname
  #procs to be mapped: 1
  Mapping policy:      BYSLOT
  Binding policy:      CORE

--------------------------------------------------------------------------

@rhc54
Copy link
Contributor

rhc54 commented Feb 3, 2023

Weird - okay, let's try the following diff. This includes all the prior ones as they are now going to interleave, so go into the prrte directory and do a git reset --hard before applying it.

diff --git a/src/hwloc/hwloc_base_util.c b/src/hwloc/hwloc_base_util.c
index d1bdfb6940..31d53829f7 100644
--- a/src/hwloc/hwloc_base_util.c
+++ b/src/hwloc/hwloc_base_util.c
@@ -167,6 +167,7 @@ hwloc_cpuset_t prte_hwloc_base_generate_cpuset(hwloc_topology_t topo,
 hwloc_cpuset_t prte_hwloc_base_setup_summary(hwloc_topology_t topo)
 {
     hwloc_cpuset_t avail = NULL;
+    char *tmp;
 
     avail = hwloc_bitmap_alloc();
     /* get the cpus we are bound to */
@@ -194,6 +195,9 @@ hwloc_cpuset_t prte_hwloc_base_setup_summary(hwloc_topology_t topo)
 #else
     hwloc_bitmap_copy(avail, hwloc_topology_get_allowed_cpuset(topo));
 #endif
+    tmp = NULL;
+    hwloc_bitmap_list_asprintf(&tmp, avail);
+    pmix_output(0, "SETUPSUMMARY: %s", (NULL == tmp) ? "NULL" : tmp);
 
     return avail;
 }
@@ -209,9 +213,11 @@ hwloc_cpuset_t prte_hwloc_base_filter_cpus(hwloc_topology_t topo)
     if (NULL == prte_hwloc_default_cpu_list) {
         PMIX_OUTPUT_VERBOSE((5, prte_hwloc_base_output,
                              "hwloc:base: no cpus specified - using root available cpuset"));
+        pmix_output(0, "NO DEFAULT CPU LIST");
         avail = prte_hwloc_base_setup_summary(topo);
     } else {
         PMIX_OUTPUT_VERBOSE((5, prte_hwloc_base_output, "hwloc:base: filtering cpuset"));
+        pmix_output(0, "FILTERING CPUSET: %s", prte_hwloc_default_cpu_list);
         avail = prte_hwloc_base_generate_cpuset(topo, prte_hwloc_default_use_hwthread_cpus,
                                                 prte_hwloc_default_cpu_list);
     }
diff --git a/src/mca/ess/hnp/ess_hnp_module.c b/src/mca/ess/hnp/ess_hnp_module.c
index a6e330342b..49931c0ce0 100644
--- a/src/mca/ess/hnp/ess_hnp_module.c
+++ b/src/mca/ess/hnp/ess_hnp_module.c
@@ -393,6 +393,10 @@ static int rte_init(int argc, char **argv)
     t->index = pmix_pointer_array_add(prte_node_topologies, t);
     node->topology = t;
     node->available = prte_hwloc_base_filter_cpus(prte_hwloc_topology);
+    error = NULL;
+    hwloc_bitmap_list_asprintf(&error, node->available);
+    pmix_output(0, "ESS: %s", (NULL == error) ? "NULL" : error);
+
     if (15 < pmix_output_get_verbosity(prte_ess_base_framework.framework_output)) {
         char *output = NULL;
         pmix_output(0, "%s Topology Info:", PRTE_NAME_PRINT(PRTE_PROC_MY_NAME));
diff --git a/src/mca/rmaps/base/rmaps_base_map_job.c b/src/mca/rmaps/base/rmaps_base_map_job.c
index 4ae7df6493..a81267e8fa 100644
--- a/src/mca/rmaps/base/rmaps_base_map_job.c
+++ b/src/mca/rmaps/base/rmaps_base_map_job.c
@@ -312,6 +312,7 @@ void prte_rmaps_base_map_job(int fd, short args, void *cbdata)
 
     /* set some convenience params */
     prte_get_attribute(&jdata->attributes, PRTE_JOB_CPUSET, (void**)&options.cpuset, PMIX_STRING);
+    pmix_output(0, "JOBCPUSET ATTR: %s", (NULL == options.cpuset) ? "NULL" : options.cpuset);
     if (prte_get_attribute(&jdata->attributes, PRTE_JOB_PES_PER_PROC, (void **) &u16ptr, PMIX_UINT16)) {
         options.cpus_per_rank = u16;
     } else {
diff --git a/src/mca/rmaps/base/rmaps_base_support_fns.c b/src/mca/rmaps/base/rmaps_base_support_fns.c
index c7d04044ab..b78411b2bc 100644
--- a/src/mca/rmaps/base/rmaps_base_support_fns.c
+++ b/src/mca/rmaps/base/rmaps_base_support_fns.c
@@ -674,6 +674,7 @@ int prte_rmaps_base_get_ncpus(prte_node_t *node,
                               prte_rmaps_options_t *options)
 {
     int ncpus;
+    char *tmp;
 
 #if HWLOC_API_VERSION < 0x20000
     hwloc_obj_t root;
@@ -687,15 +688,30 @@ int prte_rmaps_base_get_ncpus(prte_node_t *node,
         hwloc_bitmap_and(prte_rmaps_base.available, prte_rmaps_base.available, obj->allowed_cpuset);
     }
 #else
+    if (NULL != options->job_cpuset) {
+        tmp = NULL;
+        hwloc_bitmap_list_asprintf(&tmp, options->job_cpuset);
+        pmix_output(0, "JOBCPUSET: %s", (NULL == tmp) ? "NO-CPUS" : tmp);
+    } else {
+        pmix_output(0, "JOBCPUSET IS NULL");
+    }
+    tmp = NULL;
+    hwloc_bitmap_list_asprintf(&tmp, hwloc_topology_get_allowed_cpuset(node->topology->topo));
+    pmix_output(0, "ALLOWEDCPUSET: %s", (NULL == tmp) ? "NULL" : tmp);
     if (NULL == options->job_cpuset) {
         hwloc_bitmap_copy(prte_rmaps_base.available, hwloc_topology_get_allowed_cpuset(node->topology->topo));
     } else {
         hwloc_bitmap_and(prte_rmaps_base.available, hwloc_topology_get_allowed_cpuset(node->topology->topo), options->job_cpuset);
     }
+    pmix_output(0, "OBJ: %s", (NULL == obj) ? "NULL" : "NON-NULL");
     if (NULL != obj) {
         hwloc_bitmap_and(prte_rmaps_base.available, prte_rmaps_base.available, obj->cpuset);
     }
 #endif
+    tmp = NULL;
+    hwloc_bitmap_list_asprintf(&tmp, prte_rmaps_base.available);
+    pmix_output(0, "GETNCPUS: %s", (NULL == tmp) ? "NULL" : tmp);
+
     if (options->use_hwthreads) {
         ncpus = hwloc_bitmap_weight(prte_rmaps_base.available);
     } else {
@@ -788,6 +804,7 @@ void prte_rmaps_base_get_cpuset(prte_job_t *jdata,
                                 prte_node_t *node,
                                 prte_rmaps_options_t *options)
 {
+    char *tmp;
     PRTE_HIDE_UNUSED_PARAMS(jdata);
     
     if (NULL != options->cpuset) {
@@ -795,7 +812,10 @@ void prte_rmaps_base_get_cpuset(prte_job_t *jdata,
                                                               options->use_hwthreads,
                                                               options->cpuset);
     } else {
-        options->job_cpuset = hwloc_bitmap_dup(node->available);
+    tmp = NULL;
+    hwloc_bitmap_list_asprintf(&tmp, node->available);
+    pmix_output(0, "AVAILABLECPUSET: %s", (NULL == tmp) ? "NULL" : tmp);
+         options->job_cpuset = hwloc_bitmap_dup(node->available);
     }
 }
 
diff --git a/src/mca/rmaps/round_robin/rmaps_rr_mappers.c b/src/mca/rmaps/round_robin/rmaps_rr_mappers.c
index 69080d3d06..25cfd419a6 100644
--- a/src/mca/rmaps/round_robin/rmaps_rr_mappers.c
+++ b/src/mca/rmaps/round_robin/rmaps_rr_mappers.c
@@ -53,6 +53,7 @@ int prte_rmaps_rr_byslot(prte_job_t *jdata,
     prte_proc_t *proc;
     bool second_pass = false;
     prte_binding_policy_t savebind = options->bind;
+    char *tmp;
 
     pmix_output_verbose(2, prte_rmaps_base_framework.framework_output,
                         "mca:rmaps:rr: mapping by slot for job %s slots %d num_procs %lu",
@@ -84,6 +85,14 @@ pass:
                             "mca:rmaps:rr:slot working node %s", node->name);
 
         prte_rmaps_base_get_cpuset(jdata, node, options);
+        pmix_output(0, "CPUSET MAPPER: %s", (NULL == options->cpuset) ? "NULL" : options->cpuset);
+        if (NULL == options->job_cpuset) {
+            pmix_output(0, "JOBCPUSET MAPPER: NULL");
+        } else {
+            tmp = NULL;
+            hwloc_bitmap_list_asprintf(&tmp, options->job_cpuset);
+            pmix_output(0, "JOBCPUSET MAPPER: %s", (NULL == tmp) ? "NULL" : tmp);
+        }
 
         /* compute the number of procs to go on this node */
         if (second_pass) {

@gkatev
Copy link
Contributor Author

gkatev commented Feb 3, 2023

$ mpirun -n 1 --map-by slot --bind-to core --prtemca rmaps_base_verbose 100 hostname
[deepv:05186] mca: base: component_find: searching NULL for rmaps components
[deepv:05186] mca: base: find_dyn_components: checking NULL for rmaps components
[deepv:05186] pmix:mca: base: components_register: registering framework rmaps components
[deepv:05186] pmix:mca: base: components_register: found loaded component ppr
[deepv:05186] pmix:mca: base: components_register: component ppr register function successful
[deepv:05186] pmix:mca: base: components_register: found loaded component rank_file
[deepv:05186] pmix:mca: base: components_register: component rank_file has no register or open function
[deepv:05186] pmix:mca: base: components_register: found loaded component round_robin
[deepv:05186] pmix:mca: base: components_register: component round_robin register function successful
[deepv:05186] pmix:mca: base: components_register: found loaded component seq
[deepv:05186] pmix:mca: base: components_register: component seq register function successful
[deepv:05186] mca: base: components_open: opening rmaps components
[deepv:05186] mca: base: components_open: found loaded component ppr
[deepv:05186] mca: base: components_open: component ppr open function successful
[deepv:05186] mca: base: components_open: found loaded component rank_file
[deepv:05186] mca: base: components_open: found loaded component round_robin
[deepv:05186] mca: base: components_open: component round_robin open function successful
[deepv:05186] mca: base: components_open: found loaded component seq
[deepv:05186] mca: base: components_open: component seq open function successful
[deepv:05186] mca:rmaps:select: checking available component ppr
[deepv:05186] mca:rmaps:select: Querying component [ppr]
[deepv:05186] mca:rmaps:select: checking available component rank_file
[deepv:05186] mca:rmaps:select: Querying component [rank_file]
[deepv:05186] mca:rmaps:select: checking available component round_robin
[deepv:05186] mca:rmaps:select: Querying component [round_robin]
[deepv:05186] mca:rmaps:select: checking available component seq
[deepv:05186] mca:rmaps:select: Querying component [seq]
[deepv:05186] [prterun-deepv-5186@0,0]: Final mapper priorities
[deepv:05186] 	Mapper: rank_file Priority: 100
[deepv:05186] 	Mapper: ppr Priority: 90
[deepv:05186] 	Mapper: seq Priority: 60
[deepv:05186] 	Mapper: round_robin Priority: 10
[deepv:05186] NO DEFAULT CPU LIST
[deepv:05186] ESS: 0-19
[deepv:05186] NO DEFAULT CPU LIST
[deepv:05186] [prterun-deepv-5186@0,0] rmaps:base set policy with slot
[deepv:05186] mca:rmaps: mapping job prterun-deepv-5186@1
[deepv:05186] JOBCPUSET ATTR: NULL
[deepv:05186] mca:rmaps: setting mapping policies for job prterun-deepv-5186@1 inherit TRUE hwtcpus FALSE
[deepv:05186] mca:rmaps:rf: job prterun-deepv-5186@1 not using rankfile policy
[deepv:05186] mca:rmaps:ppr: job prterun-deepv-5186@1 not using ppr mapper PPR NULL policy PPR NOTSET
[deepv:05186] [prterun-deepv-5186@0,0] rmaps:seq called on job prterun-deepv-5186@1
[deepv:05186] mca:rmaps:seq: job prterun-deepv-5186@1 not using seq mapper
[deepv:05186] mca:rmaps:rr: mapping job prterun-deepv-5186@1
[deepv:05186] [prterun-deepv-5186@0,0] Starting with 1 nodes in list
[deepv:05186] [prterun-deepv-5186@0,0] Filtering thru apps
[deepv:05186] [prterun-deepv-5186@0,0] Retained 1 nodes in list
[deepv:05186] [prterun-deepv-5186@0,0] node dp-dam01 has 48 slots available
[deepv:05186] AVAILABLE NODES FOR MAPPING:
[deepv:05186]     node: dp-dam01 daemon: 1 slots_available: 48
[deepv:05186] mca:rmaps:rr: mapping by slot for job prterun-deepv-5186@1 slots 48 num_procs 1
[deepv:05186] mca:rmaps:rr:slot working node dp-dam01
[deepv:05186] AVAILABLECPUSET: 0-19
[deepv:05186] CPUSET MAPPER: NULL
[deepv:05186] JOBCPUSET MAPPER: 0-19
[deepv:05186] JOBCPUSET: 0-19
[deepv:05186] ALLOWEDCPUSET: 0-95
[deepv:05186] OBJ: NULL
[deepv:05186] GETNCPUS: 0-19
[deepv:05186] [prterun-deepv-5186@0,0] get_avail_ncpus: node dp-dam01 has 0 procs on it
[deepv:05186] JOBCPUSET: 0-19
[deepv:05186] ALLOWEDCPUSET: 0-95
[deepv:05186] OBJ: NULL
[deepv:05186] GETNCPUS: 0-19
[deepv:05186] mca:rmaps:rr:slot job prterun-deepv-5186@1 is oversubscribed - performing second pass
[deepv:05186] mca:rmaps:rr:slot working node dp-dam01
[deepv:05186] AVAILABLECPUSET: 0-19
[deepv:05186] CPUSET MAPPER: NULL
[deepv:05186] JOBCPUSET MAPPER: 0-19
[deepv:05186] JOBCPUSET: 0-19
[deepv:05186] ALLOWEDCPUSET: 0-95
[deepv:05186] OBJ: NULL
[deepv:05186] GETNCPUS: 0-19
[deepv:05186] [prterun-deepv-5186@0,0] get_avail_ncpus: node dp-dam01 has 0 procs on it
[deepv:05186] JOBCPUSET: 0-19
[deepv:05186] ALLOWEDCPUSET: 0-95
[deepv:05186] OBJ: NULL
[deepv:05186] GETNCPUS: 0-19
--------------------------------------------------------------------------
Your job failed to map. Either no mapper was available, or none
of the available mappers was able to perform the requested
mapping operation.

  Mapper result:       Out of resource
  Application:         hostname
  #procs to be mapped: 1
  Mapping policy:      BYSLOT
  Binding policy:      CORE

--------------------------------------------------------------------------

@rhc54
Copy link
Contributor

rhc54 commented Feb 3, 2023

Well, here's the problem:

[deepv:05186] NO DEFAULT CPU LIST
[deepv:05186] ESS: 0-19
[deepv:05186] NO DEFAULT CPU LIST

A little light is beginning to show. Try adding this:

diff --git a/src/hwloc/hwloc_base_util.c b/src/hwloc/hwloc_base_util.c
index d1bdfb6940..065eeb5f27 100644
--- a/src/hwloc/hwloc_base_util.c
+++ b/src/hwloc/hwloc_base_util.c
@@ -167,11 +167,15 @@ hwloc_cpuset_t prte_hwloc_base_generate_cpuset(hwloc_topology_t topo,
 hwloc_cpuset_t prte_hwloc_base_setup_summary(hwloc_topology_t topo)
 {
     hwloc_cpuset_t avail = NULL;
+    char *tmp;
 
     avail = hwloc_bitmap_alloc();
     /* get the cpus we are bound to */
     if (!prte_hwloc_synthetic_topo &&
         0 <= hwloc_get_cpubind(topo, avail, HWLOC_CPUBIND_PROCESS)) {
+        tmp = NULL;
+        hwloc_bitmap_list_asprintf(&tmp, avail);
+        pmix_output(0, "WE ARE BOUND: %s", tmp);
         return avail;
     }

What appears to be happening (and this diff should confirm it) is that Slurm is binding mpirun to hwthreads 0-19, which then constrains anything we do to that range. It also appears that this machine has no cores, but is configured as strictly hwthreads. So bind-to core has to fail because there are no cores.

@gkatev
Copy link
Contributor Author

gkatev commented Feb 3, 2023

Interesting, and weird. Any idea how this constraint is happening or where's the source of the problem in general? But from what I understand it's a result of a buggy/misconfigured slurm environment? (and there's also perhaps a question of why it worked with rc7, but I suppose anything is possible, it could even be failing but silently)

With the latest diff:

$ mpirun -n 1 --map-by slot --bind-to core --prtemca rmaps_base_verbose 100 hostname
[deepv:21042] mca: base: component_find: searching NULL for rmaps components
[deepv:21042] mca: base: find_dyn_components: checking NULL for rmaps components
[deepv:21042] pmix:mca: base: components_register: registering framework rmaps components
[deepv:21042] pmix:mca: base: components_register: found loaded component ppr
[deepv:21042] pmix:mca: base: components_register: component ppr register function successful
[deepv:21042] pmix:mca: base: components_register: found loaded component rank_file
[deepv:21042] pmix:mca: base: components_register: component rank_file has no register or open function
[deepv:21042] pmix:mca: base: components_register: found loaded component round_robin
[deepv:21042] pmix:mca: base: components_register: component round_robin register function successful
[deepv:21042] pmix:mca: base: components_register: found loaded component seq
[deepv:21042] pmix:mca: base: components_register: component seq register function successful
[deepv:21042] mca: base: components_open: opening rmaps components
[deepv:21042] mca: base: components_open: found loaded component ppr
[deepv:21042] mca: base: components_open: component ppr open function successful
[deepv:21042] mca: base: components_open: found loaded component rank_file
[deepv:21042] mca: base: components_open: found loaded component round_robin
[deepv:21042] mca: base: components_open: component round_robin open function successful
[deepv:21042] mca: base: components_open: found loaded component seq
[deepv:21042] mca: base: components_open: component seq open function successful
[deepv:21042] mca:rmaps:select: checking available component ppr
[deepv:21042] mca:rmaps:select: Querying component [ppr]
[deepv:21042] mca:rmaps:select: checking available component rank_file
[deepv:21042] mca:rmaps:select: Querying component [rank_file]
[deepv:21042] mca:rmaps:select: checking available component round_robin
[deepv:21042] mca:rmaps:select: Querying component [round_robin]
[deepv:21042] mca:rmaps:select: checking available component seq
[deepv:21042] mca:rmaps:select: Querying component [seq]
[deepv:21042] [prterun-deepv-21042@0,0]: Final mapper priorities
[deepv:21042] 	Mapper: rank_file Priority: 100
[deepv:21042] 	Mapper: ppr Priority: 90
[deepv:21042] 	Mapper: seq Priority: 60
[deepv:21042] 	Mapper: round_robin Priority: 10
[deepv:21042] NO DEFAULT CPU LIST
[deepv:21042] WE ARE BOUND: 0-19
[deepv:21042] ESS: 0-19
[deepv:21042] NO DEFAULT CPU LIST
[deepv:21042] WE ARE BOUND: 0-19
[deepv:21042] [prterun-deepv-21042@0,0] rmaps:base set policy with slot
[deepv:21042] mca:rmaps: mapping job prterun-deepv-21042@1
[deepv:21042] JOBCPUSET ATTR: NULL
[deepv:21042] mca:rmaps: setting mapping policies for job prterun-deepv-21042@1 inherit TRUE hwtcpus FALSE
[deepv:21042] mca:rmaps:rf: job prterun-deepv-21042@1 not using rankfile policy
[deepv:21042] mca:rmaps:ppr: job prterun-deepv-21042@1 not using ppr mapper PPR NULL policy PPR NOTSET
[deepv:21042] [prterun-deepv-21042@0,0] rmaps:seq called on job prterun-deepv-21042@1
[deepv:21042] mca:rmaps:seq: job prterun-deepv-21042@1 not using seq mapper
[deepv:21042] mca:rmaps:rr: mapping job prterun-deepv-21042@1
[deepv:21042] [prterun-deepv-21042@0,0] Starting with 1 nodes in list
[deepv:21042] [prterun-deepv-21042@0,0] Filtering thru apps
[deepv:21042] [prterun-deepv-21042@0,0] Retained 1 nodes in list
[deepv:21042] [prterun-deepv-21042@0,0] node dp-dam01 has 48 slots available
[deepv:21042] AVAILABLE NODES FOR MAPPING:
[deepv:21042]     node: dp-dam01 daemon: 1 slots_available: 48
[deepv:21042] mca:rmaps:rr: mapping by slot for job prterun-deepv-21042@1 slots 48 num_procs 1
[deepv:21042] mca:rmaps:rr:slot working node dp-dam01
[deepv:21042] AVAILABLECPUSET: 0-19
[deepv:21042] CPUSET MAPPER: NULL
[deepv:21042] JOBCPUSET MAPPER: 0-19
[deepv:21042] JOBCPUSET: 0-19
[deepv:21042] ALLOWEDCPUSET: 0-95
[deepv:21042] OBJ: NULL
[deepv:21042] GETNCPUS: 0-19
[deepv:21042] [prterun-deepv-21042@0,0] get_avail_ncpus: node dp-dam01 has 0 procs on it
[deepv:21042] JOBCPUSET: 0-19
[deepv:21042] ALLOWEDCPUSET: 0-95
[deepv:21042] OBJ: NULL
[deepv:21042] GETNCPUS: 0-19
[deepv:21042] mca:rmaps:rr:slot job prterun-deepv-21042@1 is oversubscribed - performing second pass
[deepv:21042] mca:rmaps:rr:slot working node dp-dam01
[deepv:21042] AVAILABLECPUSET: 0-19
[deepv:21042] CPUSET MAPPER: NULL
[deepv:21042] JOBCPUSET MAPPER: 0-19
[deepv:21042] JOBCPUSET: 0-19
[deepv:21042] ALLOWEDCPUSET: 0-95
[deepv:21042] OBJ: NULL
[deepv:21042] GETNCPUS: 0-19
[deepv:21042] [prterun-deepv-21042@0,0] get_avail_ncpus: node dp-dam01 has 0 procs on it
[deepv:21042] JOBCPUSET: 0-19
[deepv:21042] ALLOWEDCPUSET: 0-95
[deepv:21042] OBJ: NULL
[deepv:21042] GETNCPUS: 0-19
--------------------------------------------------------------------------
Your job failed to map. Either no mapper was available, or none
of the available mappers was able to perform the requested
mapping operation.

  Mapper result:       Out of resource
  Application:         hostname
  #procs to be mapped: 1
  Mapping policy:      BYSLOT
  Binding policy:      CORE

--------------------------------------------------------------------------

@rhc54
Copy link
Contributor

rhc54 commented Feb 3, 2023

Yeah, you're hitting an external binding constraint. Could be your admins made a change to the Slurm config - have you retried that prior rc to see if it currently works? Also possible that we were ignoring something that we now pay attention to.

What kind of machine is this? If you run lstopo from HWLOC, does it report the existence of any cores? Or just PUs? I can add an error check (and more useful error message) if we are asked to bind-to core but there are no cores on the machine.

@gkatev
Copy link
Contributor Author

gkatev commented Feb 3, 2023

I see thanks, I'll also contact the admins and see what insight they can contribute.

The machine is a 2x Xeon 8260 (but it's also happening on another partition w/ different CPUs). All looks in order in lstopo, with all expected cores per socket and 2 PUs per core.

Yes, rc7 still works. Here are some logs for completeness's sake:

$ 5.0.0rc7/mpirun --map-by slot --bind-to core --display map hostname

========================   JOB MAP   ========================
Data for JOB mpirun-deepv-8862@1 offset 0 Total slots allocated 48
    Mapping policy: BYSLOT:NOOVERSUBSCRIBE  Ranking policy: SLOT Binding policy: CORE
    Cpu set: N/A  PPR: N/A  Cpus-per-rank: N/A  Cpu Type: CORE


Data for node: dp-dam01	Num slots: 48	Max slots: 0	Num procs: 48
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 0 Bound: package[0][core:0]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 1 Bound: package[0][core:1]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 2 Bound: package[0][core:2]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 3 Bound: package[0][core:3]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 4 Bound: package[0][core:4]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 5 Bound: package[0][core:5]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 6 Bound: package[0][core:6]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 7 Bound: package[0][core:7]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 8 Bound: package[0][core:8]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 9 Bound: package[0][core:9]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 10 Bound: package[0][core:10]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 11 Bound: package[0][core:11]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 12 Bound: package[0][core:12]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 13 Bound: package[0][core:13]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 14 Bound: package[0][core:14]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 15 Bound: package[0][core:15]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 16 Bound: package[0][core:16]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 17 Bound: package[0][core:17]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 18 Bound: package[0][core:18]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 19 Bound: package[0][core:19]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 20 Bound: package[0][core:20]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 21 Bound: package[0][core:21]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 22 Bound: package[0][core:22]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 23 Bound: package[0][core:23]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 24 Bound: package[1][core:24]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 25 Bound: package[1][core:25]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 26 Bound: package[1][core:26]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 27 Bound: package[1][core:27]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 28 Bound: package[1][core:28]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 29 Bound: package[1][core:29]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 30 Bound: package[1][core:30]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 31 Bound: package[1][core:31]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 32 Bound: package[1][core:32]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 33 Bound: package[1][core:33]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 34 Bound: package[1][core:34]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 35 Bound: package[1][core:35]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 36 Bound: package[1][core:36]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 37 Bound: package[1][core:37]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 38 Bound: package[1][core:38]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 39 Bound: package[1][core:39]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 40 Bound: package[1][core:40]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 41 Bound: package[1][core:41]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 42 Bound: package[1][core:42]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 43 Bound: package[1][core:43]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 44 Bound: package[1][core:44]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 45 Bound: package[1][core:45]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 46 Bound: package[1][core:46]
        Process jobid: mpirun-deepv-8862@1 App: 0 Process rank: 47 Bound: package[1][core:47]

=============================================================
$ 5.0.0rc7/mpirun --map-by slot --bind-to core --display map --prtemca rmaps_base_verbose 100 hostname
[deepv:19088] mca: base: components_register: registering framework rmaps components
[deepv:19088] mca: base: components_register: found loaded component mindist
[deepv:19088] mca: base: components_register: component mindist register function successful
[deepv:19088] mca: base: components_register: found loaded component ppr
[deepv:19088] mca: base: components_register: component ppr register function successful
[deepv:19088] mca: base: components_register: found loaded component rank_file
[deepv:19088] mca: base: components_register: component rank_file has no register or open function
[deepv:19088] mca: base: components_register: found loaded component round_robin
[deepv:19088] mca: base: components_register: component round_robin register function successful
[deepv:19088] mca: base: components_register: found loaded component seq
[deepv:19088] mca: base: components_register: component seq register function successful
[deepv:19088] mca: base: components_open: opening rmaps components
[deepv:19088] mca: base: components_open: found loaded component mindist
[deepv:19088] mca: base: components_open: component mindist open function successful
[deepv:19088] mca: base: components_open: found loaded component ppr
[deepv:19088] mca: base: components_open: component ppr open function successful
[deepv:19088] mca: base: components_open: found loaded component rank_file
[deepv:19088] mca: base: components_open: found loaded component round_robin
[deepv:19088] mca: base: components_open: component round_robin open function successful
[deepv:19088] mca: base: components_open: found loaded component seq
[deepv:19088] mca: base: components_open: component seq open function successful
[deepv:19088] mca:rmaps:select: checking available component mindist
[deepv:19088] mca:rmaps:select: Querying component [mindist]
[deepv:19088] mca:rmaps:select: checking available component ppr
[deepv:19088] mca:rmaps:select: Querying component [ppr]
[deepv:19088] mca:rmaps:select: checking available component rank_file
[deepv:19088] mca:rmaps:select: Querying component [rank_file]
[deepv:19088] mca:rmaps:select: checking available component round_robin
[deepv:19088] mca:rmaps:select: Querying component [round_robin]
[deepv:19088] mca:rmaps:select: checking available component seq
[deepv:19088] mca:rmaps:select: Querying component [seq]
[deepv:19088] [mpirun-deepv-19088@0,0]: Final mapper priorities
[deepv:19088] 	Mapper: ppr Priority: 90
[deepv:19088] 	Mapper: seq Priority: 60
[deepv:19088] 	Mapper: mindist Priority: 20
[deepv:19088] 	Mapper: round_robin Priority: 10
[deepv:19088] 	Mapper: rank_file Priority: 0
[deepv:19088] [mpirun-deepv-19088@0,0] rmaps:base set policy with :display
[deepv:19088] [mpirun-deepv-19088@0,0] rmaps:base policy  modifiers display provided
[deepv:19088] [mpirun-deepv-19088@0,0] rmaps:base check modifiers with display
[deepv:19088] [mpirun-deepv-19088@0,0] rmaps:base set policy with slot
[deepv:19088] mca:rmaps: mapping job mpirun-deepv-19088@1
[deepv:19088] AVAILABLE NODES FOR MAPPING:
[deepv:19088]     node: dp-dam01 daemon: 1 slots_available: 48
[deepv:19088] mca:rmaps: setting mapping policies for job mpirun-deepv-19088@1 nprocs 48 inherit TRUE hwtcpus FALSE
[deepv:19088] mca:rmaps:ppr: job mpirun-deepv-19088@1 not using ppr mapper PPR NULL policy PPR NOTSET
[deepv:19088] mca:rmaps:seq: job mpirun-deepv-19088@1 not using seq mapper
[deepv:19088] mca:rmaps:mindist: job mpirun-deepv-19088@1 not using mindist mapper
[deepv:19088] mca:rmaps:rr: mapping job mpirun-deepv-19088@1
[deepv:19088] AVAILABLE NODES FOR MAPPING:
[deepv:19088]     node: dp-dam01 daemon: 1 slots_available: 48
[deepv:19088] mca:rmaps:rr: mapping by slot for job mpirun-deepv-19088@1 slots 48 num_procs 48
[deepv:19088] mca:rmaps:rr:slot working node dp-dam01
[deepv:19088] mca:rmaps:rr:slot assigning 48 procs to node dp-dam01
[deepv:19088] RANKING POLICY: SLOT
[deepv:19088] mca:rmaps:base: computing vpids by slot for job mpirun-deepv-19088@1
[deepv:19088] mca:rmaps:rank assigning vpid 0
[deepv:19088] mca:rmaps:rank assigning vpid 1
[deepv:19088] mca:rmaps:rank assigning vpid 2
[deepv:19088] mca:rmaps:rank assigning vpid 3
[deepv:19088] mca:rmaps:rank assigning vpid 4
[deepv:19088] mca:rmaps:rank assigning vpid 5
[deepv:19088] mca:rmaps:rank assigning vpid 6
[deepv:19088] mca:rmaps:rank assigning vpid 7
[deepv:19088] mca:rmaps:rank assigning vpid 8
[deepv:19088] mca:rmaps:rank assigning vpid 9
[deepv:19088] mca:rmaps:rank assigning vpid 10
[deepv:19088] mca:rmaps:rank assigning vpid 11
[deepv:19088] mca:rmaps:rank assigning vpid 12
[deepv:19088] mca:rmaps:rank assigning vpid 13
[deepv:19088] mca:rmaps:rank assigning vpid 14
[deepv:19088] mca:rmaps:rank assigning vpid 15
[deepv:19088] mca:rmaps:rank assigning vpid 16
[deepv:19088] mca:rmaps:rank assigning vpid 17
[deepv:19088] mca:rmaps:rank assigning vpid 18
[deepv:19088] mca:rmaps:rank assigning vpid 19
[deepv:19088] mca:rmaps:rank assigning vpid 20
[deepv:19088] mca:rmaps:rank assigning vpid 21
[deepv:19088] mca:rmaps:rank assigning vpid 22
[deepv:19088] mca:rmaps:rank assigning vpid 23
[deepv:19088] mca:rmaps:rank assigning vpid 24
[deepv:19088] mca:rmaps:rank assigning vpid 25
[deepv:19088] mca:rmaps:rank assigning vpid 26
[deepv:19088] mca:rmaps:rank assigning vpid 27
[deepv:19088] mca:rmaps:rank assigning vpid 28
[deepv:19088] mca:rmaps:rank assigning vpid 29
[deepv:19088] mca:rmaps:rank assigning vpid 30
[deepv:19088] mca:rmaps:rank assigning vpid 31
[deepv:19088] mca:rmaps:rank assigning vpid 32
[deepv:19088] mca:rmaps:rank assigning vpid 33
[deepv:19088] mca:rmaps:rank assigning vpid 34
[deepv:19088] mca:rmaps:rank assigning vpid 35
[deepv:19088] mca:rmaps:rank assigning vpid 36
[deepv:19088] mca:rmaps:rank assigning vpid 37
[deepv:19088] mca:rmaps:rank assigning vpid 38
[deepv:19088] mca:rmaps:rank assigning vpid 39
[deepv:19088] mca:rmaps:rank assigning vpid 40
[deepv:19088] mca:rmaps:rank assigning vpid 41
[deepv:19088] mca:rmaps:rank assigning vpid 42
[deepv:19088] mca:rmaps:rank assigning vpid 43
[deepv:19088] mca:rmaps:rank assigning vpid 44
[deepv:19088] mca:rmaps:rank assigning vpid 45
[deepv:19088] mca:rmaps:rank assigning vpid 46
[deepv:19088] mca:rmaps:rank assigning vpid 47
[deepv:19088] mca:rmaps: compute bindings for job mpirun-deepv-19088@1 with policy CORE[4007]
[deepv:19088] mca:rmaps: computing bindings for job mpirun-deepv-19088@1
[deepv:19088] [mpirun-deepv-19088@0,0] bind_depth: 5
[deepv:19088] mca:rmaps: bind downward for job mpirun-deepv-19088@1 with bindings CORE
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: node dp-dam01 has 48 procs on it
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,0]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,1]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,2]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,3]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,4]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,5]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,6]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,7]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,8]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,9]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,10]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,11]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,12]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,13]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,14]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,15]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,16]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,17]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,18]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,19]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,20]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,21]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,22]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,23]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,24]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,25]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,26]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,27]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,28]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,29]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,30]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,31]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,32]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,33]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,34]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,35]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,36]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,37]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,38]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,39]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,40]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,41]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,42]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,43]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,44]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,45]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,46]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,47]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,0] BITMAP 0,48
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,0][dp-dam01] TO package[0][core:0]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,1] BITMAP 1,49
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,1][dp-dam01] TO package[0][core:1]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,2] BITMAP 2,50
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,2][dp-dam01] TO package[0][core:2]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,3] BITMAP 3,51
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,3][dp-dam01] TO package[0][core:3]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,4] BITMAP 4,52
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,4][dp-dam01] TO package[0][core:4]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,5] BITMAP 5,53
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,5][dp-dam01] TO package[0][core:5]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,6] BITMAP 6,54
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,6][dp-dam01] TO package[0][core:6]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,7] BITMAP 7,55
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,7][dp-dam01] TO package[0][core:7]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,8] BITMAP 8,56
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,8][dp-dam01] TO package[0][core:8]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,9] BITMAP 9,57
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,9][dp-dam01] TO package[0][core:9]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,10] BITMAP 10,58
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,10][dp-dam01] TO package[0][core:10]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,11] BITMAP 11,59
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,11][dp-dam01] TO package[0][core:11]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,12] BITMAP 12,60
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,12][dp-dam01] TO package[0][core:12]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,13] BITMAP 13,61
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,13][dp-dam01] TO package[0][core:13]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,14] BITMAP 14,62
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,14][dp-dam01] TO package[0][core:14]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,15] BITMAP 15,63
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,15][dp-dam01] TO package[0][core:15]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,16] BITMAP 16,64
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,16][dp-dam01] TO package[0][core:16]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,17] BITMAP 17,65
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,17][dp-dam01] TO package[0][core:17]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,18] BITMAP 18,66
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,18][dp-dam01] TO package[0][core:18]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,19] BITMAP 19,67
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,19][dp-dam01] TO package[0][core:19]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,20] BITMAP 20,68
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,20][dp-dam01] TO package[0][core:20]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,21] BITMAP 21,69
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,21][dp-dam01] TO package[0][core:21]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,22] BITMAP 22,70
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,22][dp-dam01] TO package[0][core:22]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,23] BITMAP 23,71
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,23][dp-dam01] TO package[0][core:23]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,24] BITMAP 24,72
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,24][dp-dam01] TO package[1][core:24]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,25] BITMAP 25,73
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,25][dp-dam01] TO package[1][core:25]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,26] BITMAP 26,74
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,26][dp-dam01] TO package[1][core:26]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,27] BITMAP 27,75
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,27][dp-dam01] TO package[1][core:27]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,28] BITMAP 28,76
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,28][dp-dam01] TO package[1][core:28]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,29] BITMAP 29,77
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,29][dp-dam01] TO package[1][core:29]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,30] BITMAP 30,78
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,30][dp-dam01] TO package[1][core:30]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,31] BITMAP 31,79
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,31][dp-dam01] TO package[1][core:31]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,32] BITMAP 32,80
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,32][dp-dam01] TO package[1][core:32]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,33] BITMAP 33,81
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,33][dp-dam01] TO package[1][core:33]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,34] BITMAP 34,82
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,34][dp-dam01] TO package[1][core:34]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,35] BITMAP 35,83
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,35][dp-dam01] TO package[1][core:35]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,36] BITMAP 36,84
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,36][dp-dam01] TO package[1][core:36]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,37] BITMAP 37,85
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,37][dp-dam01] TO package[1][core:37]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,38] BITMAP 38,86
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,38][dp-dam01] TO package[1][core:38]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,39] BITMAP 39,87
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,39][dp-dam01] TO package[1][core:39]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,40] BITMAP 40,88
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,40][dp-dam01] TO package[1][core:40]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,41] BITMAP 41,89
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,41][dp-dam01] TO package[1][core:41]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,42] BITMAP 42,90
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,42][dp-dam01] TO package[1][core:42]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,43] BITMAP 43,91
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,43][dp-dam01] TO package[1][core:43]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,44] BITMAP 44,92
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,44][dp-dam01] TO package[1][core:44]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,45] BITMAP 45,93
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,45][dp-dam01] TO package[1][core:45]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,46] BITMAP 46,94
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,46][dp-dam01] TO package[1][core:46]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,47] BITMAP 47,95
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,47][dp-dam01] TO package[1][core:47]

========================   JOB MAP   ========================
Data for JOB mpirun-deepv-19088@1 offset 0 Total slots allocated 48
    Mapping policy: BYSLOT:NOOVERSUBSCRIBE  Ranking policy: SLOT Binding policy: CORE
    Cpu set: N/A  PPR: N/A  Cpus-per-rank: N/A  Cpu Type: CORE


Data for node: dp-dam01	Num slots: 48	Max slots: 0	Num procs: 48
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 0 Bound: package[0][core:0]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 1 Bound: package[0][core:1]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 2 Bound: package[0][core:2]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 3 Bound: package[0][core:3]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 4 Bound: package[0][core:4]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 5 Bound: package[0][core:5]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 6 Bound: package[0][core:6]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 7 Bound: package[0][core:7]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 8 Bound: package[0][core:8]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 9 Bound: package[0][core:9]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 10 Bound: package[0][core:10]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 11 Bound: package[0][core:11]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 12 Bound: package[0][core:12]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 13 Bound: package[0][core:13]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 14 Bound: package[0][core:14]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 15 Bound: package[0][core:15]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 16 Bound: package[0][core:16]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 17 Bound: package[0][core:17]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 18 Bound: package[0][core:18]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 19 Bound: package[0][core:19]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 20 Bound: package[0][core:20]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 21 Bound: package[0][core:21]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 22 Bound: package[0][core:22]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 23 Bound: package[0][core:23]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 24 Bound: package[1][core:24]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 25 Bound: package[1][core:25]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 26 Bound: package[1][core:26]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 27 Bound: package[1][core:27]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 28 Bound: package[1][core:28]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 29 Bound: package[1][core:29]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 30 Bound: package[1][core:30]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 31 Bound: package[1][core:31]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 32 Bound: package[1][core:32]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 33 Bound: package[1][core:33]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 34 Bound: package[1][core:34]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 35 Bound: package[1][core:35]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 36 Bound: package[1][core:36]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 37 Bound: package[1][core:37]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 38 Bound: package[1][core:38]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 39 Bound: package[1][core:39]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 40 Bound: package[1][core:40]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 41 Bound: package[1][core:41]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 42 Bound: package[1][core:42]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 43 Bound: package[1][core:43]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 44 Bound: package[1][core:44]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 45 Bound: package[1][core:45]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 46 Bound: package[1][core:46]
        Process jobid: mpirun-deepv-19088@1 App: 0 Process rank: 47 Bound: package[1][core:47]

=============================================================

[deepv:19088] mca:rmaps: compute bindings for job mpirun-deepv-19088@1 with policy CORE[4007]
[deepv:19088] mca:rmaps: computing bindings for job mpirun-deepv-19088@1
[deepv:19088] [mpirun-deepv-19088@0,0] bind_depth: 5
[deepv:19088] mca:rmaps: bind downward for job mpirun-deepv-19088@1 with bindings CORE
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: node dp-dam01 has 48 procs on it
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,0]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,1]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,2]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,3]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,4]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,5]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,6]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,7]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,8]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,9]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,10]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,11]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,12]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,13]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,14]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,15]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,16]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,17]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,18]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,19]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,20]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,21]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,22]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,23]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,24]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,25]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,26]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,27]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,28]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,29]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,30]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,31]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,32]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,33]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,34]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,35]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,36]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,37]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,38]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,39]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,40]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,41]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,42]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,43]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,44]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,45]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,46]
[deepv:19088] [mpirun-deepv-19088@0,0] reset_usage: ignoring proc [mpirun-deepv-19088@1,47]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,0] BITMAP 0,48
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,0][dp-dam01] TO package[0][core:0]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,1] BITMAP 1,49
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,1][dp-dam01] TO package[0][core:1]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,2] BITMAP 2,50
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,2][dp-dam01] TO package[0][core:2]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,3] BITMAP 3,51
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,3][dp-dam01] TO package[0][core:3]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,4] BITMAP 4,52
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,4][dp-dam01] TO package[0][core:4]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,5] BITMAP 5,53
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,5][dp-dam01] TO package[0][core:5]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,6] BITMAP 6,54
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,6][dp-dam01] TO package[0][core:6]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,7] BITMAP 7,55
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,7][dp-dam01] TO package[0][core:7]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,8] BITMAP 8,56
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,8][dp-dam01] TO package[0][core:8]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,9] BITMAP 9,57
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,9][dp-dam01] TO package[0][core:9]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,10] BITMAP 10,58
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,10][dp-dam01] TO package[0][core:10]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,11] BITMAP 11,59
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,11][dp-dam01] TO package[0][core:11]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,12] BITMAP 12,60
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,12][dp-dam01] TO package[0][core:12]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,13] BITMAP 13,61
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,13][dp-dam01] TO package[0][core:13]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,14] BITMAP 14,62
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,14][dp-dam01] TO package[0][core:14]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,15] BITMAP 15,63
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,15][dp-dam01] TO package[0][core:15]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,16] BITMAP 16,64
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,16][dp-dam01] TO package[0][core:16]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,17] BITMAP 17,65
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,17][dp-dam01] TO package[0][core:17]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,18] BITMAP 18,66
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,18][dp-dam01] TO package[0][core:18]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,19] BITMAP 19,67
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,19][dp-dam01] TO package[0][core:19]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,20] BITMAP 20,68
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,20][dp-dam01] TO package[0][core:20]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,21] BITMAP 21,69
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,21][dp-dam01] TO package[0][core:21]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,22] BITMAP 22,70
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,22][dp-dam01] TO package[0][core:22]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,23] BITMAP 23,71
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,23][dp-dam01] TO package[0][core:23]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,24] BITMAP 24,72
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,24][dp-dam01] TO package[1][core:24]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,25] BITMAP 25,73
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,25][dp-dam01] TO package[1][core:25]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,26] BITMAP 26,74
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,26][dp-dam01] TO package[1][core:26]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,27] BITMAP 27,75
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,27][dp-dam01] TO package[1][core:27]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,28] BITMAP 28,76
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,28][dp-dam01] TO package[1][core:28]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,29] BITMAP 29,77
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,29][dp-dam01] TO package[1][core:29]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,30] BITMAP 30,78
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,30][dp-dam01] TO package[1][core:30]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,31] BITMAP 31,79
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,31][dp-dam01] TO package[1][core:31]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,32] BITMAP 32,80
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,32][dp-dam01] TO package[1][core:32]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,33] BITMAP 33,81
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,33][dp-dam01] TO package[1][core:33]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,34] BITMAP 34,82
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,34][dp-dam01] TO package[1][core:34]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,35] BITMAP 35,83
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,35][dp-dam01] TO package[1][core:35]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,36] BITMAP 36,84
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,36][dp-dam01] TO package[1][core:36]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,37] BITMAP 37,85
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,37][dp-dam01] TO package[1][core:37]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,38] BITMAP 38,86
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,38][dp-dam01] TO package[1][core:38]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,39] BITMAP 39,87
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,39][dp-dam01] TO package[1][core:39]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,40] BITMAP 40,88
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,40][dp-dam01] TO package[1][core:40]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,41] BITMAP 41,89
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,41][dp-dam01] TO package[1][core:41]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,42] BITMAP 42,90
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,42][dp-dam01] TO package[1][core:42]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,43] BITMAP 43,91
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,43][dp-dam01] TO package[1][core:43]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,44] BITMAP 44,92
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,44][dp-dam01] TO package[1][core:44]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,45] BITMAP 45,93
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,45][dp-dam01] TO package[1][core:45]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,46] BITMAP 46,94
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,46][dp-dam01] TO package[1][core:46]
[deepv:19088] [mpirun-deepv-19088@0,0] PROC [mpirun-deepv-19088@1,47] BITMAP 47,95
[deepv:19088] [mpirun-deepv-19088@0,0] BOUND PROC [mpirun-deepv-19088@1,47][dp-dam01] TO package[1][core:47]

@rhc54
Copy link
Contributor

rhc54 commented Feb 3, 2023

Oh my - that is a very old version of PRRTE in that rc! We haven't used that binding algo in quite some time.

Can you generate an xml output from lstopo and post it for me? I'd like to see if I can locally reproduce the problem.

@gkatev
Copy link
Contributor Author

gkatev commented Feb 3, 2023

Yes here it is: topo.xml.txt

@rhc54
Copy link
Contributor

rhc54 commented Feb 3, 2023

Yeah, there is something funny going on with respect to Slurm - everything works fine for me with your topology. It's that externally applied binding that is causing the problem. Key is to figure out where that is coming from.

@rhc54
Copy link
Contributor

rhc54 commented Feb 4, 2023

I had a thought hit me and checked your topology to confirm. The reason we cannot bind-to core is that the applied binding is only allocating one hwthread from each core. In other words, there are two hwt's per core, you are only being allocated the first one of those on each core. So we cannot bind you to a core because you don't "own" both hwt's on any core.

This looks increasingly like a Slurm configuration issue.

@gkatev
Copy link
Contributor Author

gkatev commented Feb 6, 2023

Ah, that makes sense. I thought it might have been my -N 1 -n 48 alloc command that was problematic, but also with -N 1 -n 96 and some other tests there wasn't a difference. I suppose the 0-19 (which will only include one hwthread per core) takes precedence and what you describe happens. No clue where it comes from at the moment (I'll let the admins know and see if they can help).

@gkatev
Copy link
Contributor Author

gkatev commented Feb 8, 2023

I realized where the 20 comes from..... The login node, on which I run salloc, has 20 cores. I believe it's also a VM in case it's relevant.

Does this ring any new bells? Is the problem that all this mapping/binding code is running in the context of the login node while it should be running in the context of the allocated node?

@rhc54
Copy link
Contributor

rhc54 commented Feb 8, 2023

That is actually a rather typical scenario (minus the VM). We don't use the topology from the node where mpirun is executing - we launch our daemons across the allocation, and then use the topology they discover for performing the process placement. So it shouldn't matter what is on the login node.

That said, I'm happy to take another look and verify that we aren't making a mistake somewhere in sensing an external binding.

@gkatev
Copy link
Contributor Author

gkatev commented Feb 8, 2023

Perfect, that's as expected. My thought is that the daemon is seeing the login node's topology instead of that of the allocated one -- for some reason. Or something like the launched daemon "retaining" the affinity of the login node?

I added some debug prints to prte_hwloc_base_setup_summary:

diff --git a/src/hwloc/hwloc_base_util.c b/src/hwloc/hwloc_base_util.c
index d1bdfb6940..b8f9c130a3 100644
--- a/src/hwloc/hwloc_base_util.c
+++ b/src/hwloc/hwloc_base_util.c
@@ -164,14 +164,52 @@ hwloc_cpuset_t prte_hwloc_base_generate_cpuset(hwloc_topology_t topo,
     return avail;
 }
 
+static void affinity(void) {
+    cpu_set_t mask;
+    long nproc, i;
+    
+    if(sched_getaffinity(0, sizeof(cpu_set_t), &mask) == -1) {
+        perror("sched_getaffinity");
+        assert(false);
+    }
+    
+    nproc = sysconf(_SC_NPROCESSORS_ONLN);
+    printf("sched_getaffinity = ");
+    
+    long n_aff = 0;
+    
+    for(i = 0; i < nproc; i++) {
+        printf("%d ", CPU_ISSET(i, &mask));
+        
+        if(CPU_ISSET(i, &mask) == 1)
+            n_aff++;
+    }
+    
+    printf("(%d)\n", n_aff);
+}
+
+#include <unistd.h>
+#include <limits.h>
+
 hwloc_cpuset_t prte_hwloc_base_setup_summary(hwloc_topology_t topo)
 {
     hwloc_cpuset_t avail = NULL;
+    char *tmp;
 
     avail = hwloc_bitmap_alloc();
     /* get the cpus we are bound to */
     if (!prte_hwloc_synthetic_topo &&
         0 <= hwloc_get_cpubind(topo, avail, HWLOC_CPUBIND_PROCESS)) {
+        tmp = NULL;
+        hwloc_bitmap_list_asprintf(&tmp, avail);
+        pmix_output(0, "WE ARE BOUND: %s", tmp);
+        
+        char hostname[HOST_NAME_MAX + 1];
+        gethostname(hostname, HOST_NAME_MAX + 1);
+        printf("hostname: %s\n", hostname);
+        
+        affinity();
+        
         return avail;
     }

Produces:

$ mpirun -n 1 --map-by slot --bind-to core echo 
[deepv:13833] NO DEFAULT CPU LIST
[deepv:13833] WE ARE BOUND: 0-19
hostname: deepv
sched_getaffinity = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (20)
[deepv:13833] ESS: 0-19
[deepv:13833] NO DEFAULT CPU LIST
[deepv:13833] WE ARE BOUND: 0-19
hostname: deepv
sched_getaffinity = 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 (20)
[deepv:13833] JOBCPUSET ATTR: NULL
[deepv:13833] AVAILABLECPUSET: 0-19
[deepv:13833] CPUSET MAPPER: NULL
[deepv:13833] JOBCPUSET MAPPER: 0-19
[deepv:13833] JOBCPUSET: 0-19
[deepv:13833] ALLOWEDCPUSET: 0-95
[deepv:13833] OBJ: NULL
[deepv:13833] GETNCPUS: 0-19
[deepv:13833] JOBCPUSET: 0-19
[deepv:13833] ALLOWEDCPUSET: 0-95
[deepv:13833] OBJ: NULL
[deepv:13833] GETNCPUS: 0-19
[deepv:13833] AVAILABLECPUSET: 0-19
[deepv:13833] CPUSET MAPPER: NULL
[deepv:13833] JOBCPUSET MAPPER: 0-19
[deepv:13833] JOBCPUSET: 0-19
[deepv:13833] ALLOWEDCPUSET: 0-95
[deepv:13833] OBJ: NULL
[deepv:13833] GETNCPUS: 0-19
[deepv:13833] JOBCPUSET: 0-19
[deepv:13833] ALLOWEDCPUSET: 0-95
[deepv:13833] OBJ: NULL
[deepv:13833] GETNCPUS: 0-19
--------------------------------------------------------------------------
Your job failed to map. Either no mapper was available, or none
of the available mappers was able to perform the requested
mapping operation.

  Mapper result:       Out of resource
  Application:         echo
  #procs to be mapped: 1
  Mapping policy:      BYSLOT
  Binding policy:      CORE

--------------------------------------------------------------------------

It seems like this code is running on the login node (hostname: deepv), but it is supposed to? Because the daemons collect the topology information and send it back to the mapper to map the processes?

Maybe you could point me to the code where the topology information is collected? Or to some part of the daemon's code and I could check the affinity or other information there?

@rhc54
Copy link
Contributor

rhc54 commented Feb 8, 2023

I'll take a look at it - seems like something may be off there. Out of curiosity, though - why have you bound mpirun in your VM?? At least, hwloc thinks you are bound, seemingly to all processors in that VM?

@rhc54
Copy link
Contributor

rhc54 commented Feb 8, 2023

I suspect this diff will fix it, though I'm a tad concerned it will break others:

diff --git a/src/hwloc/hwloc_base_util.c b/src/hwloc/hwloc_base_util.c
index d1bdfb6940..b342651d23 100644
--- a/src/hwloc/hwloc_base_util.c
+++ b/src/hwloc/hwloc_base_util.c
@@ -169,11 +169,6 @@ hwloc_cpuset_t prte_hwloc_base_setup_summary(hwloc_topology_t topo)
     hwloc_cpuset_t avail = NULL;
 
     avail = hwloc_bitmap_alloc();
-    /* get the cpus we are bound to */
-    if (!prte_hwloc_synthetic_topo &&
-        0 <= hwloc_get_cpubind(topo, avail, HWLOC_CPUBIND_PROCESS)) {
-        return avail;
-    }
 
     /* get the root available cpuset */
 #if HWLOC_API_VERSION < 0x20000

@rhc54
Copy link
Contributor

rhc54 commented Feb 8, 2023

Hmmm...when you say you are running mpirun "in a VM", do you actually mean a virtual machine? Or are you talking about a container - and if so, which container tech are you using? I think that might be the root of the issue here - your container allows hwloc to see the entire node topology, with the container simply bound to some subset of the cpus.

Perhaps one data point is: when you earlier provided me with your topology, what node did you use to get it? The login node? Or a compute node?

@gkatev
Copy link
Contributor Author

gkatev commented Feb 8, 2023

The login node itself apparently looks like a VM or something similar, I don't know what exactly. I noticed it in lscpu (a bit weird of a cpu name and hypervisor in the flags), and thought it might be relevant as I imagined VMs are more likely to have weird topologies. So both salloc and mpirun are ran on the same "system".

The topology XML above was from the allocated compute node, and here is the one of the login node: login_topo.xml.txt. This node has 20 cores as seen in htop and lstopo, so this must be where the "20" we see is coming from. But we shouldn't be seeing any aspect of the login node's topology as it's not a part of the allocation (?).

Based on this I thought that perhaps one of two effects was taking place: 1. The "number of available cores" was bleeding through from the login node to the allocated node -- if that makes sense -- not sure in what form that would happen, I imagine slurm could also play a role here. Or 2, code that should be running on the allocated node instead runs on the login node and collects a topology that shows 20 cores available instead of 48. Or something along these lines, this is how I perceive with it my uneducated eyes!

I haven't done any king of bounding to mpirun, what exactly did you mean?
(thanks for the diff, will test it in a bit)

@rhc54
Copy link
Contributor

rhc54 commented Feb 8, 2023

I haven't done any king of bounding to mpirun, what exactly did you mean?

I mean that mpirun has been constrained to execute on a subset of the available physical processors on the login node. From the topology, it looks like that is what has happened here as I very much doubt that your login node consists of a CPU package with 20 single-thread physical cores in it. Historically, allocation constraints on mpirun were carried over to compute nodes on clusters. I think hwloc has done a better job recently in including such things in their various cpuset fields, and so that code can probably be deprecated. Need to think more about it to ensure we aren't creating additional problems for others, but I think the diff I provided is probably correct.

@gkatev
Copy link
Contributor Author

gkatev commented Feb 8, 2023

The way I understand it is that it's not mpirun that's getting constrained, but rather the whole VM that makes up the login node is given just 20 cores from the physical host system that it runs on. Or did you mean that it's possible for the constraint of the VM on the host system, to pass over to mpirun which runs on the guest and affect the job?

Indeed with the above diff that removes the hwloc_get_cpubind call/check the problems go away! (I don't think I would argue that this check go away -- I'm still not sure where this bind comes from or why we are seeing it in the job)

@rhc54
Copy link
Contributor

rhc54 commented Feb 8, 2023

The way I understand it is that it's not mpirun that's getting constrained, but rather the whole VM that makes up the login node is given just 20 cores from the physical host system that it runs on.

I think we are drifting down into terminology hell, so let's just drop it.

Indeed with the above diff that removes the hwloc_get_cpubind call/check the problems go away!

Yeah, I expected that would be the case. I need to do some checking to ensure we aren't breaking other people. This scenario is pretty unusual/odd, so I don't want to break the norm just to support it. I'll play around with it a bit.

@rhc54
Copy link
Contributor

rhc54 commented Feb 8, 2023

@bgoglin Can you please clarify something for me? When you say "allowed" cpuset, are you talking about the cpus that this process is allowed to utilize? Effectively the online cpus that the process has been bound to?

So if I've been externally bound and I discover my topology, calling hwloc_topology_get_allowed_cpuset on the root object of the topology will return the binding envelope - correct?

@bgoglin
Copy link
Contributor

bgoglin commented Feb 8, 2023

allowed = what's available in your cgroup on Linux, but not always in your current binding. You may be bound to less CPUs than the cgroup. And you'd be able to rebind larger, but always inside the list of allowed.
If you have CPU#0-#7 in your machine, with only #0-#3 in cgroup and you're bound to #0-#1, then your binding is #0-#1 while hwloc_topo_get_allowed_cpuset() would return #0-#3.
Not sure what you call "binding envelope". If that's the current binding, then it's #0-#1. If that's the maximal binding, then it's #0-#3 (= allowed).

@rhc54
Copy link
Contributor

rhc54 commented Feb 8, 2023

Guess it may just be me, but I have always considered a cgroup to be an "externally applied binding". Sure, you can subdivide it if you want - but you can't get outside it (well, unless you have privilege - but that's another story).

You've given me the answer I needed - the "get_allowed_cpuset" is returning the externally applied binding envelope, which means I can get rid of the outdated code that is causing the problem here. Thanks!

@rhc54
Copy link
Contributor

rhc54 commented Feb 9, 2023

@gkatev Thanks for your patience and assistance in tracking this down - much appreciated!

@gkatev
Copy link
Contributor Author

gkatev commented Feb 9, 2023

Thanks @rhc54 for the support and for the fix! Confirming that the linked PR does fix the original issue. Feel free to close the issue whenever -- I imagine we might want it open until the commit in PRRTE makes its way into 5.0.x (for milestone tracking purposes).

@awlauria awlauria closed this as completed May 1, 2023
@awlauria
Copy link
Contributor

awlauria commented May 1, 2023

This is in v5.0.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants