-
Notifications
You must be signed in to change notification settings - Fork 900
Error launching under slurm (Out of resource) #11371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Try this while under the Slurm allocation: |
Do we expect this command to hang? (it does). I'm using my cluster's system-wide version, which as I see is 22.05.7. |
I realized I had Revised effects without any env vars:
Still stands that it appears in main and 5.0.x but not 5.0.0rc8. Edit: When running the second command above ( |
If you specify the binding, then we require that the binding be done - i.e., if you say My best guess here is that Slurm is assigning you to some place where we cannot bind your procs to cores. Perhaps there are only hwthreads and no cores? If you do I don't have access to a Slurm machine, but we aren't hearing of any problems from people who do - so I'm thinking there might be something about this Slurm setup that is causing the problem. I asked for the version because I was just contacted by SchedMD about a bug in one of their releases that causes OMPI some issues, but your version doesn't match so that isn't the cause. |
I see. Perhaps I also have to contact the system's maintentainers, I'm not super familiar with slurm either. The hwthread command does yield an improvement, but I can only get at most 20 processes (seems arbitrary, system has 48/96 cores/threads across 2 sockets):
I will also try out a couple different salloc parameters (e.g. cpus instead of tasks, or something like that) to see if I spot an improvement. |
Could be some kind of cgroup setting - we are only seeing 20 hwthreads in the allocation. Sounds suspicious. |
Not sure this fits all of your issue, but here are a few items we check/use for running prte/ompi within a local slurm system.
ensure you configured with `--enable-slurm`
export PRTE_MCA_ras=slurm
export PRTE_MCA_plm=slurm
# *** or directly on command-line ***
mpirun \
--prtemca plm slurm \
--prtemca ras slurm \
...
# Useful to ensure can use all cores in allocation
# when running mpirun in allocation
export PRTE_MCA_ras_slurm_use_entire_allocation=1
# Needed for VNI enabled Cray XE SS11 system
export PRTE_MCA_ras_base_launch_orted_on_hn=1
# *** or directly on command-line ***
mpirun \
--prtemca ras_slurm_use_entire_allocation 1 \
--prtemca ras_base_launch_orted_on_hn 1 \
...
salloc -S 0 ... # do not reserve any cores for system noise
salloc -S 1 ... # reserve 1 core
salloc -S 8 ... # reserve 8 cores (one per l3cache) [default] Look at
salloc -N 2 -t 10 -A $MYACCT -S 0 --threads-per-core 2
--prtemca ras_base_verbose 20 # See that #slots matches SLURM_JOB_CPUS_PER_NODE
--prtemca plm_base_verbose 20 # Ensure slurm selected |
Regarding the initial claim about the versions, there was a mistake in my 5.0.0rc8 installation. So after clearing that up, it is in fact 5.0.0rc7 that works fine. 5.0.0rc8 also erros out, even without the bind-to param (maybe the default there was to bind to core?), but with a different message
5.0.x and main also error out, as described above (no change) Do we know how this variable (5.0.0rc7 vs 5.0.0rc8) might play a role in this? I see a bunch of changes in |
I know that those release candidates are quite stale (in terms of PRRTE), and so I can't be sure you aren't just hitting old problems that have already been fixed. Likewise, OMPI main was just updated yesterday (IIRC - it was in the last day or two). Are you using the head of OMPI main as of today? If not, it might be worth updating so we know we are all looking at the same thing. You might also want to go into the 3rd-party openpmix and prrte submodule directories and do "git checkout master" followed by "git pull" on each of them, just to ensure you have the latest of both code bases. Please be sure to configure with The error message is really suspicious to me. For whatever reason, you don't seem to have a usable allocation. Let's try with an updated OMPI main (as per above) and see what you get. If it fails, then add "--prtemca rmaps_base_verbose 100" to the mpirun cmd line. |
I see. In the above I had prrte @ dc6ccf6 and openmpix @ 415d704 (few days old). I made a new build with the latest prrte/openmpix master. I'm now at pprrte @ 081890a, openpmix @ 0818181 (debug build), the problem does remain.
Again with the hwthreads I can spawn up to 20 procs but not more. Could you elaborate a bit on the cgroup thing with which I don't have experience? Can I easily check it's not in effect? Also thanks @naughtont3 for the debug info, generally things seem in order. Under my
Do we think this looks like a slurm installation issue or like a prrte issue? Should |
Really not sure at this point. Your Slurm envars look correct, but it is also clear that we are not seeing the allocation.
If "get_ncpus" returns 0, then we can't assign a process to that location as we don't have any available cpus. It uses hwloc to find the cpus, but the number of available cpus is impacted by Slurm, which can control what hwloc sees by setting a "window" of allowed cpus (cgroups is the mechanism by which that is done). Basically, think of it as Slurm "binding" anything you run to a specified set of cpus. In this case, the allocation output from Try applying the following patch to the 3rd-party/prrte directory: diff --git a/src/mca/rmaps/base/rmaps_base_support_fns.c b/src/mca/rmaps/base/rmaps_base_support_fns.c
index c7d04044ab..2fa6b7d6bc 100644
--- a/src/mca/rmaps/base/rmaps_base_support_fns.c
+++ b/src/mca/rmaps/base/rmaps_base_support_fns.c
@@ -674,6 +674,7 @@ int prte_rmaps_base_get_ncpus(prte_node_t *node,
prte_rmaps_options_t *options)
{
int ncpus;
+ char *tmp;
#if HWLOC_API_VERSION < 0x20000
hwloc_obj_t root;
@@ -687,15 +688,30 @@ int prte_rmaps_base_get_ncpus(prte_node_t *node,
hwloc_bitmap_and(prte_rmaps_base.available, prte_rmaps_base.available, obj->allowed_cpuset);
}
#else
+ if (NULL != options->job_cpuset) {
+ tmp = NULL;
+ hwloc_bitmap_list_asprintf(&tmp, options->job_cpuset);
+ pmix_output(0, "JOBCPUSET: %s", (NULL == tmp) ? "NO-CPUS" : tmp);
+ } else {
+ pmix_output(0, "JOBCPUSET IS NULL");
+ }
+ tmp = NULL;
+ hwloc_bitmap_list_asprintf(&tmp, hwloc_topology_get_allowed_cpuset(node->topology->topo));
+ pmix_output(0, "ALLOWEDCPUSET: %s", (NULL == tmp) ? "NULL" : tmp);
if (NULL == options->job_cpuset) {
hwloc_bitmap_copy(prte_rmaps_base.available, hwloc_topology_get_allowed_cpuset(node->topology->topo));
} else {
hwloc_bitmap_and(prte_rmaps_base.available, hwloc_topology_get_allowed_cpuset(node->topology->topo), options->job_cpuset);
}
+ pmix_output(0, "OBJ: %s", (NULL == obj) ? "NULL" : "NON-NULL");
if (NULL != obj) {
hwloc_bitmap_and(prte_rmaps_base.available, prte_rmaps_base.available, obj->cpuset);
}
#endif
+ tmp = NULL;
+ hwloc_bitmap_list_asprintf(&tmp, prte_rmaps_base.available);
+ pmix_output(0, "GETNCPUS: %s", (NULL == tmp) ? "NULL" : tmp);
+
if (options->use_hwthreads) {
ncpus = hwloc_bitmap_weight(prte_rmaps_base.available);
} else { Should hopefully provide a little more insight into what is going on. |
I see sounds reasonable. Let me also note that I'm not 100% confident in my salloc params, but I did also try other ones, e.g. with -t or -c instead of -n, but no dice. With the above patch applied:
|
Hmmm...well, that certainly wasn't what I expected to see! It looks like you have something that is setting the job cpuset (like an MCA param for "hwloc_default_cpu_list") that is restricting the available cpus to 0-19. Could you add the following diff: diff --git a/src/mca/rmaps/base/rmaps_base_map_job.c b/src/mca/rmaps/base/rmaps_base_map_job.c
index 4ae7df6493..a81267e8fa 100644
--- a/src/mca/rmaps/base/rmaps_base_map_job.c
+++ b/src/mca/rmaps/base/rmaps_base_map_job.c
@@ -312,6 +312,7 @@ void prte_rmaps_base_map_job(int fd, short args, void *cbdata)
/* set some convenience params */
prte_get_attribute(&jdata->attributes, PRTE_JOB_CPUSET, (void**)&options.cpuset, PMIX_STRING);
+ pmix_output(0, "JOBCPUSET ATTR: %s", (NULL == options.cpuset) ? "NULL" : options.cpuset);
if (prte_get_attribute(&jdata->attributes, PRTE_JOB_PES_PER_PROC, (void **) &u16ptr, PMIX_UINT16)) {
options.cpus_per_rank = u16;
} else {
diff --git a/src/mca/rmaps/round_robin/rmaps_rr_mappers.c b/src/mca/rmaps/round_robin/rmaps_rr_mappers.c
index 1ba0053f17..df791071b1 100644
--- a/src/mca/rmaps/round_robin/rmaps_rr_mappers.c
+++ b/src/mca/rmaps/round_robin/rmaps_rr_mappers.c
@@ -53,6 +53,7 @@ int prte_rmaps_rr_byslot(prte_job_t *jdata,
prte_proc_t *proc;
bool second_pass = false;
prte_binding_policy_t savebind = options->bind;
+ char *tmp;
pmix_output_verbose(2, prte_rmaps_base_framework.framework_output,
"mca:rmaps:rr: mapping by slot for job %s slots %d num_procs %lu",
@@ -84,6 +85,14 @@ pass:
"mca:rmaps:rr:slot working node %s", node->name);
prte_rmaps_base_get_cpuset(jdata, node, options);
+ pmix_output(0, "CPUSET MAPPER: %s", (NULL == options->cpuset) ? "NULL" : options->cpuset);
+ if (NULL == options->job_cpuset) {
+ pmix_output(0, "JOBCPUSET MAPPER: NULL");
+ } else {
+ tmp = NULL;
+ hwloc_bitmap_list_asprintf(&tmp, options->job_cpuset);
+ pmix_output(0, "JOBCPUSET MAPPER: %s", (NULL == tmp) ? "NULL" : tmp);
+ }
/* compute the number of procs to go on this node */
if (second_pass) { |
Hmm, my env seem clear, here's the new output:
|
Weird - okay, let's try the following diff. This includes all the prior ones as they are now going to interleave, so go into the prrte directory and do a diff --git a/src/hwloc/hwloc_base_util.c b/src/hwloc/hwloc_base_util.c
index d1bdfb6940..31d53829f7 100644
--- a/src/hwloc/hwloc_base_util.c
+++ b/src/hwloc/hwloc_base_util.c
@@ -167,6 +167,7 @@ hwloc_cpuset_t prte_hwloc_base_generate_cpuset(hwloc_topology_t topo,
hwloc_cpuset_t prte_hwloc_base_setup_summary(hwloc_topology_t topo)
{
hwloc_cpuset_t avail = NULL;
+ char *tmp;
avail = hwloc_bitmap_alloc();
/* get the cpus we are bound to */
@@ -194,6 +195,9 @@ hwloc_cpuset_t prte_hwloc_base_setup_summary(hwloc_topology_t topo)
#else
hwloc_bitmap_copy(avail, hwloc_topology_get_allowed_cpuset(topo));
#endif
+ tmp = NULL;
+ hwloc_bitmap_list_asprintf(&tmp, avail);
+ pmix_output(0, "SETUPSUMMARY: %s", (NULL == tmp) ? "NULL" : tmp);
return avail;
}
@@ -209,9 +213,11 @@ hwloc_cpuset_t prte_hwloc_base_filter_cpus(hwloc_topology_t topo)
if (NULL == prte_hwloc_default_cpu_list) {
PMIX_OUTPUT_VERBOSE((5, prte_hwloc_base_output,
"hwloc:base: no cpus specified - using root available cpuset"));
+ pmix_output(0, "NO DEFAULT CPU LIST");
avail = prte_hwloc_base_setup_summary(topo);
} else {
PMIX_OUTPUT_VERBOSE((5, prte_hwloc_base_output, "hwloc:base: filtering cpuset"));
+ pmix_output(0, "FILTERING CPUSET: %s", prte_hwloc_default_cpu_list);
avail = prte_hwloc_base_generate_cpuset(topo, prte_hwloc_default_use_hwthread_cpus,
prte_hwloc_default_cpu_list);
}
diff --git a/src/mca/ess/hnp/ess_hnp_module.c b/src/mca/ess/hnp/ess_hnp_module.c
index a6e330342b..49931c0ce0 100644
--- a/src/mca/ess/hnp/ess_hnp_module.c
+++ b/src/mca/ess/hnp/ess_hnp_module.c
@@ -393,6 +393,10 @@ static int rte_init(int argc, char **argv)
t->index = pmix_pointer_array_add(prte_node_topologies, t);
node->topology = t;
node->available = prte_hwloc_base_filter_cpus(prte_hwloc_topology);
+ error = NULL;
+ hwloc_bitmap_list_asprintf(&error, node->available);
+ pmix_output(0, "ESS: %s", (NULL == error) ? "NULL" : error);
+
if (15 < pmix_output_get_verbosity(prte_ess_base_framework.framework_output)) {
char *output = NULL;
pmix_output(0, "%s Topology Info:", PRTE_NAME_PRINT(PRTE_PROC_MY_NAME));
diff --git a/src/mca/rmaps/base/rmaps_base_map_job.c b/src/mca/rmaps/base/rmaps_base_map_job.c
index 4ae7df6493..a81267e8fa 100644
--- a/src/mca/rmaps/base/rmaps_base_map_job.c
+++ b/src/mca/rmaps/base/rmaps_base_map_job.c
@@ -312,6 +312,7 @@ void prte_rmaps_base_map_job(int fd, short args, void *cbdata)
/* set some convenience params */
prte_get_attribute(&jdata->attributes, PRTE_JOB_CPUSET, (void**)&options.cpuset, PMIX_STRING);
+ pmix_output(0, "JOBCPUSET ATTR: %s", (NULL == options.cpuset) ? "NULL" : options.cpuset);
if (prte_get_attribute(&jdata->attributes, PRTE_JOB_PES_PER_PROC, (void **) &u16ptr, PMIX_UINT16)) {
options.cpus_per_rank = u16;
} else {
diff --git a/src/mca/rmaps/base/rmaps_base_support_fns.c b/src/mca/rmaps/base/rmaps_base_support_fns.c
index c7d04044ab..b78411b2bc 100644
--- a/src/mca/rmaps/base/rmaps_base_support_fns.c
+++ b/src/mca/rmaps/base/rmaps_base_support_fns.c
@@ -674,6 +674,7 @@ int prte_rmaps_base_get_ncpus(prte_node_t *node,
prte_rmaps_options_t *options)
{
int ncpus;
+ char *tmp;
#if HWLOC_API_VERSION < 0x20000
hwloc_obj_t root;
@@ -687,15 +688,30 @@ int prte_rmaps_base_get_ncpus(prte_node_t *node,
hwloc_bitmap_and(prte_rmaps_base.available, prte_rmaps_base.available, obj->allowed_cpuset);
}
#else
+ if (NULL != options->job_cpuset) {
+ tmp = NULL;
+ hwloc_bitmap_list_asprintf(&tmp, options->job_cpuset);
+ pmix_output(0, "JOBCPUSET: %s", (NULL == tmp) ? "NO-CPUS" : tmp);
+ } else {
+ pmix_output(0, "JOBCPUSET IS NULL");
+ }
+ tmp = NULL;
+ hwloc_bitmap_list_asprintf(&tmp, hwloc_topology_get_allowed_cpuset(node->topology->topo));
+ pmix_output(0, "ALLOWEDCPUSET: %s", (NULL == tmp) ? "NULL" : tmp);
if (NULL == options->job_cpuset) {
hwloc_bitmap_copy(prte_rmaps_base.available, hwloc_topology_get_allowed_cpuset(node->topology->topo));
} else {
hwloc_bitmap_and(prte_rmaps_base.available, hwloc_topology_get_allowed_cpuset(node->topology->topo), options->job_cpuset);
}
+ pmix_output(0, "OBJ: %s", (NULL == obj) ? "NULL" : "NON-NULL");
if (NULL != obj) {
hwloc_bitmap_and(prte_rmaps_base.available, prte_rmaps_base.available, obj->cpuset);
}
#endif
+ tmp = NULL;
+ hwloc_bitmap_list_asprintf(&tmp, prte_rmaps_base.available);
+ pmix_output(0, "GETNCPUS: %s", (NULL == tmp) ? "NULL" : tmp);
+
if (options->use_hwthreads) {
ncpus = hwloc_bitmap_weight(prte_rmaps_base.available);
} else {
@@ -788,6 +804,7 @@ void prte_rmaps_base_get_cpuset(prte_job_t *jdata,
prte_node_t *node,
prte_rmaps_options_t *options)
{
+ char *tmp;
PRTE_HIDE_UNUSED_PARAMS(jdata);
if (NULL != options->cpuset) {
@@ -795,7 +812,10 @@ void prte_rmaps_base_get_cpuset(prte_job_t *jdata,
options->use_hwthreads,
options->cpuset);
} else {
- options->job_cpuset = hwloc_bitmap_dup(node->available);
+ tmp = NULL;
+ hwloc_bitmap_list_asprintf(&tmp, node->available);
+ pmix_output(0, "AVAILABLECPUSET: %s", (NULL == tmp) ? "NULL" : tmp);
+ options->job_cpuset = hwloc_bitmap_dup(node->available);
}
}
diff --git a/src/mca/rmaps/round_robin/rmaps_rr_mappers.c b/src/mca/rmaps/round_robin/rmaps_rr_mappers.c
index 69080d3d06..25cfd419a6 100644
--- a/src/mca/rmaps/round_robin/rmaps_rr_mappers.c
+++ b/src/mca/rmaps/round_robin/rmaps_rr_mappers.c
@@ -53,6 +53,7 @@ int prte_rmaps_rr_byslot(prte_job_t *jdata,
prte_proc_t *proc;
bool second_pass = false;
prte_binding_policy_t savebind = options->bind;
+ char *tmp;
pmix_output_verbose(2, prte_rmaps_base_framework.framework_output,
"mca:rmaps:rr: mapping by slot for job %s slots %d num_procs %lu",
@@ -84,6 +85,14 @@ pass:
"mca:rmaps:rr:slot working node %s", node->name);
prte_rmaps_base_get_cpuset(jdata, node, options);
+ pmix_output(0, "CPUSET MAPPER: %s", (NULL == options->cpuset) ? "NULL" : options->cpuset);
+ if (NULL == options->job_cpuset) {
+ pmix_output(0, "JOBCPUSET MAPPER: NULL");
+ } else {
+ tmp = NULL;
+ hwloc_bitmap_list_asprintf(&tmp, options->job_cpuset);
+ pmix_output(0, "JOBCPUSET MAPPER: %s", (NULL == tmp) ? "NULL" : tmp);
+ }
/* compute the number of procs to go on this node */
if (second_pass) {
|
|
Well, here's the problem:
A little light is beginning to show. Try adding this: diff --git a/src/hwloc/hwloc_base_util.c b/src/hwloc/hwloc_base_util.c
index d1bdfb6940..065eeb5f27 100644
--- a/src/hwloc/hwloc_base_util.c
+++ b/src/hwloc/hwloc_base_util.c
@@ -167,11 +167,15 @@ hwloc_cpuset_t prte_hwloc_base_generate_cpuset(hwloc_topology_t topo,
hwloc_cpuset_t prte_hwloc_base_setup_summary(hwloc_topology_t topo)
{
hwloc_cpuset_t avail = NULL;
+ char *tmp;
avail = hwloc_bitmap_alloc();
/* get the cpus we are bound to */
if (!prte_hwloc_synthetic_topo &&
0 <= hwloc_get_cpubind(topo, avail, HWLOC_CPUBIND_PROCESS)) {
+ tmp = NULL;
+ hwloc_bitmap_list_asprintf(&tmp, avail);
+ pmix_output(0, "WE ARE BOUND: %s", tmp);
return avail;
}
What appears to be happening (and this diff should confirm it) is that Slurm is binding |
Interesting, and weird. Any idea how this constraint is happening or where's the source of the problem in general? But from what I understand it's a result of a buggy/misconfigured slurm environment? (and there's also perhaps a question of why it worked with rc7, but I suppose anything is possible, it could even be failing but silently) With the latest diff:
|
Yeah, you're hitting an external binding constraint. Could be your admins made a change to the Slurm config - have you retried that prior rc to see if it currently works? Also possible that we were ignoring something that we now pay attention to. What kind of machine is this? If you run |
I see thanks, I'll also contact the admins and see what insight they can contribute. The machine is a 2x Xeon 8260 (but it's also happening on another partition w/ different CPUs). All looks in order in lstopo, with all expected cores per socket and 2 PUs per core. Yes, rc7 still works. Here are some logs for completeness's sake:
|
Oh my - that is a very old version of PRRTE in that rc! We haven't used that binding algo in quite some time. Can you generate an xml output from lstopo and post it for me? I'd like to see if I can locally reproduce the problem. |
Yes here it is: topo.xml.txt |
Yeah, there is something funny going on with respect to Slurm - everything works fine for me with your topology. It's that externally applied binding that is causing the problem. Key is to figure out where that is coming from. |
I had a thought hit me and checked your topology to confirm. The reason we cannot This looks increasingly like a Slurm configuration issue. |
Ah, that makes sense. I thought it might have been my |
I realized where the 20 comes from..... The login node, on which I run salloc, has 20 cores. I believe it's also a VM in case it's relevant. Does this ring any new bells? Is the problem that all this mapping/binding code is running in the context of the login node while it should be running in the context of the allocated node? |
That is actually a rather typical scenario (minus the VM). We don't use the topology from the node where That said, I'm happy to take another look and verify that we aren't making a mistake somewhere in sensing an external binding. |
Perfect, that's as expected. My thought is that the daemon is seeing the login node's topology instead of that of the allocated one -- for some reason. Or something like the launched daemon "retaining" the affinity of the login node? I added some debug prints to diff --git a/src/hwloc/hwloc_base_util.c b/src/hwloc/hwloc_base_util.c
index d1bdfb6940..b8f9c130a3 100644
--- a/src/hwloc/hwloc_base_util.c
+++ b/src/hwloc/hwloc_base_util.c
@@ -164,14 +164,52 @@ hwloc_cpuset_t prte_hwloc_base_generate_cpuset(hwloc_topology_t topo,
return avail;
}
+static void affinity(void) {
+ cpu_set_t mask;
+ long nproc, i;
+
+ if(sched_getaffinity(0, sizeof(cpu_set_t), &mask) == -1) {
+ perror("sched_getaffinity");
+ assert(false);
+ }
+
+ nproc = sysconf(_SC_NPROCESSORS_ONLN);
+ printf("sched_getaffinity = ");
+
+ long n_aff = 0;
+
+ for(i = 0; i < nproc; i++) {
+ printf("%d ", CPU_ISSET(i, &mask));
+
+ if(CPU_ISSET(i, &mask) == 1)
+ n_aff++;
+ }
+
+ printf("(%d)\n", n_aff);
+}
+
+#include <unistd.h>
+#include <limits.h>
+
hwloc_cpuset_t prte_hwloc_base_setup_summary(hwloc_topology_t topo)
{
hwloc_cpuset_t avail = NULL;
+ char *tmp;
avail = hwloc_bitmap_alloc();
/* get the cpus we are bound to */
if (!prte_hwloc_synthetic_topo &&
0 <= hwloc_get_cpubind(topo, avail, HWLOC_CPUBIND_PROCESS)) {
+ tmp = NULL;
+ hwloc_bitmap_list_asprintf(&tmp, avail);
+ pmix_output(0, "WE ARE BOUND: %s", tmp);
+
+ char hostname[HOST_NAME_MAX + 1];
+ gethostname(hostname, HOST_NAME_MAX + 1);
+ printf("hostname: %s\n", hostname);
+
+ affinity();
+
return avail;
} Produces:
It seems like this code is running on the login node ( Maybe you could point me to the code where the topology information is collected? Or to some part of the daemon's code and I could check the affinity or other information there? |
I'll take a look at it - seems like something may be off there. Out of curiosity, though - why have you bound |
I suspect this diff will fix it, though I'm a tad concerned it will break others: diff --git a/src/hwloc/hwloc_base_util.c b/src/hwloc/hwloc_base_util.c
index d1bdfb6940..b342651d23 100644
--- a/src/hwloc/hwloc_base_util.c
+++ b/src/hwloc/hwloc_base_util.c
@@ -169,11 +169,6 @@ hwloc_cpuset_t prte_hwloc_base_setup_summary(hwloc_topology_t topo)
hwloc_cpuset_t avail = NULL;
avail = hwloc_bitmap_alloc();
- /* get the cpus we are bound to */
- if (!prte_hwloc_synthetic_topo &&
- 0 <= hwloc_get_cpubind(topo, avail, HWLOC_CPUBIND_PROCESS)) {
- return avail;
- }
/* get the root available cpuset */
#if HWLOC_API_VERSION < 0x20000 |
Hmmm...when you say you are running Perhaps one data point is: when you earlier provided me with your topology, what node did you use to get it? The login node? Or a compute node? |
The login node itself apparently looks like a VM or something similar, I don't know what exactly. I noticed it in The topology XML above was from the allocated compute node, and here is the one of the login node: login_topo.xml.txt. This node has 20 cores as seen in htop and lstopo, so this must be where the "20" we see is coming from. But we shouldn't be seeing any aspect of the login node's topology as it's not a part of the allocation (?). Based on this I thought that perhaps one of two effects was taking place: 1. The "number of available cores" was bleeding through from the login node to the allocated node -- if that makes sense -- not sure in what form that would happen, I imagine slurm could also play a role here. Or 2, code that should be running on the allocated node instead runs on the login node and collects a topology that shows 20 cores available instead of 48. Or something along these lines, this is how I perceive with it my uneducated eyes! I haven't done any king of bounding to mpirun, what exactly did you mean? |
I mean that |
The way I understand it is that it's not mpirun that's getting constrained, but rather the whole VM that makes up the login node is given just 20 cores from the physical host system that it runs on. Or did you mean that it's possible for the constraint of the VM on the host system, to pass over to mpirun which runs on the guest and affect the job? Indeed with the above diff that removes the hwloc_get_cpubind call/check the problems go away! (I don't think I would argue that this check go away -- I'm still not sure where this bind comes from or why we are seeing it in the job) |
I think we are drifting down into terminology hell, so let's just drop it.
Yeah, I expected that would be the case. I need to do some checking to ensure we aren't breaking other people. This scenario is pretty unusual/odd, so I don't want to break the norm just to support it. I'll play around with it a bit. |
@bgoglin Can you please clarify something for me? When you say "allowed" cpuset, are you talking about the cpus that this process is allowed to utilize? Effectively the online cpus that the process has been bound to? So if I've been externally bound and I discover my topology, calling |
allowed = what's available in your cgroup on Linux, but not always in your current binding. You may be bound to less CPUs than the cgroup. And you'd be able to rebind larger, but always inside the list of allowed. |
Guess it may just be me, but I have always considered a cgroup to be an "externally applied binding". Sure, you can subdivide it if you want - but you can't get outside it (well, unless you have privilege - but that's another story). You've given me the answer I needed - the "get_allowed_cpuset" is returning the externally applied binding envelope, which means I can get rid of the outdated code that is causing the problem here. Thanks! |
@gkatev Thanks for your patience and assistance in tracking this down - much appreciated! |
Thanks @rhc54 for the support and for the fix! Confirming that the linked PR does fix the original issue. Feel free to close the issue whenever -- I imagine we might want it open until the commit in PRRTE makes its way into 5.0.x (for milestone tracking purposes). |
This is in v5.0.x |
Hi, I've been unable to start mpi jobs under slurm reservations with the latest main.
I'm under
salloc -N 1 -n 48
, and the mesage is:This happens with 1 as well as with 2 nodes in the reservation. It also doesn't work in 5.0.x, but in 5.0.0rc8 all is well. It doesn't happen when not under slurm.
I tried to chase it down a bit:
It looked to me like the failure starts happening because
prte_rmaps_base_get_ncpus()
returned 0. These debug prints:Produce:
The text was updated successfully, but these errors were encountered: