Skip to content

Investigate and document current behavior of "aggressive" mode #11735

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #10480
qkoziol opened this issue Jun 5, 2023 · 12 comments
Closed
Tracked by #10480

Investigate and document current behavior of "aggressive" mode #11735

qkoziol opened this issue Jun 5, 2023 · 12 comments

Comments

@qkoziol
Copy link
Contributor

qkoziol commented Jun 5, 2023

Is "aggressive" mode really determined by the slot count provided by PRRTE? Or is it determined by a query to hwloc with a reference to the number of processes per node. It just surprises me that this part of OMPI is controlled by PRRTE instead of something more generic that might work with, say, Slurm direct launch via srun. (from @jjhursey)

@qkoziol
Copy link
Contributor Author

qkoziol commented Jun 5, 2023

This is an old link to section 10.8.21 -- it may or may not be correct: https://ompi--8329.org.readthedocs.build/en/8329/faq/running-mpi-apps.html

@qkoziol qkoziol changed the title From Josh Hursey Investigate and document current behavior of "aggressive" mode Jun 5, 2023
@qkoziol
Copy link
Contributor Author

qkoziol commented Jun 5, 2023

Split out from #10480

@edgargabriel
Copy link
Member

Here is what I found so far by looking into the ompi source code:

  1. The mca parameter that controls the yield_when_idle behavior is mpi_yield_when_idle (registered in mpi/runtime/ompi_mpi_params.c)

  2. The only location that I could find that overrides the value of the MCA parameter is in ompi/instance/instance.c https://github.com/open-mpi/ompi/blob/main/ompi/instance/instance.c#L415

   /* if we are oversubscribed, then set yield_when_idle
     * accordingly */
    if (ompi_mpi_oversubscribed) {
        ompi_mpi_yield_when_idle = true;
    }

This value is then used to set the opal_progress_yield_when_idle value https://github.com/open-mpi/ompi/blob/main/ompi/instance/instance.c#L765

    /* see if yield_when_idle was specified - if so, use it */
    opal_progress_set_yield_when_idle (ompi_mpi_yield_when_idle);
  1. ompi_mpi_oversubscribed is set in ompi/runtime/ompi_rte.c https://github.com/open-mpi/ompi/blob/main/ompi/runtime/ompi_rte.c#L935
#ifdef PMIX_NODE_OVERSUBSCRIBED
    pname.jobid = opal_process_info.my_name.jobid;
    pname.vpid = OPAL_VPID_WILDCARD;
    OPAL_MODEX_RECV_VALUE_OPTIONAL(ret, PMIX_NODE_OVERSUBSCRIBED, &pname,
                                   NULL, PMIX_BOOL);
    if (PMIX_SUCCESS == ret) {
        ompi_mpi_oversubscribed = true;
    }
#endif

I could not find another place where either ompi_mpi_oversubscribed or ompi_mpi_yield_when_idle is changed (or opal_progress_yield_when_idle outside of opal_progress_set_yield_when_idle() ). I will try to understand the prrte part next, but I think it is fair to say that ompi does whatever prrte returns here.

@edgargabriel
Copy link
Member

edgargabriel commented Jun 29, 2023

After looking into this a bit more here is how I understand the logic:

  • the yield_when_idle behavior (aka "aggressive mode") is controlled by the ompi_mpi_oversubscribed flag or the mca parameter mpi_yield_when_idle

  • the flag is set as a result of a PMIx_Get operation on PMIX_NODE_OVERSUBSCRIBED. In case of a direct launch, it is up the the PMIx implementation whether that flag is supported and provided. Hence, the flag could be provided e.g. by slurm.

  • In case of PRRTE, the flag is set in src/mca/rmaps/base/rmaps_base_support_fns.c in prte_rmaps_base_check_oversubscribed(). More specifically, it compares the number of slots on a node to the number of processes assigned. The number of slots again might be retrieved from hwloc (prte_hwloc_base_get_nbobjs_by_type() ) but can also come from a variety of ras components or by reading a host file. I don't think it is necessary to detail this path to answer the question on this ticket and update the documentation though.

@edgargabriel
Copy link
Member

I read through the documentation of v5.0.x, and I don't see the documentation making wrong statements wrt to oversubscription. In addition, I am honestly not sure whether the 'explanation' on why something is behaving a certain way (vs. the how the user can influence the setting) warrants a 'blocking' label.

Unless there are objections, I would like to i) remove the blocking label and change it to a lower level, and ii) close this ticket sometime next week

@jsquyres
Copy link
Member

jsquyres commented Jul 7, 2023

@edgargabriel If the current statements in the docs are correct, cool -- I agree: remove the blocker label. The issue was that there were so many things that had changed about run-time behavior that it warranted a check to ensure that the docs were a) correct, and/or b) should be expanded/clarified/whatever.

If we want to additionally expand the current docs (with more explanations, examples, ...etc.) and those aren't critical / could be added at any time after v5.0.0, cool -- we can do that, too.

@edgargabriel
Copy link
Member

@jsquyres I agree. I am 99.9% sure that we don't make an outrageously wrong statement related to oversubscription and the 'aggresive' mode control. If nothing else, I would argue that this issue does not warrant holding up the 5.0. release. Improvements are always possible.

@jsquyres
Copy link
Member

jsquyres commented Jul 7, 2023

Hahaha -- I just removed critical, but I see that you were the one who downgraded from blocker to critical.

When you read through the text, did you see anything that you would obviously improve? If so, could you just jot down a few bullet points for someone in the future to come through that make those changes?

Otherwise, if the text is currently ok and/or you don't see any obvious improvements, then I think we should close this issue as completed.

@edgargabriel
Copy link
Member

edgargabriel commented Jul 7, 2023

I think the only item that stood out to me was that the description was focusing on mpirun/prrte related behavior, and was making only few statements on direct launch. This was however also the reason that I thought that just dropping in a line about oversubscription + direct launch in one of the paragraphs will not help, and will just disturb the current flow.

@rhc54
Copy link
Contributor

rhc54 commented Jul 10, 2023

The number of slots again might be retrieved from hwloc

This is a common misconception, so I'll just explain it a bit here (it is included in the PRRTE documentation as well). HWLOC cannot provide any info on the number of slots as "slot" is not a hardware-related concept. It is simply a number indicating the number of processes that are allowed to be executed within this allocation on the given node by the user. The number can be set to equal the number of CPUs (cores or hwts) on the node for dedicated allocations (i.e., where nodes are not shared) if the sys admin elects to do so. This is often the case, which is why people conflate the two concepts.

Oversubscribed therefore has nothing to do with the number of CPUs on the node. It simply indicates that you are running more processes on the node than the allocated limit. Unmanaged system have no mechanism for detecting and/or controlling such behavior, but even managed systems can be oversubscribed when mpirun is used to start the job as the mpirun daemons "shield" the actual number of application procs from the host environment.

This is why PRRTE has to provide the "oversubscribed" flag. We chose to have PMIx convey it so that there would be a "standardized" way of getting the info. However, note that it would be highly unlikely that you would be "oversubscribed" during a direct launch - Slurm would definitely refuse to start more procs than allocated slots, thereby preventing even the possibility for operating oversubscribed. I don't know about other environments, but I very much doubt that any of them would allow you to direct launch an oversubscribed job.

@edgargabriel
Copy link
Member

As far as I can see the documentation is correct. I am closing this ticket because of this as complete.

@edgargabriel
Copy link
Member

The number of slots again might be retrieved from hwloc

This is a common misconception, so I'll just explain it a bit here (it is included in the PRRTE documentation as well). HWLOC cannot provide any info on the number of slots as "slot" is not a hardware-related concept. It is simply a number indicating the number of processes that are allowed to be executed within this allocation on the given node by the user. The number can be set to equal the number of CPUs (cores or hwts) on the node for dedicated allocations (i.e., where nodes are not shared) if the sys admin elects to do so. This is often the case, which is why people conflate the two concepts.

Oversubscribed therefore has nothing to do with the number of CPUs on the node. It simply indicates that you are running more processes on the node than the allocated limit. Unmanaged system have no mechanism for detecting and/or controlling such behavior, but even managed systems can be oversubscribed when mpirun is used to start the job as the mpirun daemons "shield" the actual number of application procs from the host environment.

This is why PRRTE has to provide the "oversubscribed" flag. We chose to have PMIx convey it so that there would be a "standardized" way of getting the info. However, note that it would be highly unlikely that you would be "oversubscribed" during a direct launch - Slurm would definitely refuse to start more procs than allocated slots, thereby preventing even the possibility for operating oversubscribed. I don't know about other environments, but I very much doubt that any of them would allow you to direct launch an oversubscribed job.

Thank you Ralph for the explanation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants