-
Notifications
You must be signed in to change notification settings - Fork 900
mpi_yield_when_idle setting from etc/openmpi-mca-params.conf is ignored in 4.0.0 #6433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
only sets the OMPI_MCA_mpi_yield_when_idle environment variable if this directive was not given (for example via a config file). Refs. open-mpi#6433 Signed-off-by: Gilles Gouaillardet <[email protected]>
You may not have gotten an email about it, but @ggouaillardet proposed a PR about this -- see #6440. |
in schizo/ompi, sets the new OMPI_MCA_mpi_oversubscribe environment variable according to the node oversubscription state. This MCA parameter is used to set the default value of the mpi_yield_when_idle parameter. This two steps tango is needed so the mpi_yield_when_idle setting is always honored when set in a config file. Refs. open-mpi#6433 Signed-off-by: Gilles Gouaillardet <[email protected]>
Greetings jsquyres , Thanks, Could you please let me know what future minor version of ompi will have this fix. |
in schizo/ompi, sets the new OMPI_MCA_mpi_oversubscribe environment variable according to the node oversubscription state. This MCA parameter is used to set the default value of the mpi_yield_when_idle parameter. This two steps tango is needed so the mpi_yield_when_idle setting is always honored when set in a config file. Refs. open-mpi#6433 Signed-off-by: Gilles Gouaillardet <[email protected]>
Issue open-mpi#6433 notes that although yield_when_idle was fixed it had not actually been pushed to v.4.0.x. To remedy this I cherry picked the fix from user ggouaillardet and pushed it to v.4.0.x. Signed-off-by: William Bailey <[email protected]>
@hppritcha should just cherry-pick the original #6440 fix to v4.0.x for inclusion into v4.0.3 ? |
in schizo/ompi, sets the new OMPI_MCA_mpi_oversubscribe environment variable according to the node oversubscription state. This MCA parameter is used to set the default value of the mpi_yield_when_idle parameter. This two steps tango is needed so the mpi_yield_when_idle setting is always honored when set in a config file. Refs. open-mpi#6433 Signed-off-by: Gilles Gouaillardet <[email protected]>
in schizo/ompi, sets the new OMPI_MCA_mpi_oversubscribe environment variable according to the node oversubscription state. This MCA parameter is used to set the default value of the mpi_yield_when_idle parameter. This two steps tango is needed so the mpi_yield_when_idle setting is always honored when set in a config file. Refs. open-mpi#6433 Signed-off-by: Gilles Gouaillardet <[email protected]> (cherry-picked from cc97c0f)
in schizo/ompi, sets the new OMPI_MCA_mpi_oversubscribe environment variable according to the node oversubscription state. This MCA parameter is used to set the default value of the mpi_yield_when_idle parameter. This two steps tango is needed so the mpi_yield_when_idle setting is always honored when set in a config file. Refs. open-mpi#6433 Signed-off-by: Gilles Gouaillardet <[email protected]> (cherry-picked from cc97c0f)
Sorry for reviving an old issue, but as far as I know yield does nothing on modern Linux kernel. Or rather, immediately returns to the process. At least with modern CFS. Considering that, why OpenMPI would need to use yield? The man page for sched_yield says:
|
Can you point out documentation for this behavior? There is plenty of noise out there, the only useful thing I found was https://www.kernel.org/doc/html/latest/scheduler/sched-design-CFS.html which states:
I would expect that any runnable task with a similar priority gets scheduled for execution before the task calling
Consider two threads/processes sending each other messages on an oversubscribed node. The receiver is scheduled first, the sender is waiting for its timeslice to become available. The alternative to calling |
@devreal I contacted a colleague who is more knowledgeable in CFS than I am. Here is a quote:
Basically it means that if a thread yields, but still it did not run out of some allocated time, the kernel will immediately return to the same thread. I decided to make a crude experiment by comparing number of context switches when running For me, $ taskset -c 3 ./test The output shows stat from get_rusage USER is user CPU time, SYS is system CPU time, CTXSW is the number of voluntary context switches, and INVCTXSW is the number of involuntary context switches. I checked with strace that both thread actually make corresponding system calls. |
I think this interpretation is not quite correct. AFAIU, there is no fixed time quantum that is allocated. Instead,
Two points:
It appears that the calls to You are right that [1] https://www.kernel.org/doc/html/latest/scheduler/sched-design-CFS.html |
Last I checked, OMPI stopped using |
@rhc54 It's still alive and kicking, if OMPI detects oversubscription or the user enables it through an mca parameter |
Hmmm...I do see it got put back into |
From what I see, |
It doesn't, right now - that logic is missing from PRRTE. Probably should be added back to the |
I remember what this was all about now. We had received some complaints about cpu usage at 100% while the process was in finalize or some other quasi-idle state. Investigating, we found that we were indeed calling Only solution was to use a |
I thought we decided a while ago (years at this point) that |
I believe MPICH lets the user choose which strategy to use. I don't think this has been implemented in OMPI, yet. Could be a nifty feature (and low-hanging fruit) for 5.0... |
@devreal Thanks for catching a bug. But if you look at the logs, you will notice that user+sys time is equal for |
@planetA That is not surprising given that each thread runs for a fixed amount of time. So yes, the yielding threads will call into the kernel more often and thus have a higher share of sys time. |
@devreal I fix the wall clock time, not CPU time. It just rather indicates that CFS equalises user+system time, rather than user time. And considering that, sched_yield gave nothing to other threads. |
Ahh yes, my mistake. That is surprising, indeed (well, to me at least...) Thanks for pointing that out! If I replace
|
Some more thoughts: it's not surprising that the yielding tasks get as much runtime as the hostname tasks, that's the nature of the completely-fair scheduler ^^ In a more realistic scenario, however, Some HPC systems may not use the CFS, in which case |
Hi ,
I have set "yield when idle" in etc/openmpi-mca-params.conf
tail openmpi-mca-params.conf
plm_rsh_agent = rsh
rmaps_base_oversubscribe = 1
hwloc_base_binding_policy = none
mpi_yield_when_idle = 1
However the setting seems to be ignored.
I get 100 % us 0% sys for processes waiting on MPI calls.
But if I set the environment variable:
setenv OMPI_MCA_mpi_yield_when_idle 1
or
mpirun --mca mpi_yield_when_idle true
I get the more expected 15% us 85 % sys.
Could someone please check that the setting is read from the configuration file as expected.
The text was updated successfully, but these errors were encountered: