mpi_yield_when_idle setting from etc/openmpi-mca-params.conf is ignored in 4.0.0 #6433

iassiour · 2019-02-25T17:47:57Z

Hi ,

I have set "yield when idle" in etc/openmpi-mca-params.conf

tail openmpi-mca-params.conf

plm_rsh_agent = rsh
rmaps_base_oversubscribe = 1
hwloc_base_binding_policy = none
mpi_yield_when_idle = 1

However the setting seems to be ignored.

I get 100 % us 0% sys for processes waiting on MPI calls.

But if I set the environment variable:

setenv OMPI_MCA_mpi_yield_when_idle 1

or
mpirun --mca mpi_yield_when_idle true

I get the more expected 15% us 85 % sys.

Could someone please check that the setting is read from the configuration file as expected.

only sets the OMPI_MCA_mpi_yield_when_idle environment variable if this directive was not given (for example via a config file). Refs. open-mpi#6433 Signed-off-by: Gilles Gouaillardet <[email protected]>

jsquyres · 2019-02-27T16:50:24Z

You may not have gotten an email about it, but @ggouaillardet proposed a PR about this -- see #6440.

in schizo/ompi, sets the new OMPI_MCA_mpi_oversubscribe environment variable according to the node oversubscription state. This MCA parameter is used to set the default value of the mpi_yield_when_idle parameter. This two steps tango is needed so the mpi_yield_when_idle setting is always honored when set in a config file. Refs. open-mpi#6433 Signed-off-by: Gilles Gouaillardet <[email protected]>

iassiour · 2019-02-28T09:34:01Z

Greetings jsquyres ,

Thanks, Could you please let me know what future minor version of ompi will have this fix.

in schizo/ompi, sets the new OMPI_MCA_mpi_oversubscribe environment variable according to the node oversubscription state. This MCA parameter is used to set the default value of the mpi_yield_when_idle parameter. This two steps tango is needed so the mpi_yield_when_idle setting is always honored when set in a config file. Refs. open-mpi#6433 Signed-off-by: Gilles Gouaillardet <[email protected]>

Issue open-mpi#6433 notes that although yield_when_idle was fixed it had not actually been pushed to v.4.0.x. To remedy this I cherry picked the fix from user ggouaillardet and pushed it to v.4.0.x. Signed-off-by: William Bailey <[email protected]>

gpaulsen · 2019-11-11T14:41:46Z

@hppritcha should just cherry-pick the original #6440 fix to v4.0.x for inclusion into v4.0.3 ?

in schizo/ompi, sets the new OMPI_MCA_mpi_oversubscribe environment variable according to the node oversubscription state. This MCA parameter is used to set the default value of the mpi_yield_when_idle parameter. This two steps tango is needed so the mpi_yield_when_idle setting is always honored when set in a config file. Refs. open-mpi#6433 Signed-off-by: Gilles Gouaillardet <[email protected]>

in schizo/ompi, sets the new OMPI_MCA_mpi_oversubscribe environment variable according to the node oversubscription state. This MCA parameter is used to set the default value of the mpi_yield_when_idle parameter. This two steps tango is needed so the mpi_yield_when_idle setting is always honored when set in a config file. Refs. open-mpi#6433 Signed-off-by: Gilles Gouaillardet <[email protected]> (cherry-picked from cc97c0f)

hppritcha · 2020-02-28T18:43:39Z

closed via #7168 and #6440

in schizo/ompi, sets the new OMPI_MCA_mpi_oversubscribe environment variable according to the node oversubscription state. This MCA parameter is used to set the default value of the mpi_yield_when_idle parameter. This two steps tango is needed so the mpi_yield_when_idle setting is always honored when set in a config file. Refs. open-mpi#6433 Signed-off-by: Gilles Gouaillardet <[email protected]> (cherry-picked from cc97c0f)

planetA · 2020-09-04T09:52:36Z

@iassiour @ggouaillardet

Sorry for reviving an old issue, but as far as I know yield does nothing on modern Linux kernel. Or rather, immediately returns to the process. At least with modern CFS.

Considering that, why OpenMPI would need to use yield?

The man page for sched_yield says:

Use of sched_yield() with nondeterministic scheduling policies such as SCHED_OTHER is unspecified and very likely means your application design is broken.

devreal · 2020-09-04T10:17:33Z

@planetA

Sorry for reviving an old issue, but as far as I know yield does nothing on modern Linux kernel. Or rather, immediately returns to the process. At least with modern CFS.

Can you point out documentation for this behavior? There is plenty of noise out there, the only useful thing I found was https://www.kernel.org/doc/html/latest/scheduler/sched-design-CFS.html which states:

yield_task(…)

This function is basically just a dequeue followed by an enqueue [...]

I would expect that any runnable task with a similar priority gets scheduled for execution before the task calling sched_yield gets a chance to run again. I'm not deep into CFS specifics though...

Considering that, why OpenMPI would need to use yield?

Consider two threads/processes sending each other messages on an oversubscribed node. The receiver is scheduled first, the sender is waiting for its timeslice to become available. The alternative to calling sched_yield (or I guess usleep would do too) is to burn through all the cycles in the receiver's timeslice before the sender can run. It's simply meant to reduce the idle time in case a peer is eligible to run but not scheduled, which can happen on oversubscribed nodes.

planetA · 2020-09-04T12:30:54Z

@devreal I contacted a colleague who is more knowledgeable in CFS than I am. Here is a quote:

And [sched_yield] can actually do something, even in CFS. But it does not necessarily yield the CPU. What it does is check whether yielding the CPU at this point would be fair, i.e. yield only when there is another thread that had less CPU time (considering the weights from compute intensiveness and nice levels) than the current one.

Basically it means that if a thread yields, but still it did not run out of some allocated time, the kernel will immediately return to the same thread.

I decided to make a crude experiment by comparing number of context switches when running gethostname and sched_yield in busy loop. See the code here: https://gist.github.com/planetA/10738756412cc411a7f9002fcb2639f4

For me, sched_yield changes neither user time, nor number of context switches:

$ taskset -c 3 ./test
YIELD: USER: 3815710 SYS: 6179530 CTXSW: 1 INVCTXSW: 1301
HOSTN: USER: 3815753 SYS: 6179601 CTXSW: 1 INVCTXSW: 1302

The output shows stat from get_rusage USER is user CPU time, SYS is system CPU time, CTXSW is the number of voluntary context switches, and INVCTXSW is the number of involuntary context switches.

I checked with strace that both thread actually make corresponding system calls.

devreal · 2020-09-04T12:56:56Z

And [sched_yield] can actually do something, even in CFS. But it does not necessarily yield the CPU. What it does is check whether yielding the CPU at this point would be fair, i.e. yield only when there is another thread that had less CPU time (considering the weights from compute intensiveness and nice levels) than the current one.

Basically it means that if a thread yields, but still it did not run out of some allocated time, the kernel will immediately return to the same thread.

I think this interpretation is not quite correct. AFAIU, there is no fixed time quantum that is allocated. Instead, sched_yield gives the scheduler the chance to check whether there is a task that has used less CPU time than the calling task and make a scheduling decision based on the p->se.vruntime value [§3, 1]. Scheduling is done relative to the time used by other tasks, not in terms of absolute timeslices (something I learned after I made my earlier comment ^^)

For me, sched_yield changes neither user time, nor number of context switches:
$ taskset -c 3 ./test
YIELD: USER: 3815710 SYS: 6179530 CTXSW: 1 INVCTXSW: 1301
HOSTN: USER: 3815753 SYS: 6179601 CTXSW: 1 INVCTXSW: 1302

Two points:

you should use RUSAGE_THREAD instead of RUSAGE_SELF (which according to the man page includes all threads)
Increasing the number of threads (tried twenty instead of two) yields the desired effect:

$ taskset -c 0 ./test_sched_yield
YIELD: USER: 166768 SYS: 333536 CTXSW: 0 INVCTXSW: 474200
YIELD: USER: 150754 SYS: 349475 CTXSW: 0 INVCTXSW: 477002
YIELD: USER: 177729 SYS: 322545 CTXSW: 0 INVCTXSW: 474802
YIELD: USER: 202645 SYS: 298008 CTXSW: 0 INVCTXSW: 477600
HOSTN: USER: 348952 SYS: 151594 CTXSW: 0 INVCTXSW: 121
YIELD: USER: 207146 SYS: 293457 CTXSW: 0 INVCTXSW: 476302
HOSTN: USER: 295636 SYS: 205537 CTXSW: 0 INVCTXSW: 130
YIELD: USER: 256279 SYS: 244630 CTXSW: 0 INVCTXSW: 477404
HOSTN: USER: 360715 SYS: 140904 CTXSW: 0 INVCTXSW: 121
HOSTN: USER: 326283 SYS: 175909 CTXSW: 0 INVCTXSW: 124
HOSTN: USER: 288554 SYS: 213530 CTXSW: 0 INVCTXSW: 125
YIELD: USER: 197841 SYS: 303828 CTXSW: 0 INVCTXSW: 478504
YIELD: USER: 261765 SYS: 241099 CTXSW: 0 INVCTXSW: 478304
HOSTN: USER: 328843 SYS: 174258 CTXSW: 0 INVCTXSW: 128

It appears that the calls to sched_yield are treated as involuntary context switches, but the difference between hostname and sched_yield is clearly visible.

You are right that sched_yield does not force a task-switch but it appears that it leads to more task switches happening, which is what you want in a scenario where tasks wait for each other on an oversubscribed system.

[1] https://www.kernel.org/doc/html/latest/scheduler/sched-design-CFS.html

rhc54 · 2020-09-04T12:59:07Z

Last I checked, OMPI stopped using sched_yield in the code base several generations ago because the Linux folks reported that it didn't accomplish what we wanted. I don't believe we call it any more, unless someone added it back in again.

devreal · 2020-09-04T13:06:27Z

@rhc54 It's still alive and kicking, if OMPI detects oversubscription or the user enables it through an mca parameter

rhc54 · 2020-09-04T13:14:32Z

Hmmm...I do see it got put back into opal_progress again, but the runtime isn't setting anything for oversubscription, at least not in the master branch

devreal · 2020-09-04T13:26:02Z

From what I see, ompi_mpi_yield_when_idle is passed to opal_progress_set_yield_when_idle and its value is set through the ompi_mpi_oversubscribe value and overwritten by mpi_yield_when_idle mca parameter if set. The mca parameter ompi_mpi_oversubscribe is described as "Internal MCA parameter set by the runtime environment when oversubscribing nodes". Not sure how exactly that works though...

rhc54 · 2020-09-04T13:28:18Z

It doesn't, right now - that logic is missing from PRRTE. Probably should be added back to the schizo/ompi component based on the older ORTE code.

rhc54 · 2020-09-04T14:20:20Z

I remember what this was all about now. We had received some complaints about cpu usage at 100% while the process was in finalize or some other quasi-idle state. Investigating, we found that we were indeed calling sched_yield, but since no other process was ready/anxious to take the cpu, the kernel scheduler just put us right back into play...and so we cycled at 100% cpu.

Only solution was to use a nanosleep (or something equivalent) to force a context switch, so we do that in a few places. This lets us back off the cpu when we know we are just idling.

hjelmn · 2020-09-04T14:33:22Z

I thought we decided a while ago (years at this point) that sched_yield was a bad idea on Linux. It is discouraged at the very least. nanosleep is probably the best replacement.

devreal · 2020-09-04T14:39:51Z

I believe MPICH lets the user choose which strategy to use. I don't think this has been implemented in OMPI, yet. Could be a nifty feature (and low-hanging fruit) for 5.0...

planetA · 2020-09-04T14:44:21Z

@devreal Thanks for catching a bug. But if you look at the logs, you will notice that user+sys time is equal for sched_yield and gethostname. This rather indicates that sched_yield just spends more time in the kernel, although both get equal share of CPU resources.

devreal · 2020-09-04T14:50:03Z

@planetA That is not surprising given that each thread runs for a fixed amount of time. So yes, the yielding threads will call into the kernel more often and thus have a higher share of sys time.

planetA · 2020-09-04T14:58:10Z

@devreal I fix the wall clock time, not CPU time. It just rather indicates that CFS equalises user+system time, rather than user time. And considering that, sched_yield gave nothing to other threads.

Here are the results of run with 40 threads: and

devreal · 2020-09-04T15:06:23Z

Ahh yes, my mistake. That is surprising, indeed (well, to me at least...) Thanks for pointing that out!

If I replace sched_yield with nanosleep() with 1ns, things seem to be much better:

$ taskset -c 1 ./test_sched_yield
HOSTN: USER: 460763 SYS: 237935 CTXSW: 0 INVCTXSW: 17141
SLEEP: USER: 133467 SYS: 165823 CTXSW: 171590 INVCTXSW: 7
HOSTN: USER: 494181 SYS: 203252 CTXSW: 0 INVCTXSW: 17080
SLEEP: USER: 98858 SYS: 208700 CTXSW: 171489 INVCTXSW: 8
SLEEP: USER: 89434 SYS: 217754 CTXSW: 171586 INVCTXSW: 8
SLEEP: USER: 102674 SYS: 200682 CTXSW: 171593 INVCTXSW: 8
HOSTN: USER: 432745 SYS: 264688 CTXSW: 0 INVCTXSW: 17128
SLEEP: USER: 92480 SYS: 202576 CTXSW: 171791 INVCTXSW: 7
SLEEP: USER: 127337 SYS: 172280 CTXSW: 171791 INVCTXSW: 16
HOSTN: USER: 462106 SYS: 237187 CTXSW: 0 INVCTXSW: 17160
HOSTN: USER: 452250 SYS: 246682 CTXSW: 0 INVCTXSW: 17173
HOSTN: USER: 470889 SYS: 227596 CTXSW: 0 INVCTXSW: 17100
HOSTN: USER: 462504 SYS: 236986 CTXSW: 0 INVCTXSW: 17036
HOSTN: USER: 452352 SYS: 248193 CTXSW: 0 INVCTXSW: 17177
SLEEP: USER: 115417 SYS: 186157 CTXSW: 171890 INVCTXSW: 18
SLEEP: USER: 104919 SYS: 200300 CTXSW: 171789 INVCTXSW: 10
SLEEP: USER: 96452 SYS: 209680 CTXSW: 171791 INVCTXSW: 9
HOSTN: USER: 451047 SYS: 249686 CTXSW: 0 INVCTXSW: 17239
SLEEP: USER: 120125 SYS: 186511 CTXSW: 171787 INVCTXSW: 19
HOSTN: USER: 509340 SYS: 237106 CTXSW: 0 INVCTXSW: 17488

devreal · 2020-09-04T15:35:34Z

Some more thoughts: it's not surprising that the yielding tasks get as much runtime as the hostname tasks, that's the nature of the completely-fair scheduler ^^ In a more realistic scenario, however, sched_yield might give some time to a process that is runnable but not scheduled yet. Calling it in a loop probably doesn't yield much though.

Some HPC systems may not use the CFS, in which case sched_yield might behave differently. I don't think we should abandon it but instead keep it as an option so that users can switch between usleep, nanosleep, and sched_yield (preferably at runtime). FWIW, the yielding needs some fixups for 5.0 anyway to accomodate for the ULT integrations, see #7702.

ggouaillardet mentioned this issue Feb 27, 2019

schizo/ompi: correctly handle the yield_when_idle option #6440

Merged

jsquyres added the question label Feb 27, 2019

jsquyres added the State: Awaiting code label Feb 27, 2019

ggouaillardet mentioned this issue Apr 26, 2019

setting mpi_yield_when_idle in etc/openmpi-mca-params.conf does not seem to work in 4.0.0 #6616

Closed

jsquyres added State: Awaiting merge to release branches and removed State: Awaiting code labels Oct 18, 2019

wbailey2 mentioned this issue Nov 10, 2019

pr/fix-yield_when_idle #7154

Closed

gpaulsen self-assigned this Nov 11, 2019

wbailey2 mentioned this issue Nov 14, 2019

v4.0.x: schizo/ompi: correctly handle the yield_when_idle option #7168

Merged

hppritcha closed this as completed Feb 28, 2020

devreal mentioned this issue Sep 7, 2020

Move yield capability to opal thread component #8037

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mpi_yield_when_idle setting from etc/openmpi-mca-params.conf is ignored in 4.0.0 #6433

mpi_yield_when_idle setting from etc/openmpi-mca-params.conf is ignored in 4.0.0 #6433

iassiour commented Feb 25, 2019

jsquyres commented Feb 27, 2019

iassiour commented Feb 28, 2019

gpaulsen commented Nov 11, 2019

hppritcha commented Feb 28, 2020

planetA commented Sep 4, 2020

devreal commented Sep 4, 2020

planetA commented Sep 4, 2020

devreal commented Sep 4, 2020

rhc54 commented Sep 4, 2020

devreal commented Sep 4, 2020

rhc54 commented Sep 4, 2020

devreal commented Sep 4, 2020

rhc54 commented Sep 4, 2020

rhc54 commented Sep 4, 2020

hjelmn commented Sep 4, 2020 •

edited

Loading

devreal commented Sep 4, 2020

planetA commented Sep 4, 2020

devreal commented Sep 4, 2020

planetA commented Sep 4, 2020 •

edited

Loading

devreal commented Sep 4, 2020

devreal commented Sep 4, 2020

mpi_yield_when_idle setting from etc/openmpi-mca-params.conf is ignored in 4.0.0 #6433

mpi_yield_when_idle setting from etc/openmpi-mca-params.conf is ignored in 4.0.0 #6433

Comments

iassiour commented Feb 25, 2019

jsquyres commented Feb 27, 2019

iassiour commented Feb 28, 2019

gpaulsen commented Nov 11, 2019

hppritcha commented Feb 28, 2020

planetA commented Sep 4, 2020

devreal commented Sep 4, 2020

planetA commented Sep 4, 2020

devreal commented Sep 4, 2020

rhc54 commented Sep 4, 2020

devreal commented Sep 4, 2020

rhc54 commented Sep 4, 2020

devreal commented Sep 4, 2020

rhc54 commented Sep 4, 2020

rhc54 commented Sep 4, 2020

hjelmn commented Sep 4, 2020 • edited Loading

devreal commented Sep 4, 2020

planetA commented Sep 4, 2020

devreal commented Sep 4, 2020

planetA commented Sep 4, 2020 • edited Loading

devreal commented Sep 4, 2020

devreal commented Sep 4, 2020

hjelmn commented Sep 4, 2020 •

edited

Loading

planetA commented Sep 4, 2020 •

edited

Loading