Skip to content

mpi_yield_when_idle setting from etc/openmpi-mca-params.conf is ignored in 4.0.0 #6433

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
iassiour opened this issue Feb 25, 2019 · 21 comments
Closed

Comments

@iassiour
Copy link

Hi ,

I have set "yield when idle" in etc/openmpi-mca-params.conf

tail openmpi-mca-params.conf

plm_rsh_agent = rsh
rmaps_base_oversubscribe = 1
hwloc_base_binding_policy = none
mpi_yield_when_idle = 1

However the setting seems to be ignored.

I get 100 % us 0% sys for processes waiting on MPI calls.

But if I set the environment variable:

setenv OMPI_MCA_mpi_yield_when_idle 1

or
mpirun --mca mpi_yield_when_idle true

I get the more expected 15% us 85 % sys.

Could someone please check that the setting is read from the configuration file as expected.

ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Feb 27, 2019
only sets the OMPI_MCA_mpi_yield_when_idle environment variable
if this directive was not given (for example via a config file).

Refs. open-mpi#6433

Signed-off-by: Gilles Gouaillardet <[email protected]>
@jsquyres
Copy link
Member

You may not have gotten an email about it, but @ggouaillardet proposed a PR about this -- see #6440.

ggouaillardet added a commit to ggouaillardet/ompi that referenced this issue Feb 28, 2019
in schizo/ompi, sets the new OMPI_MCA_mpi_oversubscribe environment
variable according to the node oversubscription state.

This MCA parameter is used to set the default value of the
mpi_yield_when_idle parameter.

This two steps tango is needed so the mpi_yield_when_idle setting
is always honored when set in a config file.

Refs. open-mpi#6433

Signed-off-by: Gilles Gouaillardet <[email protected]>
@iassiour
Copy link
Author

Greetings jsquyres ,

Thanks, Could you please let me know what future minor version of ompi will have this fix.

wbailey2 pushed a commit to wbailey2/ompi that referenced this issue Nov 10, 2019
in schizo/ompi, sets the new OMPI_MCA_mpi_oversubscribe environment
variable according to the node oversubscription state.

This MCA parameter is used to set the default value of the
mpi_yield_when_idle parameter.

This two steps tango is needed so the mpi_yield_when_idle setting
is always honored when set in a config file.

Refs. open-mpi#6433

Signed-off-by: Gilles Gouaillardet <[email protected]>
wbailey2 added a commit to wbailey2/ompi that referenced this issue Nov 10, 2019
Issue open-mpi#6433 notes that although yield_when_idle was fixed it had not
actually been pushed to v.4.0.x. To remedy this I cherry picked the
fix from user ggouaillardet and pushed it to v.4.0.x.

Signed-off-by: William Bailey <[email protected]>
@gpaulsen
Copy link
Member

@hppritcha should just cherry-pick the original #6440 fix to v4.0.x for inclusion into v4.0.3 ?

@gpaulsen gpaulsen self-assigned this Nov 11, 2019
wbailey2 pushed a commit to wbailey2/ompi that referenced this issue Nov 14, 2019
in schizo/ompi, sets the new OMPI_MCA_mpi_oversubscribe environment
variable according to the node oversubscription state.

This MCA parameter is used to set the default value of the
mpi_yield_when_idle parameter.

This two steps tango is needed so the mpi_yield_when_idle setting
is always honored when set in a config file.

Refs. open-mpi#6433

Signed-off-by: Gilles Gouaillardet <[email protected]>
wbailey2 pushed a commit to wbailey2/ompi that referenced this issue Dec 2, 2019
in schizo/ompi, sets the new OMPI_MCA_mpi_oversubscribe environment
variable according to the node oversubscription state.

This MCA parameter is used to set the default value of the
mpi_yield_when_idle parameter.

This two steps tango is needed so the mpi_yield_when_idle setting
is always honored when set in a config file.

Refs. open-mpi#6433

Signed-off-by: Gilles Gouaillardet <[email protected]>
(cherry-picked from cc97c0f)
@hppritcha
Copy link
Member

closed via #7168 and #6440

cniethammer pushed a commit to cniethammer/ompi that referenced this issue May 10, 2020
in schizo/ompi, sets the new OMPI_MCA_mpi_oversubscribe environment
variable according to the node oversubscription state.

This MCA parameter is used to set the default value of the
mpi_yield_when_idle parameter.

This two steps tango is needed so the mpi_yield_when_idle setting
is always honored when set in a config file.

Refs. open-mpi#6433

Signed-off-by: Gilles Gouaillardet <[email protected]>
(cherry-picked from cc97c0f)
@planetA
Copy link

planetA commented Sep 4, 2020

@iassiour @ggouaillardet

Sorry for reviving an old issue, but as far as I know yield does nothing on modern Linux kernel. Or rather, immediately returns to the process. At least with modern CFS.

Considering that, why OpenMPI would need to use yield?

The man page for sched_yield says:

Use of sched_yield() with nondeterministic scheduling policies such as SCHED_OTHER is unspecified and very likely means your application design is broken.

@devreal
Copy link
Contributor

devreal commented Sep 4, 2020

@planetA

Sorry for reviving an old issue, but as far as I know yield does nothing on modern Linux kernel. Or rather, immediately returns to the process. At least with modern CFS.

Can you point out documentation for this behavior? There is plenty of noise out there, the only useful thing I found was https://www.kernel.org/doc/html/latest/scheduler/sched-design-CFS.html which states:

yield_task(…)

This function is basically just a dequeue followed by an enqueue [...]

I would expect that any runnable task with a similar priority gets scheduled for execution before the task calling sched_yield gets a chance to run again. I'm not deep into CFS specifics though...

Considering that, why OpenMPI would need to use yield?

Consider two threads/processes sending each other messages on an oversubscribed node. The receiver is scheduled first, the sender is waiting for its timeslice to become available. The alternative to calling sched_yield (or I guess usleep would do too) is to burn through all the cycles in the receiver's timeslice before the sender can run. It's simply meant to reduce the idle time in case a peer is eligible to run but not scheduled, which can happen on oversubscribed nodes.

@planetA
Copy link

planetA commented Sep 4, 2020

@devreal I contacted a colleague who is more knowledgeable in CFS than I am. Here is a quote:

And [sched_yield] can actually do something, even in CFS. But it does not necessarily yield the CPU. What it does is check whether yielding the CPU at this point would be fair, i.e. yield only when there is another thread that had less CPU time (considering the weights from compute intensiveness and nice levels) than the current one.

Basically it means that if a thread yields, but still it did not run out of some allocated time, the kernel will immediately return to the same thread.

I decided to make a crude experiment by comparing number of context switches when running gethostname and sched_yield in busy loop. See the code here: https://gist.github.com/planetA/10738756412cc411a7f9002fcb2639f4

For me, sched_yield changes neither user time, nor number of context switches:

$ taskset -c 3 ./test
YIELD: USER: 3815710 SYS: 6179530 CTXSW: 1 INVCTXSW: 1301
HOSTN: USER: 3815753 SYS: 6179601 CTXSW: 1 INVCTXSW: 1302

The output shows stat from get_rusage USER is user CPU time, SYS is system CPU time, CTXSW is the number of voluntary context switches, and INVCTXSW is the number of involuntary context switches.

I checked with strace that both thread actually make corresponding system calls.

@devreal
Copy link
Contributor

devreal commented Sep 4, 2020

And [sched_yield] can actually do something, even in CFS. But it does not necessarily yield the CPU. What it does is check whether yielding the CPU at this point would be fair, i.e. yield only when there is another thread that had less CPU time (considering the weights from compute intensiveness and nice levels) than the current one.

Basically it means that if a thread yields, but still it did not run out of some allocated time, the kernel will immediately return to the same thread.

I think this interpretation is not quite correct. AFAIU, there is no fixed time quantum that is allocated. Instead, sched_yield gives the scheduler the chance to check whether there is a task that has used less CPU time than the calling task and make a scheduling decision based on the p->se.vruntime value [§3, 1]. Scheduling is done relative to the time used by other tasks, not in terms of absolute timeslices (something I learned after I made my earlier comment ^^)

For me, sched_yield changes neither user time, nor number of context switches:

$ taskset -c 3 ./test
YIELD: USER: 3815710 SYS: 6179530 CTXSW: 1 INVCTXSW: 1301
HOSTN: USER: 3815753 SYS: 6179601 CTXSW: 1 INVCTXSW: 1302

Two points:

  1. you should use RUSAGE_THREAD instead of RUSAGE_SELF (which according to the man page includes all threads)
  2. Increasing the number of threads (tried twenty instead of two) yields the desired effect:
$ taskset -c 0 ./test_sched_yield
YIELD: USER: 166768 SYS: 333536 CTXSW: 0 INVCTXSW: 474200
YIELD: USER: 150754 SYS: 349475 CTXSW: 0 INVCTXSW: 477002
YIELD: USER: 177729 SYS: 322545 CTXSW: 0 INVCTXSW: 474802
YIELD: USER: 202645 SYS: 298008 CTXSW: 0 INVCTXSW: 477600
HOSTN: USER: 348952 SYS: 151594 CTXSW: 0 INVCTXSW: 121
YIELD: USER: 207146 SYS: 293457 CTXSW: 0 INVCTXSW: 476302
HOSTN: USER: 295636 SYS: 205537 CTXSW: 0 INVCTXSW: 130
YIELD: USER: 256279 SYS: 244630 CTXSW: 0 INVCTXSW: 477404
HOSTN: USER: 360715 SYS: 140904 CTXSW: 0 INVCTXSW: 121
HOSTN: USER: 326283 SYS: 175909 CTXSW: 0 INVCTXSW: 124
HOSTN: USER: 288554 SYS: 213530 CTXSW: 0 INVCTXSW: 125
YIELD: USER: 197841 SYS: 303828 CTXSW: 0 INVCTXSW: 478504
YIELD: USER: 261765 SYS: 241099 CTXSW: 0 INVCTXSW: 478304
HOSTN: USER: 328843 SYS: 174258 CTXSW: 0 INVCTXSW: 128

It appears that the calls to sched_yield are treated as involuntary context switches, but the difference between hostname and sched_yield is clearly visible.

You are right that sched_yield does not force a task-switch but it appears that it leads to more task switches happening, which is what you want in a scenario where tasks wait for each other on an oversubscribed system.

[1] https://www.kernel.org/doc/html/latest/scheduler/sched-design-CFS.html

@rhc54
Copy link
Contributor

rhc54 commented Sep 4, 2020

Last I checked, OMPI stopped using sched_yield in the code base several generations ago because the Linux folks reported that it didn't accomplish what we wanted. I don't believe we call it any more, unless someone added it back in again.

@devreal
Copy link
Contributor

devreal commented Sep 4, 2020

@rhc54 It's still alive and kicking, if OMPI detects oversubscription or the user enables it through an mca parameter

@rhc54
Copy link
Contributor

rhc54 commented Sep 4, 2020

Hmmm...I do see it got put back into opal_progress again, but the runtime isn't setting anything for oversubscription, at least not in the master branch

@devreal
Copy link
Contributor

devreal commented Sep 4, 2020

From what I see, ompi_mpi_yield_when_idle is passed to opal_progress_set_yield_when_idle and its value is set through the ompi_mpi_oversubscribe value and overwritten by mpi_yield_when_idle mca parameter if set. The mca parameter ompi_mpi_oversubscribe is described as "Internal MCA parameter set by the runtime environment when oversubscribing nodes". Not sure how exactly that works though...

@rhc54
Copy link
Contributor

rhc54 commented Sep 4, 2020

It doesn't, right now - that logic is missing from PRRTE. Probably should be added back to the schizo/ompi component based on the older ORTE code.

@rhc54
Copy link
Contributor

rhc54 commented Sep 4, 2020

I remember what this was all about now. We had received some complaints about cpu usage at 100% while the process was in finalize or some other quasi-idle state. Investigating, we found that we were indeed calling sched_yield, but since no other process was ready/anxious to take the cpu, the kernel scheduler just put us right back into play...and so we cycled at 100% cpu.

Only solution was to use a nanosleep (or something equivalent) to force a context switch, so we do that in a few places. This lets us back off the cpu when we know we are just idling.

@hjelmn
Copy link
Member

hjelmn commented Sep 4, 2020

I thought we decided a while ago (years at this point) that sched_yield was a bad idea on Linux. It is discouraged at the very least. nanosleep is probably the best replacement.

@devreal
Copy link
Contributor

devreal commented Sep 4, 2020

I believe MPICH lets the user choose which strategy to use. I don't think this has been implemented in OMPI, yet. Could be a nifty feature (and low-hanging fruit) for 5.0...

@planetA
Copy link

planetA commented Sep 4, 2020

@devreal Thanks for catching a bug. But if you look at the logs, you will notice that user+sys time is equal for sched_yield and gethostname. This rather indicates that sched_yield just spends more time in the kernel, although both get equal share of CPU resources.

@devreal
Copy link
Contributor

devreal commented Sep 4, 2020

@planetA That is not surprising given that each thread runs for a fixed amount of time. So yes, the yielding threads will call into the kernel more often and thus have a higher share of sys time.

@planetA
Copy link

planetA commented Sep 4, 2020

@devreal I fix the wall clock time, not CPU time. It just rather indicates that CFS equalises user+system time, rather than user time. And considering that, sched_yield gave nothing to other threads.

Here are the results of run with 40 threads: user time and user+system time

@devreal
Copy link
Contributor

devreal commented Sep 4, 2020

Ahh yes, my mistake. That is surprising, indeed (well, to me at least...) Thanks for pointing that out!

If I replace sched_yield with nanosleep() with 1ns, things seem to be much better:

$ taskset -c 1 ./test_sched_yield
HOSTN: USER: 460763 SYS: 237935 CTXSW: 0 INVCTXSW: 17141
SLEEP: USER: 133467 SYS: 165823 CTXSW: 171590 INVCTXSW: 7
HOSTN: USER: 494181 SYS: 203252 CTXSW: 0 INVCTXSW: 17080
SLEEP: USER: 98858 SYS: 208700 CTXSW: 171489 INVCTXSW: 8
SLEEP: USER: 89434 SYS: 217754 CTXSW: 171586 INVCTXSW: 8
SLEEP: USER: 102674 SYS: 200682 CTXSW: 171593 INVCTXSW: 8
HOSTN: USER: 432745 SYS: 264688 CTXSW: 0 INVCTXSW: 17128
SLEEP: USER: 92480 SYS: 202576 CTXSW: 171791 INVCTXSW: 7
SLEEP: USER: 127337 SYS: 172280 CTXSW: 171791 INVCTXSW: 16
HOSTN: USER: 462106 SYS: 237187 CTXSW: 0 INVCTXSW: 17160
HOSTN: USER: 452250 SYS: 246682 CTXSW: 0 INVCTXSW: 17173
HOSTN: USER: 470889 SYS: 227596 CTXSW: 0 INVCTXSW: 17100
HOSTN: USER: 462504 SYS: 236986 CTXSW: 0 INVCTXSW: 17036
HOSTN: USER: 452352 SYS: 248193 CTXSW: 0 INVCTXSW: 17177
SLEEP: USER: 115417 SYS: 186157 CTXSW: 171890 INVCTXSW: 18
SLEEP: USER: 104919 SYS: 200300 CTXSW: 171789 INVCTXSW: 10
SLEEP: USER: 96452 SYS: 209680 CTXSW: 171791 INVCTXSW: 9
HOSTN: USER: 451047 SYS: 249686 CTXSW: 0 INVCTXSW: 17239
SLEEP: USER: 120125 SYS: 186511 CTXSW: 171787 INVCTXSW: 19
HOSTN: USER: 509340 SYS: 237106 CTXSW: 0 INVCTXSW: 17488

@devreal
Copy link
Contributor

devreal commented Sep 4, 2020

Some more thoughts: it's not surprising that the yielding tasks get as much runtime as the hostname tasks, that's the nature of the completely-fair scheduler ^^ In a more realistic scenario, however, sched_yield might give some time to a process that is runnable but not scheduled yet. Calling it in a loop probably doesn't yield much though.

Some HPC systems may not use the CFS, in which case sched_yield might behave differently. I don't think we should abandon it but instead keep it as an option so that users can switch between usleep, nanosleep, and sched_yield (preferably at runtime). FWIW, the yielding needs some fixups for 5.0 anyway to accomodate for the ULT integrations, see #7702.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants