Skip to content

Replace sched_yield with usleep(1) on Linux #1051

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Replace sched_yield with usleep(1) on Linux #1051

wants to merge 2 commits into from

Conversation

martin-frbg
Copy link
Collaborator

to avoid the massive overhead of the sched_yield call on Linux kernels since its semantics were changed in early 2003 (late 2.5 series) to include reordering of the thread queue. Ref. #900,#923

to avoid the massive overhead of the sched_yield call on Linux kernels since its semantics were changed in early 2003 (late 2.5 series) to include reordering of the thread queue. Ref. #900,#923
@xianyi
Copy link
Collaborator

xianyi commented Jan 9, 2017

@martin-frbg Do you measure the overhead of sched_yield or usleep(1) ?

@martin-frbg
Copy link
Collaborator Author

Unfortunately I do not have access to any serious HPC system (and I have to agree the two issue threads I refered to are a bit confusing, with attempts to modify the GEMM thresholds intermixed with the changes to YIELDING ). On my lowly dual-core Kaby Lake laptop I do see a significant increase in throughput with the (improved) deig.R benchmark (2k x 2k going from 8000Mflops/28sec to 24000/9sec) though most of that will probably be due to avoiding thermal throttling.
Note also that wernsaar already replaced sched_yield with asm(nop) for a number of AMD cpus in the past while I do not think the underlying issue is in cpu hardware. Should I open a separate issue to discuss ?

@brada4
Copy link
Contributor

brada4 commented Jan 9, 2017

@xianyi just start idle CPU hog e.g.:
nice yes > /dev/null &
Then see how much spare time you get running same sample with different YIELDING options.
It demonstrates more CPU to competing users, or less heat for nothing in absence of such...
I don't get significant boost in wall time spent, but system CPU time consumption drops noticeably, somebody with watt-meter could be better positioned to measure.

@martin-frbg
Copy link
Collaborator Author

I'll see if I can retask at least a 4-core Haswell to benchmarking tomorrow.

@martin-frbg
Copy link
Collaborator Author

First results (still with deig.R, on a quadcore Haswell with turboboost disabled to ensure repeatability) show a slight slowdown rather than the 10 percent speedup reported for fenrus75' test case. The change however does allow the cores to drop to "idle" intermittently, where with sched_yield the core allocation was split between userspace and system time - this is probably what helps thermal management and would make the calculation run faster with dynamic overclocking enabled.

@brada4
Copy link
Contributor

brada4 commented Jan 10, 2017

for me YIELDING {} gave best result with 0.1um speedup

@martin-frbg
Copy link
Collaborator Author

Interestingly an empty YIELDING is what was used in early versions of libGoto2 up to at least 1.08, while 1.13 had sched_yield.

@martin-frbg
Copy link
Collaborator Author

Some preliminary data (more to come next weekend if time and workload permits). deig.R benchmark
modified for matrix size 10240x10240, four threads on otherwise idle quadcore Haswell 3.4GHz w/o turboboost, YIELDING defined as:
sched_yield 62626.0 MFlops 457.1 sec
usleep(1) 56996.8 MFlops 502.2 sec
nothing 62442.8 MFlops 458.4 sec
8xnop 62759.1 MFlops 456.1 sec
So usleep(1) appears to be significantly slower, the other alternatives are basically on par in terms of speed but as mentioned above sched_yield keeps the cores from going idle. Results for 2 and 8 threads are similar to this (though I only have them for small matrix sizes up to 2048x2048)

@brada4
Copy link
Contributor

brada4 commented Jan 11, 2017

you can cat() whole z: z[1] would be user time (that makes flops) and z[2] system time (that generates heat)
2k2k8B=32MB, probably dont try at NUMA...
I think actual purpose of sched_yield is to make cores idle, but I am not sure.
My googe research also shows that heavy sched_yield is product of kernel 2.6, which is oldest in the wild for all practical purposes.

@brada4
Copy link
Contributor

brada4 commented Jan 11, 2017

From sched_yield(2) manual page:
Strategic calls to sched_yield() can improve performance by giving other threads or processes a chance to run when (heavily) contended resources (e.g., mutexes) have been released by the caller. Avoid calling sched_yield() unnecessarily or inappropriately
(e.g., when resources needed by other schedulable threads are still held by the caller), since doing so will result in unnecessary context switches, which will degrade system performance.

@martin-frbg
Copy link
Collaborator Author

martin-frbg commented Jan 11, 2017

Yeah we can all read manpages - my problem is when would a call to sched_yield clearly be considered unnecessary, inappropriate and indecent, and why would K.Goto and all who came after him decide this was not the case here ? (Note libGoto2-1.13 was released well after the linux kernel got the current sched_yield semantics - of course it could be that the man was using some flavor of *BSD or something else with lightweight sched_yield at the time)

@martin-frbg
Copy link
Collaborator Author

PR updated to current favorite just to prevent accidental merging of known bad idea.

@martin-frbg
Copy link
Collaborator Author

Did not get around to much more yet, so no complete picture. Quick comparison of deig.R run with 4 threads and 10240x10240 matrix, i7-4770 with and without turboboost enabled:
sched_yield, fixed freq 62625 MFlops 457 sec turboboost 64515 MFlops 443 sec
asm(8xnop), fixed freq 62759 MFlops 456 sec turboboost 65407 MFlops 437 sec

@brada4
Copy link
Contributor

brada4 commented Jan 17, 2017

Seems that slack turns up for small samples more without impacting overall time.

@jeffhammond
Copy link

Just remember that sched_yield has a significant upside over nop when oversubscribing. It would be a lot better to allow users to select the more forgiving option if they intend to run in a multi-tenant or desktop environment.

@martin-frbg
Copy link
Collaborator Author

martin-frbg commented Jan 19, 2017

Thanks for that comment - indeed I am still looking for reasons why sched_yield was preferred so far. If I am not mistaken, a default build of OpenBLAS will have a built-in thread limit of two times the number of cores on x86 at least and so far the behaviour of a "lightly" overloaded system without sched_yield did not appear to be worse. (And I did see a clear improvement in terms of thermal management with my - apparently poorly designed - early Kaby Lake laptop). So perhaps it would be sufficient to provide a choice of implementations for "YIELDING" in Makefile.rule or couple it to the NUM_THREADS value in relation to physical cores detected ? (Lastly there is always the option to limit the number of threads to some managable quantity for the given system at runtime)

@martin-frbg
Copy link
Collaborator Author

Closing this for now as the results so far are a bit inconclusive and I need to get rid of my current fork as doing subsequent PRs from it seems to have led to an unintended revert of #988

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants