-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Replace sched_yield with usleep(1) on Linux #1051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@martin-frbg Do you measure the overhead of sched_yield or usleep(1) ? |
Unfortunately I do not have access to any serious HPC system (and I have to agree the two issue threads I refered to are a bit confusing, with attempts to modify the GEMM thresholds intermixed with the changes to YIELDING ). On my lowly dual-core Kaby Lake laptop I do see a significant increase in throughput with the (improved) deig.R benchmark (2k x 2k going from 8000Mflops/28sec to 24000/9sec) though most of that will probably be due to avoiding thermal throttling. |
@xianyi just start idle CPU hog e.g.: |
I'll see if I can retask at least a 4-core Haswell to benchmarking tomorrow. |
First results (still with deig.R, on a quadcore Haswell with turboboost disabled to ensure repeatability) show a slight slowdown rather than the 10 percent speedup reported for fenrus75' test case. The change however does allow the cores to drop to "idle" intermittently, where with sched_yield the core allocation was split between userspace and system time - this is probably what helps thermal management and would make the calculation run faster with dynamic overclocking enabled. |
for me YIELDING {} gave best result with 0.1um speedup |
Interestingly an empty YIELDING is what was used in early versions of libGoto2 up to at least 1.08, while 1.13 had sched_yield. |
Some preliminary data (more to come next weekend if time and workload permits). deig.R benchmark |
you can cat() whole z: z[1] would be user time (that makes flops) and z[2] system time (that generates heat) |
From sched_yield(2) manual page: |
Yeah we can all read manpages - my problem is when would a call to sched_yield clearly be considered unnecessary, inappropriate and indecent, and why would K.Goto and all who came after him decide this was not the case here ? (Note libGoto2-1.13 was released well after the linux kernel got the current sched_yield semantics - of course it could be that the man was using some flavor of *BSD or something else with lightweight sched_yield at the time) |
PR updated to current favorite just to prevent accidental merging of known bad idea. |
Did not get around to much more yet, so no complete picture. Quick comparison of deig.R run with 4 threads and 10240x10240 matrix, i7-4770 with and without turboboost enabled: |
Seems that slack turns up for small samples more without impacting overall time. |
Just remember that |
Thanks for that comment - indeed I am still looking for reasons why sched_yield was preferred so far. If I am not mistaken, a default build of OpenBLAS will have a built-in thread limit of two times the number of cores on x86 at least and so far the behaviour of a "lightly" overloaded system without sched_yield did not appear to be worse. (And I did see a clear improvement in terms of thermal management with my - apparently poorly designed - early Kaby Lake laptop). So perhaps it would be sufficient to provide a choice of implementations for "YIELDING" in Makefile.rule or couple it to the NUM_THREADS value in relation to physical cores detected ? (Lastly there is always the option to limit the number of threads to some managable quantity for the given system at runtime) |
Closing this for now as the results so far are a bit inconclusive and I need to get rid of my current fork as doing subsequent PRs from it seems to have led to an unintended revert of #988 |
to avoid the massive overhead of the sched_yield call on Linux kernels since its semantics were changed in early 2003 (late 2.5 series) to include reordering of the thread queue. Ref. #900,#923