Skip to content

Optimize busy loop for modern linux by default #2041

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Optimize busy loop for modern linux by default #2041

wants to merge 2 commits into from

Conversation

brada4
Copy link
Contributor

@brada4 brada4 commented Mar 3, 2019

Use just clock wait in place of sched_yield that is heavy in modern kernel side, also when called from light virtualisations like LXC or chroots.
I managed to make 8000 sched_yields per second without pti and 5000 with it, so this is already better for cases where zero threading threshold is in force, and with no kernel bombing also for turbo CPUs.
It dropped about 5% of time spent in gemm on huge set, I did not test much more.
Explanation from different point in file.
This does not solve problem of busy loop being employed, where some light IPC could work, it just eases life of current code.

@brada4
Copy link
Contributor Author

brada4 commented Mar 3, 2019

Bulldozer spinning right above gives questionable benefit with piledriver - no turbo, though theoretically immediate reaction to threads done

@martin-frbg
Copy link
Collaborator

We have been there several times without conclusive results (most recently #1861 I think), and your PR does look similar to my #1600 that I threw out again in #1613. So what is new ?

@brada4
Copy link
Contributor Author

brada4 commented Mar 4, 2019

Nothing is new, just I mention cause and consequence.
Linux 2.6 has compatibility switch to enable 2.4 behaviour, I am pretty sure we do target other systems.
Otherwise it is fine, same as patches before.

@martin-frbg
Copy link
Collaborator

So why sleep again like my earlier "failed" PR rather than nop or pause ?

@brada4
Copy link
Contributor Author

brada4 commented Mar 4, 2019

Nop does not un-schedule process, sleep does, thus easing lxc scenario that core that spins the busy loop acrually gives cycles to whatever else is on the system, it does not contribute to the compitation at that point anyway.

@brada4
Copy link
Contributor Author

brada4 commented Mar 4, 2019

E.g. run idle priority process/ thread per core outside openblas process, then count what time it accumulated, with sleep one gets few seconds not bombed into kernel.

@brada4
Copy link
Contributor Author

brada4 commented Mar 5, 2019

Nop chain should be good in case it permits simd part of cpu to sleep, while not idle completely, power is down and turbo is up, it is also possible that one in a hundred disables idle mwait hlt in the kernel, so that short naps are no naps. As long as this is in generic code, no single code would be best for every case, but at lest should not be very bad by default, i mean assuming nanosleep is hlt or mwait if no other process runs should be safe for common case

@brada4
Copy link
Contributor Author

brada4 commented Jun 8, 2019

Sleep is wildly better on a virtual machine, while on real CPU it is greatly indifferent. I dont know.

@brada4
Copy link
Contributor Author

brada4 commented Aug 1, 2019

I will make new one with toplevel option, leaving defaults intact, packagers can try to measure then.

@brada4 brada4 closed this Aug 1, 2019
@brada4 brada4 deleted the wait branch August 1, 2019 07:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants