-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Optimize busy loop for modern linux by default #2041
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Bulldozer spinning right above gives questionable benefit with piledriver - no turbo, though theoretically immediate reaction to threads done |
Nothing is new, just I mention cause and consequence. |
So why sleep again like my earlier "failed" PR rather than nop or pause ? |
Nop does not un-schedule process, sleep does, thus easing lxc scenario that core that spins the busy loop acrually gives cycles to whatever else is on the system, it does not contribute to the compitation at that point anyway. |
E.g. run idle priority process/ thread per core outside openblas process, then count what time it accumulated, with sleep one gets few seconds not bombed into kernel. |
Nop chain should be good in case it permits simd part of cpu to sleep, while not idle completely, power is down and turbo is up, it is also possible that one in a hundred disables idle mwait hlt in the kernel, so that short naps are no naps. As long as this is in generic code, no single code would be best for every case, but at lest should not be very bad by default, i mean assuming nanosleep is hlt or mwait if no other process runs should be safe for common case |
Sleep is wildly better on a virtual machine, while on real CPU it is greatly indifferent. I dont know. |
I will make new one with toplevel option, leaving defaults intact, packagers can try to measure then. |
Use just clock wait in place of sched_yield that is heavy in modern kernel side, also when called from light virtualisations like LXC or chroots.
I managed to make 8000 sched_yields per second without pti and 5000 with it, so this is already better for cases where zero threading threshold is in force, and with no kernel bombing also for turbo CPUs.
It dropped about 5% of time spent in gemm on huge set, I did not test much more.
Explanation from different point in file.
This does not solve problem of busy loop being employed, where some light IPC could work, it just eases life of current code.