-
Notifications
You must be signed in to change notification settings - Fork 11.8k
ggml: refactor compute thread: merge three spin variables into one #816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I had implemented Observations:
Not sure these observations are right, because performance varies on platforms. |
I've tested it on my Intel macOS for over 20 times with 7B/13B model data, without corruption. |
I'll see if this affects #813 |
@mqy : can you please remove the disable flag? I think this is more confusing for reviewers than simply switching between commits with git. Thanks! |
disclaimer: this information is all through the lens of x86 arch since i know virtually nothing about the inner workings of other archs like arm. The current implementation in these three all are different things: #ifdef _WIN32
#include <windows.h>
#else
#include <sched.h>
#include <immintrin.h>
#endif
volatile int lock = 1;
void spinlock() {
while (lock) {
// no-op, spinlock
}
}
void spinlock_with_hint() {
while (lock) {
_mm_pause(); // generate a F3 90 (PAUSE) instruction to hint processor of spinlock
}
}
void threadyield() {
while (lock) {
#ifdef _WIN32
Sleep(0); // yield thread's timeslice - not a spinlock
#else
sched_yield(); // same thing but unix
#endif
}
} it should be reviewed when operations like it should be noted though that yielding allows for thread context switching which can potentially have a huge impact on cpu caching / branch prediction especially with more aggressive inlining and longer functions. so using the yield method can currently decrease performance, however since the ggml funcs aren't aggressively inlined currently the impact on cache/prediction is probably limited. spinlocking does consume more energy as it locks the thread to 100% but using the PAUSE instruction can mitigate some energy use (on x86, i'm not familiar with other archs). spinlock is faster but i don't think that will have any measurable impact in this case, however what could have is that locking the thread keeps it locked to the same processor and context switching is not allowed, so there is a potential cache/prediction impact here. however i'm unsure how processors actually do caching/branch prediction at the bare metal level while waiting for a spinlock to be released. there is also the possibility to use signaling methods like mutexes/semaphores. On these cases the platform-specific options are usually the best but also decrease portability and have to be implemented specifically for each platform so they are not the best option. Always Use a Lightweight Mutex however all of those waiting methods are very fast and ultimate timing precision isn't an issue here and we aren't spawning thousands of threads so I think the end-result doesn't actually rely so much which lock method is the fastest or more precise but comes down to how they affect the calculations that proceed them in terms of the invisible black magic portion of cpu branch prediction/caching at the bare metal level. i don't think this is something easily solved logically (at least by a mere mortal, some semiconductor fab expert could) but rather by testing different methods and measuring results. even inside just x86 the different methods could have wildly varying results depending on the cpu (intel/amd, high/low power, new/old, etc.) and also stuff like whether you're losing performance by getting downclocked from hitting the thermal/power limit or not. like for example, even if spinlock increases performance in theory but the power use leads to a downclock, the real world performance would actually be lower. for example, any x86 laptop will not have adequate cooling to reach max performance so lowering power use will in pretty much all cases increase performance too. for adequately cooled desktops this can be a hit-or-miss, since most processors can run at 100% without getting throttled and in that case the more-energy more-performance option would be the best. then again the latest intel 13th gen processors run so hot that there doesn't exist a cooling solution in the planet which could give it max perf and it always ends up getting thermal throttled no matter what. and if this wasn't complicated enough yet, there is the matter intel's 12th/13th gen P/E core architecture along with thread scheduling and P/E work triage happening in tandem between hardware "intel thread director" and the OS thread scheduler. it appears that the selling point of having a "smart thread scheduler" isn't so smart after all in real-world applications and there is potential for huge performance gains in optimizing this: #572 #842 it could be useful to have multiple options for perf testing by something like: __forceinline static void lockthread() {
while (lock) {
#if GGML_WAIT_MODE == GGML_WAIT_MODE_YIELD
Sleep(0); // yield thread's timeslice - not a spinlock
#elif GGML_WAIT_MODE == GGML_WAIT_MODE_SLEEP
Sleep(1); // sleep for ~1ms
#elif GGML_WAIT_MODE == GGML_WAIT_MODE_SPIN_HINT
_mm_pause();
#elif GGML_WAIT_MODE == GGML_WAIT_MODE_SPIN
// no-op
#endif
}
} because even something slower like Sleep(1) could actually result in an overall improvement if it prevents thermal/power throttling. sorry for the wall-of-text but the topic of threading is pretty complex and cannot really be properly contracted to just a few sentences. |
According to C11 they are not. I know that on x86, aligned loads and stores are atomic, but that isn't portable.
I thought that PAUSE is a legacy instruction and has limited effect on modern CPUs.
I think this is a good idea. It's posix, Windows (and macOS if not posix?). Unless there are other platforms to be supported.
As I proposed in the original PR, I think that the compute graph function should iterate over tasks - not nodes and that could bring a real performance improvement.
This is why I think using a coordinated wait of some kind like a mutex is a better option - we leave all this to the OS.
True |
The most destructive to performance is the C++ std:: library though since it litters the compiled code with exception handlers, constructors/destructors, memory (de)allocations, security checks, enter/leave critical sections, mutexes, etc. Doing away with those and replacing them with C primitives would increase performance by a huge margin and allow for much better branch prediction and instruction caching but it would be a rather huge undertaking to convert the whole codebase. While the current computing code itself is based on primitives and pointers (fast) , when you use std:: constructs to load/store stuff in between it destroys performance and branch predictions/caching that could occur. Just compare the compiled output of a simple std::string vs a primitive char array , but thousand times in the scope of whole codebase. That is compiled with |
Well, it depends on how the standard library is used. C++ isn't bad for performance per se - it's simply harder to use properly. Regarding exception handlers, this is what Regarding constructors, copies are rather aggressively elided when compiling with newer compilers, some even with optimizations disabled. Constructors and destructors are code that would have to be written anyway. Memory allocations and deallocations happen only if small objects being constantly created and destroyed. Regarding security, none of the C++ containers does any kinds of bounds checking unless that's explicitly requested. I strongly disagree that converting to C will improve performance. Instead, that C++ code should be optimized. For example, regarding strings, the current practice is to pass string views instead as these do not incur any unnecessary copies. C++ is at least as performant as C, provided that it is used properly. |
Hm, I never heard that. In fact this Intel C++ Compiler Guide from 2021 instructs
So maybe you're thinking about that the A quick google search lead me to this discussion thread where it's claimed that the I could be wrong but I don't think it's deprecated or anything. Anyway it's probably better to test rather than try to come to a logical conclusion about it. |
Ok, thanks for clearing this up. I must have misheard something, then. I agree that it all comes down to performance tests. |
ggml.c
Outdated
/*.n_ready =*/ 0, | ||
/*.has_work =*/ false, | ||
/*.stop =*/ false, | ||
/*.flag =*/ 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use designated initializers instead of comments, they are allowed in C11 (but not in ISO C++11).
Otherwise, this is certainly simpler and works fine for me, but I'm no expert on multithreading.
Ok, committed, thanks. |
If we want to support multi-sessions, the busy-spin mode have to be changed to the I found that every worker task runs from several us to a hundred, over half nodes can't be paralleled to workers. The computation time is too short, thus any heavy thread schedule policy would suffer noticeable perf down or gain little energy savings due to frequently context switching. I've tried spin + cond-wait, detailed controlling introduced tens of lines of codes, with complicate logic, and sometimes suffers dead lock. But spin + pause + usleep is quite simple to implement, looks promising code. If we agree that it's hard to balance perf and energy savings implicitly, we could let users make their decisions. Suppose we define several performance mode, we can map these modes (and corresponding levels) to various implementation + config. For example we define two modes with their levels (bigger number means best).
Perf:
Energy saving:
A pause ref: pause techniques on many architectures |
Interesting point :) But is that possible? I'll take a deep look into this way. |
Noop, it's PURE spinlock, in both
From sched_yield man pagee: |
@mqy Here is what I proposed regarding iterating over tasks: As far as I understand the code, the current work scheduling is less than ideal. The main thread launches some Main thread: compute_graph G; // topologically-sorted
multithreaded_queue<task> Q;
for (node& n : G) {
// The number of incoming edges
// ie. the number of dependencies
if (n.dependency_count.nonatomic_load() > 0)
break;
Q.batch_enqueue(n.tasks);
}
Q.start_working();
execute_work()
// cleanup
return [the result] Worker threads execute Q.wait_for_start_working_blocking();
while (!Q.done()) {
task to_do = Q.pop_blocking();
execute(to_do);
// if this was the last task for this node, the node has completed
if(to_do.node.task_count.atomic_fetch_sub(1) == 1) {
// so, all the node's dependents have one dependency less
for (node& n : to_do.node.dependents) {
// if the current node was the last dependency of this node
// we can enqueue this node's tasks for execution
if (n.dependency_count.atomic_fetch_sub(1) == 1) {
Q.batch_enqueue(n.tasks);
}
}
}
} This design should eliminate all the blocking and waiting and maximize the amount of time spent by the threads on executing useful work. |
Thanks! From the map-reduce view, main thread dispatches tasks to workers, then waits for all done. So the If we could map nodes instead of tensors, there would be huge room for the wait-notify way. Also possible parallel both graph and some heavy nodes. In my machine, ggml_vec_dot_q4_0 takes over 30% time. BTW, I'm looking deep into The first step is moving the node bench code from [EDIT] my perf test branch master...mqy:llama.cpp:ggml-thread-stat |
@mqy
idk why the manual would say that, using Sleep(0) is perfectly fine in Windows and I think sched_yield is the same thing on linux and should be ok too. to be honest the entry succeeds to explain nothing and also sound patronizing at the same time so i'm not sure how much weight I would put into "is unspecified" and "likely means your application design is broken". |
Perhaps bunch of continuous Here is the definition of /* The execution of the next instruction is delayed by an implementation
specific amount of time. The instruction does not modify the
architectural state. This is after the pop_options pragma because
it does not require SSE support in the processor--the encoding is a
nop on processors that do not support it. */
extern __inline void
__attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_pause(void) {
/* There is no exact match with this construct, but the following is
close to the desired effect. */
#if _ARCH_PWR8
/* On power8 and later processors we can depend on Program Priority
(PRI) and associated "very low" PPI setting. Since we don't know
what PPI this thread is running at we: 1) save the current PRI
from the PPR SPR into a local GRP, 2) set the PRI to "very low*
via the special or 31,31,31 encoding. 3) issue an "isync" to
insure the PRI change takes effect before we execute any more
instructions.
Now we can execute a lwsync (release barrier) while we execute
this thread at "very low" PRI. Finally we restore the original
PRI and continue execution. */
unsigned long __PPR;
__asm__ volatile(" mfppr %0;"
" or 31,31,31;"
" isync;"
" lwsync;"
" isync;"
" mtppr %0;"
: "=r"(__PPR)
:
: "memory");
#else
/* For older processor where we may not even have Program Priority
controls we can only depend on Heavy Weight Sync. */
__atomic_thread_fence(__ATOMIC_SEQ_CST);
#endif
} It's quite straight forward, should be expanded as one line of assembly code. From my testing, it can slow down execution time for hundreds of times on Itel Core i7 (8 gen). But [EDIT] I'm looking forward somebody run spin_hint.c on aarch64, test the wfe instruction. |
@mqy idk why you changed to using to be able to use inline assembly truly portably you'd need to resort to dirty hacks like this which wouldn't work in this case because it's not truly inline as there's an extra function call so you can't just put "F3 90" to a char array. anyway like said __mm_pause for x86 is supported by clang/gcc/msvc i really cannot say anything about the macos ppc part as i simply don't know enough of archs other than x86 to be qualified to even make a guesstimate. |
@anzz1 thanks for the kind comments.
I agree with you that "the pause is not a sleep replacement" now. spin_hint is modified as follow (temp keep the dirty wfe as is) static inline void spin_hint(void) {
#if defined(__x86_64__)
#include <emmintrin.h>
_mm_pause();
#elif defined(__aarch64__)
__asm__ __volatile__ ("wfe");
#endif
} |
Close this PR because it is incomplete. |
This PR simplified spin logic in graph compute, benefits:
No obvious logic change, no code deletion, changes are protected by a compile time feature flag `DISABLE_GGML_COMPUTE_SPIN_V2`. This feature can be disabled by setting `-DDISABLE_GGML_COMPUTE_SPIN_V2` to `CFLAGS` in Makefile.