Skip to content

ggml: refactor compute thread: merge three spin variables into one #816

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

mqy
Copy link
Contributor

@mqy mqy commented Apr 6, 2023

This PR simplified spin logic in graph compute, benefits:

  1. replaced three sync primitives with just one, so should make the related codes easy to read, and no potential dead lock.
  2. less spin loops, thus a bit speed up and energy savings.
No obvious logic change, no code deletion, changes are protected by a compile time feature flag `DISABLE_GGML_COMPUTE_SPIN_V2`. This feature can be disabled by setting `-DDISABLE_GGML_COMPUTE_SPIN_V2` to `CFLAGS` in Makefile.

@mqy mqy marked this pull request as draft April 6, 2023 18:07
@mqy mqy changed the title ggml: refactor compute thread: use only one sync var; add spin hint [WIP]ggml: refactor compute thread: use only one sync var; add spin hint Apr 6, 2023
@mqy
Copy link
Contributor Author

mqy commented Apr 6, 2023

I had implemented spin hint and spin hint + sched_yield to balance speed and
energy in the PR. And had implemented global per compute graph worker group.

Observations:

  1. From my local test, spin hint only, spin hint (100 times) + sched_yield both
    result in slightly slowdown as of per token time, with imperceptible egergy savings.
    The underlying fact is: almost all threads involve in computing node by node.
    So, sched_yield may be a overhead, but spin hint may deserve some try?
  2. Every pthread_create takes about only 10 ns (Itel core i7, macOS). I think, even
    if we use 16 threads, the total 32 create+join syscalls would not take more than
    0.5 ms. Comparing to the ~100ms per token score, the benefit is negligible, so no
    need use global thread pool?

Not sure these observations are right, because performance varies on platforms.

@mqy mqy changed the title [WIP]ggml: refactor compute thread: use only one sync var; add spin hint ggml: refactor compute thread: merge three spin variables into one Apr 6, 2023
@mqy mqy marked this pull request as ready for review April 6, 2023 20:51
@mqy
Copy link
Contributor Author

mqy commented Apr 6, 2023

I've tested it on my Intel macOS for over 20 times with 7B/13B model data, without corruption.
So this PR is ready for review and test now.

@janekb04
Copy link

janekb04 commented Apr 7, 2023

I'll see if this affects #813

@sw
Copy link
Contributor

sw commented Apr 9, 2023

@mqy : can you please remove the disable flag? I think this is more confusing for reviewers than simply switching between commits with git. Thanks!

@anzz1
Copy link
Contributor

anzz1 commented Apr 9, 2023

disclaimer: this information is all through the lens of x86 arch since i know virtually nothing about the inner workings of other archs like arm.

The current implementation in ggml.c isn't actually a spinlock but a thread-yield so it should be pretty okay energy-savings-wise as-is. edit: as pointed out by @mqy below, this is wrong, as only this ggml_critical_section_start function is a thread-yield, while spinlocks are used elsewhere

https://github.com/ggerganov/llama.cpp/blob/aaf3b23debc1fe1a06733c8c6468fb84233cc44f/ggml.c#L2767-L2785

these three all are different things:
https://godbolt.org/z/qz9or71bh

#ifdef _WIN32
#include <windows.h>
#else
#include <sched.h>
#include <immintrin.h>
#endif

volatile int lock = 1;
void spinlock() {
    while (lock) {
        // no-op, spinlock
    }
}

void spinlock_with_hint() {
    while (lock) {
        _mm_pause(); // generate a F3 90 (PAUSE) instruction to hint processor of spinlock
    }
}

void threadyield() {
    while (lock) {
#ifdef _WIN32
        Sleep(0); // yield thread's timeslice - not a spinlock
#else
        sched_yield(); // same thing but unix
#endif
    }
}

it should be reviewed when operations like atomic_load are actually required since primitive read operations in C are always atomic (and thus thread-safe) by nature (provided they are properly aligned, stored atomically in memory and not optimized out by the compiler which can be guaranteed with the volatile keyword).

it should be noted though that yielding allows for thread context switching which can potentially have a huge impact on cpu caching / branch prediction especially with more aggressive inlining and longer functions. so using the yield method can currently decrease performance, however since the ggml funcs aren't aggressively inlined currently the impact on cache/prediction is probably limited.

spinlocking does consume more energy as it locks the thread to 100% but using the PAUSE instruction can mitigate some energy use (on x86, i'm not familiar with other archs). spinlock is faster but i don't think that will have any measurable impact in this case, however what could have is that locking the thread keeps it locked to the same processor and context switching is not allowed, so there is a potential cache/prediction impact here. however i'm unsure how processors actually do caching/branch prediction at the bare metal level while waiting for a spinlock to be released.

there is also the possibility to use signaling methods like mutexes/semaphores. On these cases the platform-specific options are usually the best but also decrease portability and have to be implemented specifically for each platform so they are not the best option. Always Use a Lightweight Mutex

however all of those waiting methods are very fast and ultimate timing precision isn't an issue here and we aren't spawning thousands of threads so I think the end-result doesn't actually rely so much which lock method is the fastest or more precise but comes down to how they affect the calculations that proceed them in terms of the invisible black magic portion of cpu branch prediction/caching at the bare metal level. i don't think this is something easily solved logically (at least by a mere mortal, some semiconductor fab expert could) but rather by testing different methods and measuring results.

even inside just x86 the different methods could have wildly varying results depending on the cpu (intel/amd, high/low power, new/old, etc.) and also stuff like whether you're losing performance by getting downclocked from hitting the thermal/power limit or not. like for example, even if spinlock increases performance in theory but the power use leads to a downclock, the real world performance would actually be lower. for example, any x86 laptop will not have adequate cooling to reach max performance so lowering power use will in pretty much all cases increase performance too. for adequately cooled desktops this can be a hit-or-miss, since most processors can run at 100% without getting throttled and in that case the more-energy more-performance option would be the best. then again the latest intel 13th gen processors run so hot that there doesn't exist a cooling solution in the planet which could give it max perf and it always ends up getting thermal throttled no matter what.

and if this wasn't complicated enough yet, there is the matter intel's 12th/13th gen P/E core architecture along with thread scheduling and P/E work triage happening in tandem between hardware "intel thread director" and the OS thread scheduler. it appears that the selling point of having a "smart thread scheduler" isn't so smart after all in real-world applications and there is potential for huge performance gains in optimizing this: #572 #842

it could be useful to have multiple options for perf testing by something like:

__forceinline static void lockthread() {
    while (lock) {
#if GGML_WAIT_MODE == GGML_WAIT_MODE_YIELD
        Sleep(0); // yield thread's timeslice - not a spinlock
#elif GGML_WAIT_MODE == GGML_WAIT_MODE_SLEEP
        Sleep(1); // sleep for ~1ms
#elif GGML_WAIT_MODE == GGML_WAIT_MODE_SPIN_HINT
        _mm_pause();
#elif GGML_WAIT_MODE == GGML_WAIT_MODE_SPIN
        // no-op
#endif
    }
}

because even something slower like Sleep(1) could actually result in an overall improvement if it prevents thermal/power throttling.

sorry for the wall-of-text but the topic of threading is pretty complex and cannot really be properly contracted to just a few sentences.

@janekb04
Copy link

janekb04 commented Apr 9, 2023

@anzz1

it should be reviewed when operations like atomic_load are actually required since primitive read operations in C are always atomic (and thus thread-safe) by nature (provided they are properly aligned, stored atomically in memory and not optimized out by the compiler which can be guaranteed with the volatile keyword).

According to C11 they are not. I know that on x86, aligned loads and stores are atomic, but that isn't portable.

spin locking does consume more energy as it locks the thread to 100% but using the PAUSE instruction can mitigate some energy use (on x86, i'm not familiar with other archs). spinlock is faster but i don't think that will have any measurable impact in this case, however what could have is that locking the thread keeps it locked to the same processor and context switching is not allowed, so there is a potential cache/prediction impact here. however i'm unsure how processors actually do caching/branch prediction at the bare metal level while waiting for a spinlock to be released.

I thought that PAUSE is a legacy instruction and has limited effect on modern CPUs.

there is also the possibility to use signaling methods like mutexes/semaphores. On these cases the platform-specific options are usually the best but also decrease portability and have to be implemented specifically for each platform so they are not the best option. Always Use a Lightweight Mutex

I think this is a good idea. It's posix, Windows (and macOS if not posix?). Unless there are other platforms to be supported.

however all of those waiting methods are very fast and ultimate timing precision isn't an issue here so I think the end-result doesn't actually rely so much which lock method is the fastest or more precise but comes down to how they affect the calculations that proceed them in terms of the invisible black magic portion of cpu branch prediction/caching at the bare metal level. i don't think this is something easily solved logically but rather by testing different methods and measuring results.

As I proposed in the original PR, I think that the compute graph function should iterate over tasks - not nodes and that could bring a real performance improvement.

even inside just x86 the different methods could have wildly varying results depending on the cpu (intel/amd, high/low power, new/old, etc.) and also stuff like whether you're losing performance by getting downclocked from hitting the thermal/power limit or not. like for example, even if spinlock increases performance in theory but the power use leads to a downclock, the real world performance would actually be lower. for example, any x86 laptop will not have adequate cooling to reach max performance so lowering power use will in pretty much all cases increase performance too. for adequately cooled desktops this can be a hit-or-miss, since most processors can run at 100% without getting throttled and in that case the more-energy more-performance option would be the best. then again the latest intel 13th gen processors run so hot that there doesn't exist a cooling solution in the planet which could give it max perf since it always ends up getting thermal throttled no matter what.

This is why I think using a coordinated wait of some kind like a mutex is a better option - we leave all this to the OS.

and if this wasn't complicated enough yet, there is the matter intel's 12th/13th gen P/E core architecture along with thread scheduling and P/E work triage happening in tandem between hardware "intel thread director" and the OS thread scheduler. it appears that the selling point of having a "smart thread scheduler" isn't so smart after all in real-world applications and there is potential for huge performance gains in optimizing this: #572 #842

True

@anzz1
Copy link
Contributor

anzz1 commented Apr 9, 2023

The most destructive to performance is the C++ std:: library though since it litters the compiled code with exception handlers, constructors/destructors, memory (de)allocations, security checks, enter/leave critical sections, mutexes, etc.

Doing away with those and replacing them with C primitives would increase performance by a huge margin and allow for much better branch prediction and instruction caching but it would be a rather huge undertaking to convert the whole codebase. While the current computing code itself is based on primitives and pointers (fast) , when you use std:: constructs to load/store stuff in between it destroys performance and branch predictions/caching that could occur.

Just compare the compiled output of a simple std::string vs a primitive char array , but thousand times in the scope of whole codebase. That is compiled with /MT on msvc, so it doesn't even include the cost of having to call a library multiple times using the normal /MD compilation. (yes that example isn't strictly an apples-to-apples comparison but you get the point)

@janekb04
Copy link

janekb04 commented Apr 9, 2023

Well, it depends on how the standard library is used. C++ isn't bad for performance per se - it's simply harder to use properly.

Regarding exception handlers, this is what noexcept is for. Exceptions are nearly zero-cost on the non-exceptional path. They do emit code which impacts cache locality, but it is located in cold code sections. There is also the possibility of disabling them with a flag.

Regarding constructors, copies are rather aggressively elided when compiling with newer compilers, some even with optimizations disabled.

Constructors and destructors are code that would have to be written anyway. Memory allocations and deallocations happen only if small objects being constantly created and destroyed.

Regarding security, none of the C++ containers does any kinds of bounds checking unless that's explicitly requested.

I strongly disagree that converting to C will improve performance. Instead, that C++ code should be optimized.

For example, regarding strings, the current practice is to pass string views instead as these do not incur any unnecessary copies.

C++ is at least as performant as C, provided that it is used properly.

@anzz1
Copy link
Contributor

anzz1 commented Apr 9, 2023

I thought that PAUSE is a legacy instruction and has limited effect on modern CPUs.

Hm, I never heard that. In fact this Intel C++ Compiler Guide from 2021 instructs

All legacy processors execute PAUSE instruction as a NOP, but in processors that use the PAUSE instruction as a hint there can be significant performance benefit.

So maybe you're thinking about that the PAUSE instruction behaves differently in older processors.

A quick google search lead me to this discussion thread where it's claimed that the PAUSE instruction is used heavily on linux kernel.

I could be wrong but I don't think it's deprecated or anything. Anyway it's probably better to test rather than try to come to a logical conclusion about it.

@janekb04
Copy link

janekb04 commented Apr 9, 2023

Ok, thanks for clearing this up. I must have misheard something, then. I agree that it all comes down to performance tests.

ggml.c Outdated
/*.n_ready =*/ 0,
/*.has_work =*/ false,
/*.stop =*/ false,
/*.flag =*/ 0,
Copy link
Contributor

@sw sw Apr 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use designated initializers instead of comments, they are allowed in C11 (but not in ISO C++11).

Otherwise, this is certainly simpler and works fine for me, but I'm no expert on multithreading.

@mqy
Copy link
Contributor Author

mqy commented Apr 9, 2023

@mqy : can you please remove the disable flag? I think this is more confusing for reviewers than simply switching between commits with git. Thanks!

Ok, committed, thanks.

@mqy
Copy link
Contributor Author

mqy commented Apr 10, 2023

If we want to support multi-sessions, the busy-spin mode have to be changed to the cond-wait (or similar) way to make the code go inside into kernel, chances to leave more CPU time for other sessions to run.

I found that every worker task runs from several us to a hundred, over half nodes can't be paralleled to workers. The computation time is too short, thus any heavy thread schedule policy would suffer noticeable perf down or gain little energy savings due to frequently context switching.

I've tried spin + cond-wait, detailed controlling introduced tens of lines of codes, with complicate logic, and sometimes suffers dead lock. But spin + pause + usleep is quite simple to implement, looks promising code. If we agree that it's hard to balance perf and energy savings implicitly, we could let users make their decisions.

Suppose we define several performance mode, we can map these modes (and corresponding levels) to various implementation + config.

For example we define two modes with their levels (bigger number means best).

  • perf, with levels 0-9
  • energy saving, with levels 0-9

Perf:

  • level (9) run spin always
  • level (5) spin with 5 pauses
  • level (0) spin with 10 pauses

Energy saving:

  • level (9) usleep(100)
  • level (5) usleep(50)
  • level (0) usleep(10)

A pause ref: pause techniques on many architectures

@mqy
Copy link
Contributor Author

mqy commented Apr 10, 2023

As I proposed in the original PR, I think that the compute graph function should iterate over tasks - not nodes and that could bring a real performance improvement.

Interesting point :) But is that possible? I'll take a deep look into this way.

@mqy
Copy link
Contributor Author

mqy commented Apr 10, 2023

The current implementation in ggml.c isn't actually a spinlock but a thread-yield so it should be pretty okay energy-savings-wise as-is.

Noop, it's PURE spinlock, in both ggml_graph_compute and ggml_graph_compute_thread.

it should be noted though that yielding allows for thread context switching which can potentially have a huge impact on cpu caching / branch prediction especially with more aggressive inlining and longer functions. so using the yield method can currently decrease performance, however since the ggml funcs aren't aggressively inlined currently the impact on cache/prediction is probably limited.

From sched_yield man pagee:
sched_yield() is intended for use with real-time scheduling
policies (i.e., SCHED_FIFO or SCHED_RR). Use of sched_yield()
with nondeterministic scheduling policies such as SCHED_OTHER is
unspecified and very likely means your application design is
broken.

@janekb04
Copy link

@mqy Here is what I proposed regarding iterating over tasks:

As far as I understand the code, the current work scheduling is less than ideal. The main thread launches some N threads. Then, it creates the "compute graph". I assume that it is a DAG with each node representing some computation. I assume that it is topologically sorted before the main for (int i = 0; i < cgraph->n_nodes; i++) loop. The loop sequentially goes through all the nodes. If the "compute graph" is indeed a sorted DAG, then, here comes the optimization: instead of going "for node in graph: for task in node:", the tasks from nondependent nodes could be run independently and each node. This would mean that fundamentally, the code would work like:

Main thread:

compute_graph G; // topologically-sorted
multithreaded_queue<task> Q;
for (node& n : G) {
    // The number of incoming edges
    // ie. the number of dependencies
    if (n.dependency_count.nonatomic_load() > 0)
        break;
    Q.batch_enqueue(n.tasks);
}
Q.start_working();
execute_work()
// cleanup
return [the result]

Worker threads execute execute_work function:

Q.wait_for_start_working_blocking();
while (!Q.done()) {
    task to_do = Q.pop_blocking();
    execute(to_do);
    
    // if this was the last task for this node, the node has completed
    if(to_do.node.task_count.atomic_fetch_sub(1) == 1) {
        // so, all the node's dependents have one dependency less
        for (node& n : to_do.node.dependents) {
             // if the current node was the last dependency of this node
             // we can enqueue this node's tasks for execution
             if (n.dependency_count.atomic_fetch_sub(1) == 1) {
                 Q.batch_enqueue(n.tasks);
             }
        }
    }
}

This design should eliminate all the blocking and waiting and maximize the amount of time spent by the threads on executing useful work.

@mqy
Copy link
Contributor Author

mqy commented Apr 10, 2023

the tasks from nondependent nodes could be run independently and each node.

Thanks!

From the map-reduce view, main thread dispatches tasks to workers, then waits for all done. So the thread pool + parallel compute graph design fall into the map-reduce manner as well, no exception. I'm sure that with more in-depth studying into the model would lead to better model-aware designs. Are you try this solution now?

If we could map nodes instead of tensors, there would be huge room for the wait-notify way. Also possible parallel both graph and some heavy nodes. In my machine, ggml_vec_dot_q4_0 takes over 30% time.

BTW, I'm looking deep into ggml_graph_print now.

The first step is moving the node bench code from ggml_graph_compute to ggml_compute_forward, this will help get accurate per tensor (object) bench data. By comparing the time between single thread vs multi-threads, For nodes that gain little with multi-threading, it's OK to run with less threads. If the collect-stat-adapt way could be done at runtime, we are hoping able to balance per-energy in a clever way.

[EDIT] my perf test branch master...mqy:llama.cpp:ggml-thread-stat
From bottom of perf.txt we can see that: the top-N are: MUL_MAT(3270 ms), ROPE(25 ms), CPY(7 ms), ...; while many other ops take 0 ms, thus may not deserve parallel computing at all.

@anzz1
Copy link
Contributor

anzz1 commented Apr 10, 2023

@mqy
I think the perf/energysaving idea is fantastic.
I don't think you're supposed to do multiple pause hint instructions though, only one per loop. I don't have any facts to back this up right now I just remember it being so, might be wrong.

From sched_yield man pagee: sched_yield() is intended for use with real-time scheduling policies (i.e., SCHED_FIFO or SCHED_RR). Use of sched_yield() with nondeterministic scheduling policies such as SCHED_OTHER is unspecified and very likely means your application design is broken.

idk why the manual would say that, using Sleep(0) is perfectly fine in Windows and I think sched_yield is the same thing on linux and should be ok too. to be honest the entry succeeds to explain nothing and also sound patronizing at the same time so i'm not sure how much weight I would put into "is unspecified" and "likely means your application design is broken".

@mqy
Copy link
Contributor Author

mqy commented Apr 10, 2023

I don't think you're supposed to do multiple pause hint instructions though, only one per loop. I don't have any facts to back this up right now I just remember it being so, might be wrong.

Perhaps bunch of continuous pause hint instructions COULD put the core in low power mode for a while. Have a look at this file https://github.com/mqy/llama.cpp/blob/ggml-thread-stat/tests/spin_hint.c.

Here is the definition of _mm_pause() on macOS:

/* The execution of the next instruction is delayed by an implementation
   specific amount of time.  The instruction does not modify the
   architectural state.  This is after the pop_options pragma because
   it does not require SSE support in the processor--the encoding is a
   nop on processors that do not support it.  */
extern __inline void
    __attribute__((__gnu_inline__, __always_inline__, __artificial__))
    _mm_pause(void) {
  /* There is no exact match with this construct, but the following is
     close to the desired effect.  */
#if _ARCH_PWR8
  /* On power8 and later processors we can depend on Program Priority
     (PRI) and associated "very low" PPI setting.  Since we don't know
     what PPI this thread is running at we: 1) save the current PRI
     from the PPR SPR into a local GRP, 2) set the PRI to "very low*
     via the special or 31,31,31 encoding. 3) issue an "isync" to
     insure the PRI change takes effect before we execute any more
     instructions.
     Now we can execute a lwsync (release barrier) while we execute
     this thread at "very low" PRI.  Finally we restore the original
     PRI and continue execution.  */
  unsigned long __PPR;

  __asm__ volatile("	mfppr	%0;"
                   "   or 31,31,31;"
                   "   isync;"
                   "   lwsync;"
                   "   isync;"
                   "   mtppr	%0;"
                   : "=r"(__PPR)
                   :
                   : "memory");
#else
  /* For older processor where we may not even have Program Priority
     controls we can only depend on Heavy Weight Sync.  */
  __atomic_thread_fence(__ATOMIC_SEQ_CST);
#endif
}

It's quite straight forward, should be expanded as one line of assembly code. From my testing, it can slow down execution time for hundreds of times on Itel Core i7 (8 gen). But __asm__ __volatile__ ("pause") doesn't work as expected.

[EDIT] I'm looking forward somebody run spin_hint.c on aarch64, test the wfe instruction.

@anzz1
Copy link
Contributor

anzz1 commented Apr 10, 2023

@mqy
Sorry but I don't quite follow you, if you want to sleep then use sleep, the pause is not a sleep replacement and afaik should only be used as a spinlock hint (once per loop) and not multiple times. PAUSE

idk why you changed to using __asm__ __volatile__ ("pause") instead of the _mm_pause() intrinsic for x86 from my working example posted earlier since gcc/clang/msvc all emit the proper F3 90 aka "PAUSE" / "REP NOP" instruction for it. yes i also think it would be lovely to be able to use inline assembly but unfortunately the compilers are finicky with it (and x64 inline assembly isn't supported at all on msvc, ugh) and the _mm_pause intrinsic seems well supported and gets the job done.

to be able to use inline assembly truly portably you'd need to resort to dirty hacks like this which wouldn't work in this case because it's not truly inline as there's an extra function call so you can't just put "F3 90" to a char array. anyway like said __mm_pause for x86 is supported by clang/gcc/msvc

i really cannot say anything about the macos ppc part as i simply don't know enough of archs other than x86 to be qualified to even make a guesstimate.

@mqy
Copy link
Contributor Author

mqy commented Apr 10, 2023

@anzz1 thanks for the kind comments.

Sorry but I don't quite follow you, if you want to sleep then use sleep, the pause is not a sleep replacement and afaik should only be used as a spinlock hint (once per loop) and not multiple times.

I agree with you that "the pause is not a sleep replacement" now.

spin_hint is modified as follow (temp keep the dirty wfe as is)

static inline void spin_hint(void) {
#if defined(__x86_64__)
#include <emmintrin.h>
    _mm_pause();
#elif defined(__aarch64__)
    __asm__ __volatile__ ("wfe");
#endif
}

@mqy mqy marked this pull request as draft April 13, 2023 11:11
@mqy
Copy link
Contributor Author

mqy commented Apr 13, 2023

Close this PR because it is incomplete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants