-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Race condition in memory.c while running dgemm many times from omp parallel region #2444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This looks a lot like the scenario described in #1536, where NUM_PARALLEL was introduced as a workaround. (it is not clear to me how the search could reach the end of the array when there are still slots/cells available in it so perhaps it is actually two bugs working together ?) |
My scenario is different if I understand that issue correctly. Also I noticed that with OMP_PROC_BIND=TRUE or with OMP_PLACES=... error disappears. While computation still takes the same amount of time as version with infinite loop over memory array. In case of OMP_PROC_BIND=TRUE I see that memory array almost always near empty (only about 10 slots are used all the time). |
I still think this is strange - granted that two threads can fight over who gets the slot, but wouldn't it be guaranteed that either of them succeeds, so from the viewpoint of the other the situation would be exactly the same as if it had arrived a millisecond later and seen that slot already occupied ? Thus it should never be possible to advance across slots that then remain empty and have the seeking thread fall of the end of the memory array instead. Unless the fine-grained locking is totally not working and both threads run away sulking after their clash... |
I'm not sure how exactly that situation may happen and not sure how to debug that. I've tried to use default lock (based on CAS) instead of arm64 asm lock, but crash still remains. |
Realized a bit late that our Drone.io CI appears to allow access to all 96 cores on their ThunderX systems, will try again to reproduce the problem. Chances are this is specific to architectures with weak memory order, and maybe it is as simple as a missing barrier instruction in memory.c where memory[position].used gets set to 1 (at least I wonder why we have a WMB for releasing the segment, but none for marking it in use) |
This simple change appears to have actually solved the issue, could you retest with current |
Does not help( Still same error. |
Pity, it looked so good and logical. I'll run some more tests. |
I cannot reproduce this with 69f277f on ThunderX, I can let the dgemm benchmark loop for an hour (until it hits the CI time limit) without problems where it would easily show the symptoms before the fix. |
I've just double checked on tsv110: clone develop branch, modify benchmark/gemm.c so main loop looks like
add -fopenmp to benchmark/Makefile and compile with USE_OPENMP=1 TARGET=TSV110 BINARY=64. Running I've also tried the test you modified in your commit (dgemm_thread_safety) and it works without any error. Then I removed line you added (the line with WMB in memory.c) and rerun dgemm_thread_safety. It pass without any error. So maybe I am doing something wrong with benchmark/gemm.c ? Also please be sure you have NUM_BUFFERS=OMP_NUM_THREADS*2 on your machine, otherwise it is hard to reproduce this error. |
|
drone.io runs appear to be alternating between 96-core thunderx and 44-core falkor. I have checked that OMP_NUM_THREADS is identical to the number of cores and that NUM_BUFFERS is twice that. Still I cannot reproduce the problem with the earlier fix in place. |
Thats interesting. I added
And it worked for me! It fixed the issue. Actually after adding that WMB I can even remove your WMB added in 69f277f and no error happens. Not sure if it is correct to add WMB on each iteration, but at least it is a working solution. So there is something definitely wrong with |
Thanks for the feedback, glad that my hunch was not completely wrong. I must admit that my knowledge of memory barriers is still a bit sketchy (and I am more familiar with platforms that have strong memory ordering). Would be interesting to see if the barrier can be pulled out of the loop, and/or if a read barrier (assuming that might be less expensive) could be sufficient here. |
Putting just an MB or WMB before the do loop does appear to work fine on both Falkor and ThunderX, I have not come across a TSV110 or EMAG8180 on drone.io however. Not tried an "dmb ishld" for which there is no predefined macro in OpenBLAS yet but from what I could find on the net the cost of either seems to be nearly the same in most benchmarks (?). (I had originally assumed our "MB" to be a read barrier but it is actually read+write, i.e. "dmb ish") |
It does not work for me on TSV110 putting MB or WMB before the loop. The only one working solution for me is to do any barrier on each loop iteration before Using load memory barrier I think adding memory barrier on each iteration is reasonable here. Because without memory barrier this loop looks like
So I guess CPU can use violent optimizations in this case, assuming comparison is often false (for example load multiple elements simultaneously). While adding barrier there forces CPU to load each element of memory[] sequentially, one after another. I also have a simple reproducer (just 2 self-contained files with main and simplified memory.c version). I can share it if it can help. |
OK so let's do that then - the only drawback of adding "dmb ishld" is that it needs to be added to the headers of all other platforms (but on the other hand it could be an empty definition on systems with strong memory ordering). 4% performance loss seems a small price to pay for correct results on the systems that need it. Thank you very much again for testing and providing the actual working solution. |
I figured out that dgemm fails when I call it many times from omp parallel region with many threads. The error is
BLAS: Program is Terminated. Because you tried to allocate too many memory regions.
Here you can find simple reproducer (actually you can just add pragma to file benchmark/gemm.c, so it will look like the code below):
dgemm params: m=n=k=10 (actually any small matrices work). Others are default ones from benchmark/gemm.c
For n in num_threads I use 96 (it equals to my machine cores count).
Parameter
loops
can be small, but with big ones (for example 10^5) error happens immediately.I build it on TSV110 machine with flags
BINARY=64 TARGET=TSV110 USE_OPENMP=1
(I also add CFLAGS="-fopenmp" to benchmark builds).So with such parameters I have error in less than 1 second. If i change num_threads(96) to num_threads(95) the error still fires, but in about 2-3 seconds.
Minor debugging showed that root cause of this problem is in memory[NUM_BUFFERS] array from memory.c. Define NUM_BUFFERS in my case is 96 * 2 = 192.
Many threads try to find empty cell in array memory[NUM_BUFFERS]. But sometimes (actually quite often) 2 threads try to acquire same cell, and one of them acquires it while the other continues its search. So while some thread continues its search, the situation can happen again to it (I just added printf to see how many times acquiring fails and it showed that during one pass through this array one thread can skip 4-5 empty cells). So if thread is lucky enough, it can easily reach the end of memory array and then we have error (but gdb still shows that there are about 20 actually empty cells in array behind current thread position at the moment of reaching the end).
So changing NUM_BUFFERS from 192 to some bigger number (for example 200) postpones error, but it still can happen.
Another possible fix is making loop (with memory array acquiring) in memory.c infinite (then error disappears, but I guess performance may be unstable).
Maybe you have ideas of how it should be fixed?
The text was updated successfully, but these errors were encountered: