Skip to content

Race condition in memory.c while running dgemm many times from omp parallel region #2444

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mistomin opened this issue Feb 25, 2020 · 17 comments · Fixed by #2553
Closed

Race condition in memory.c while running dgemm many times from omp parallel region #2444

mistomin opened this issue Feb 25, 2020 · 17 comments · Fixed by #2553
Milestone

Comments

@mistomin
Copy link

I figured out that dgemm fails when I call it many times from omp parallel region with many threads. The error is BLAS: Program is Terminated. Because you tried to allocate too many memory regions.

Here you can find simple reproducer (actually you can just add pragma to file benchmark/gemm.c, so it will look like the code below):

#pragma omp parallel num_threads(n)
{
    for (j = 0; j < loops; j++)
        GEMM(...)
}

dgemm params: m=n=k=10 (actually any small matrices work). Others are default ones from benchmark/gemm.c
For n in num_threads I use 96 (it equals to my machine cores count).
Parameter loops can be small, but with big ones (for example 10^5) error happens immediately.
I build it on TSV110 machine with flags BINARY=64 TARGET=TSV110 USE_OPENMP=1 (I also add CFLAGS="-fopenmp" to benchmark builds).

So with such parameters I have error in less than 1 second. If i change num_threads(96) to num_threads(95) the error still fires, but in about 2-3 seconds.

Minor debugging showed that root cause of this problem is in memory[NUM_BUFFERS] array from memory.c. Define NUM_BUFFERS in my case is 96 * 2 = 192.
Many threads try to find empty cell in array memory[NUM_BUFFERS]. But sometimes (actually quite often) 2 threads try to acquire same cell, and one of them acquires it while the other continues its search. So while some thread continues its search, the situation can happen again to it (I just added printf to see how many times acquiring fails and it showed that during one pass through this array one thread can skip 4-5 empty cells). So if thread is lucky enough, it can easily reach the end of memory array and then we have error (but gdb still shows that there are about 20 actually empty cells in array behind current thread position at the moment of reaching the end).

So changing NUM_BUFFERS from 192 to some bigger number (for example 200) postpones error, but it still can happen.
Another possible fix is making loop (with memory array acquiring) in memory.c infinite (then error disappears, but I guess performance may be unstable).

Maybe you have ideas of how it should be fixed?

@martin-frbg
Copy link
Collaborator

This looks a lot like the scenario described in #1536, where NUM_PARALLEL was introduced as a workaround. (it is not clear to me how the search could reach the end of the array when there are still slots/cells available in it so perhaps it is actually two bugs working together ?)

@mistomin
Copy link
Author

My scenario is different if I understand that issue correctly.
I do not run more than n=MAX_CPU_NUMBER threads at the same time. I see in cpu utilization that only n threads runs simultaneously if I call dgemm from omp parallel num_threads(n). I see only n threads in gdb. So I guess only n slots in memory array should be used. But somehow gdb shows that near all of slots (2*n slots) are in use during computation (I also double checked it with printf(position)).

Also I noticed that with OMP_PROC_BIND=TRUE or with OMP_PLACES=... error disappears. While computation still takes the same amount of time as version with infinite loop over memory array. In case of OMP_PROC_BIND=TRUE I see that memory array almost always near empty (only about 10 slots are used all the time).

@martin-frbg
Copy link
Collaborator

I still think this is strange - granted that two threads can fight over who gets the slot, but wouldn't it be guaranteed that either of them succeeds, so from the viewpoint of the other the situation would be exactly the same as if it had arrived a millisecond later and seen that slot already occupied ? Thus it should never be possible to advance across slots that then remain empty and have the seeking thread fall of the end of the memory array instead. Unless the fine-grained locking is totally not working and both threads run away sulking after their clash...

@mistomin
Copy link
Author

mistomin commented Mar 5, 2020

I'm not sure how exactly that situation may happen and not sure how to debug that. I've tried to use default lock (based on CAS) instead of arm64 asm lock, but crash still remains.

@martin-frbg
Copy link
Collaborator

Realized a bit late that our Drone.io CI appears to allow access to all 96 cores on their ThunderX systems, will try again to reproduce the problem. Chances are this is specific to architectures with weak memory order, and maybe it is as simple as a missing barrier instruction in memory.c where memory[position].used gets set to 1 (at least I wonder why we have a WMB for releasing the segment, but none for marking it in use)

@martin-frbg
Copy link
Collaborator

This simple change appears to have actually solved the issue, could you retest with current develop branch on your platform ?

@mistomin
Copy link
Author

mistomin commented Apr 8, 2020

Does not help( Still same error.

@martin-frbg
Copy link
Collaborator

Pity, it looked so good and logical. I'll run some more tests.

@martin-frbg
Copy link
Collaborator

I cannot reproduce this with 69f277f on ThunderX, I can let the dgemm benchmark loop for an hour (until it hits the CI time limit) without problems where it would easily show the symptoms before the fix.

@mistomin
Copy link
Author

mistomin commented Apr 9, 2020

I've just double checked on tsv110: clone develop branch, modify benchmark/gemm.c so main loop looks like

#pragma omp parallel
{
    for (int j = 0; j < loops; j++)
        GEMM(...)
}

add -fopenmp to benchmark/Makefile and compile with USE_OPENMP=1 TARGET=TSV110 BINARY=64. Running OPENBLAS_LOOPS=10000 OMP_NUM_THREADS=96 ./benchmark/dgemm.goto 10 10 0 gives me same error in 1 second.

I've also tried the test you modified in your commit (dgemm_thread_safety) and it works without any error.

Then I removed line you added (the line with WMB in memory.c) and rerun dgemm_thread_safety. It pass without any error.

So maybe I am doing something wrong with benchmark/gemm.c ?

Also please be sure you have NUM_BUFFERS=OMP_NUM_THREADS*2 on your machine, otherwise it is hard to reproduce this error.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Apr 9, 2020

Wondering if there are actually two bugs if the default NUM_PARALLEL=2 from #1536 plays a role.
NUM_PARALLEL is still 1 by default, so NUM_BUFFERS should be 2 MAX_CPU_NUMBER which should correspond to 296 here automatically. And yes, my benchmark gets compiled with USE_OPENMP/-fopenmp defined.
Probably cpu speed plays a role as well, thunderx is likely to be much slower than your tsv110 (and unfortunately drone.io has pushed my latest tests onto a system with just 44 cores so I need t remove the nthreads argument from the pragma.). I definitely saw the problem before I added the WMB.

@martin-frbg
Copy link
Collaborator

drone.io runs appear to be alternating between 96-core thunderx and 44-core falkor. I have checked that OMP_NUM_THREADS is identical to the number of cores and that NUM_BUFFERS is twice that. Still I cannot reproduce the problem with the earlier fix in place.
(There is another spot in blas_memory_alloc() near line 2760 of memory.c where memory[position].used is read in an if() and a lock is acquired only afterwards - perhaps that one could need a barrier too ?)

@mistomin
Copy link
Author

Thats interesting. I added WMB as first line of that do loop, so I have

...
  LOCK_COMMAND(&alloc_lock);
#endif
  do {
    WMB;
#if defined(USE_OPENMP)
    if (!memory[position].used) {
      blas_lock(&memory[position].lock);
#endif
...

And it worked for me! It fixed the issue. Actually after adding that WMB I can even remove your WMB added in 69f277f and no error happens. Not sure if it is correct to add WMB on each iteration, but at least it is a working solution. So there is something definitely wrong with used variable, some threads can get wrong value of it. I will do some more debugging.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Apr 10, 2020

Thanks for the feedback, glad that my hunch was not completely wrong. I must admit that my knowledge of memory barriers is still a bit sketchy (and I am more familiar with platforms that have strong memory ordering). Would be interesting to see if the barrier can be pulled out of the loop, and/or if a read barrier (assuming that might be less expensive) could be sufficient here.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Apr 10, 2020

Putting just an MB or WMB before the do loop does appear to work fine on both Falkor and ThunderX, I have not come across a TSV110 or EMAG8180 on drone.io however. Not tried an "dmb ishld" for which there is no predefined macro in OpenBLAS yet but from what I could find on the net the cost of either seems to be nearly the same in most benchmarks (?). (I had originally assumed our "MB" to be a read barrier but it is actually read+write, i.e. "dmb ish")

@martin-frbg martin-frbg added this to the 0.3.10 milestone Apr 10, 2020
@mistomin
Copy link
Author

It does not work for me on TSV110 putting MB or WMB before the loop. The only one working solution for me is to do any barrier on each loop iteration before if (memory[position].used). It does not matter which barrier, all of them works.

Using load memory barrier dmb ishld gives speedup over dmb ishst for about 15% in my example. But still it gives about 4% performance degradation compared to version without any barrier in my example (however my example is quite far from real ones, so in real world degradations should not be so noticable).

I think adding memory barrier on each iteration is reasonable here. Because without memory barrier this loop looks like

load memory[position].used into register
cmp register 0
branch equal

So I guess CPU can use violent optimizations in this case, assuming comparison is often false (for example load multiple elements simultaneously). While adding barrier there forces CPU to load each element of memory[] sequentially, one after another.

I also have a simple reproducer (just 2 self-contained files with main and simplified memory.c version). I can share it if it can help.

@martin-frbg
Copy link
Collaborator

OK so let's do that then - the only drawback of adding "dmb ishld" is that it needs to be added to the headers of all other platforms (but on the other hand it could be an empty definition on systems with strong memory ordering). 4% performance loss seems a small price to pay for correct results on the systems that need it. Thank you very much again for testing and providing the actual working solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants