Skip to content

locking performance with OpenBLAS #2247

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cparrott73 opened this issue Sep 3, 2019 · 7 comments
Closed

locking performance with OpenBLAS #2247

cparrott73 opened this issue Sep 3, 2019 · 7 comments

Comments

@cparrott73
Copy link

Greetings OpenBLAS developers:

One of my colleagues has been working on an application that uses OpenBLAS, and he noticed some performance issues involving locking on OpenBLAS. He made some changes to OpenBLAS that seemed to improve performance. He passed these along to me, and I wanted to mention them here for your review. I would like to get your thoughts on these changes - particularly if you think they are 100% safe. We have not encountered any issues with these changes so far, but our testing has not been exhaustive.

Here is what my colleague had to say:

  1. Check to see whether memory has been initialized before acquiring the semaphore (lock) for memory initialization. No need to get the lock if memory is already initialized.

In the blas_memory_alloc() routine, change this:

    LOCK_COMMAND(&alloc_lock);

    if (!memory_initialized) {  

To this:

  if (!memory_initialized) {
    LOCK_COMMAND(&alloc_lock);

And then also move the corresponding UNLOCK_COMMAND(&alloc_lock) to inside the code block at the end:

#if defined(SMP) && !defined(USE_OPENMP)
      UNLOCK_COMMAND(&alloc_lock);
#endif
    }
  1. Acquire lock one for the duration of the scanning of array "memory". Don't get/release the lock for every iteration of the array.

In the blas_memory_alloc() routine, change this:

  do {
    if (!memory[position].used && (memory[position].pos == mypos)) {
#if defined(SMP) && !defined(USE_OPENMP)
      LOCK_COMMAND(&alloc_lock);
#else      
      blas_lock(&memory[position].lock);
#endif
      if (!memory[position].used) goto allocation;
#if defined(SMP) && !defined(USE_OPENMP)
      UNLOCK_COMMAND(&alloc_lock);
#else
      blas_unlock(&memory[position].lock);
#endif      
    }

To this:

#if defined(SMP) && !defined(USE_OPENMP)
      LOCK_COMMAND(&alloc_lock);
#else      
      blas_lock(&memory[position].lock);
#endif
  do {
    if (!memory[position].used && (memory[position].pos == mypos)) {
      if (!memory[position].used) goto allocation;
    }
#if defined(SMP) && !defined(USE_OPENMP)
      UNLOCK_COMMAND(&alloc_lock);
#else
      blas_unlock(&memory[position].lock);
#endif      

In essence, he has observed that fine-grained locking and unlocking of individual elements in the memory[] array while scanning them is actually worse than just locking and unlocking the entire array once.

Let us know what you think about these changes. Thanks in advance!

@martin-frbg
Copy link
Collaborator

Thank you for these suggestions. I certainly do not consider myself an expert in thread-safe programming, most if not all of these locks were added in response to valgrind/helgrind warnings and with the intention to keep code changes minimal.
Regarding (1), if I remember correctly the "memory_initialized" flag itself needed to be protected against races. For (2), I am unsure if I ran any benchmarks back then - it may have been just a gut feeling that holding a lock for longer than absolutely necessary would be bad for performance
(in particular as this code had existed for so many years without locks...).

@brada4
Copy link
Contributor

brada4 commented Sep 4, 2019

2nd part looks more transactional the way that all RO checks are done in locked context followed by RW, currently it looks like [].used can be raced to change between 2 checks....

The initialisation could happen racing from 2 threads in case static library did not initialize it and 2 threads got called.

Could you make a PR on the 2nd part?

@brada4
Copy link
Contributor

brada4 commented Sep 5, 2019

What version you are working on?
I found your code snip dropped a while ago?
https://github.com/xianyi/OpenBLAS/blob/fde8a8e6a02b1186a178f8bbe3b3e5d84c8786e1/driver/others/memory.c#L2685

@martin-frbg
Copy link
Collaborator

Indeed the code section that your item (2) apparently refers to is already commented out since 0.3.4 (brada4's #1814, the "WHEREAMI" code on which it depended was deemed obsolete).
Item (1) is I believe addressed by an outer if (!memory_initialized) {in the OPENMP case - without OPENMP the access to the initialized flag itself is racy.

@brada4
Copy link
Contributor

brada4 commented Sep 5, 2019

The idea is sort of good - code pinches the variable without locks , then locks if it jas chance and updates, just that lock syscall is more expensive (and gets worse with spectre etc fixes) than plainly iterating all structure in a loop under one lock.

@cparrott73
Copy link
Author

Sorry about that - my colleague was working with 0.3.3. I appreciate the feedback.

@brada4
Copy link
Contributor

brada4 commented Sep 6, 2019

No worries, you are welcome to share if you find any avoidable excess syscalls anywhere in code.
Actually #1785 +#1814 patches should fix concern in 0.3.3, but still 0.3.7 has more improvement even in same source file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants