Skip to content

improve _syrk for small data #757

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
brada4 opened this issue Jan 23, 2016 · 6 comments
Closed

improve _syrk for small data #757

brada4 opened this issue Jan 23, 2016 · 6 comments

Comments

@brada4
Copy link
Contributor

brada4 commented Jan 23, 2016

Motivation in #751
interface/syrk.c

   args.common = NULL;
   args.nthreads = num_cpu_avail(3);

+  if(trans) {
+    if (sizeof(FLOAT)*(args.lda*args.k+args.ldc*args.n)<2*1024*1024) args.nthreads = 1;
+  } else {
+    if (sizeof(FLOAT)*args.n*(args.ldc+args.lda)<2*1024*1024) args.nthreads = 1;
+  }
+
   if (args.nthreads == 1) {
 #endif
@martin-frbg
Copy link
Collaborator

Wouldn't it make more sense to keep these in a single bug ticket (and eventually pull request) rather than flooding the system with serially created tickets ? (Seeing a benchmark for at least some of these might be instructive as well, not all the low hanging fruit may be worth picking)

@brada4
Copy link
Contributor Author

brada4 commented Jan 23, 2016

It would take 3x more tickets to deduce same result. I agree on length of reading and make one more patch for the rest.
Basically - approximate memory involved from netlib arguments and dont thrash outer caches with 2 threads. Yes, it is lightly miscalculated by not considering real cache size, existence of shared caches, or buffers allocated deep in calls. But whatever the approximation error, it completely avoids touching extra threads for small DSP cases, which are much smaller than my assumed cache.
As a bonus <2*1024*1024 acts as a todo marker in case it smells like rotten fruit.

@brada4
Copy link
Contributor Author

brada4 commented Jan 23, 2016

diff for rest of s* d*. somebody else should take care of complex as I have close to zero interest in them

serially_antibenchmark_diff.txt

@jeromerobert
Copy link
Contributor

@brada4 you should close that bugs because:

  • They are flooding the bug tracker
  • They are code contribution so they should be a (unique) pull request (nobody will merge them if you let them as free patch).
  • None of your threshold depends on GEMM_MULTITHREAD_THRESHOLD
  • finally, and this is the main reason, all your threshold are at least 100 time too high (did you do any measurements ?)

@brada4
Copy link
Contributor Author

brada4 commented Jan 24, 2016

Sorry for knocking at the wrong door. I dont use or want to use GIT. You can read that inside diff.
I think model of GEMM_MULTITHREAD_THRESHOLD is flawed, and will get painfully messy when you calibrate it against one single sandy bridge CPU.
You see measurements in #730 illustrating cache thrashing by concurrent threads and super small data.
They are all bugs with 4-10x performance regression for small matrices (in order of <X kilobytes, where X is likely 128k)

@martin-frbg
Copy link
Collaborator

closing as implemented in #3292 and followup PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants