improve _syrk for small data #757

brada4 · 2016-01-23T13:37:00Z

Motivation in #751
interface/syrk.c

   args.common = NULL;
   args.nthreads = num_cpu_avail(3);

+  if(trans) {
+    if (sizeof(FLOAT)*(args.lda*args.k+args.ldc*args.n)<2*1024*1024) args.nthreads = 1;
+  } else {
+    if (sizeof(FLOAT)*args.n*(args.ldc+args.lda)<2*1024*1024) args.nthreads = 1;
+  }
+
   if (args.nthreads == 1) {
 #endif

The text was updated successfully, but these errors were encountered:

martin-frbg · 2016-01-23T14:10:22Z

Wouldn't it make more sense to keep these in a single bug ticket (and eventually pull request) rather than flooding the system with serially created tickets ? (Seeing a benchmark for at least some of these might be instructive as well, not all the low hanging fruit may be worth picking)

brada4 · 2016-01-23T14:46:36Z

It would take 3x more tickets to deduce same result. I agree on length of reading and make one more patch for the rest.
Basically - approximate memory involved from netlib arguments and dont thrash outer caches with 2 threads. Yes, it is lightly miscalculated by not considering real cache size, existence of shared caches, or buffers allocated deep in calls. But whatever the approximation error, it completely avoids touching extra threads for small DSP cases, which are much smaller than my assumed cache.
As a bonus <2*1024*1024 acts as a todo marker in case it smells like rotten fruit.

brada4 · 2016-01-23T21:27:00Z

diff for rest of s* d*. somebody else should take care of complex as I have close to zero interest in them

serially_antibenchmark_diff.txt

jeromerobert · 2016-01-24T11:17:08Z

@brada4 you should close that bugs because:

They are flooding the bug tracker
They are code contribution so they should be a (unique) pull request (nobody will merge them if you let them as free patch).
None of your threshold depends on GEMM_MULTITHREAD_THRESHOLD
finally, and this is the main reason, all your threshold are at least 100 time too high (did you do any measurements ?)

brada4 · 2016-01-24T11:35:02Z

Sorry for knocking at the wrong door. I dont use or want to use GIT. You can read that inside diff.
I think model of GEMM_MULTITHREAD_THRESHOLD is flawed, and will get painfully messy when you calibrate it against one single sandy bridge CPU.
You see measurements in #730 illustrating cache thrashing by concurrent threads and super small data.
They are all bugs with 4-10x performance regression for small matrices (in order of <X kilobytes, where X is likely 128k)

martin-frbg · 2024-01-09T13:59:28Z

closing as implemented in #3292 and followup PRs

martin-frbg closed this as completed Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve _syrk for small data #757

improve _syrk for small data #757

brada4 commented Jan 23, 2016

martin-frbg commented Jan 23, 2016

brada4 commented Jan 23, 2016

brada4 commented Jan 23, 2016

jeromerobert commented Jan 24, 2016

brada4 commented Jan 24, 2016

martin-frbg commented Jan 9, 2024

improve _syrk for small data #757

improve _syrk for small data #757

Comments

brada4 commented Jan 23, 2016

martin-frbg commented Jan 23, 2016

brada4 commented Jan 23, 2016

brada4 commented Jan 23, 2016

jeromerobert commented Jan 24, 2016

brada4 commented Jan 24, 2016

martin-frbg commented Jan 9, 2024