Poor performance on Pentium N3540 using 64-bit OpenBLAS #1300

AllinCottrell · 2017-09-10T20:30:01Z

I'm not quite certain this is an OpenBLAS issue -- in principle, it could be a compiler
problem, but...

I'm a developer of an open-source econometrics package that uses OpenBLAS, and
we have come across a numerical optimization problem on which OpenBLAS generally
does a good job, with the sole exception of 64-bit OpenBLAS running on Intel Pentium
N3540. We have compared results across multiple platforms, CPUs and word-lengths,
and have also compared results between OpenBLAS 64-bit and plain Netlib blas/lapack
64-bit.

By "poor performance" I mean that relative to all other cases, 64-bit OpenBLAS on
Pentium N3540 takes over twice as many iterations and ends up with a value of the
maximand that is about 1 percent worse than other cases. And, to be clear, "other cases"
includes 32-bit OpenBLAS and 64-bit Netlib blas/lapack on the same machine.

I'm not yet able to produce a minimal test case, but I can say that the OpenBLAS
functions called by the optimization routine include dpotrf/dpotri and dgetrf/dgetrs.

In all cases entering the comparison, OpenBLAS is compiled using gcc. In the case
of 64-bit OpenBLAS on Pentium N3540 the compiler is x86_64-w64-mingw32-gcc
(gcc 5.4.0). The same compiler was used to produce the Netlib blas/lapack DLLs that
gave "normal" results on the same target machine, and the corresponding 32-bit
compiler, i686-w64-mingw32-gcc (gcc 5.4.0) was used to produce the 32-bit
openblas.dll that also gave normal results.

AllinCottrell · 2017-09-10T20:34:19Z

I should add: we are using version 0.2.20 of OpenBLAS but have also tried 0.3.0.dev.
Updating to the latter made no difference to the comparison.

brada4 · 2017-09-10T21:40:16Z

How big are inputs?
To what functions?
To what CPU you compare this low cost low power CPU?

AllinCottrell · 2017-09-10T22:22:59Z

The inputs are of moderate size: matrices of less than 50 in dimension.

We're comparing with Nehalem, Sandybridge and Haswell, but the most relevant
point is that we're comparing with both 32-bit OpenBLAS and 64-bit standard
Netlib blas/lapack on the very same low-cost low-power CPU.

AllinCottrell · 2017-09-10T22:26:49Z

"To what functions?" See my first post: dpotrf/dpotri and dgetrf/dgetrs.

martin-frbg · 2017-09-10T22:33:50Z

What build options did you use ? (In particular, is this a single- or multithread build ? Issues #1270 and #1253 may be somewhat related if not entirely understood yet)

AllinCottrell · 2017-09-10T23:05:45Z

I used these (relevant) build options:

DYNAMIC_ARCH = 1 # core detected as "Atom" on Pentium N3540
USE_OPENMP = 1 # since we use OpenMP in our caller code at some points
NUM_THREADS = 24

So it's a multithreaded build.Thanks for the refs to other Issues; I'm scanning them in search
of possible commonalities.

AllinCottrell · 2017-09-10T23:15:58Z

@martin-frbg : given the discussion under #1253 maybe I should try rebuilding with a
more recent cross-gcc (such as 7.2.0).

brada4 · 2017-09-11T05:50:22Z

Are you saying 32bit OpenBLAS is faster than 64bit OpenBLAS on particular CPU?

AllinCottrell · 2017-09-11T18:36:35Z

@brada4 : the 32-bit OpenBLAS is not necessarily faster, but it takes fewer
iterations to reach convergence and it reaches a higher maximum than the
64-bit build, on our test problem.

brada4 · 2017-09-11T19:49:31Z

Dare to share quick quick sample? I have exact CPU as you and many more.

brada4 · 2017-09-11T21:33:54Z

Try suggestion from #1237 , though I dont believe recent atoms/pentiums are in the low end of gene pool.
Can you profile (i dont have windows recipe, linux one would be perf record/perf report) how much time is spent in each BLAS (as opposed to LAPACK wrappers) call? Sample would help me learn Linux part of this.

AllinCottrell · 2017-09-12T18:04:04Z

@brada4 Sorry, can't produce a minimal test case since I don't have access to a Pentium N3540 myself; I'm working from (good, detailed) information sent by a colleague in Ukraine. But we do have some more information: besides getting good results on the N3540 with 32-bit OpenBLAS (that's under Windows 10), we also get good results when the machine is booted into Ubuntu and runs 64-bit OpenBLAS 0.2.19: in that case the core selected dynamically is Prescott, not Atom.

It looks as if CORE_ATOM is not suitable for 64-bit operation of the N3540 Silvermont. I tried making a 64-bit Windows build of 0.3.0.dev with gcc 7.2.0 and that produced even worse results than gcc 5.4.0 (missed the maximum by a big margin) using the Atom core.

martin-frbg · 2017-09-12T18:15:41Z

That would be quite surprising, as the N3540 is certainly part of the "Atom" line. To reduce the number of variables, would it be possible for you to repeat the Ubuntu check with a 64-bit 0.3.0dev ?

AllinCottrell · 2017-09-12T20:47:44Z

@martin-frbg I'm afraid that would be difficult: I'm on Arch (plus Fedora on another machine) and I'm not sure I could build a working drop-in replacement for the Debian/Ubuntu libopenblas. (Debian factors out the common blas/lapack functions and links to a modified libopenblas for the optimized functions, as I understand it.)

martin-frbg · 2017-09-12T21:16:07Z

I see. Perhaps your colleague could try the Ubuntu 0.2.19 with export OPENBLAS_CORETYPE=ATOM then, to see if it is actually a Prescott vs Atom code problem rather than a compiler or windows issue ?
(I only have an older N3150 box running Linux, and as I understand it I would probably need your code (gretl ?) and data to be able to investigate the problem, unless it also shows up with the included benchmark/dpotrf.goto testcase)

brada4 · 2017-09-12T21:41:37Z

Is performance profile set to performance for measurement? Win10 parks cores on laptop just like mobile phones.

AllinCottrell · 2017-09-12T21:47:45Z

@martin-frbg OK, I can certainly suggest that (OPENBLAS_CORETYPE=ATOM).

@brada4 OK, that's another thing we can check.

AllinCottrell · 2017-09-12T21:49:43Z

General point: OpenBLAS is fantastically good software and I hope we can help sort out an odd case of less than optimal performance.

AllinCottrell · 2017-09-13T12:48:21Z

Well, my colleague tried installing, on Ubuntu, a build of OpenBLAS 0.2.20 from
https://launchpad.net/ubuntu/artful/+package/libopenblas-base . It turns out this (a) sets blascore to Atom on his Pentium N3540 but (b) produces correct results. So the problem we've noted seems to be specific to (64-bit) Windows.

martin-frbg · 2017-09-13T13:10:54Z

I have only wild guesses to offer now - setting CONSISTENT_FPCSR=1 (as I think was what brada4 was suggesting with his reference to 1237) and/or USE_SIMPLE_THREADED_LEVEL3=1 in case there is some kind of thread contention even with OPENMP on that platform.

brada4 · 2017-09-13T18:04:49Z

Yup, that parameter.
MKL aint fast on this CPU either.

AllinCottrell · 2017-09-15T17:34:24Z

We've now tested a 64-bit build of openblas on Windows with CONSISTENT_FPCSR=1. I'm afraid this modification did not improve the results.
Speed is not the issue, it's accuracy (in particular missing a known maximum). The excess number
of iterations on 64-bit Windows is of interest primarily because it indicates the math is not being done right.

brada4 · 2017-09-15T19:16:01Z

BLAS results should be very similar - to few bits accuracy/rounding error.
can you compare interim matrices returned from BLAS calls down the road against:
single-threaded openblas
netlib blas built with -O2 or lower
if you post sample (or tell it is convex surface, random, some multi-almost-minimum test function or so) we could try too.

AllinCottrell · 2017-09-15T19:28:30Z

@brada4 : "can you compare interim matrices returned from BLAS..." Yes, that's exactly what I'd like to do, to pinpoint the problem. And I've put some debugging spew into my code to enable that. Right now my colleague with the Pentium N has lost patience (or is just taking a breather!) so I'll have to wait a bit.

brada4 · 2017-09-15T20:01:22Z

Lets start with small effort big impact options - alternate BLAS libs. (you can find prebuilt maybe old netlib blas/lapack around the web, just rename DLLs and test, it may take a lot longer to finish)

martin-frbg · 2017-09-15T21:47:33Z

This makes me wonder if/how your code sets up handling of denormals, i.e. the motivation behind the CONSISTENT_FPCSR option as outlined in the first message of #1237, as I imagine a difference in default behaviour between the platforms could have a significant influence on calculation time and accuracy

brada4 · 2017-09-16T05:16:53Z

@AllinCottrell please come forward with some numbers. It does not work saying it is slow, then inaccurate, then slow again.

AllinCottrell · 2017-09-17T18:31:26Z

I've never said it's slow. It might be, but the calculation completes so quickly that you'd have to run it thousands of times to get a meaningful timing. My concern is that it's not producing the same answers (and the fact that it takes a lot more iterations is an indicator that the calculations are diverging). To summarize, all of the following produce a log-likelihood of -62.56261 (agreeing to at least 7 significant digits) after 51 or 52 iterations on a certain maximum-likelihood problem:

32-bit openblas on Pentium N3540, Windows 10 (Atom core)
64-bit openblas 0.2.19 on Pentium N3540, Windows 10 (Prescott core)
Current 64-bit openblas on Pentium N3540, Linux (Atom core)
64-bit netlib blas, Pentium N3540, Windows 10
Current 64-bit openblas on all other machines we've tried (includes Nehalem, Sandybridge, Haswell)

The odd man out is newer 64-bit openblas on Pentium N3540, Windows 10 (Atom core): in this case we're getting a maximized log-likelihood of -62.75480 after 132 iterations. This is well beyond the sort of marginal difference that one expects across different compilers or platforms on nonlinear optimization problems.

brada4 · 2017-09-17T18:46:42Z

Can you cout() presumed optimums by steps to see if it diverges or floats around?

susilehtola · 2018-06-25T20:38:33Z

If your results differ that much, I'd recommend just comparing the value and gradient at the initial guess vector; the wrong results should be already obvious there.

(This is not a question of poor performance, but of an incorrect result.)

brianborchers · 2018-06-25T21:24:14Z

It seems from the description of the problem that some OpenBLAS function could be producing incorrect results that are causing the iterative algorithm to converge more slowly than usual. It would be painful, but you could track this down by checking the results from each call to a BLAS routine to see when the results first diverge from correct results obtained using a different build of OpenBLAS that does work correctly.

AllinCottrell · 2018-06-25T21:32:19Z

@brianborchers : that's exactly it, and if I possessed a machine with the CPU in question I'd be doing as you say. But I'm afraid I don't, and the correspondent who was feeding me information on this point has apparently lost interest in the matter.

martin-frbg · 2018-06-25T21:48:06Z

Probably any recent machine would do, with TARGET=ATOM set for the build (and Windows, with the gcc 5.4 mingw (as mentioned earlier). This may actually be (or have been) a mingw bug for all we know.

brada4 · 2018-06-26T08:49:07Z

MKL does not shine on this CPU either. Is it worth comparing in case there is no case?

martin-frbg · 2018-06-26T09:08:37Z

Quoting AllinCottrell from september, "speed is not the issue, it's accuracy" #1300 (comment) . So the question seems to be why OpenBLAS compiled with mingw gives wrong results with at least this specific model of Atom. Earlier tests already excluded a general bug in the code as the same hardware worked fine under Linux. I must say I am tempted to shrug it off as some Atom-specific mingw code generation bug (in what is probably an outdated version of mingw at least now) if nobody has the hardware+os combination to reproduce it.

martin-frbg closed this as completed May 15, 2021

Poor performance on Pentium N3540 using 64-bit OpenBLAS #1300

Poor performance on Pentium N3540 using 64-bit OpenBLAS #1300

Comments

AllinCottrell commented Sep 10, 2017

AllinCottrell commented Sep 10, 2017

brada4 commented Sep 10, 2017

AllinCottrell commented Sep 10, 2017

AllinCottrell commented Sep 10, 2017

martin-frbg commented Sep 10, 2017

AllinCottrell commented Sep 10, 2017

AllinCottrell commented Sep 10, 2017

brada4 commented Sep 11, 2017

AllinCottrell commented Sep 11, 2017

brada4 commented Sep 11, 2017

brada4 commented Sep 11, 2017

AllinCottrell commented Sep 12, 2017

martin-frbg commented Sep 12, 2017

AllinCottrell commented Sep 12, 2017

martin-frbg commented Sep 12, 2017

brada4 commented Sep 12, 2017

AllinCottrell commented Sep 12, 2017

AllinCottrell commented Sep 12, 2017

AllinCottrell commented Sep 13, 2017

martin-frbg commented Sep 13, 2017

brada4 commented Sep 13, 2017

AllinCottrell commented Sep 15, 2017

brada4 commented Sep 15, 2017

AllinCottrell commented Sep 15, 2017

brada4 commented Sep 15, 2017

martin-frbg commented Sep 15, 2017

brada4 commented Sep 16, 2017

AllinCottrell commented Sep 17, 2017

brada4 commented Sep 17, 2017

susilehtola commented Jun 25, 2018

brianborchers commented Jun 25, 2018

AllinCottrell commented Jun 25, 2018

martin-frbg commented Jun 25, 2018

brada4 commented Jun 26, 2018

martin-frbg commented Jun 26, 2018