Skip to content

Poor performance on Pentium N3540 using 64-bit OpenBLAS #1300

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
AllinCottrell opened this issue Sep 10, 2017 · 35 comments
Closed

Poor performance on Pentium N3540 using 64-bit OpenBLAS #1300

AllinCottrell opened this issue Sep 10, 2017 · 35 comments

Comments

@AllinCottrell
Copy link

I'm not quite certain this is an OpenBLAS issue -- in principle, it could be a compiler
problem, but...

I'm a developer of an open-source econometrics package that uses OpenBLAS, and
we have come across a numerical optimization problem on which OpenBLAS generally
does a good job, with the sole exception of 64-bit OpenBLAS running on Intel Pentium
N3540. We have compared results across multiple platforms, CPUs and word-lengths,
and have also compared results between OpenBLAS 64-bit and plain Netlib blas/lapack
64-bit.

By "poor performance" I mean that relative to all other cases, 64-bit OpenBLAS on
Pentium N3540 takes over twice as many iterations and ends up with a value of the
maximand that is about 1 percent worse than other cases. And, to be clear, "other cases"
includes 32-bit OpenBLAS and 64-bit Netlib blas/lapack on the same machine.

I'm not yet able to produce a minimal test case, but I can say that the OpenBLAS
functions called by the optimization routine include dpotrf/dpotri and dgetrf/dgetrs.

In all cases entering the comparison, OpenBLAS is compiled using gcc. In the case
of 64-bit OpenBLAS on Pentium N3540 the compiler is x86_64-w64-mingw32-gcc
(gcc 5.4.0). The same compiler was used to produce the Netlib blas/lapack DLLs that
gave "normal" results on the same target machine, and the corresponding 32-bit
compiler, i686-w64-mingw32-gcc (gcc 5.4.0) was used to produce the 32-bit
openblas.dll that also gave normal results.

@AllinCottrell
Copy link
Author

I should add: we are using version 0.2.20 of OpenBLAS but have also tried 0.3.0.dev.
Updating to the latter made no difference to the comparison.

@brada4
Copy link
Contributor

brada4 commented Sep 10, 2017

How big are inputs?
To what functions?
To what CPU you compare this low cost low power CPU?

@AllinCottrell
Copy link
Author

The inputs are of moderate size: matrices of less than 50 in dimension.

We're comparing with Nehalem, Sandybridge and Haswell, but the most relevant
point is that we're comparing with both 32-bit OpenBLAS and 64-bit standard
Netlib blas/lapack on the very same low-cost low-power CPU.

@AllinCottrell
Copy link
Author

"To what functions?" See my first post: dpotrf/dpotri and dgetrf/dgetrs.

@martin-frbg
Copy link
Collaborator

What build options did you use ? (In particular, is this a single- or multithread build ? Issues #1270 and #1253 may be somewhat related if not entirely understood yet)

@AllinCottrell
Copy link
Author

I used these (relevant) build options:

DYNAMIC_ARCH = 1 # core detected as "Atom" on Pentium N3540
USE_OPENMP = 1 # since we use OpenMP in our caller code at some points
NUM_THREADS = 24

So it's a multithreaded build.Thanks for the refs to other Issues; I'm scanning them in search
of possible commonalities.

@AllinCottrell
Copy link
Author

@martin-frbg : given the discussion under #1253 maybe I should try rebuilding with a
more recent cross-gcc (such as 7.2.0).

@brada4
Copy link
Contributor

brada4 commented Sep 11, 2017

Are you saying 32bit OpenBLAS is faster than 64bit OpenBLAS on particular CPU?

@AllinCottrell
Copy link
Author

@brada4 : the 32-bit OpenBLAS is not necessarily faster, but it takes fewer
iterations to reach convergence and it reaches a higher maximum than the
64-bit build, on our test problem.

@brada4
Copy link
Contributor

brada4 commented Sep 11, 2017

Dare to share quick quick sample? I have exact CPU as you and many more.

@brada4
Copy link
Contributor

brada4 commented Sep 11, 2017

Try suggestion from #1237 , though I dont believe recent atoms/pentiums are in the low end of gene pool.
Can you profile (i dont have windows recipe, linux one would be perf record/perf report) how much time is spent in each BLAS (as opposed to LAPACK wrappers) call? Sample would help me learn Linux part of this.

@AllinCottrell
Copy link
Author

@brada4 Sorry, can't produce a minimal test case since I don't have access to a Pentium N3540 myself; I'm working from (good, detailed) information sent by a colleague in Ukraine. But we do have some more information: besides getting good results on the N3540 with 32-bit OpenBLAS (that's under Windows 10), we also get good results when the machine is booted into Ubuntu and runs 64-bit OpenBLAS 0.2.19: in that case the core selected dynamically is Prescott, not Atom.

It looks as if CORE_ATOM is not suitable for 64-bit operation of the N3540 Silvermont. I tried making a 64-bit Windows build of 0.3.0.dev with gcc 7.2.0 and that produced even worse results than gcc 5.4.0 (missed the maximum by a big margin) using the Atom core.

@martin-frbg
Copy link
Collaborator

That would be quite surprising, as the N3540 is certainly part of the "Atom" line. To reduce the number of variables, would it be possible for you to repeat the Ubuntu check with a 64-bit 0.3.0dev ?

@AllinCottrell
Copy link
Author

@martin-frbg I'm afraid that would be difficult: I'm on Arch (plus Fedora on another machine) and I'm not sure I could build a working drop-in replacement for the Debian/Ubuntu libopenblas. (Debian factors out the common blas/lapack functions and links to a modified libopenblas for the optimized functions, as I understand it.)

@martin-frbg
Copy link
Collaborator

I see. Perhaps your colleague could try the Ubuntu 0.2.19 with export OPENBLAS_CORETYPE=ATOM then, to see if it is actually a Prescott vs Atom code problem rather than a compiler or windows issue ?
(I only have an older N3150 box running Linux, and as I understand it I would probably need your code (gretl ?) and data to be able to investigate the problem, unless it also shows up with the included benchmark/dpotrf.goto testcase)

@brada4
Copy link
Contributor

brada4 commented Sep 12, 2017

Is performance profile set to performance for measurement? Win10 parks cores on laptop just like mobile phones.

@AllinCottrell
Copy link
Author

@martin-frbg OK, I can certainly suggest that (OPENBLAS_CORETYPE=ATOM).

@brada4 OK, that's another thing we can check.

@AllinCottrell
Copy link
Author

General point: OpenBLAS is fantastically good software and I hope we can help sort out an odd case of less than optimal performance.

@AllinCottrell
Copy link
Author

Well, my colleague tried installing, on Ubuntu, a build of OpenBLAS 0.2.20 from
https://launchpad.net/ubuntu/artful/+package/libopenblas-base . It turns out this (a) sets blascore to Atom on his Pentium N3540 but (b) produces correct results. So the problem we've noted seems to be specific to (64-bit) Windows.

@martin-frbg
Copy link
Collaborator

I have only wild guesses to offer now - setting CONSISTENT_FPCSR=1 (as I think was what brada4 was suggesting with his reference to 1237) and/or USE_SIMPLE_THREADED_LEVEL3=1 in case there is some kind of thread contention even with OPENMP on that platform.

@brada4
Copy link
Contributor

brada4 commented Sep 13, 2017

Yup, that parameter.
MKL aint fast on this CPU either.

@AllinCottrell
Copy link
Author

We've now tested a 64-bit build of openblas on Windows with CONSISTENT_FPCSR=1. I'm afraid this modification did not improve the results.
Speed is not the issue, it's accuracy (in particular missing a known maximum). The excess number
of iterations on 64-bit Windows is of interest primarily because it indicates the math is not being done right.

@brada4
Copy link
Contributor

brada4 commented Sep 15, 2017

BLAS results should be very similar - to few bits accuracy/rounding error.
can you compare interim matrices returned from BLAS calls down the road against:
single-threaded openblas
netlib blas built with -O2 or lower
if you post sample (or tell it is convex surface, random, some multi-almost-minimum test function or so) we could try too.

@AllinCottrell
Copy link
Author

@brada4 : "can you compare interim matrices returned from BLAS..." Yes, that's exactly what I'd like to do, to pinpoint the problem. And I've put some debugging spew into my code to enable that. Right now my colleague with the Pentium N has lost patience (or is just taking a breather!) so I'll have to wait a bit.

@brada4
Copy link
Contributor

brada4 commented Sep 15, 2017

Lets start with small effort big impact options - alternate BLAS libs. (you can find prebuilt maybe old netlib blas/lapack around the web, just rename DLLs and test, it may take a lot longer to finish)

@martin-frbg
Copy link
Collaborator

This makes me wonder if/how your code sets up handling of denormals, i.e. the motivation behind the CONSISTENT_FPCSR option as outlined in the first message of #1237, as I imagine a difference in default behaviour between the platforms could have a significant influence on calculation time and accuracy

@brada4
Copy link
Contributor

brada4 commented Sep 16, 2017

@AllinCottrell please come forward with some numbers. It does not work saying it is slow, then inaccurate, then slow again.

@AllinCottrell
Copy link
Author

I've never said it's slow. It might be, but the calculation completes so quickly that you'd have to run it thousands of times to get a meaningful timing. My concern is that it's not producing the same answers (and the fact that it takes a lot more iterations is an indicator that the calculations are diverging). To summarize, all of the following produce a log-likelihood of -62.56261 (agreeing to at least 7 significant digits) after 51 or 52 iterations on a certain maximum-likelihood problem:

32-bit openblas on Pentium N3540, Windows 10 (Atom core)
64-bit openblas 0.2.19 on Pentium N3540, Windows 10 (Prescott core)
Current 64-bit openblas on Pentium N3540, Linux (Atom core)
64-bit netlib blas, Pentium N3540, Windows 10
Current 64-bit openblas on all other machines we've tried (includes Nehalem, Sandybridge, Haswell)

The odd man out is newer 64-bit openblas on Pentium N3540, Windows 10 (Atom core): in this case we're getting a maximized log-likelihood of -62.75480 after 132 iterations. This is well beyond the sort of marginal difference that one expects across different compilers or platforms on nonlinear optimization problems.

@brada4
Copy link
Contributor

brada4 commented Sep 17, 2017

Can you cout() presumed optimums by steps to see if it diverges or floats around?

@susilehtola
Copy link
Contributor

If your results differ that much, I'd recommend just comparing the value and gradient at the initial guess vector; the wrong results should be already obvious there.

(This is not a question of poor performance, but of an incorrect result.)

@brianborchers
Copy link

It seems from the description of the problem that some OpenBLAS function could be producing incorrect results that are causing the iterative algorithm to converge more slowly than usual. It would be painful, but you could track this down by checking the results from each call to a BLAS routine to see when the results first diverge from correct results obtained using a different build of OpenBLAS that does work correctly.

@AllinCottrell
Copy link
Author

@brianborchers : that's exactly it, and if I possessed a machine with the CPU in question I'd be doing as you say. But I'm afraid I don't, and the correspondent who was feeding me information on this point has apparently lost interest in the matter.

@martin-frbg
Copy link
Collaborator

Probably any recent machine would do, with TARGET=ATOM set for the build (and Windows, with the gcc 5.4 mingw (as mentioned earlier). This may actually be (or have been) a mingw bug for all we know.

@brada4
Copy link
Contributor

brada4 commented Jun 26, 2018

MKL does not shine on this CPU either. Is it worth comparing in case there is no case?

@martin-frbg
Copy link
Collaborator

Quoting AllinCottrell from september, "speed is not the issue, it's accuracy" #1300 (comment) . So the question seems to be why OpenBLAS compiled with mingw gives wrong results with at least this specific model of Atom. Earlier tests already excluded a general bug in the code as the same hardware worked fine under Linux. I must say I am tempted to shrug it off as some Atom-specific mingw code generation bug (in what is probably an outdated version of mingw at least now) if nobody has the hardware+os combination to reproduce it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants