Skip to content

Lapack test failures for OpenBLAS 0.3.21 on ARM Neoverse_v1 #4187

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
casparvl opened this issue Aug 8, 2023 · 13 comments
Closed

Lapack test failures for OpenBLAS 0.3.21 on ARM Neoverse_v1 #4187

casparvl opened this issue Aug 8, 2023 · 13 comments

Comments

@casparvl
Copy link

casparvl commented Aug 8, 2023

I've compiled OpenBLAS 0.3.21 on an ARM Neoverse_v1 CPU. Running the lapack test suite, I see some tests failing

SUMMARY                 nb test run     numerical error         other error
================        ===========     =================       ================
REAL                    1315683         107     (0.008%)        0       (0.000%)
DOUBLE PRECISION        1314777         66      (0.005%)        0       (0.000%)
COMPLEX                 773609          97      (0.013%)        0       (0.000%)
COMPLEX16               776246          74      (0.010%)        0       (0.000%)

--> ALL PRECISIONS      4180315         344     (0.008%)        0       (0.000%)


== 2023-07-19 12:16:15,506 openblas.py:113 INFO 4180315 LAPACK tests run - 344 failed due to numerical errors - 0 failed due to other errors

To give some idea, here is an excerpt from the test report, some (not all) of the failures:

Excerpt
...
Matrix order=   15, type=10, seed= 798,1691,2423, 745, result  5 is 8.389E+06
 Matrix order=   15, type=11, seed= 931,1787, 557,2429, result  5 is 8.389E+06
 Matrix order=   15, type=17, seed= 529,3615,1764,1221, result  5 is 8.389E+06
 Matrix order=   15, type=18, seed=3991, 625,3539,2581, result  5 is 8.389E+06
 Matrix order=   15, type=19, seed=2400,1821, 218,1365, result  5 is 8.389E+06
 Matrix order=   15, type=20, seed=2155,2073,3686, 149, result  5 is 8.389E+06
 Matrix order=   15, type=21, seed=3823,2194,3510,3029, result  5 is 8.389E+06
 Matrix order=   15, type=22, seed=2951, 608,1256,3481, result  5 is 8.389E+06
 Matrix order=   15, type=23, seed=1243,2927, 263, 357, result  5 is 8.389E+06
 Matrix order=   15, type=24, seed=1487,3956,2976,3649, result  5 is 8.389E+06
 Matrix order=   15, type=25, seed=2605,2211,3982,2721, result  5 is 8.389E+06
 Matrix order=   15, type=26, seed=   8,1600,1726,2817, result  5 is 8.389E+06
 Matrix order=   20, type= 8, seed= 624, 693,1681,   9, result  5 is 8.389E+06
 Matrix order=   20, type=10, seed= 563,2705, 428, 633, result  5 is 8.389E+06
...
 SGV drivers:     64 out of   1092 tests failed to pass the threshold
...
 Matrix order=   20, type=10, seed= 563,2705, 428, 633, result  5 is 4.504D+15
 Matrix order=   20, type=12, seed= 112,1059,2403,2121, result  5 is 4.504D+15
 Matrix order=   20, type=13, seed=3448,3524, 410,2697, result  5 is 4.504D+15
 Matrix order=   20, type=17, seed= 723,4024,2193,2265, result  5 is 4.504D+15
...
 DGV drivers:     34 out of   1092 tests failed to pass the threshold
...
 Matrix order=    6, type=16, seed= 852,1170,4018,2529, result  5 is 4.504D+15
 Matrix order=    8, type=20, seed= 202,3818,2830,2373, result  5 is 4.504D+15
 Matrix order=    8, type=25, seed= 582,1680, 289,1745, result  5 is 4.504D+15
 Matrix order=   10, type= 8, seed= 573,2931,1057, 529, result  5 is 4.504D+15
 Matrix order=   10, type=10, seed= 308, 212,3333,  73, result  5 is 4.504D+15
...
 DGV drivers:     18 out of   1092 tests failed to pass the threshold
...
 N=  30 M=   0, P=   5, type  8, test  2, ratio=  555331.
 N=   3 M=   0, P=  20, type  8, test  2, ratio= 0.889025E+07
 N=  30 M=   0, P=  20, type  8, test  2, ratio= 0.219820E+07
 M=   3 P=   0, N=  30, type  7, test  1, ratio= 0.800042E+07
...
 GQR:     14 out of   1728 tests failed to pass the threshold
...
 Matrix order=   20, type=16, seed= 647,2328,1944, 557, result  5 is 8.389E+06
 CGV drivers:      1 out of   1092 tests failed to pass the threshold
...
 CHESV , UPLO='U', N =    3, type  2, test  1, ratio = 0.10874E+07
 CHESVX, FACT='N', UPLO='U', N =    3, type  2, test  1, ratio = 0.10874E+07
 CHESV , UPLO='U', N =    5, type  9, test  1, ratio = 0.32042E+07
 CHESVX, FACT='N', UPLO='U', N =    5, type  9, test  1, ratio = 0.32042E+07
 CHESV , UPLO='U', N =    5, type 10, test  1, ratio = 0.35394E+07
 CHESVX, FACT='N', UPLO='U', N =    5, type 10, test  1, ratio = 0.35394E+07
 CHESV , UPLO='U', N =   10, type  2, test  1, ratio = 0.16458E+07
 CHESVX, FACT='N', UPLO='U', N =   10, type  2, test  1, ratio = 0.16458E+07
...
 CHE drivers:     24 out of   1072 tests failed to pass the threshold
...
 Matrix order=   12, type=24, seed=2693,2404,3046,2957, result  5 is 4.504D+15
 Matrix order=   12, type=25, seed= 578, 723, 929,1637, result  5 is 4.504D+15
 Matrix order=   12, type=26, seed=2061,1512,1968, 125, result  5 is 4.504D+15
 Matrix order=   20, type=16, seed= 647,2328,1944, 557, result  5 is 4.504D+15
 Matrix order=   20, type=17, seed=1585,3902,3906,1293, result  5 is 4.504D+15
...
 ZGV drivers:     53 out of   1092 tests failed to pass the threshold

The full test report can be found in this gist.

Now, admittedly, I don't have much experience in interpreting the results of the Lapack test suite.

Are these failures cause for concern? I.e. are they small deviations (possibly due to slightly different numerical roundoff etc on this architecture)? Or do they point to a real bug? The numbers sure don't look small, but again... I don't really know what I'm looking at, so hoping to get some help here :)

@martin-frbg
Copy link
Collaborator

0.3.23 or better yet current develop branch would probably be more useful - unfortunately we're only just starting to test the NeoverseV1 code in CI (AWS) and it is not yet hooked up to the main repository. (Contributed kernels have come from Arm-affiliated folks though...)
One problem with the LAPACK testsuite is that a lot of tests set the return matrix to some absurdly high values to signal that an error occured, as the options for less dramatic feedback are limited

@casparvl
Copy link
Author

casparvl commented Aug 8, 2023

Thanks for the quick response!

0.3.23 or better yet current develop branch would probably be more useful

Hm, let's say for the sake of argument that I don't have that flexibility. Some minor context as to why: this issue popped up in a project where a group of HPC centers is trying to build a common software stack, to offer to their scientific users. This software stack is built for a wide range of architectures (various generations of AMD, intel, and ARM chips) so that it can be used on a variety of systems.

In the (scientific) HPC world, fixed versions are generally appreciated in that context of reproducibility. Typically, we offer multiple versions of each software, and let the user pick. We will no doubt have newer versions of OpenBLAS in newer versions of our software stacks, but 0.3.21 one of the ones we are trying to offer currently.

unfortunately we're only just starting to test the NeoverseV1 code in CI (AWS) and it is not yet hooked up to the main repository

Great that you are setting this up. I'm wondering, given that this CI is not there currently: for 0.3.21 (or any of the release versions, really), what would be your expected behaviour for Neoverse_v1 architectures? Does OpenBLAS just fall back to some generic kernel (i.e. potentially poor performance), but should it at least produce the correct result? Or would you say all bets are off and it might produce incorrect results as well? Of course, I know that without testing, you can never really be sure, just wondering what your expectation is on this.

One problem with the LAPACK testsuite is that a lot of tests set the return matrix to some absurdly high values to signal that an error occured, as the options for less dramatic feedback are limited

Does this mean that even though the testsuite summary qualifies this as numerical error, the underlying error that has occured might not have been numerical at all? It just returns a large number, for lack of other signalling methods?

I guess on our side, we have two choices: either we don't offer it on neoverse_v1, or we ignore the test results, offer it (potentially with a big fat warning), and hope that it doesn't cause issues down the line.

That's also why I'm so interested in what these errors mean: if it simply means the numerical error is slightly larger than expected, that should not be a huge deal for most scientific codes. Also, if you're saying "use 0.3.23 or develop" simply for performance reasons, I'd still be comfortable offering 0.3.21 to users - in fact I'd love for our users to be able to experiment with both and see the potential performance benefits of newer versions, especially for something low level like OpenBLAS on a new architecture. But, if it means the result is really wrong (i.e. beyond numerical precision issues) with 0.3.21, that is an issue: we clearly don't want users to end up with incorrect scientific results :)

@martin-frbg
Copy link
Collaborator

In the (scientific) HPC world, fixed versions are generally appreciated in that context of reproducibility

I understand the desire for version stability, on the other hand that may result in your fixed version having unfixed bugs that have been fixed since. (Though I have to admit that sometimes bugs are not only getting fixed but replaced with bigger and better ones elsewhere)

I'm wondering, given that this CI is not there currently: for 0.3.21 (or any of the release versions, really), what would be your expected behaviour for Neoverse_v1 architectures?

Not having this in CI does not mean that it was not distributed as part of the codebase - just that I trusted the contributor (and the occasional qemu run). NeoverseV1 support (as in cpu id recognition) appeared in 0.3.20, any version before that will have fallen back to very basic ARMV8 code. 0.3.22 and the hopefully-soon-to-be 0.3.24 that is current develop got further improvements in NRM2 and GEMM that may play a role w.r.t the test errors you are getting.

Does this mean that even though the testsuite summary qualifies this as numerical error, the underlying error that has occured might not have been numerical at all?

No, still a numerical error but not of the magnitude returned in the result matrix.

What you can do is (a) repeat the test (or any other) on NeoverseV1 with the environment variable OPENBLAS_VERBOSE set to 2 , this will make OpenBLAS report which cpu it detected at runtime. (Maybe 0.3.21 is not even selecting the V1 kernels), and/or (b) run with OPENBLAS_CORETYPE set to something else, like ARMV8 or NEOVERSEN1 to force it to go with these non-SVE kernels instead (and see if the errors are gone.
Only I would rather not spend time debugging an issue in 0.3.21 that may already have been fixed since.

@casparvl
Copy link
Author

casparvl commented Aug 9, 2023

Not having this in CI does not mean that it was not distributed as part of the codebase - just that I trusted the contributor (and the occasional qemu run). NeoverseV1 support (as in cpu id recognition) appeared in 0.3.20, any version before that will have fallen back to very basic ARMV8 code. 0.3.22 and the hopefully-soon-to-be 0.3.24 that is current develop got further improvements in NRM2 and GEMM that may play a role w.r.t the test errors you are getting.

Very clear. Ok, so my takeaway from this is: OpenBLAS is supposed to produce the correct result with any version, though possibly at poor performance (basic ARMV8 code). That's encouraging. From 0.3.21 there maybe even be (some) optimized kernels for NeoverseV1.

No, still a numerical error but not of the magnitude returned in the result matrix.

Ok, clear, thanks! I'm guessing there is no way for me to easily dig deeper and figure out if the original error was small or large, right?

What you can do is (a) repeat the test (or any other) on NeoverseV1 with the environment variable OPENBLAS_VERBOSE set to 2 , this will make OpenBLAS report which cpu it detected at runtime. (Maybe 0.3.21 is not even selecting the V1 kernels), and/or (b) run with OPENBLAS_CORETYPE set to something else, like ARMV8 or NEOVERSEN1 to force it to go with these non-SVE kernels instead (and see if the errors are gone.

Ok, I'll try that.

Only I would rather not spend time debugging an issue in 0.3.21 that may already have been fixed since.

That actually makes a lot of sense :) I'll have a look how easy it is for me to bump the version here to latest release or develop - maybe not to offer it to our users, but I understand why that would make it more relevant for you too look into.

@martin-frbg
Copy link
Collaborator

Ok, clear, thanks! I'm guessing there is no way for me to easily dig deeper and figure out if the original error was small or large, right?

I don't remember offhand, but an analysis was posted in one of the earlier issue tickets about suspicious floods of test errors either here or in Reference-LAPACK (where the increasing fragility of the testsuite inf the face of FMA and compiler optimizations has been acknowledged, but it is basically the same number of developers keeping things alive).

Reference-LAPACK/lapack#732 could also be related

@casparvl
Copy link
Author

Just tried 0.3.23. I see way fewer failures but it also applies some extra patches (I'm using EasyBuild to install this). See here for the 'build recipy' that EasyBuild uses - it lists the patches (which are included in the same directory in the repo).

                        -->   LAPACK TESTING SUMMARY  <--
SUMMARY                 nb test run     numerical error         other error
================        ===========     =================       ================
REAL                    1328283         0       (0.000%)        0       (0.000%)
DOUBLE PRECISION        1327377         14      (0.001%)        0       (0.000%)
COMPLEX                 788035          0       (0.000%)        0       (0.000%)
COMPLEX16               789102          0       (0.000%)        0       (0.000%)

--> ALL PRECISIONS      4232797         14      (0.000%)        0       (0.000%)

I think this patch might actually be responsible for the lower fail count. I'll try to backport that to my OpenBLAS-0.3.21 build and see if that also reduces the failure count there. (N.B. that patch is written by the same person who created the issue you linked in your previous post - he's active in the EasyBuild community :))

@martin-frbg
Copy link
Collaborator

Thanks, interesting to see that he took that approach after discussion on the LAPACK issue stalled. I'm a bit hesitant to patch the testsuite to fit OpenBLAS' needs - after all it could be tempting to just hide actual bugs in our code that way. On the other hand there's little incentive for the Reference-LAPACK project to fix what is working for them in the realm of their un-optimized, non-FMA implementation of LAPACK and BLAS. (And I'm sitting on the fence as a contributor to both projects)

@casparvl
Copy link
Author

Still exactly the same errors, also with the patch. Taking a close look, it makes sense: the patch DGD and SGD tests. I'm seeing not errors in those two, but in DGV, SGV etc.

Anyway, I guess the positive news is that there are fewer failures with 0.3.23 (even if that was with the aforementioned patch - I don't know how much errors we'd see without for this version). In case you're interested in the 14 failures, see this gist. I think I'll still try to follow up on your suggestion to run the test on 0.3.21 with OPENBLAS_VERBOSE and setting the OPENBLAS_CORETYPE, just to see if it makes a difference.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Aug 10, 2023

Must be the fix-order-vectorization patch (which adds parentheses in the C/ZLAHQR sources as quoted in the LAPACK issue... any other changes will be due to the newer kernels in 0.3.23. And I notice that your 14 remaining failures are in QR functions as well, though double precision real. Maybe worth looking for a similar spot that needs parentheses to prevent reordering...

@casparvl
Copy link
Author

You mean to say that it would explain the difference? Because this patch is applied to both our 0.3.21 and 0.3.23 builds, so shouldn't be causing any difference. I guess it might be newer kernels in 0.3.23 after all that cause the tests to pass with 0.3.23.

I tried to re-run the test suite with OPENBLAS_VERBOSE=2, but I didn't see any output. I was afraid that potentially the stderr wasn't redirected to one of the logfiles and just ignored somehow, but even manually a single set of tests by manually calling e.g. OPENBLAS_VERBOSE=2 EIG/xeigtsts < sgd.in in the lapack-netlib/TESTING directory didn't show any extra output. I assumed this is a runtime envrionment variable, isn't it? Same seems to be true for OPENBLAS_CORETYPE: even if I set this some value that should definitely lead to some errors on an ARM architecture (e.g. OPENBLAS_CORETYPE=Zen) a run of OPENBLAS_CORETYPE=Zen2 OPENBLAS_VERBOSE=2 EIG/xeigtsts < sgd.in completes fine (well, with the same numerical errors as before, but nothing else). Any idea what I might be doing wrong here?

@martin-frbg
Copy link
Collaborator

Are you testing with a DYNAMIC_ARCH build (as I assume the easybuild ones are) at all ? Both options will do nothing in a library purpose-built for a single cpu model

@casparvl
Copy link
Author

Sorry, completely dropped the ball on this one when I got interrupted with other responsibilities. I think you are completely right. My libopenblas.so and .a just point to their equivalent libopenblas_neoversev1p-r0.3.23.so and .a respectively. So that explains why the flags don't do anything.

For now, I think we'll just go ahead and build the rest of our software stack with this. That might help us figure out if there is any effect of these issues 'down the line' in higher level applications. Since the BLAS test suite itself passes without errors (this is really only the Lapack test suite that shows issues), we are hopefull that it does not cause issues down the line. I'm hoping that @bartoldeman can have another look at this at some point, since he has way more expertise in assessing these issues than me :)

@martin-frbg
Copy link
Collaborator

Seeing only a single error (in SGGES3 - probably related to test shortcomings discussed in #4032) with current develop and TARGET=NEOVERSEV1 or ARMV8SVE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants