-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Lapack test failures for OpenBLAS 0.3.21 on ARM Neoverse_v1 #4187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
0.3.23 or better yet current |
Thanks for the quick response!
Hm, let's say for the sake of argument that I don't have that flexibility. Some minor context as to why: this issue popped up in a project where a group of HPC centers is trying to build a common software stack, to offer to their scientific users. This software stack is built for a wide range of architectures (various generations of AMD, intel, and ARM chips) so that it can be used on a variety of systems. In the (scientific) HPC world, fixed versions are generally appreciated in that context of reproducibility. Typically, we offer multiple versions of each software, and let the user pick. We will no doubt have newer versions of OpenBLAS in newer versions of our software stacks, but
Great that you are setting this up. I'm wondering, given that this CI is not there currently: for
Does this mean that even though the testsuite summary qualifies this as I guess on our side, we have two choices: either we don't offer it on That's also why I'm so interested in what these errors mean: if it simply means the numerical error is slightly larger than expected, that should not be a huge deal for most scientific codes. Also, if you're saying "use 0.3.23 or develop" simply for performance reasons, I'd still be comfortable offering |
I understand the desire for version stability, on the other hand that may result in your fixed version having unfixed bugs that have been fixed since. (Though I have to admit that sometimes bugs are not only getting fixed but replaced with bigger and better ones elsewhere)
Not having this in CI does not mean that it was not distributed as part of the codebase - just that I trusted the contributor (and the occasional qemu run). NeoverseV1 support (as in cpu id recognition) appeared in 0.3.20, any version before that will have fallen back to very basic ARMV8 code. 0.3.22 and the hopefully-soon-to-be 0.3.24 that is current
No, still a numerical error but not of the magnitude returned in the result matrix. What you can do is (a) repeat the test (or any other) on NeoverseV1 with the environment variable OPENBLAS_VERBOSE set to 2 , this will make OpenBLAS report which cpu it detected at runtime. (Maybe 0.3.21 is not even selecting the V1 kernels), and/or (b) run with OPENBLAS_CORETYPE set to something else, like ARMV8 or NEOVERSEN1 to force it to go with these non-SVE kernels instead (and see if the errors are gone. |
Very clear. Ok, so my takeaway from this is: OpenBLAS is supposed to produce the correct result with any version, though possibly at poor performance (basic ARMV8 code). That's encouraging. From 0.3.21 there maybe even be (some) optimized kernels for NeoverseV1.
Ok, clear, thanks! I'm guessing there is no way for me to easily dig deeper and figure out if the original error was small or large, right?
Ok, I'll try that.
That actually makes a lot of sense :) I'll have a look how easy it is for me to bump the version here to latest release or develop - maybe not to offer it to our users, but I understand why that would make it more relevant for you too look into. |
I don't remember offhand, but an analysis was posted in one of the earlier issue tickets about suspicious floods of test errors either here or in Reference-LAPACK (where the increasing fragility of the testsuite inf the face of FMA and compiler optimizations has been acknowledged, but it is basically the same number of developers keeping things alive). Reference-LAPACK/lapack#732 could also be related |
Just tried 0.3.23. I see way fewer failures but it also applies some extra patches (I'm using EasyBuild to install this). See here for the 'build recipy' that EasyBuild uses - it lists the patches (which are included in the same directory in the repo).
I think this patch might actually be responsible for the lower fail count. I'll try to backport that to my OpenBLAS-0.3.21 build and see if that also reduces the failure count there. (N.B. that patch is written by the same person who created the issue you linked in your previous post - he's active in the EasyBuild community :)) |
Thanks, interesting to see that he took that approach after discussion on the LAPACK issue stalled. I'm a bit hesitant to patch the testsuite to fit OpenBLAS' needs - after all it could be tempting to just hide actual bugs in our code that way. On the other hand there's little incentive for the Reference-LAPACK project to fix what is working for them in the realm of their un-optimized, non-FMA implementation of LAPACK and BLAS. (And I'm sitting on the fence as a contributor to both projects) |
Still exactly the same errors, also with the patch. Taking a close look, it makes sense: the patch Anyway, I guess the positive news is that there are fewer failures with |
Must be the fix-order-vectorization patch (which adds parentheses in the C/ZLAHQR sources as quoted in the LAPACK issue... any other changes will be due to the newer kernels in 0.3.23. And I notice that your 14 remaining failures are in QR functions as well, though double precision real. Maybe worth looking for a similar spot that needs parentheses to prevent reordering... |
You mean to say that it would explain the difference? Because this patch is applied to both our I tried to re-run the test suite with |
Are you testing with a DYNAMIC_ARCH build (as I assume the easybuild ones are) at all ? Both options will do nothing in a library purpose-built for a single cpu model |
Sorry, completely dropped the ball on this one when I got interrupted with other responsibilities. I think you are completely right. My For now, I think we'll just go ahead and build the rest of our software stack with this. That might help us figure out if there is any effect of these issues 'down the line' in higher level applications. Since the BLAS test suite itself passes without errors (this is really only the Lapack test suite that shows issues), we are hopefull that it does not cause issues down the line. I'm hoping that @bartoldeman can have another look at this at some point, since he has way more expertise in assessing these issues than me :) |
Seeing only a single error (in SGGES3 - probably related to test shortcomings discussed in #4032) with current develop and TARGET=NEOVERSEV1 or ARMV8SVE. |
Uh oh!
There was an error while loading. Please reload this page.
I've compiled OpenBLAS 0.3.21 on an ARM Neoverse_v1 CPU. Running the lapack test suite, I see some tests failing
To give some idea, here is an excerpt from the test report, some (not all) of the failures:
Excerpt
The full test report can be found in this gist.
Now, admittedly, I don't have much experience in interpreting the results of the Lapack test suite.
Are these failures cause for concern? I.e. are they small deviations (possibly due to slightly different numerical roundoff etc on this architecture)? Or do they point to a real bug? The numbers sure don't look small, but again... I don't really know what I'm looking at, so hoping to get some help here :)
The text was updated successfully, but these errors were encountered: