-
Notifications
You must be signed in to change notification settings - Fork 1.6k
scipy test failures on x86_64 with openblas 0.3.6 #2137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is for SciPy 1.2.1 or the just-released 1.3.0, or another version? Speaking from the SciPy side, we often have to be rather selective about the exact OpenBLAS version pairing for everything to work. Sometimes a patch or two is needed after a release. This will hopefully improve over time though! |
Is " |
This is with 1.2.1. We are also seeing segfaults with 1.3.0 - http://koji.fedoraproject.org/koji/taskinfo?taskID=34942727 - I haven't yet reported that to scipy. |
If I read the overview correctly, the build failure on x86_64 coincided with the OpenBLAS update (but apparently a few other packages were updated in the same timeframe ?), but other platforms started to fail much later, despite also linking against OpenBLAS ? Can you tell what cpu model your build host uses ? |
I don't immediately recognize the error as something I've seen recently. Maybe @rgommers has seen it before. I'm pretty sure that for SciPy 1.2.1 wheels we're using an OpenBLAS commit on the 0.3.5.dev line prior to the 0.3.6 release point. For SciPy 1.3.0, wheels are using an OpenBLAS commit from the 0.3.7.dev line to deal with recent SkylakeX kernel issues. I suppose we'll have to do a 1.2.2 eventually with the 0.3.7.dev OpenBLAS as well for the same reason. See here for example: scipy/scipy#10145 Sadly, the Azure CI logs seem to have been purged already (!) so I can't check if the same failure was reported there. |
So can anybody please clarify what is the relation between "SciPy wheels" and whatever Fedora uses to build their packages. Is there any, or is Fedora compiling some sort of bare bones SciPy source with whatever OpenBLAS they see fit ? The lengthy build log originally linked above (and now gone) looked quite confusing to me, with what looked like a fortran-based BLAS getting built in between and a ton of warnings about out-of-bounds accesses in some fft library test code but no clear reference to a particular version of OpenBLAS. |
I have never seen that before. Very likely specific to OpenBLAS, even more so because the report says it's introduced with the upgrade to OpenBLAS 0.3.6 @opoplawski if you're able to bisect that to an OpenBLAS commit, that would probably be very useful. |
I gather this is a python error message, apparently either from using a wrong function prototype or a function actually not returning anything. Now in the case of OpenBLAS, I'd think it much more likely to receive an incorrect result rather than no result at all. |
The code in question is here: |
FWIW, the line 280 in decomp_cholesky.pc calls dpbtrf, which would imply either of SYRK,TRSM, GEMV or GEMM. If the build host for this was actually SkylakeX this would indeed be due to the incomplete reversion of the AVX512 DGEMM kernel in 0.3.6 (#1955, and accordingly only affect AVX512 hardware). |
I did manage to dig up recent SciPy master branch test failures (5) under SkylakeX Intel SDE emulation on MacOS, with logs available from here: https://dev.azure.com/tylerjereddy/tyler-scipy-fork/_build/results?buildId=841
Not sure how helpful that is, but we never saw the Linux failures on SKX because our manylinux1 glibc is too old to allow the AVX512 to go through. The PR I linked above can be used as a starting point to check if that's the same problem on Linux as well, although checking the machine hardware should be faster! There's also the useful |
Unfortunately we're dealing with a couple of different build issues. The openblas 0.3.6 update appears to have broken the x86_64 build with test failures, but a later numpydoc 0.8.0 -> 0.9.1 update broke the documentation build on all arches. Fedora builds scipy (or any package) against whatever libraries are in that Fedora release. Rawhide is the rolling development for the next Fedora release, so updates of any library can occur at any time. scipy's wheels are built with specific versions of libraries that they specify. The builder reports:
Unfortunately OPENBLAS_VERBOSE=2 does not appear to produce any usable output that I can discern. Scratch build here - https://koji.fedoraproject.org/koji/taskinfo?taskID=34966483 I don't think I'm going to have any time to be able to do a bisect of the openblas commits. |
If you look google for It says about returning NULL - the only return is INFO that is zero (might be interpreted as NULL in most cases) in case of success... |
Not sure what to make of this - Haswell is probably the most used platform, so any errors e.g. in the recent rewrite of assembly constraints following #2009 (that I just reviewed again) should have had more than enough time to show up in less complex environments. And I still do not understand how a numerical error in OpenBLAS could lead to that particular error message. |
Not reproducible by building scipy-1.2.1 with 0.3.6 on openSUSE 15 and running the provided testsuite via |
No problems with a gcc 9.1.0 build either (as far as the build tests and BLAS-Tester are concerned at least, I do not plan to redo the whole scipy thing unless there is additional evidence of an actual problem in OpenBLAS) |
Managed to redo the scipy build&test now as well, still no SystemError. (I did get four "ValueError" test failures from einsum with both compiler setups - if I read the report correctly these were all of the type "size of label ... for operands ... does not match previous terms" and may actually originate in the version of numpy installed on the system. |
Any updates from either scipy or Fedora ? As I cannot reproduce the problem in my environment there is nothing for me to do except perhaps doubt that OpenBLAS is the culprit. (Though in view of #2154 you could try adding -fno-optimize-sibling-calls to your gfortran flags - for OpenBLAS' LAPACK as well as any other fortran code that scipy calls from a C interface.) |
I'm not sure what to advise here from the SciPy side other than to suggest using the same version of OpenBLAS that we use for "official" SciPy 1.2.1 wheel builds. Otherwise, it would not be hugely surprising that there could be issues that aren't patched by one of the two projects. Is there a Docker image that can be provided/ used to reproduce this from the Fedora side? That might give us the traction we need, though I'm still not sure how much sense it would make to backport a fix to i.e., SciPy 1.2.x so that it works with another version of OpenBLAS. The usual approach is to discuss a problem with OpenBLAS team & if they find that a fix is appropriate we simply bump the OpenBLAS version (commit hash) that we build/ test / release binaries with, even if not at a stable release point just yet. I'm planning to release SciPy 1.2.2 very soon (test wheel builds already under way), the next release in the LTS support series, but it will use a more recent OpenBLAS 0.3.7.dev commit for wheels because of SkylakeX AVX kernel stuff. |
Hello, I have created an environment in a Docker container, where it is fairly easy to reproduce the bug. |
May I ask for this issue to be prioritized? |
Please check #2154 Note that reproducer should be with locked versions, ''rawhide" is moving target, say "26 ISO" is not _pbrtf function , behind the scenes, uses argument that is affected by particular ABI breakage, basically all arguments after are displaced, so essentially garbage is passed to the call.
|
@brada4 unlikely as it built fine for me with 9.1 on opensuse. Offhand I see nothing suspicious in the Fedora openblas.spec file either (their build uses TARGET=CORE2 DYNAMIC_ARCH=1 while my test above was native HASWELL but I doubt that matters). I find it a bit odd that the scipy build logs show it linking against both their pthreads (-lopenblasp) and single-threaded (-lopenblas) versions of OpenBLAS but I assume all external references would already be satisfied by the former so the latter should get ignored (?) |
I think it links to both at times, at least did so with tatlas.so and satlas.so a while ago |
@Dormouse759 with your Dockerfile, I get a number of package conflicts involving libgit, python3, python3-rpm and libgomp and the scipy sunsequently build bombs out somewhere around the messagestream.map creation with what looks like python version conflicts, culminating in several error messages "object of type 'type' has no len() in evaluating len(list)". |
@martin-frbg Thank you for letting me know, this uses rawhide and a copr repo, where we try to port things to 3.8, conflicts emerge eventually. I will do my best to provide a container with locked versions, so this doesn't happen. I am aware that asking does not create dev time magically, I don't expect everyone to magically start working on it, and I respect your decision to work on other issues first. I only want to point out that this is not a small issue and blocks many packages in Fedora. |
@Dormouse759 could you try in your CI to build latest combination with f30 (gcc901) and f29 (gcc8) so that we all know if gcc9 is culprit or not? |
FYI, there are no developers working on this project as a full-time job, this project is maintained by a handful of users/volunteers/passersby. Mostly people just fix whatever affects or annoys them, if they have the time and competence. |
I have updated the reproducer so it's less prone to breakage. |
I have tested with gcc-8.3.1-2.fc29.x86_64 and the issue still persists. |
Not sure about this statement. Fedora CI log shows success with OpenBLAS 0.3.5 f29 and failures with netlib lapack 3.8.0 on f30 and rawhide. OpenBLAS includes great deal of that code too. |
@brada4 I only have used older version of gcc. All other packages were used from rawhide. That means: |
So - make a wild guess if it was that lapack or same lapack code copied inside OpenBLAS..... |
@brada4 I don't think I understand what you are pointing out. |
The GCC9 fortran ABI issue breaks LAPACK 3.8.0, the reference implementation and openblas copy thereof alike. |
Trouble with your theory is that #2154 (actually Reference-LAPACK issue 339) is not new to 3.8.0 but as old as LAPACKE, only GCC has slowly (and probably at first inadvertently) become much more strict in its enforcements of formal standards. Actually the post-9.1 GCC now has a workaround in place to avoid breaking all the legacy codes out there. |
It will take some months to land in released gcc |
Certainly. The point is that this particular problem must have existed with any netlib lapack and any gcc version since gcc7 or so, and from what i read it seems to have led to sporadic, hard to reproduce errors only. |
The thing is all involved components gradually changed to no good, there is no case with rest frozen where 0.3.5->0.3.6 is the breaking change (or Netlib-> 0.3.6) |
Unfortunately the build still fails in the same place for me, although this time there were no package conflicts.(I do see a warning about an invalid include path for "python3.7m" before the "object of type type has no len" error). |
Note that if I ignore the apparent scipy build error from the docker build phase and try to run the reproducer by |
I see the problem of linking against both the serial and parallel version of OpenBLAS has been noted in https://bugzilla.redhat.com/show_bug.cgi?id=1709161 already (which was opened before this ticket here) but I do not see any indication there whether fixing this solved the "returned NULL". And I think I have got the docker setup to work now - however, as soon as I replace the fedora-supplied libopenblas.so and libopenblasp.so with a locally built copy of libopenblasp.so (no matter if 0.3.5, 0.3.6 or develop) the repro.py executes without errors (checked by adding print statements inside the script). Just copying the fedora lioopenblasp.so over its libopenblas.so counterpart or vice versa does not make the error go away. |
Building my own 0.3.6 inside the container (with just |
Thank you for your help with the issue. |
I think this issue stays tracked downstream, there is nothin openblas code change could help |
I intend to keep this open until we can be certain there is no significant (legitimate) difference in build options between the fedora package(s) and my build.My bet however is on this being some fundamental incompatibility that prevented loading of "their" libopenblas at a very early stage. |
Ill try attached docker, rpm bundle gets built with hardening/optimizing flags of system, kind of bisect those |
F30 adds |
Was this fixed? I can't reproduce with the Dockerfile |
@isuruf I do not think so - at least the related fedora tickets are unchanged, as is the build status on koschei. So you can run the reproduce.sh inside the docker container without getting the error output ? (You will not see any test run during the docker build, as the scipy build does not complete) |
Yes. |
Curious. I could reproduce it on every invocation (when I set up the environment to try different cpu kernels) until I replaced the library, And to me it looked as if it failed as soon as it tried to load. What hardware did you use for your test ? |
i7-8750H |
Seems like there has been some hickup in Fedora systems. This means Fedora issues have been most likely resolved. I will test it and report back.
|
At least the koschei builds seem to be failing for other reasons (such as missing input files for some plot routines apparently) since Jul 02... |
This issue has been resolved for Fedora. |
Thanks for providing the Docker container, which was very instrumental. |
Since the introduction of openblas 0.3.6 to Fedora rawhide, we're seeing test failures on scipy builds on x86_64. See https://apps.fedoraproject.org/koschei/package/scipy
and in particular: https://kojipkgs.fedoraproject.org/work/tasks/3609/34623609/build.log
Example:
Perhaps affected some other packages as well: https://apps.fedoraproject.org/koschei/affected-by/openblas-devel?epoch1=0&version1=0.3.5&release1=5.fc31&epoch2=0&version2=0.3.6&release2=1.fc31&collection=f31
The text was updated successfully, but these errors were encountered: