Skip to content

scipy test failures on x86_64 with openblas 0.3.6 #2137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
opoplawski opened this issue May 19, 2019 · 58 comments
Closed

scipy test failures on x86_64 with openblas 0.3.6 #2137

opoplawski opened this issue May 19, 2019 · 58 comments

Comments

@opoplawski
Copy link

Since the introduction of openblas 0.3.6 to Fedora rawhide, we're seeing test failures on scipy builds on x86_64. See https://apps.fedoraproject.org/koschei/package/scipy
and in particular: https://kojipkgs.fedoraproject.org/work/tasks/3609/34623609/build.log

Example:

______________________________ TestLSQ.test_lstsq ______________________________
self = <scipy.interpolate.tests.test_bsplines.TestLSQ object at 0x7ff594251cf8>
    def test_lstsq(self):
        # check LSQ construction vs a full matrix version
        x, y, t, k = self.x, self.y, self.t, self.k
    
        c0, AY = make_lsq_full_matrix(x, y, t, k)
>       b = make_lsq_spline(x, y, t, k)
scipy/interpolate/tests/test_bsplines.py:1182: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
scipy/interpolate/_bsplines.py:1017: in make_lsq_spline
    check_finite=check_finite)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
ab = array([[1.00189868e+00, 7.52101832e-01, 5.38951295e-01, 6.89409012e-01,
        6.23893499e-01, 7.79037069e-01, 6.1263...e-04, 5.93585425e-04,
        2.05099058e-03, 2.58394962e-03, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00]])
overwrite_ab = True, lower = True, check_finite = True
    def cholesky_banded(ab, overwrite_ab=False, lower=False, check_finite=True):
        """
        Cholesky decompose a banded Hermitian positive-definite matrix
    
        The matrix a is stored in ab either in lower diagonal or upper
        diagonal ordered form::
    
            ab[u + i - j, j] == a[i,j]        (if upper form; i <= j)
            ab[    i - j, j] == a[i,j]        (if lower form; i >= j)
    
        Example of ab (shape of a is (6,6), u=2)::
    
            upper form:
            *   *   a02 a13 a24 a35
            *   a01 a12 a23 a34 a45
            a00 a11 a22 a33 a44 a55
    
            lower form:
            a00 a11 a22 a33 a44 a55
            a10 a21 a32 a43 a54 *
            a20 a31 a42 a53 *   *
    
        Parameters
        ----------
        ab : (u + 1, M) array_like
            Banded matrix
        overwrite_ab : bool, optional
            Discard data in ab (may enhance performance)
        lower : bool, optional
            Is the matrix in the lower form. (Default is upper form)
        check_finite : bool, optional
            Whether to check that the input matrix contains only finite numbers.
            Disabling may give a performance gain, but may result in problems
            (crashes, non-termination) if the inputs do contain infinities or NaNs.
    
        Returns
        -------
        c : (u + 1, M) ndarray
            Cholesky factorization of a, in the same banded format as ab
    
        See also
        --------
        cho_solve_banded : Solve a linear set equations, given the Cholesky factorization
                    of a banded hermitian.
    
        Examples
        --------
        >>> from scipy.linalg import cholesky_banded
        >>> from numpy import allclose, zeros, diag
        >>> Ab = np.array([[0, 0, 1j, 2, 3j], [0, -1, -2, 3, 4], [9, 8, 7, 6, 9]])
        >>> A = np.diag(Ab[0,2:], k=2) + np.diag(Ab[1,1:], k=1)
        >>> A = A + A.conj().T + np.diag(Ab[2, :])
        >>> c = cholesky_banded(Ab)
        >>> C = np.diag(c[0, 2:], k=2) + np.diag(c[1, 1:], k=1) + np.diag(c[2, :])
        >>> np.allclose(C.conj().T @ C - A, np.zeros((5, 5)))
        True
    
        """
        if check_finite:
            ab = asarray_chkfinite(ab)
        else:
            ab = asarray(ab)
    
        pbtrf, = get_lapack_funcs(('pbtrf',), (ab,))
>       c, info = pbtrf(ab, lower=lower, overwrite_ab=overwrite_ab)
E       SystemError: <fortran object> returned NULL without setting an error
scipy/linalg/decomp_cholesky.py:280: SystemError

Perhaps affected some other packages as well: https://apps.fedoraproject.org/koschei/affected-by/openblas-devel?epoch1=0&version1=0.3.5&release1=5.fc31&epoch2=0&version2=0.3.6&release2=1.fc31&collection=f31

@tylerjereddy
Copy link
Contributor

This is for SciPy 1.2.1 or the just-released 1.3.0, or another version? Speaking from the SciPy side, we often have to be rather selective about the exact OpenBLAS version pairing for everything to work. Sometimes a patch or two is needed after a release. This will hopefully improve over time though!

@martin-frbg
Copy link
Collaborator

martin-frbg commented May 20, 2019

Is "SystemError: <fortran object> returned NULL without setting an error" indicative of an OpenBLAS error at all, or just a more general build failure ? (If it is, it would be helpful to know the cpu or OpenBLAS cpu target this is observed with )

@opoplawski
Copy link
Author

This is with 1.2.1. We are also seeing segfaults with 1.3.0 - http://koji.fedoraproject.org/koji/taskinfo?taskID=34942727 - I haven't yet reported that to scipy.

@martin-frbg
Copy link
Collaborator

If I read the overview correctly, the build failure on x86_64 coincided with the OpenBLAS update (but apparently a few other packages were updated in the same timeframe ?), but other platforms started to fail much later, despite also linking against OpenBLAS ? Can you tell what cpu model your build host uses ?

@tylerjereddy
Copy link
Contributor

I don't immediately recognize the error as something I've seen recently. Maybe @rgommers has seen it before.

I'm pretty sure that for SciPy 1.2.1 wheels we're using an OpenBLAS commit on the 0.3.5.dev line prior to the 0.3.6 release point. For SciPy 1.3.0, wheels are using an OpenBLAS commit from the 0.3.7.dev line to deal with recent SkylakeX kernel issues.

I suppose we'll have to do a 1.2.2 eventually with the 0.3.7.dev OpenBLAS as well for the same reason. See here for example: scipy/scipy#10145 Sadly, the Azure CI logs seem to have been purged already (!) so I can't check if the same failure was reported there.

@martin-frbg
Copy link
Collaborator

I'm pretty sure that for SciPy 1.2.1 wheels we

So can anybody please clarify what is the relation between "SciPy wheels" and whatever Fedora uses to build their packages. Is there any, or is Fedora compiling some sort of bare bones SciPy source with whatever OpenBLAS they see fit ? The lengthy build log originally linked above (and now gone) looked quite confusing to me, with what looked like a fortran-based BLAS getting built in between and a ton of warnings about out-of-bounds accesses in some fft library test code but no clear reference to a particular version of OpenBLAS.

@rgommers
Copy link
Contributor

Is "SystemError: returned NULL without setting an error" indicative of an OpenBLAS error at all, or just a more general build failure

I have never seen that before. Very likely specific to OpenBLAS, even more so because the report says it's introduced with the upgrade to OpenBLAS 0.3.6

@opoplawski if you're able to bisect that to an OpenBLAS commit, that would probably be very useful.

@martin-frbg
Copy link
Collaborator

I gather this is a python error message, apparently either from using a wrong function prototype or a function actually not returning anything. Now in the case of OpenBLAS, I'd think it much more likely to receive an incorrect result rather than no result at all.

@brada4
Copy link
Contributor

brada4 commented May 20, 2019

The code in question is here:
https://github.com/scipy/scipy/blob/02b0001af4d7125a390556c227578d6cfd06d4e2/scipy/linalg/decomp_cholesky.py#L280
Does it work with older python, it seems mainline python gets about as picky as pypy regarding external call prototypes.

@martin-frbg
Copy link
Collaborator

FWIW, the line 280 in decomp_cholesky.pc calls dpbtrf, which would imply either of SYRK,TRSM, GEMV or GEMM. If the build host for this was actually SkylakeX this would indeed be due to the incomplete reversion of the AVX512 DGEMM kernel in 0.3.6 (#1955, and accordingly only affect AVX512 hardware).

@tylerjereddy
Copy link
Contributor

I did manage to dig up recent SciPy master branch test failures (5) under SkylakeX Intel SDE emulation on MacOS, with logs available from here: https://dev.azure.com/tylerjereddy/tyler-scipy-fork/_build/results?buildId=841

  1. test_solve_discrete_are in scipy/linalg/tests/test_solvers.py
  2. TestQR.test_random_tall_right in scipy/linalg/tests/test_decomp.py
  3. TestQR.test_random_tall_left in scipy/linalg/tests/test_decomp.py
  4. test_pinv_pinv2_comparison in scipy/linalg/tests/test_basic.py
  5. TestLstsq.test_random_exact in scipy/linalg/tests/test_basic.py

Not sure how helpful that is, but we never saw the Linux failures on SKX because our manylinux1 glibc is too old to allow the AVX512 to go through. The PR I linked above can be used as a starting point to check if that's the same problem on Linux as well, although checking the machine hardware should be faster!

There's also the useful OPENBLAS_VERBOSE=2 and i.e., OPENBLAS_CORETYPE=haswell to try to avoid the kernel issue if you're bound to the same hardware & OpenBLAS version for some reason.

@opoplawski
Copy link
Author

opoplawski commented May 21, 2019

Unfortunately we're dealing with a couple of different build issues. The openblas 0.3.6 update appears to have broken the x86_64 build with test failures, but a later numpydoc 0.8.0 -> 0.9.1 update broke the documentation build on all arches.

Fedora builds scipy (or any package) against whatever libraries are in that Fedora release. Rawhide is the rolling development for the next Fedora release, so updates of any library can occur at any time. scipy's wheels are built with specific versions of libraries that they specify.

The builder reports:

model name	: Intel Core Processor (Haswell, no TSX, IBRS)
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single pti ibrs ibpb fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat

Unfortunately OPENBLAS_VERBOSE=2 does not appear to produce any usable output that I can discern. Scratch build here - https://koji.fedoraproject.org/koji/taskinfo?taskID=34966483

I don't think I'm going to have any time to be able to do a bisect of the openblas commits.

@brada4
Copy link
Contributor

brada4 commented May 21, 2019

If you look google for SystemError: <fortran object> returned NULL you get a lot of hits against python 3.7 also in non-numeric libs all around foreign library imports. I am just wondering if problem is same in python 3.6 or not.

It says about returning NULL - the only return is INFO that is zero (might be interpreted as NULL in most cases) in case of success...

@martin-frbg
Copy link
Collaborator

Not sure what to make of this - Haswell is probably the most used platform, so any errors e.g. in the recent rewrite of assembly constraints following #2009 (that I just reviewed again) should have had more than enough time to show up in less complex environments. And I still do not understand how a numerical error in OpenBLAS could lead to that particular error message.

@martin-frbg
Copy link
Collaborator

Not reproducible by building scipy-1.2.1 with 0.3.6 on openSUSE 15 and running the provided testsuite via python runtests.py. (I do note that this build was done with gcc 7.3.1 while rawhide seems to be using some flavor of 9.1. I am not aware of any miscompilations with 9.x since #2009 was fixed - though you had seen some transient failures there back in february )

@martin-frbg
Copy link
Collaborator

martin-frbg commented May 22, 2019

No problems with a gcc 9.1.0 build either (as far as the build tests and BLAS-Tester are concerned at least, I do not plan to redo the whole scipy thing unless there is additional evidence of an actual problem in OpenBLAS)
(One temporary distraction was caused by my first 0.3.6 build somehow still picking the older gfortran,
leading to a mixed gcc7.3/9.1 build that linked to libgfortran4 instead of libgfortran5 and promptly bombed out on the first lapack test. I wonder if a segfault in a library call would lead to this "SystemError - returned NULL" effect in python ?)

@martin-frbg
Copy link
Collaborator

Managed to redo the scipy build&test now as well, still no SystemError. (I did get four "ValueError" test failures from einsum with both compiler setups - if I read the report correctly these were all of the type "size of label ... for operands ... does not match previous terms" and may actually originate in the version of numpy installed on the system.

@martin-frbg
Copy link
Collaborator

Any updates from either scipy or Fedora ? As I cannot reproduce the problem in my environment there is nothing for me to do except perhaps doubt that OpenBLAS is the culprit. (Though in view of #2154 you could try adding -fno-optimize-sibling-calls to your gfortran flags - for OpenBLAS' LAPACK as well as any other fortran code that scipy calls from a C interface.)

@tylerjereddy
Copy link
Contributor

I'm not sure what to advise here from the SciPy side other than to suggest using the same version of OpenBLAS that we use for "official" SciPy 1.2.1 wheel builds. Otherwise, it would not be hugely surprising that there could be issues that aren't patched by one of the two projects.

Is there a Docker image that can be provided/ used to reproduce this from the Fedora side? That might give us the traction we need, though I'm still not sure how much sense it would make to backport a fix to i.e., SciPy 1.2.x so that it works with another version of OpenBLAS.

The usual approach is to discuss a problem with OpenBLAS team & if they find that a fix is appropriate we simply bump the OpenBLAS version (commit hash) that we build/ test / release binaries with, even if not at a stable release point just yet.

I'm planning to release SciPy 1.2.2 very soon (test wheel builds already under way), the next release in the LTS support series, but it will use a more recent OpenBLAS 0.3.7.dev commit for wheels because of SkylakeX AVX kernel stuff.

@ghost
Copy link

ghost commented Jun 18, 2019

Hello, I have created an environment in a Docker container, where it is fairly easy to reproduce the bug.
If you need help and faster communication via a side channel, please find me on IRC Freenode/#fedora-python. (mplch)
The Dockerfile can be found here: https://github.com/Dormouse759/fedora-OpenBLAS-reproducer

@ghost
Copy link

ghost commented Jun 19, 2019

May I ask for this issue to be prioritized?
This is blocking rather a large set of packages in Fedora depending on SciPy to be rebuilt against Python 3.8 and we are not sure what problems will emerge after SciPy gets fixed.
Thank you for any help provided.

@brada4
Copy link
Contributor

brada4 commented Jun 19, 2019

Please check #2154

Note that reproducer should be with locked versions, ''rawhide" is moving target, say "26 ISO" is not

_pbrtf function , behind the scenes, uses argument that is affected by particular ABI breakage, basically all arguments after are displaced, so essentially garbage is passed to the call.

      UPLO is CHARACTER*1
      = 'U':  Upper triangle of A is stored;
      = 'L':  Lower triangle of A is stored.

@martin-frbg
Copy link
Collaborator

@brada4 unlikely as it built fine for me with 9.1 on opensuse. Offhand I see nothing suspicious in the Fedora openblas.spec file either (their build uses TARGET=CORE2 DYNAMIC_ARCH=1 while my test above was native HASWELL but I doubt that matters). I find it a bit odd that the scipy build logs show it linking against both their pthreads (-lopenblasp) and single-threaded (-lopenblas) versions of OpenBLAS but I assume all external references would already be satisfied by the former so the latter should get ignored (?)

@brada4
Copy link
Contributor

brada4 commented Jun 20, 2019

I think it links to both at times, at least did so with tatlas.so and satlas.so a while ago

@martin-frbg
Copy link
Collaborator

@Dormouse759 with your Dockerfile, I get a number of package conflicts involving libgit, python3, python3-rpm and libgomp and the scipy sunsequently build bombs out somewhere around the messagestream.map creation with what looks like python version conflicts, culminating in several error messages "object of type 'type' has no len() in evaluating len(list)".
(NB you may ask for priority all you like, but that does not magically create developer time out of thin air.)

@ghost
Copy link

ghost commented Jun 20, 2019

@martin-frbg Thank you for letting me know, this uses rawhide and a copr repo, where we try to port things to 3.8, conflicts emerge eventually. I will do my best to provide a container with locked versions, so this doesn't happen.

I am aware that asking does not create dev time magically, I don't expect everyone to magically start working on it, and I respect your decision to work on other issues first. I only want to point out that this is not a small issue and blocks many packages in Fedora.

@brada4
Copy link
Contributor

brada4 commented Jun 20, 2019

@Dormouse759 could you try in your CI to build latest combination with f30 (gcc901) and f29 (gcc8) so that we all know if gcc9 is culprit or not?

@TiborGY
Copy link
Contributor

TiborGY commented Jun 21, 2019

I am aware that asking does not create dev time magically, I don't expect everyone to magically start working on it, and I respect your decision to work on other issues first. I only want to point out that this is not a small issue and blocks many packages in Fedora.

FYI, there are no developers working on this project as a full-time job, this project is maintained by a handful of users/volunteers/passersby. Mostly people just fix whatever affects or annoys them, if they have the time and competence.

@ghost
Copy link

ghost commented Jun 26, 2019

I have updated the reproducer so it's less prone to breakage.
Note that it is now using Python 3.7, where the issue is reproducible too.
I still haven't tested different gcc versions. I will do so and report the result here.

@ghost
Copy link

ghost commented Jun 27, 2019

I have tested with gcc-8.3.1-2.fc29.x86_64 and the issue still persists.

@brada4
Copy link
Contributor

brada4 commented Jun 28, 2019

Not sure about this statement. Fedora CI log shows success with OpenBLAS 0.3.5 f29 and failures with netlib lapack 3.8.0 on f30 and rawhide. OpenBLAS includes great deal of that code too.

@ghost
Copy link

ghost commented Jun 28, 2019

@brada4 I only have used older version of gcc. All other packages were used from rawhide. That means:
openblas-0.3.6-1.fc31.x86_64
lapack-3.8.0-12.fc31.x86_64

@brada4
Copy link
Contributor

brada4 commented Jun 28, 2019

So - make a wild guess if it was that lapack or same lapack code copied inside OpenBLAS.....

@ghost
Copy link

ghost commented Jun 28, 2019

@brada4 I don't think I understand what you are pointing out.

@brada4
Copy link
Contributor

brada4 commented Jun 28, 2019

The GCC9 fortran ABI issue breaks LAPACK 3.8.0, the reference implementation and openblas copy thereof alike.
Fedora CI shows breakage without involving OpenBLAS.
Could you make a situation where scipy tests pass with reference LAPACK but breaks with OpenBLAS? Current koji logs show the opposite.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Jun 28, 2019

Trouble with your theory is that #2154 (actually Reference-LAPACK issue 339) is not new to 3.8.0 but as old as LAPACKE, only GCC has slowly (and probably at first inadvertently) become much more strict in its enforcements of formal standards. Actually the post-9.1 GCC now has a workaround in place to avoid breaking all the legacy codes out there.

@brada4
Copy link
Contributor

brada4 commented Jun 28, 2019

It will take some months to land in released gcc

@martin-frbg
Copy link
Collaborator

Certainly. The point is that this particular problem must have existed with any netlib lapack and any gcc version since gcc7 or so, and from what i read it seems to have led to sporadic, hard to reproduce errors only.
If it was the culprit here, the same faults should already have occured with 0.3.5

@brada4
Copy link
Contributor

brada4 commented Jun 29, 2019

The thing is all involved components gradually changed to no good, there is no case with rest frozen where 0.3.5->0.3.6 is the breaking change (or Netlib-> 0.3.6)
We do not have clairvoyance to see if all future import will be new or old ABI, I suspect it is the case for changing function naming in ABI or even of libblas.so.4

@martin-frbg
Copy link
Collaborator

martin-frbg commented Jun 29, 2019

Unfortunately the build still fails in the same place for me, although this time there were no package conflicts.(I do see a warning about an invalid include path for "python3.7m" before the "object of type type has no len" error).

@martin-frbg
Copy link
Collaborator

Note that if I ignore the apparent scipy build error from the docker build phase and try to run the reproducer by docker run _imagename_ /reproducer.sh I do get the " returned NULL" no matter which OPENBLAS_CORETYPE I pass via the --env parameter. But any attempt to load scipy or individual modules in basically the same way as the repro scripts ends with a "not found"

@martin-frbg
Copy link
Collaborator

martin-frbg commented Jun 30, 2019

I see the problem of linking against both the serial and parallel version of OpenBLAS has been noted in https://bugzilla.redhat.com/show_bug.cgi?id=1709161 already (which was opened before this ticket here) but I do not see any indication there whether fixing this solved the "returned NULL".

And I think I have got the docker setup to work now - however, as soon as I replace the fedora-supplied libopenblas.so and libopenblasp.so with a locally built copy of libopenblasp.so (no matter if 0.3.5, 0.3.6 or develop) the repro.py executes without errors (checked by adding print statements inside the script). Just copying the fedora lioopenblasp.so over its libopenblas.so counterpart or vice versa does not make the error go away.

@martin-frbg
Copy link
Collaborator

Building my own 0.3.6 inside the container (with just make DYNAMIC_ARCH=1 DEBUG=1, optionally also with USE_OPENMP=1) and copying that over the original /usr/lib64/libopenblasp.so and libopenblas.so removes the error as well. Suggest you consult with the fedora openblas maintainer (if there is such a person)...

@ghost
Copy link

ghost commented Jul 1, 2019

Thank you for your help with the issue.
I have contacted OpenBLAS maintainer in the downstream issue: https://bugzilla.redhat.com/show_bug.cgi?id=1606315

@brada4
Copy link
Contributor

brada4 commented Jul 1, 2019

I think this issue stays tracked downstream, there is nothin openblas code change could help

@martin-frbg
Copy link
Collaborator

I intend to keep this open until we can be certain there is no significant (legitimate) difference in build options between the fedora package(s) and my build.My bet however is on this being some fundamental incompatibility that prevented loading of "their" libopenblas at a very early stage.

@brada4
Copy link
Contributor

brada4 commented Jul 1, 2019

Ill try attached docker, rpm bundle gets built with hardening/optimizing flags of system, kind of bisect those

@brada4
Copy link
Contributor

brada4 commented Jul 3, 2019

F30 adds -Wl,--as-needed to LD flags, no symbol found missing.

@isuruf
Copy link
Contributor

isuruf commented Jul 3, 2019

Was this fixed? I can't reproduce with the Dockerfile

@martin-frbg
Copy link
Collaborator

@isuruf I do not think so - at least the related fedora tickets are unchanged, as is the build status on koschei. So you can run the reproduce.sh inside the docker container without getting the error output ? (You will not see any test run during the docker build, as the scipy build does not complete)

@isuruf
Copy link
Contributor

isuruf commented Jul 3, 2019

So you can run the reproduce.sh inside the docker container without getting the error output ?

Yes.

@martin-frbg
Copy link
Collaborator

Curious. I could reproduce it on every invocation (when I set up the environment to try different cpu kernels) until I replaced the library, And to me it looked as if it failed as soon as it tried to load. What hardware did you use for your test ?

@isuruf
Copy link
Contributor

isuruf commented Jul 3, 2019

i7-8750H

@ghost
Copy link

ghost commented Jul 4, 2019

Seems like there has been some hickup in Fedora systems.

This means Fedora issues have been most likely resolved. I will test it and report back.

  • Rebuild since older build doesn't show up in updates system.

@martin-frbg
Copy link
Collaborator

At least the koschei builds seem to be failing for other reasons (such as missing input files for some plot routines apparently) since Jul 02...

@ghost
Copy link

ghost commented Jul 8, 2019

This issue has been resolved for Fedora.
Thank you all for your help, you guys are amazing.

@martin-frbg
Copy link
Collaborator

Thanks for providing the Docker container, which was very instrumental.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants