Skip to content

Non-deterministically corrupted results from np.dot (conditional on large imports) #12394

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1pakch opened this issue Nov 15, 2018 · 12 comments
Closed
Labels

Comments

@1pakch
Copy link

1pakch commented Nov 15, 2018

I am getting randomly corrupted results from np.dot if the first argument is an array of doubles with FORTRAN layout.

The behavior is non-deterministic and the does not occur unless I import a large module beforehand (e.g. seaborn or theano does the trick).

Reproducing code example:

import numpy as np

random = dict(
    int = lambda shape: np.random.randint(low=0, high=10, size=shape),
    float = lambda shape: np.random.randn(*shape)
)

def check_dot(A, b, attempts=10, eps=1e-1):
    expected = (A * b).sum(axis=1)
    for i in range(attempts):
        actual = np.dot(A, b)
        if not np.allclose(expected, actual, eps):
            return False
    return True

def test(n_cols=2):
    for n_rows in 2**np.arange(1, 16, 1):
        for type, f in random.items():
            A = f((n_rows, n_cols))
            b = f((n_cols,))
            for order in 'CF':
                Ac = np.copy(A, order=order)
                if not check_dot(Ac, b):
                    print(type, n_rows, order)
                    return Ac, b
    return None


assert test() is None
import seaborn # or import theano
Ac, b = test()

prints

float 8192 F

The error reliably occurs for matrices of larger sizes.

Examples of corrupted results (random data)

If the failure occurs and I evaluate np.dot(Ac, b) a few times

import matplotlib.pyplot as plt

n_eval = 4
f, axes = plt.subplots(n_eval, 1, True, True, figsize=(8, 1.5*n_eval))

expected = (Ac*b).sum(axis=1)
for ax in axes:
    actual  = np.dot(Ac, b)
    ax.plot(actual - expected)

the output is non-deterministic and the errors look like this:
errors-randn

Examples of corrupted results (structured data)

If I replace the normal random generator with linspace:

random = dict(
    int = lambda shape: np.random.randint(low=0, high=10, size=shape),
    float = lambda shape: np.linspace(0, 1, np.product(shape)).reshape(shape)
)

I get errors plots looking like this:
errors-linspace

Numpy/Python version information:

Since I am using numpy packaged for nix package manager it is easy to get exactly the same Python environment package on a different machine. I did not do this yet.

>> sys.version
3.6.6 (default, Jun 27 2018, 05:47:41) 
[GCC 7.3.0]

>> np.__version__
1.15.1

>> np.show_config()
blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/nix/store/blk28p4cr6r2nc7fi1c4gggiqpd7pkqy-openblas-0.3.1/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/nix/store/blk28p4cr6r2nc7fi1c4gggiqpd7pkqy-openblas-0.3.1/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/nix/store/blk28p4cr6r2nc7fi1c4gggiqpd7pkqy-openblas-0.3.1/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/nix/store/blk28p4cr6r2nc7fi1c4gggiqpd7pkqy-openblas-0.3.1/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None)]

@charris
Copy link
Member

charris commented Nov 15, 2018

Hmm, OpenBLAS 0.3.1 is known buggy, that is why we don't use it. Where did you get NumPy?

@1pakch
Copy link
Author

1pakch commented Nov 15, 2018

Thanks a lot!

I'm using nix package manager. The latest stable release of packages (18.09) happened to have OpenBLAS 0.3.1 as a default version (upstream is already at 0.3.3).

I created an issue there where I am arguing that by default nix should package numpy/scipy using exactly the same versions of dependencies as you do in your binary builds. Do I understand correctly that you are building numpy 1.15.x with OpenBLAS 0.3.0 as reported in the release notes?

@charris
Copy link
Member

charris commented Nov 15, 2018

Yes, we built the wheels with 0.3.0. See comments MacPython/numpy-wheels@cd53070 . Note that we have discovered that 0.3.0 is not thread safe, so will be looking for 0.3.4.

@charris
Copy link
Member

charris commented Nov 15, 2018

See OpenMathLib/OpenBLAS#1851 for 0.3.0 threading problem.

@charris
Copy link
Member

charris commented Nov 15, 2018

See also OpenMathLib/OpenBLAS#1844.

@1pakch
Copy link
Author

1pakch commented Nov 15, 2018

Thanks. Apparently it's indeed multithreading-related. I get no errors if I force OpenBLAS to use only one thread. I don't think seaborn creates any threads so it's not clear why import matters.

@brada4
Copy link

brada4 commented Nov 15, 2018

You could try to apply diff from PR referenced in linked issue that is supposed to fix it. It is quite old code being modified, thus should apply cleanly on many even not so recent versions. It is due for 0.3.4 if you can wait a bit and upgrade binary.
Not much point in upgrading to 0.3.3 now, anything 0.2.15-0.3.3 are equally buggy.

@charris
Copy link
Member

charris commented Nov 15, 2018

@brada4 What sort of schedule are you looking at for 0.3.4. I'm looking to delay NumPy 1.16 until it comes out.

@brada4
Copy link

brada4 commented Nov 15, 2018

@martin-frbg has plan exactly because of this issue:
OpenMathLib/OpenBLAS#1865 (comment)
EDIT: test code uses DGEMV, thus related to OpenMathLib/OpenBLAS#1844 and not the distinct "equally ugly" issue with _GEMM OpenMathLib/OpenBLAS#1851

@charris charris added this to the 1.16.0 release milestone Nov 20, 2018
@charris
Copy link
Member

charris commented Nov 20, 2018

I put a 1.16 milestone on this for tracking purposes.

@charris
Copy link
Member

charris commented Dec 4, 2018

I think this has been fixed in OpenBLAS 3.4, so removing the milestone. @ilya-kolpakov you can test this now by downloading the latest numpy wheel builds from https://7933911d6844c6c53a7d-47bd50c35cd79bd838daf386af554a83.ssl.cf2.rackcdn.com/, look for files beginning numpy-1.16.0.dev0+20181204, or you can wait for 1.16.0rc1.

@seberg
Copy link
Member

seberg commented Sep 22, 2019

Considering that this is an OpenBLAS issue from more then half a year ago, which apparently was fixed upstream. Closing this issue.

@seberg seberg closed this as completed Sep 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants