nan in svd N for Haswell core #3318

matthew-brett · 2021-07-16T12:39:15Z

Please forgive me for this bug-report - coming from Numpy - but I think we've hit a bug in the singular value calculation for the Haswell core type - see numpy/numpy#19473

Specifically, only on the Haswell core, we are getting unexpected nan values in the case where we compute the singular values without the UV matrices. The files to reproduce in Numpy are here:
svd_check.zip

The check file is svd_check.py in the archive - and contains:

import numpy as np
from numpy.linalg._umath_linalg import svd_n

mat = np.loadtxt('1.csv', delimiter=',')
s = svd_n(mat, signature='d->d')
print(np.count_nonzero(np.isnan(s)))

We run with many cores using the test_svds.sh script:

#!/bin/bash
export OPENBLAS_VERBOSE=2
for core in prescott dunnington penryn core2 nehalem sandybridge haswell; do
    export OPENBLAS_CORETYPE=$core
    python svd_check.py
done

This gives 5 nan values for the Haswell core, and 0 otherwise. If we compute the UV matrices at the same time, we never get nan in the SVs, regardless of core.

Could this be a bug in the SV-only computation in OpenBLAS?

The text was updated successfully, but these errors were encountered:

martin-frbg · 2021-07-16T12:48:53Z

which OpenBLAS version is this please ?

matthew-brett · 2021-07-16T13:33:50Z

@mattip - I'm very sorry - I lost track of the versions - this is for Numpy 1.21.0 - what version of OpenBLAS is this? What is the best way to test against OpenBLAS 0.3.17?

martin-frbg · 2021-07-16T13:59:59Z

Guessing 1.21 would be using 0.3.15, so probably no improvement from going to 0.3.17. If it only happens on Haswell it is less likely to be a bug in LAPACK but the key to the different behaviour with and without UV (if that is what I think it is) may be in LAPACK DGESDD using a different algorithm when the singular vectors are needed.

mattip · 2021-07-16T14:02:13Z

What is the best way to test against OpenBLAS 0.3.17?

Wait for a build with PR numpy/numpy#19492 to appear on https://anaconda.org/scipy-wheels-nightly/numpy/files. Since we are revising the wheel building repo to use 64 bit interfaces, the builds are quite active now. Usually they only update once a week or so on Sun. You can check the OpenBLAS version with this code https://github.com/numpy/numpy/blob/main/tools/openblas_support.py#L294

matthew-brett · 2021-07-16T14:19:47Z

@martin-frbg - from @mattip's hint, the version tested above was:

OpenBLAS 0.3.13.dev DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=64

Testing with the latest Numpy development build as per Matti's instructions (pip install -i https://pypi.anaconda.org/scipy-wheels-nightly/simple numpy), I get:

OpenBLAS 0.3.17  USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=64

This gives the same outputs for the SVDs (5 nan values for Haswell, 0 for the others).

martin-frbg · 2021-07-16T14:50:22Z

So no recent regression at least ... last relevant changes to Haswell (GEMM kernels) were around 0.3.8 I think... Will try to swap them back as soon as I have reproduced the problem on my hardware. (Probably not today though)

martin-frbg · 2021-07-17T21:01:09Z

Bisected to abef2ea "Move -fma option setting to kernel/Makefile.L1", i.e. it is a problem with where and when to apply a compiler option that is (theoretically) required only for one or two BLAS kernels that use FMA in the form of intrinsiscs. (Also means broken since 0.3.13.dev snapshot of December 17). No immediate idea how to handle this, as using gcc -mfma unconditionally led to other unexpected breakage that prompted that particular commit. (May just nuke the new srot and drot kernel, or try -mfma -mno-fma4 to see if this makes any difference. Too tired now)

martin-frbg · 2021-07-18T13:53:32Z

Interim conclusion is that the Haswell
DGEMV kernels want to be built with -mfma to avoid the miscalculation. Reason still unclear, the only somewhat recent change to them was an extension of their clobber lists to include all ymm registers

matthew-brett · 2021-07-19T08:04:45Z

Thanks very much for tracking that one down.

carlkl · 2021-07-20T19:51:57Z

wild guess: what happens if you compile (GCC) with -mfma AND -ffp-contract=off. In this case use of FMA should globally disallowed but FMA for the BLAS kernels that use FMA in the form of intrinsiscs should be supported.

martin-frbg · 2021-07-21T09:28:30Z

Thanks for the suggestion. Unfortunately this does not address the issue of FMA inexplicably being needed for some files (possibly just one) that do not use intrinsics. (My PR fixes that case for gmake builds, but changing the cmake build is much less straightforward)

carlkl · 2021-07-21T15:21:06Z

Maybe: https://stackoverflow.com/questions/13638408/override-compile-flags-for-single-files ?

martin-frbg · 2021-07-21T16:21:31Z

Thanks, I know that option but the cmake rules for the kernel files are themselves generated by a cmake script...

carlkl · 2021-07-21T18:10:12Z

Just a question: why is FMA inexplicably being needed ? Performance reasons?

martin-frbg · 2021-07-21T18:25:13Z

See earlier in this thread... #3318 (comment) #3318 (comment)

carlkl · 2021-07-21T19:30:40Z

Which file do you expect to compile with -mfma to be correct?

dgemv_n_4.c
dgemv_n_microk_haswell-4.c

martin-frbg · 2021-07-21T20:39:08Z

DGEMV_T not DGEMV_N. And the "microk" is #include'd by the other, so I did not try to apply the option separately, the patch applies it to dgemv_t_4.c A survey of older versions suggests that the NaN issue has already been present at least as far back as 0.2.20, and got "fixed" by serendipity when the -mfma` option was added to enable use of intrinsics in other kernels.

martin-frbg · 2021-07-22T19:36:42Z

This problem was already present in 0.2.15, the version that introduced the dgemv_t microkernel for Haswell. I fail to identify any formal problems with the inline assembly, and unfortunately I do not see where -mfma makes a difference in its compilation (.s and .o outputs for it with or without -mfma are identical)

matthew-brett mentioned this issue Jul 16, 2021

np.linalg.matrix_rank calculates the rank of the matrix and gives the wrong result numpy/numpy#19473

Open

martin-frbg mentioned this issue Jul 18, 2021

Empirical workaround for numpy SVD NaN problem from issue 3318 #3320

Merged

martin-frbg mentioned this issue Jul 20, 2021

BLAS memory allocation error in Scikit-learn KMeans & kNN & DBSCAN #3321

Closed

martin-frbg closed this as completed in #3320 Jul 22, 2021

martin-frbg mentioned this issue Oct 20, 2021

dgemv_n_BARCELONA segfault (openblas 0.3.18) #3421

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nan in svd N for Haswell core #3318

nan in svd N for Haswell core #3318

matthew-brett commented Jul 16, 2021 •

edited

Loading

martin-frbg commented Jul 16, 2021

matthew-brett commented Jul 16, 2021

martin-frbg commented Jul 16, 2021

mattip commented Jul 16, 2021

matthew-brett commented Jul 16, 2021

martin-frbg commented Jul 16, 2021

martin-frbg commented Jul 17, 2021 •

edited

Loading

martin-frbg commented Jul 18, 2021

matthew-brett commented Jul 19, 2021

carlkl commented Jul 20, 2021

martin-frbg commented Jul 21, 2021

carlkl commented Jul 21, 2021

martin-frbg commented Jul 21, 2021

carlkl commented Jul 21, 2021

martin-frbg commented Jul 21, 2021

carlkl commented Jul 21, 2021

martin-frbg commented Jul 21, 2021

martin-frbg commented Jul 22, 2021

nan in svd N for Haswell core #3318

nan in svd N for Haswell core #3318

Comments

matthew-brett commented Jul 16, 2021 • edited Loading

martin-frbg commented Jul 16, 2021

matthew-brett commented Jul 16, 2021

martin-frbg commented Jul 16, 2021

mattip commented Jul 16, 2021

matthew-brett commented Jul 16, 2021

martin-frbg commented Jul 16, 2021

martin-frbg commented Jul 17, 2021 • edited Loading

martin-frbg commented Jul 18, 2021

matthew-brett commented Jul 19, 2021

carlkl commented Jul 20, 2021

martin-frbg commented Jul 21, 2021

carlkl commented Jul 21, 2021

martin-frbg commented Jul 21, 2021

carlkl commented Jul 21, 2021

martin-frbg commented Jul 21, 2021

carlkl commented Jul 21, 2021

martin-frbg commented Jul 21, 2021

martin-frbg commented Jul 22, 2021

matthew-brett commented Jul 16, 2021 •

edited

Loading

martin-frbg commented Jul 17, 2021 •

edited

Loading