Skip to content

nan in svd N for Haswell core #3318

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
matthew-brett opened this issue Jul 16, 2021 · 18 comments · Fixed by #3320
Closed

nan in svd N for Haswell core #3318

matthew-brett opened this issue Jul 16, 2021 · 18 comments · Fixed by #3320

Comments

@matthew-brett
Copy link
Contributor

matthew-brett commented Jul 16, 2021

Please forgive me for this bug-report - coming from Numpy - but I think we've hit a bug in the singular value calculation for the Haswell core type - see numpy/numpy#19473

Specifically, only on the Haswell core, we are getting unexpected nan values in the case where we compute the singular values without the UV matrices. The files to reproduce in Numpy are here:
svd_check.zip

The check file is svd_check.py in the archive - and contains:

import numpy as np
from numpy.linalg._umath_linalg import svd_n

mat = np.loadtxt('1.csv', delimiter=',')
s = svd_n(mat, signature='d->d')
print(np.count_nonzero(np.isnan(s)))

We run with many cores using the test_svds.sh script:

#!/bin/bash
export OPENBLAS_VERBOSE=2
for core in prescott dunnington penryn core2 nehalem sandybridge haswell; do
    export OPENBLAS_CORETYPE=$core
    python svd_check.py
done

This gives 5 nan values for the Haswell core, and 0 otherwise. If we compute the UV matrices at the same time, we never get nan in the SVs, regardless of core.

Could this be a bug in the SV-only computation in OpenBLAS?

@martin-frbg
Copy link
Collaborator

which OpenBLAS version is this please ?

@matthew-brett
Copy link
Contributor Author

@mattip - I'm very sorry - I lost track of the versions - this is for Numpy 1.21.0 - what version of OpenBLAS is this? What is the best way to test against OpenBLAS 0.3.17?

@martin-frbg
Copy link
Collaborator

Guessing 1.21 would be using 0.3.15, so probably no improvement from going to 0.3.17. If it only happens on Haswell it is less likely to be a bug in LAPACK but the key to the different behaviour with and without UV (if that is what I think it is) may be in LAPACK DGESDD using a different algorithm when the singular vectors are needed.

@mattip
Copy link
Contributor

mattip commented Jul 16, 2021

What is the best way to test against OpenBLAS 0.3.17?

Wait for a build with PR numpy/numpy#19492 to appear on https://anaconda.org/scipy-wheels-nightly/numpy/files. Since we are revising the wheel building repo to use 64 bit interfaces, the builds are quite active now. Usually they only update once a week or so on Sun. You can check the OpenBLAS version with this code https://github.com/numpy/numpy/blob/main/tools/openblas_support.py#L294

@matthew-brett
Copy link
Contributor Author

@martin-frbg - from @mattip's hint, the version tested above was:

OpenBLAS 0.3.13.dev DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=64

Testing with the latest Numpy development build as per Matti's instructions (pip install -i https://pypi.anaconda.org/scipy-wheels-nightly/simple numpy), I get:

OpenBLAS 0.3.17  USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=64

This gives the same outputs for the SVDs (5 nan values for Haswell, 0 for the others).

@martin-frbg
Copy link
Collaborator

So no recent regression at least ... last relevant changes to Haswell (GEMM kernels) were around 0.3.8 I think... Will try to swap them back as soon as I have reproduced the problem on my hardware. (Probably not today though)

@martin-frbg
Copy link
Collaborator

martin-frbg commented Jul 17, 2021

Bisected to abef2ea "Move -fma option setting to kernel/Makefile.L1", i.e. it is a problem with where and when to apply a compiler option that is (theoretically) required only for one or two BLAS kernels that use FMA in the form of intrinsiscs. (Also means broken since 0.3.13.dev snapshot of December 17). No immediate idea how to handle this, as using gcc -mfma unconditionally led to other unexpected breakage that prompted that particular commit. (May just nuke the new srot and drot kernel, or try -mfma -mno-fma4 to see if this makes any difference. Too tired now)

@martin-frbg
Copy link
Collaborator

Interim conclusion is that the Haswell
DGEMV kernels want to be built with -mfma to avoid the miscalculation. Reason still unclear, the only somewhat recent change to them was an extension of their clobber lists to include all ymm registers

@matthew-brett
Copy link
Contributor Author

Thanks very much for tracking that one down.

@carlkl
Copy link

carlkl commented Jul 20, 2021

wild guess: what happens if you compile (GCC) with -mfma AND -ffp-contract=off. In this case use of FMA should globally disallowed but FMA for the BLAS kernels that use FMA in the form of intrinsiscs should be supported.

@martin-frbg
Copy link
Collaborator

Thanks for the suggestion. Unfortunately this does not address the issue of FMA inexplicably being needed for some files (possibly just one) that do not use intrinsics. (My PR fixes that case for gmake builds, but changing the cmake build is much less straightforward)

@carlkl
Copy link

carlkl commented Jul 21, 2021

@martin-frbg
Copy link
Collaborator

Thanks, I know that option but the cmake rules for the kernel files are themselves generated by a cmake script...

@carlkl
Copy link

carlkl commented Jul 21, 2021

Just a question: why is FMA inexplicably being needed ? Performance reasons?

@martin-frbg
Copy link
Collaborator

See earlier in this thread... #3318 (comment) #3318 (comment)

@carlkl
Copy link

carlkl commented Jul 21, 2021

Which file do you expect to compile with -mfma to be correct?

  • dgemv_n_4.c
  • dgemv_n_microk_haswell-4.c

@martin-frbg
Copy link
Collaborator

DGEMV_T not DGEMV_N. And the "microk" is #include'd by the other, so I did not try to apply the option separately, the patch applies it to dgemv_t_4.c A survey of older versions suggests that the NaN issue has already been present at least as far back as 0.2.20, and got "fixed" by serendipity when the -mfma` option was added to enable use of intrinsics in other kernels.

@martin-frbg
Copy link
Collaborator

This problem was already present in 0.2.15, the version that introduced the dgemv_t microkernel for Haswell. I fail to identify any formal problems with the inline assembly, and unfortunately I do not see where -mfma makes a difference in its compilation (.s and .o outputs for it with or without -mfma are identical)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants