Skip to content

Multi-arch OpenBLAS with DYNAMIC_ARCH=1 yields wrong result when compiling on ivybridge and running on skylake #3454

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mlell opened this issue Nov 19, 2021 · 9 comments

Comments

@mlell
Copy link

mlell commented Nov 19, 2021

I am running OpenBLAS-enabled R in a Singularity container. I used version 0.3.18 as packaged by Debian but I noticed that on some of our SLURM nodes R gives a wrong result (a genomic kinship matrix is reported as not being positive definite when it actually is). Also, I saw really strange and clearly invalid PCA (principal component analysis) results on that node.

More specifically, on nodes that have the CPU architecture "skylake" the result was wrong whereas on nodes with architectures "nehalem", "haswell" and "ivybridge" the result was correct.

To follow up on this, I compiled OpenBLAS myself. The problem only occurs with the make argument DYNAMIC_ARCH=1. I included TARGET=GENERIC but that didn't change anything. But if using make TARGET=GENERIC without DYNAMIC_ARCH, the problem does not occur. The problem also only occurs if I compile on ivybridge and run on skylake, not vice versa.

I did a git bisect using make TARGET=GENERIC DYNAMIC_ARCH=1 and found that the first commit that shows this problem is d71fe4e.

@martin-frbg
Copy link
Collaborator

Probably a duplicate of #2986 (ivybridge is same as Sandybridge as far as OpenBLAS is concerned), but back then my bisect went nowhere and the problem was stated to have appeared in much earlier versions. So far I have no idea why Skylake kernels would get miscompiled on Sandybridge specifically (doing the same DYNAMIC_ARCH build with the same gcc version on Haswell did not result in any problems on SkylakeX).

@mlell
Copy link
Author

mlell commented Nov 19, 2021

I first did a "manual" bisect by first looking for the first version tag that showed the problem and then bisecting from there. 0.3.10 was the first I tried because 0.3.0 did not compile (I think in v0.3.9 or so was a note referring to this). After narrowing it down to somewhere between 0.3.12 and 0.3.13 I bisec'ed from there. That may be why I got different bisect results than you...

Because you noted that the problem seems to have occurred in earlier versions, to be sure that I did not make an error during bisecting, I just checked out d71fe4e^ (which is commit a554712) and confirmed that the problem indeed does not occur for that commit.

@brada4
Copy link
Contributor

brada4 commented Nov 19, 2021

Please set OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1 and retry failing sample. Code you point to is regarding threads.
In other issue report there is a discussion regarding supplanting blas on debian R. A bit outdated guide (ub16/deb9) here https://github.com/xianyi/OpenBLAS/wiki/Faq#debianlts

@martin-frbg
Copy link
Collaborator

The trouble is that d71fe4e (should have) affected parameters for the unrelated Haswell and Ryzen cpus only, neither Sandybridge nor SkylakeX.

@martin-frbg
Copy link
Collaborator

Hmm. Part of the problem may be that the Ivybridge-compiled OpenBLAS may actually be using Haswell kernels or worse on SkylakeX (due to a logic bug between build platform capability and runtime platform capability). Specifically undoing the changes from d71fe4e did not make the test case from #2986 work however, and given that your code already ran fine on actual Haswell before you reverted that I still think that should have been unrelated - at worst a heisenbug that does not show up on every run.

@martin-frbg
Copy link
Collaborator

Bisecting again with the test case from #2986 but with an earlier starting point now puts the blame on 081b188 which was part of PR #2384 by @wjc404 . At least this is now something that targeted SKYLAKEX, and I can already see one design flaw that was not clear to me a year ago - the PR introduces cpu-specific code in the BLAS3 driver functions, but in DYNAMIC_ARCH builds these are only built once with the settings of the designated TARGET (or build cpu).
Now on the face of it this should only lead to a performance problem, but perhaps the 16x2 DGEMM kernel introduced by the PR can actually mishandle corner cases when it is fed smaller chunks of data than expected:

#ifdef SKYLAKEX
	/* the current AVX512 s/d/c/z GEMM kernel requires n>=6*GEMM_UNROLL_N to achieve the best performance */
	if (min_jj >= 6*GEMM_UNROLL_N) min_jj = 6*GEMM_UNROLL_N;
#else
	if (min_jj >= 3*GEMM_UNROLL_N) min_jj = 3*GEMM_UNROLL_N;
	else

(min_jj here is the N argument in a subsequent GEMM_ONCOPY or ...OTCOPY(M,N,..) (For a TARGET=GENERIC build, min_jj would be capped at 4 instead of the intended 12, as my #3026 removed the 3GEMM_UNROLL_N line which was seen to cause SYRK performance problems on Haswell. For a Sandybridge/Ivybridge build, GEMM_UNROLL_N would be 4, so min_jj would get capped at 8 instead of 24 - and it would have been 12 without my removing the 3GEMM_UNROLL_N - which was the second half of the PR to which d71fef4 belonged. So perhaps this is why reverting that part worked for you (assuming you reverted the entire PR and not just the two unrelated GEMM_UNROLL_MN lines in param.h)
Changing KERNEL.SKYLAKEX back to using dgemm_kernel_4x8_skylakex_2.c (and changing the related M,N parameters in param.h) fixes at least the related #2986 for me.

@martin-frbg
Copy link
Collaborator

Small correction - it is always the GEMM_UNROLL_N applicable for the build host that gets inserted into the level3 gemm driver code, regardless of TARGET. And the problem is reproducible with Intel SDE, so probably unrelated to BIOS/microcode versions of physical hardware.

@martin-frbg
Copy link
Collaborator

probably fixed with 0.3.19 through #3469

@mlell
Copy link
Author

mlell commented Dec 20, 2021

Checking out v0.3.19 and compiling with the same flags as in the OP solved this problem for me. Setting OMP_NUM_THREADS and OPENBLAS_NUM_THREADS to 1 with v0.3.18 did not affect the issue. Thank you for all your work!

@mlell mlell closed this as completed Dec 20, 2021
raspbian-autopush pushed a commit to raspbian-packages/openblas that referenced this issue Oct 11, 2023
…achine

Origin: upstream, OpenMathLib/OpenBLAS#3579
Bug: OpenMathLib/OpenBLAS#2986
     OpenMathLib/OpenBLAS#3454
     OpenMathLib/OpenBLAS#3557
Bug-Debian: https://bugs.debian.org/1025480
Applied-Upstream: 0.3.21
Reviewed-by: Sébastien Villemot <[email protected]>
Last-Update: 2023-06-26

When building OpenBLAS with dynamic arch selection on x86-64 hardware
that does not support AVX2 (e.g. Intel Ivybridge or earlier), then
the AVX512 (SkylakeX) kernel for DGEMM would produce incorrect
results (of course when run on AVX512-capable hardware).

The problem was that the check for determining whether the compiler
is able to understand AVX512 assembly/intrinsics was doubly
incorrect: it would test the build machine capabilities (instead of
the compiler capabilities); and it would check for AVX2 instead of
AVX512. As a consequence, on pre-AVX2 hardware, the build system
would conclude that the compiler is not able to understand AVX512
primitives, and would create a broken AVX512 (SkylakeX) DGEMM kernel
(essentially a Haswell kernel, but with some wrong assumptions, hence
leading to incorrect numerical results).
Last-Update: 2023-06-26
Gbp-Pq: Name avx512-dgemm.patch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants