Skip to content

Incorrect GEMM results with armv8 #1870

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mdenna-nviso opened this issue Nov 14, 2018 · 16 comments · Fixed by #1904
Closed

Incorrect GEMM results with armv8 #1870

mdenna-nviso opened this issue Nov 14, 2018 · 16 comments · Fixed by #1904
Milestone

Comments

@mdenna-nviso
Copy link

Hi,

it seems there is a regression in cblas_sgemm() when compiled for aarch64 (single thread).
Version 5a6a2be was working correctly but the latest one gives wrong results even for simple test cases:

NMK = 4,4,4

A = B =
0.0000, 1.0000, 2.0000, 3.0000,
4.0000, 5.0000, 6.0000, 7.0000,
8.0000, 9.0000, 10.0000, 11.0000,
12.0000, 13.0000, 14.0000, 15.0000,

Expected result:
56.0000, 62.0000, 68.0000, 74.0000,
152.0000, 174.0000, 196.0000, 218.0000,
248.0000, 286.0000, 324.0000, 362.0000,
344.0000, 398.0000, 452.0000, 506.0000,

Got:

28.0000, 34.0000, 76.0000, 82.0000,
76.0000, 98.0000, 252.0000, 274.0000,
124.0000, 162.0000, 428.0000, 466.0000,
172.0000, 226.0000, 604.0000, 658.0000,

Any suggestion?

@martin-frbg
Copy link
Collaborator

Could be related to #1821 (fastest way to check would probably be to replace just kernel/arm64/KERNEL.ARMV8 with an older version)

@rengolin
Copy link
Contributor

I'm fixing some of the ARM issues on my fork:
https://github.com/rengolin/OpenBLAS/tree/armv8-cleanup

See if that branch works for you.

@martin-frbg
Copy link
Collaborator

Can you please post your test code ? I cannot reproduce this on ARMV8 with current develop branch using this quick-n-dirty test case

using namespace std;
#include <iostream>
#include "cblas.h"
int main()
{
const enum CBLAS_ORDER Order=CblasRowMajor;
const enum CBLAS_TRANSPOSE TransA=CblasNoTrans;
const enum CBLAS_TRANSPOSE TransB=CblasNoTrans;
const int M=4;
const int N=4;
const int K=4;
const float alpha=1;
const float beta=0;
const int lda=K;
const int ldb=N;
const int ldc=N;
const float A[M*K]={0.,1.,2.,3.,4.,5.,6.,7.,8.,9.,10.,11.,12.,13.,14.,15.};
const float B[K*N]={0.,1.,2.,3.,4.,5.,6.,7.,8.,9.,10.,11.,12.,13.,14.,15.};
float C[M*N];

cblas_sgemm(Order, TransA, TransB, M, N, K, alpha, A, lda, B, ldb, beta, C, ldc);
for (int i = 0; i < M; i++)
{
    for (int j = 0; j < N; j++)
    {
        cout << C[i*M + j] << " ";
    }
    cout << endl;
}
}    

@brada4
Copy link
Contributor

brada4 commented Nov 18, 2018

@mdenna-nviso does it happen that you use one input as output by chance?

@mdenna-nviso
Copy link
Author

Hi all,

thanks for the prompt support!
The test code I used has tree separate matrix for input and output, and looks quite similar to the test code posted by Martin.
To avoid doubts I replaced my test code with Martin test code, but the result I get is still incorrect:

28 34 76 82 
76 98 252 274 
124 162 428 466 
172 226 604 658 

I repeated the test with the latest develop branch and origin/armv8-cleanup from Renato's fork, same result.
I'm compiling with aarch64-linux-gnu-g++ (Linaro GCC 6.4-2018.05) 6.4.1, and running the test on a Raspberry Pi 3B with 64 bits Debian OS.

Thanks

@mdenna-nviso
Copy link
Author

I'm wondering if I'm doing something wrong, but the same test program compiled with version 5a6a2be gives me correct results:

56 62 68 74 
152 174 196 218 
248 286 324 362 
344 398 452 506

Any test I could to help clarify the issue?

@rengolin
Copy link
Contributor

That is odd, I cannot reproduce your results either (develop or cleanup branches).

I don't have access to a (free) RPi3, would you mind doing a git bisect between 5a6a2be and some known bad commit? A simple script to compile, run and grep for "56 62 68 74" would do for git-bisect.

At least we know what change altered the results for you, so it's easier to start guessing what's going wrong.

@rengolin
Copy link
Contributor

A few questions:

  • What's the target you're building for?
  • What are the contents of config.h?
  • How are you compiling the example?

@martin-frbg
Copy link
Collaborator

Both seconded :-)
(I posted a quick rundown of git bisect at #1425 (comment) )

@mdenna-nviso
Copy link
Author

I run bisect, looks like there is a commit that breaks the library (bus fault) and from that point on I always get bus fault or wrong results.

BUS FAULT:
a399d00 is the first bad commit
commit a399d00
Author: oon3m0oo [email protected]
Date: Wed Jun 20 21:04:03 2018 +0100

Further improvements to memory.c. (#1625)

This is a commit somewhere in between the bad commit and the latest develop where the bus fault has been fixed but the results are incorrect:

[5d42b6e] Merge pull request #1756 from martin-frbg/issue1754

74 92 218 236 
122 156 394 428 
98 124 306 332 
146 188 482 524 

FAIL

I'm compiling the example using cmake.
During the build I use the following options:

CMAKE_SYSTEM_PROCESSOR aarch64
    -DUSE_THREAD=OFF
    -DNUM_THREADS=64
    -DBUILD_WITHOUT_LAPACK=ON
    -DBUILD_RELAPACK=OFF
    -DNOFORTRAN=ON
    -DBUILD_SHARED_LIBS=OFF
    -DCMAKE_INSTALL_PREFIX=${CMAKE_BINARY_DIR}/deps
    -DCMAKE_INSTALL_MESSAGE=LAZY
    -DDYNAMIC_ARCH=OFF

Thanks

@martin-frbg
Copy link
Collaborator

Too bad, it looks like the (intermittent use of the) new thread-local storage allocation code messes up the bisect. Current develop should be back to the "old" version of memory.c unless you specify USE_TLS=1 so the actual error should be elsewhere (unless something went wrong with the cmake build rules). My test above was using plain make, I'll see if I can repeat with cmake. (BTW my testbed is Cortex A57, Pi3B seems to be A53 but as far as I understand this should not lead to bad results)

@brada4
Copy link
Contributor

brada4 commented Nov 19, 2018

@mdenna-nviso
USE_TLS slipped into 0.3.3 release as default, it should not be apparently.

Small few-line patches highly recommended over 0.3.3 and older to improve old/good serializing allocator
NUM_THREADS would be considered at least 25 where it specifies buffer allocation, #1858
#1785 is a reduction of number of locks tried during allocation.

@martin-frbg
Copy link
Collaborator

@brada4 does not explain why it still goes wrong with current develop. I suspect the cmake environment may have started misdetecting some feature recently (plus I should have added that my tests are with an Android device, not Debian). Trying to build with cmake, I now get a build failure as the OS is assumed to be Linux and NO_SYSV_IPC does not get set. Need to fix that first, but perhaps something similar is happening on the Pi, leading to a "non-standard" build.

@martin-frbg
Copy link
Collaborator

CMAKE seems to be picking a different (and apparently) broken sgemm_kernel for ARMV8 (generic 2x2 as far as I can tell by now), possibly all the data obtained from the call to getarch_2nd goes unused. I cannot correlate that with any change since 5a6a2be though.

@martin-frbg martin-frbg added this to the 0.3.5 milestone Dec 2, 2018
@martin-frbg
Copy link
Collaborator

martin-frbg commented Dec 3, 2018

For TARGET=ARMV8, 5a6a2be would choose sgemm_kernel_4x4.S for inclusion in kernel/CMakeFiles , more recent versions went to generic sgemm_kernel_2x2.c . (Both pick sgemm_kernel_16x4.S when allowed to autodetect a CortexA53 system, so the microkernel selection mechanism is not completely broken).
Bisecting on this symptom alone leads to my 1cb7b90 "conditional compilation of assembly files that IOS does not like" from #1749 (somewhat predictably as that PR introduced a switch to a fallback C kernel when the assembler does not handle the dialect in the .S file. Apparently my carefully crafted ifneq ($(OS_DARWIN)$(CROSS),11) in KERNEL.ARMV8 does not work as planned in a cmake environment - probably the KERNEL file is parsed for key-value pairs without conditionals, so the last match wins. (This may have further implications for other ARM64 cpus, as their KERNEL files also make use of ifeq)
(Yes - the ParseMakefileVars macro in cmake/utils.cmake does match key-value pairs only, not logic.
Inverting the condition in KERNEL.ARMV8 to make the last definition win probably does not help, as it would leave extraneous modules defined)

@martin-frbg
Copy link
Collaborator

martin-frbg commented Dec 4, 2018

Actually reversing the order of the if branches does look like it could provide an acceptable (interim) solution, except that on its own this change creates spurious references to nonexisting sources "cgemm_kernel.S" and "zgemm_kernel.S" with corresponding ghost objects in the generated Makefiles, build and link.txt
Update: this seems to be a general (mis)feature of the kernel CMakefile, it creates spurious c/zgemm_kernel source files and build instructions for them in addition to the desired c/zgemm_kernel_b,_l,_n,_r files. On ARMV8 these extra files just happen to fail to build as some macros in the assembly are declared only for the expected cases with either of NT,NN,CT,etc. defined.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants