-
Notifications
You must be signed in to change notification settings - Fork 1.6k
ILP64 OpenBLAS gives different result from regular OpenBLAS #2779
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Is it same CPU core type? |
Make sure that you use the right flavor of include directory and LD_LIBRARY_PATH for your tests as well. I can not reproduce your problem on Haswell-class hardware at least. |
Yes they are the same machine type. #include <iostream>
#include <cblas.h>
int main ( int argc, char* argv[] ) {
const int64_t n = ((int64_t)1 << 31) - 1;
std::cout << n <<std::endl;
float* A = new float[n];
float* B = new float[n];
float* C = new float[1];
for(long i =0; i <n; i++){
A[i] = 1;
B[i] = 1;
}
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 1, 1, n, 1.0, A, n, B, 1, 0.0, C, 1);
std::cout << *C <<std::endl;
float acc = 0.0;
for (long i=0; i<n; i++)
acc += A[i] * B[i];
std::cout << acc << std::endl;
delete[] A;
delete[] B;
delete[] C;
return 0;
} the results are:
Vanilla OpenBLAS
Do I need to add any flags? |
@Zha0q1 you can check library loaded by setting I am puzzled why IPL64 and "normal" versions do not use same blocking algorithm to get similar results. i.e if you add 64 elements in same loop then to accumulator you get more, like in "normal" example |
After changing acc to double I got:
Also ldd gives:
Sorry I still don't quire understand. Is there a way to fix this? |
Thats property of floating point numbers, depending on algorithm you may saturate significand or get the arihmetically correct result. |
But when I use vanilla openblas it does give the arithmetically correct result on the same input. Is there a way to make ILP openblas consistent with that? |
reference BLAS gives 16megs result, thats the absolute truth in this case
(needs 32+GB RAM to get through) |
The fix you submitted in numpy gets the aritmetic result, its c++ equivalent is using dgemm. |
Yeah right, but with vanilla openblas even sgemm gives correct result |
Yep, but I dont get clue how ;-) |
Are you trying with ILP 64 build? |
Actually sgemv clamps in same place but sdot gets a bit further thanks to blocking being implied. |
Ohh I see what you mean now, so it's similar to loop unrolling? So can we control blocking by some means...? |
Not really, it is just unrolled in source file once. Deviation from reference knowingly is not a fix nor improvement. Lapack solvers sort of expect close to standard behaviour in such border cases. |
Report from (Fedora EPEL int64 0.3.3 sandybridge) - sixteen megs |
@brada4 I don't understand how can you suggest that we will loose precision when int64 indexing of array/tensor is used and data type still remains the same How can changing indexing type have affect on precision of output ? |
Expected precision is as of standard netlib blas with single addition per loop. If you need to sum vaues more than significand+1 apart you need bigger floating point variable with obviously bigger significand range. Take another mxnet-ish example - we need dot product of huge vector, ok, we get one. I think there is plainly no way to increase robustness of naive reference algorithm facing of marginal input. Additional precision coming from FMA or blocking is surprising, but dont trust too much of it. It wont work that way when actual adjacent inputs are significand range apart. |
Folks can we get some information on what hardware (OpenBLAS TARGET) you are running this code on ? As I wrote earlier today, I get identical results on Kaby Lake (Haswell TARGET) when I set both the include path and the LD_LIBRARY_PATH to the install location of the desired OpenBLAS build. (Also make sure that you actually include the cblas.h created by |
Right. I am using a ec2 instance with 64 cores. With
|
Further back ... ldd has to be run against binary executable where you dont know what it would import , not the loader itself. |
I bet the difference is not 32bit- vs 64bit integer, but different number of cores leading to different splitting of the workload and |
I was also able to reproduce the same difference on the same machine, with only difference in 32 and 64 openblas |
Are you sure that you #included the right version of cblas.h, and linked the appropriate libopenblas in each case ? Unfortunately I do not have access to any bigger x86_64 architecture hardware than the Epyc7401 at drone.io (or rather packet.com) , but I do not see offhand why going to 64 cores would make a difference (nor why the larger addressable range would matter). |
Update: The difference I described earlier comes from the fact that my ec2 instance ships with a spacial, possibly optimized, openblas binary. EC2 openblas (LP64, int32): no loss of precision on my example 2.14748e+09 The EC2 openblas version is:
My openblas dev build is:
Do you think this numerical difference is due to different openblas version, different build config or something else? |
The SGEMM kernel for Haswell targets was changed earlier this year (#2361) to increase performance, perhaps this introduced different rounding due to FMA operations. You could try reverting that change (undo the changes in both kernel/x86_64/KERNEL.HASWELL and param.h) or do a quick test with 0.3.7 |
ILP64 OpenBLAS built with
Compiled with
g++ gemm.cpp -I /usr/local/include/ -lopenblas
On ILP64 OpenBLAS machine:
On vanilla OpenBLAS machine:
I think I am linking correctly, because if I change
n
to(long)1 << 31) + 1
, thenILP64
Vanilla
The text was updated successfully, but these errors were encountered: