Skip to content

Blaze (in some cases) 2x faster than OpenBLAS #821

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
theoractice opened this issue Mar 24, 2016 · 4 comments
Closed

Blaze (in some cases) 2x faster than OpenBLAS #821

theoractice opened this issue Mar 24, 2016 · 4 comments

Comments

@theoractice
Copy link
Contributor

theoractice commented Mar 24, 2016

Just tried Blaze library and was shocked.

Matrix/Vector multiplication test code (Used VS2013 and linked against 0.2.15-mingw binary for maximum performance):

#include <Windows.h>
#include <opencv2/opencv.hpp>
#include <blaze/blaze.h>
#include <cblas.h>

using namespace std;
using namespace blaze;
using namespace cv;

int main(int argc, char* argv[])
{
    LARGE_INTEGER freq, start, end;
    QueryPerformanceFrequency(&freq);

    float blazeTime = 0;
    DynamicMatrix<float> bA(25, 10000);
    DynamicVector<float, columnVector> bB(10000);
    for (size_t round = 0; round < 1000; round++)
    {
        for (size_t i = 0; i < 25; ++i)
            for (size_t j = 0; j < 10000; ++j)
                bA(i, j) = blaze::rand<float>(0, 1);
        for (size_t j = 0; j < 10000; ++j)
            bB[j] = blaze::rand<float>(0, 1);
        QueryPerformanceCounter(&start);
        DynamicVector<float, columnVector> bAtB = bA*bB;
        QueryPerformanceCounter(&end);
        blazeTime += (end.QuadPart - start.QuadPart) / (float)freq.QuadPart * 1000.f;
    }
    cout << "blaze: " << blazeTime / 1000.f << " ms " << endl;

    cv::RNG &rng = theRNG();
    rng.state = getTickCount();
    float alpha = 1.f, beta = 0;
    int m = 25, p = 10000, n = 1;
    cv::Mat_<float> A(m, p, CV_32F), B(p, n, CV_32F);
    cv::Mat C1 = cv::Mat(m, n, CV_32F, 0.0);
    cv::Mat C2 = cv::Mat(m, n, CV_32F, 0.0);
    float *src1, *src2, *src3, *src4;
    src1 = (float *)A.data;
    src2 = (float *)B.data;
    src3 = (float *)C1.data;
    src4 = (float *)C2.data;

    float openblasTime = 0;
    openblas_set_num_threads(0);
    for (size_t round = 0; round < 1000; round++)
    {
        rng.fill(A, cv::RNG::NORMAL, 1, 100);
        rng.fill(B, cv::RNG::NORMAL, 2, 50);
        QueryPerformanceCounter(&start);
        cblas_sgemv(CblasRowMajor, CblasNoTrans, m, p,
            alpha, src1, p, src2, n, beta, src4, n);
        QueryPerformanceCounter(&end);
        openblasTime += (end.QuadPart - start.QuadPart) / (float)freq.QuadPart * 1000.f;
    }
    cout << "openblas: " << openblasTime / 1000.f << " ms " << endl;

    return 0;
}

Result:

blaze: 0.107732 ms
openblas: 0.205365 ms

I tested this on all my computers mentioned in JuliaLang/julia#810 and got similar results. In fact this was quite coherent with their own benchmark result. Blaze seemed to be purely header-only C++ (I might be wrong) so it was REALLY fast.

What's their trick? Could their technique benefit OpenBLAS? Or I used OpenBLAS incorrectly?

@martin-frbg
Copy link
Collaborator

Number of threads used may have played a role (OpenBLAS may have picked too many by default), and of the three systems you mentioned in JuliaLang/julia#810 I suspect only the Xeon E3 will have a substantially optimized sgemv kernel in OpenBLAS. (Does this result carry over to other functions as well, seeing that your benchmark calls only sgemv while your verdict concerns all of OpenBLAS ?)
Also possibly related to JuliaLang/LinearAlgebra.jl#72 and the (partial?) fix 6e7be06 for it that went in after 0.2.15 ?

@theoractice
Copy link
Contributor Author

I use CPU core number for number of threads. Don't know if it's proper.
According to my own observation, this 100% (2x) boost occured in sgemv/sgemm (but not in dgemv/dgemm), that's why I added "in some cases" in title, I wasn't saying that Blaze won totally over OpenBLAS.
There are more tests in their benchmarks which show some other boosts. It's a surprise to know a C++ library can do so well (in some cases).
The Julia issue seems relevant, also I have the same segfaults on E3 :(

@martin-frbg
Copy link
Collaborator

Sorry for misreading earlier. Perhaps it could be useful to compare single thread performance as well. And as far as I can tell the segfaults are thought to be fixed post 0.2.15 (#697)

@theoractice theoractice changed the title Blaze (in some cases) 100% faster than OpenBLAS Blaze (in some cases) 2x faster than OpenBLAS Mar 24, 2016
@brada4
Copy link
Contributor

brada4 commented Mar 29, 2016

Multiplication by 1.0 or adding (float)0 is a math operation. It is somehow omitted in blaze example part.
Correct title would be blaze does something less in less time than openblas. It is a draw, not ultimate victory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants