Blaze (in some cases) 2x faster than OpenBLAS #821

theoractice · 2016-03-24T06:49:48Z

Just tried Blaze library and was shocked.

Matrix/Vector multiplication test code (Used VS2013 and linked against 0.2.15-mingw binary for maximum performance):

#include <Windows.h>
#include <opencv2/opencv.hpp>
#include <blaze/blaze.h>
#include <cblas.h>

using namespace std;
using namespace blaze;
using namespace cv;

int main(int argc, char* argv[])
{
    LARGE_INTEGER freq, start, end;
    QueryPerformanceFrequency(&freq);

    float blazeTime = 0;
    DynamicMatrix<float> bA(25, 10000);
    DynamicVector<float, columnVector> bB(10000);
    for (size_t round = 0; round < 1000; round++)
    {
        for (size_t i = 0; i < 25; ++i)
            for (size_t j = 0; j < 10000; ++j)
                bA(i, j) = blaze::rand<float>(0, 1);
        for (size_t j = 0; j < 10000; ++j)
            bB[j] = blaze::rand<float>(0, 1);
        QueryPerformanceCounter(&start);
        DynamicVector<float, columnVector> bAtB = bA*bB;
        QueryPerformanceCounter(&end);
        blazeTime += (end.QuadPart - start.QuadPart) / (float)freq.QuadPart * 1000.f;
    }
    cout << "blaze: " << blazeTime / 1000.f << " ms " << endl;

    cv::RNG &rng = theRNG();
    rng.state = getTickCount();
    float alpha = 1.f, beta = 0;
    int m = 25, p = 10000, n = 1;
    cv::Mat_<float> A(m, p, CV_32F), B(p, n, CV_32F);
    cv::Mat C1 = cv::Mat(m, n, CV_32F, 0.0);
    cv::Mat C2 = cv::Mat(m, n, CV_32F, 0.0);
    float *src1, *src2, *src3, *src4;
    src1 = (float *)A.data;
    src2 = (float *)B.data;
    src3 = (float *)C1.data;
    src4 = (float *)C2.data;

    float openblasTime = 0;
    openblas_set_num_threads(0);
    for (size_t round = 0; round < 1000; round++)
    {
        rng.fill(A, cv::RNG::NORMAL, 1, 100);
        rng.fill(B, cv::RNG::NORMAL, 2, 50);
        QueryPerformanceCounter(&start);
        cblas_sgemv(CblasRowMajor, CblasNoTrans, m, p,
            alpha, src1, p, src2, n, beta, src4, n);
        QueryPerformanceCounter(&end);
        openblasTime += (end.QuadPart - start.QuadPart) / (float)freq.QuadPart * 1000.f;
    }
    cout << "openblas: " << openblasTime / 1000.f << " ms " << endl;

    return 0;
}

Result:

blaze: 0.107732 ms
openblas: 0.205365 ms

I tested this on all my computers mentioned in JuliaLang/julia#810 and got similar results. In fact this was quite coherent with their own benchmark result. Blaze seemed to be purely header-only C++ (I might be wrong) so it was REALLY fast.

What's their trick? Could their technique benefit OpenBLAS? Or I used OpenBLAS incorrectly?

The text was updated successfully, but these errors were encountered:

martin-frbg · 2016-03-24T08:33:21Z

Number of threads used may have played a role (OpenBLAS may have picked too many by default), and of the three systems you mentioned in JuliaLang/julia#810 I suspect only the Xeon E3 will have a substantially optimized sgemv kernel in OpenBLAS. (Does this result carry over to other functions as well, seeing that your benchmark calls only sgemv ~~while your verdict concerns all of OpenBLAS~~ ?)
Also possibly related to JuliaLang/LinearAlgebra.jl#72 and the (partial?) fix 6e7be06 for it that went in after 0.2.15 ?

theoractice · 2016-03-24T10:17:18Z

I use CPU core number for number of threads. Don't know if it's proper.
According to my own observation, this 100% (2x) boost occured in sgemv/sgemm (but not in dgemv/dgemm), that's why I added "in some cases" in title, I wasn't saying that Blaze won totally over OpenBLAS.
There are more tests in their benchmarks which show some other boosts. It's a surprise to know a C++ library can do so well (in some cases).
The Julia issue seems relevant, also I have the same segfaults on E3 :(

martin-frbg · 2016-03-24T10:52:01Z

Sorry for misreading earlier. Perhaps it could be useful to compare single thread performance as well. And as far as I can tell the segfaults are thought to be fixed post 0.2.15 (#697)

brada4 · 2016-03-29T17:28:45Z

Multiplication by 1.0 or adding (float)0 is a math operation. It is somehow omitted in blaze example part.
Correct title would be blaze does something less in less time than openblas. It is a draw, not ultimate victory.

theoractice changed the title ~~Blaze (in some cases) 100% faster than OpenBLAS~~ Blaze (in some cases) 2x faster than OpenBLAS Mar 24, 2016

martin-frbg mentioned this issue May 4, 2018

A problem about relation between "openblas_set_num_threads" and CPU waste and time cost? #1544

Closed

theoractice closed this as completed Jun 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blaze (in some cases) 2x faster than OpenBLAS #821

Blaze (in some cases) 2x faster than OpenBLAS #821

theoractice commented Mar 24, 2016 •

edited

Loading

martin-frbg commented Mar 24, 2016

theoractice commented Mar 24, 2016

martin-frbg commented Mar 24, 2016

brada4 commented Mar 29, 2016

Blaze (in some cases) 2x faster than OpenBLAS #821

Blaze (in some cases) 2x faster than OpenBLAS #821

Comments

theoractice commented Mar 24, 2016 • edited Loading

martin-frbg commented Mar 24, 2016

theoractice commented Mar 24, 2016

martin-frbg commented Mar 24, 2016

brada4 commented Mar 29, 2016

theoractice commented Mar 24, 2016 •

edited

Loading