Skip to content

Multithread complex dot product #2221

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
antoine-levitt opened this issue Aug 14, 2019 · 7 comments
Closed

Multithread complex dot product #2221

antoine-levitt opened this issue Aug 14, 2019 · 7 comments

Comments

@antoine-levitt
Copy link

On my machine (linux x86_64), zdotc is not multithreaded. ddot is, though.

@martin-frbg
Copy link
Collaborator

The dot functions are a bit special as they are not multithreaded at the interface level like most everything else - if I remember correctly, the opinion of the earlier developers was that these were bound by the system I/O bandwidth limit already. A select few machines do have multithreaded ddot kernels - for x86_64, I stole the idea and implementation from the arm64 ThunderX kernels in #1491 to satisfy another Julia request. Not sure if zdot(c) would be just as easy...

@antoine-levitt
Copy link
Author

if I remember correctly, the opinion of the earlier developers was that these were bound by the system I/O bandwidth limit already

At least on my laptop that is not the case - I see a definite speedup in ddot.

@martin-frbg
Copy link
Collaborator

PR #2222 now but may need some tuning - for now it is just a copypasta of the ARM server cpu code including its n=10000 threshold.

@brada4
Copy link
Contributor

brada4 commented Aug 15, 2019

Can you give some numbers, like CPU trade name? How big is input? Is it cache-line aligned (like starts on fresh malloc)? Does it help to double processing block in source file?
One cache line size, 4 values, should get fetched from memory to CPU, then 8 FLOPS done on them (that fits in 1-2 FMA instrs), then hopefully next batch is pre-fetched at once. It is certainly not FPU speed at fault, but something else micro-architectural on the way.

@antoine-levitt
Copy link
Author

I'm confused by most of your post, but here are some numbers. This is with julia 1.1, on Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz

julia> BLAS.openblas_get_config()
"USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=16"
using BenchmarkTools

for N in (5_000, 20_000, 100_000)
    for i in (1,2,4)
        BLAS.set_num_threads(i)

        aa = randn(N)
        bb = randn(N)
        @btime dot($aa,$bb)
    end
end
  815.453 ns (0 allocations: 0 bytes)
  816.198 ns (0 allocations: 0 bytes)
  810.475 ns (0 allocations: 0 bytes)
  4.070 μs (0 allocations: 0 bytes)
  2.408 μs (0 allocations: 0 bytes)
  2.349 μs (0 allocations: 0 bytes)
  24.012 μs (0 allocations: 0 bytes)
  13.054 μs (0 allocations: 0 bytes)
  8.458 μs (0 allocations: 0 bytes)

@brada4
Copy link
Contributor

brada4 commented Aug 15, 2019

Thanks, so it is pretty standard desktop CPU. I was a bit afraid of those high-end with numa in the cartridge and many memory controllers.

martin-frbg added a commit that referenced this issue Aug 15, 2019
* Add multithreading support

copied from the ThunderX2T99 kernel. For #2221
@martin-frbg
Copy link
Collaborator

Multithreading (at the kernel level, as already done on ARM server cpus) was added in #2222

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants