Skip to content

eigvals performs faster for Matrix{ComplexF64} than Matrix{Float64} on Windows #960

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
yakovbraver opened this issue Oct 18, 2022 · 9 comments
Labels
external dependencies Involves LLVM, OpenBLAS, or other linked libraries performance Must go faster system:windows Affects only Windows upstream The issue is with an upstream dependency, e.g. LLVM

Comments

@yakovbraver
Copy link

(Cross-posting from Discourse)
I’ve noticed that when diagonalising real symmetric matrices using the default OpenBLAS, eigvals may perform faster if the input matrix is complex, i.e. Matrix{ComplexF64} rather than Matrix{Float64}. Here is my test code:

using LinearAlgebra, BenchmarkTools

n = 50             # matrix dimension
F = rand(n, n)     # a random real Float64 matrix
F += F'            # make `F` symmetric
C = ComplexF64.(F) # a copy of `F` stored as a `Matrix{ComplexF64}`

@benchmark eigvals($F)
@benchmark eigvals($C)

For n = 50, the complex matrix is diagonalised ~5 times faster than the real one:
Screenshot 2022-10-17 141754
For n = 230, both calculations take the same amount of time, and for larger matrices the complex calculation becomes slower than the real one, as expected.
I could reproduce these results on four different machines running Windows 10, while on macOS 10.14.6 the issue is not present (the real calculation performs faster than the complex, as expected). The outputs of versioninfo() are available in a gist.
The issue seems not to appear on Linux either, see Discourse.
When I switch to MKL.jl, the issue does not appear.

@oscardssmith oscardssmith changed the title eigvals performs faster for Matrix{ComplexF64} than Matrix{Float64} eigvals performs faster for Matrix{ComplexF64} than Matrix{Float64} on Windows Oct 18, 2022
@andreasnoack
Copy link
Member

Could you please try to time just LAPACK.hetrd!('U', copy(F)) and LAPACK.hetrd!('U', copy(C))?

@yakovbraver
Copy link
Author

Here are the results on the same Windows JuliaLang/julia#1 machine as above:
hetrd
This is revealing. I've obtained similar results on the other Windows machines, while on macOS the Float64 calculation was faster than ComplexF64, as expected.

@andreasnoack
Copy link
Member

So the problem is in the reduction to symmetric tridiagonal form. Could you please try again with OPENBLAS_NUM_THREADS=1 and please also share the output from versioninfo().

@yakovbraver
Copy link
Author

Setting OPENBLAS_NUM_THREADS=1 fixes it; here is the output again from the same Windows JuliaLang/julia#1 machine as above:
hetrd2
I tried this on all four Windows 10 test machines (versioninfo() here), and the results are the same.
It turns out that setting OPENBLAS_NUM_THREADS=1 on an intel Mac gives a speedup as well:
Screenshot 2022-10-25 at 16 14 07
I've also found that launching Julia in the multithreaded mode (e.g. julia -t8) has no effect on this benchmark.

@andreasnoack andreasnoack added performance Must go faster upstream The issue is with an upstream dependency, e.g. LLVM system:windows Affects only Windows external dependencies Involves LLVM, OpenBLAS, or other linked libraries labels Oct 25, 2022
@andreasnoack
Copy link
Member

This looks like an issue with threading in OpenBLAS on Windows so it would be great if you could file the issue at https://github.com/xianyi/OpenBLAS. Usually, it's necessary to create a reproducer in Fortran or C before they can make progress on the issue.

@yakovbraver
Copy link
Author

OK, let me see if I can reproduce this issue using the C interface of LAPACK bundled with OpenBLAS.

@wheeheee
Copy link

wheeheee commented Dec 6, 2023

Seems to be fixed going from 1.9.4 to v1.10-rc2.
Small performance regression for ComplexF64 though.

julia> @btime eigvals($F);
  79.600 μs (11 allocations: 38.70 KiB)  # 1.10-rc2
  968.000 μs (11 allocations: 38.70 KiB) # 1.9.4

julia> @btime eigvals($C);
  189.800 μs (15 allocations: 119.70 KiB) # 1.10-rc2
  171.700 μs (15 allocations: 119.70 KiB) # 1.9.4

julia> @btime LAPACK.hetrd!('U', copy($F));
  34.000 μs (7 allocations: 33.59 KiB)  # 1.10-rc2
  932.900 μs (7 allocations: 33.59 KiB) # 1.9.4

julia> @btime LAPACK.hetrd!('U', copy($C));
  127.300 μs (7 allocations: 66.05 KiB) # 1.10-rc2
  113.600 μs (7 allocations: 66.05 KiB) # 1.9.4

versioninfo:

julia> versioninfo()
Julia Version 1.10.0-rc2
Commit dbb9c46795 (2023-12-03 15:25 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, tigerlake)
  Threads: 11 on 8 virtual cores
Environment:
  JULIA_CONDAPKG_BACKEND = Null
  JULIA_NUM_THREADS = auto

@yakovbraver
Copy link
Author

Indeed, I can replicate @wheeheee's results (n = 50):

julia> @btime eigvals($F);
  59.300  μs (11 allocations: 38.70 KiB) # 1.10-rc2
  688.900 μs (11 allocations: 38.70 KiB) # 1.9.4

julia> @btime eigvals($C);
  102.900 μs (15 allocations: 119.70 KiB) # 1.10-rc2
  146.900 μs (15 allocations: 119.70 KiB) # 1.9.4

julia> @btime LAPACK.hetrd!('U', copy($F));
  24.100  μs (7 allocations: 33.59 KiB) # 1.10-rc2
  533.800 μs (7 allocations: 33.59 KiB) # 1.9.4

julia> @btime LAPACK.hetrd!('U', copy($C));
  57.400 μs (7 allocations: 66.05 KiB) # 1.10-rc2
  74.600 μs (7 allocations: 66.05 KiB) # 1.9.4

In my case, there is an apparent perfomance improvement for ComplexF64, but this might be just noise.

Also checked for n = 230:

julia> @btime eigvals($F);
  2.700 ms (11 allocations: 498.77 KiB) # 1.10-rc2
  7.615 ms (11 allocations: 498.77 KiB) # 1.9.4

julia> @btime eigvals($C);
  7.704 ms (15 allocations: 1.80 MiB) # 1.10-rc2
  7.803 ms (15 allocations: 1.80 MiB) # 1.9.4
versioninfo
julia> versioninfo()
Julia Version 1.10.0-rc2
Commit dbb9c46795 (2023-12-03 15:25 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × 12th Gen Intel(R) Core(TM) i7-1260P
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, alderlake)
  Threads: 23 on 16 virtual cores

@yakovbraver
Copy link
Author

I have just checked that the released Julia 1.10.0 yields the same results as 1.10-rc2 (see above), so this is fixed and the issue can be closed.

@KristofferC KristofferC transferred this issue from JuliaLang/julia Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
external dependencies Involves LLVM, OpenBLAS, or other linked libraries performance Must go faster system:windows Affects only Windows upstream The issue is with an upstream dependency, e.g. LLVM
Projects
None yet
Development

No branches or pull requests

4 participants