`eigvals` performs faster for `Matrix{ComplexF64}` than `Matrix{Float64}` on Windows #960

yakovbraver · 2022-10-18T15:00:17Z

(Cross-posting from Discourse)
I’ve noticed that when diagonalising real symmetric matrices using the default OpenBLAS, eigvals may perform faster if the input matrix is complex, i.e. Matrix{ComplexF64} rather than Matrix{Float64}. Here is my test code:

using LinearAlgebra, BenchmarkTools

n = 50             # matrix dimension
F = rand(n, n)     # a random real Float64 matrix
F += F'            # make `F` symmetric
C = ComplexF64.(F) # a copy of `F` stored as a `Matrix{ComplexF64}`

@benchmark eigvals($F)
@benchmark eigvals($C)

For n = 50, the complex matrix is diagonalised ~5 times faster than the real one:

For n = 230, both calculations take the same amount of time, and for larger matrices the complex calculation becomes slower than the real one, as expected.
I could reproduce these results on four different machines running Windows 10, while on macOS 10.14.6 the issue is not present (the real calculation performs faster than the complex, as expected). The outputs of versioninfo() are available in a gist.
The issue seems not to appear on Linux either, see Discourse.
When I switch to MKL.jl, the issue does not appear.

The text was updated successfully, but these errors were encountered:

andreasnoack · 2022-10-25T07:20:04Z

Could you please try to time just LAPACK.hetrd!('U', copy(F)) and LAPACK.hetrd!('U', copy(C))?

yakovbraver · 2022-10-25T10:05:51Z

Here are the results on the same Windows JuliaLang/julia#1 machine as above:

This is revealing. I've obtained similar results on the other Windows machines, while on macOS the Float64 calculation was faster than ComplexF64, as expected.

andreasnoack · 2022-10-25T11:36:30Z

So the problem is in the reduction to symmetric tridiagonal form. Could you please try again with OPENBLAS_NUM_THREADS=1 and please also share the output from versioninfo().

yakovbraver · 2022-10-25T14:11:10Z

Setting OPENBLAS_NUM_THREADS=1 fixes it; here is the output again from the same Windows JuliaLang/julia#1 machine as above:

I tried this on all four Windows 10 test machines (versioninfo() here), and the results are the same.
It turns out that setting OPENBLAS_NUM_THREADS=1 on an intel Mac gives a speedup as well:

I've also found that launching Julia in the multithreaded mode (e.g. julia -t8) has no effect on this benchmark.

andreasnoack · 2022-10-25T14:48:47Z

This looks like an issue with threading in OpenBLAS on Windows so it would be great if you could file the issue at https://github.com/xianyi/OpenBLAS. Usually, it's necessary to create a reproducer in Fortran or C before they can make progress on the issue.

yakovbraver · 2022-10-25T18:45:52Z

OK, let me see if I can reproduce this issue using the C interface of LAPACK bundled with OpenBLAS.

wheeheee · 2023-12-06T03:20:56Z

Seems to be fixed going from 1.9.4 to v1.10-rc2.
Small performance regression for ComplexF64 though.

julia> @btime eigvals($F);
  79.600 μs (11 allocations: 38.70 KiB)  # 1.10-rc2
  968.000 μs (11 allocations: 38.70 KiB) # 1.9.4

julia> @btime eigvals($C);
  189.800 μs (15 allocations: 119.70 KiB) # 1.10-rc2
  171.700 μs (15 allocations: 119.70 KiB) # 1.9.4

julia> @btime LAPACK.hetrd!('U', copy($F));
  34.000 μs (7 allocations: 33.59 KiB)  # 1.10-rc2
  932.900 μs (7 allocations: 33.59 KiB) # 1.9.4

julia> @btime LAPACK.hetrd!('U', copy($C));
  127.300 μs (7 allocations: 66.05 KiB) # 1.10-rc2
  113.600 μs (7 allocations: 66.05 KiB) # 1.9.4

versioninfo:

julia> versioninfo()
Julia Version 1.10.0-rc2
Commit dbb9c46795 (2023-12-03 15:25 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, tigerlake)
  Threads: 11 on 8 virtual cores
Environment:
  JULIA_CONDAPKG_BACKEND = Null
  JULIA_NUM_THREADS = auto

yakovbraver · 2023-12-06T14:14:27Z

Indeed, I can replicate @wheeheee's results (n = 50):

julia> @btime eigvals($F);
  59.300  μs (11 allocations: 38.70 KiB) # 1.10-rc2
  688.900 μs (11 allocations: 38.70 KiB) # 1.9.4

julia> @btime eigvals($C);
  102.900 μs (15 allocations: 119.70 KiB) # 1.10-rc2
  146.900 μs (15 allocations: 119.70 KiB) # 1.9.4

julia> @btime LAPACK.hetrd!('U', copy($F));
  24.100  μs (7 allocations: 33.59 KiB) # 1.10-rc2
  533.800 μs (7 allocations: 33.59 KiB) # 1.9.4

julia> @btime LAPACK.hetrd!('U', copy($C));
  57.400 μs (7 allocations: 66.05 KiB) # 1.10-rc2
  74.600 μs (7 allocations: 66.05 KiB) # 1.9.4

In my case, there is an apparent perfomance improvement for ComplexF64, but this might be just noise.

Also checked for n = 230:

julia> @btime eigvals($F);
  2.700 ms (11 allocations: 498.77 KiB) # 1.10-rc2
  7.615 ms (11 allocations: 498.77 KiB) # 1.9.4

julia> @btime eigvals($C);
  7.704 ms (15 allocations: 1.80 MiB) # 1.10-rc2
  7.803 ms (15 allocations: 1.80 MiB) # 1.9.4

versioninfo

julia> versioninfo()
Julia Version 1.10.0-rc2
Commit dbb9c46795 (2023-12-03 15:25 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × 12th Gen Intel(R) Core(TM) i7-1260P
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, alderlake)
  Threads: 23 on 16 virtual cores

yakovbraver · 2023-12-30T19:32:55Z

I have just checked that the released Julia 1.10.0 yields the same results as 1.10-rc2 (see above), so this is fixed and the issue can be closed.

oscardssmith changed the title ~~eigvals performs faster for Matrix{ComplexF64} than Matrix{Float64}~~ eigvals performs faster for Matrix{ComplexF64} than Matrix{Float64} on Windows Oct 18, 2022

andreasnoack added performance Must go faster upstream The issue is with an upstream dependency, e.g. LLVM system:windows Affects only Windows external dependencies Involves LLVM, OpenBLAS, or other linked libraries labels Oct 25, 2022

yakovbraver mentioned this issue Oct 30, 2022

LAPACKE_dsytrd performance degradation with multiple threads on Windows OpenMathLib/OpenBLAS#3801

Closed

martinholters closed this as completed Apr 15, 2024

KristofferC transferred this issue from JuliaLang/julia Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

`eigvals` performs faster for `Matrix{ComplexF64}` than `Matrix{Float64}` on Windows #960

`eigvals` performs faster for `Matrix{ComplexF64}` than `Matrix{Float64}` on Windows #960

yakovbraver commented Oct 18, 2022

andreasnoack commented Oct 25, 2022

Uh oh!

yakovbraver commented Oct 25, 2022

Uh oh!

andreasnoack commented Oct 25, 2022

Uh oh!

yakovbraver commented Oct 25, 2022

Uh oh!

andreasnoack commented Oct 25, 2022

Uh oh!

yakovbraver commented Oct 25, 2022

Uh oh!

wheeheee commented Dec 6, 2023

Uh oh!

yakovbraver commented Dec 6, 2023

Uh oh!

yakovbraver commented Dec 30, 2023

Uh oh!

Uh oh!

eigvals performs faster for Matrix{ComplexF64} than Matrix{Float64} on Windows #960

eigvals performs faster for Matrix{ComplexF64} than Matrix{Float64} on Windows #960

Comments

yakovbraver commented Oct 18, 2022

andreasnoack commented Oct 25, 2022

Uh oh!

yakovbraver commented Oct 25, 2022

Uh oh!

andreasnoack commented Oct 25, 2022

Uh oh!

yakovbraver commented Oct 25, 2022

Uh oh!

andreasnoack commented Oct 25, 2022

Uh oh!

yakovbraver commented Oct 25, 2022

Uh oh!

wheeheee commented Dec 6, 2023

Uh oh!

yakovbraver commented Dec 6, 2023

Uh oh!

yakovbraver commented Dec 30, 2023

Uh oh!

`eigvals` performs faster for `Matrix{ComplexF64}` than `Matrix{Float64}` on Windows #960

`eigvals` performs faster for `Matrix{ComplexF64}` than `Matrix{Float64}` on Windows #960