-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Closed
Closed
Copy link
Labels
performanceMust go fasterMust go fasterregressionRegression in behavior compared to a previous versionRegression in behavior compared to a previous version
Description
Although I haven't identified the cause, I've noticed ~10x slowdown in simple broadcasting operations on nightly.
julia> versioninfo() # just the last version I had cached, not bisected
Julia Version 1.6.0-DEV.1117
Commit 36effbe10a (2020-10-02 17:38 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-10.0.1 (ORCJIT, skylake)
julia> xu8 = rand(UInt8, 1000, 1000); yu8 = rand(UInt8, 1000, 1000);
julia> @btime $xu8 .+ $yu8;
85.700 μs (2 allocations: 976.70 KiB)
julia> versioninfo()
Julia Version 1.6.0-DEV.1274
Commit 444aa87348 (2020-10-17 22:11 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.0 (ORCJIT, skylake)
julia> @btime $xu8 .+ $yu8;
795.400 μs (2 allocations: 976.70 KiB)
julia> xu8a = rand(UInt8, 1000 * 10, 1000); yu8a = rand(UInt8, 1000 * 10, 1000);
julia> @btime $xu8a .+ $yu8a; # roughly proportional to the number of elements
11.190 ms (2 allocations: 9.54 MiB)
julia> xu8b = rand(UInt8, 1000 ÷ 10, 1000); yu8b = rand(UInt8, 1000 ÷ 10, 1000);
julia> @btime $xu8b .+ $yu8b;
86.100 μs (2 allocations: 97.77 KiB)
`@code_native` result for "1-D" arrays (It's somewhat misleading. See comments below.)
julia> @code_native broadcast(+, UInt8[], UInt8[]) # (**Edit:** FOR 1-D ARRAYS) at least it's SIMD vectorized.
...snip...
; ││││┌ @ simdloop.jl:77 within `macro expansion' @ broadcast.jl:932
; │││││┌ @ broadcast.jl:575 within `getindex'
; ││││││┌ @ broadcast.jl:620 within `_broadcast_getindex'
; │││││││┌ @ broadcast.jl:644 within `_getindex' @ broadcast.jl:645
; ││││││││┌ @ broadcast.jl:614 within `_broadcast_getindex'
; │││││││││┌ @ array.jl:809 within `getindex'
L1216:
vmovdqu (%rcx,%rbx), %ymm0
vmovdqu 32(%rcx,%rbx), %ymm1
vmovdqu 64(%rcx,%rbx), %ymm2
vmovdqu 96(%rcx,%rbx), %ymm3
; ││││││└└└└
; ││││││┌ @ broadcast.jl:621 within `_broadcast_getindex'
; │││││││┌ @ broadcast.jl:648 within `_broadcast_getindex_evalf'
; ││││││││┌ @ int.jl:87 within `+'
vpaddb (%r9,%rbx), %ymm0, %ymm0
vpaddb 32(%r9,%rbx), %ymm1, %ymm1
vpaddb 64(%r9,%rbx), %ymm2, %ymm2
vpaddb 96(%r9,%rbx), %ymm3, %ymm3
; │││││└└└└
; │││││┌ @ array.jl:847 within `setindex!'
vmovdqu %ymm0, (%rdx,%rbx)
vmovdqu %ymm1, 32(%rdx,%rbx)
vmovdqu %ymm2, 64(%rdx,%rbx)
vmovdqu %ymm3, 96(%rdx,%rbx)
; ││││└└
; ││││┌ @ simdloop.jl:78 within `macro expansion'
; │││││┌ @ int.jl:87 within `+'
subq $-128, %rbx
cmpq %rbx, %rdi
jne L1216
; ││││└└
...snip...
The most noticeable difference is the LLVM version (i.e. 10 vs 11), but I have no evidence that the LLVM 11 is the cause at the moment.
Metadata
Metadata
Assignees
Labels
performanceMust go fasterMust go fasterregressionRegression in behavior compared to a previous versionRegression in behavior compared to a previous version