Skip to content

Performance regression in broadcasting with CartesianIndices on v1.6.0-DEV #38086

@kimikage

Description

@kimikage

Although I haven't identified the cause, I've noticed ~10x slowdown in simple broadcasting operations on nightly.

julia> versioninfo() # just the last version I had cached, not bisected
Julia Version 1.6.0-DEV.1117
Commit 36effbe10a (2020-10-02 17:38 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-10.0.1 (ORCJIT, skylake)

julia> xu8 = rand(UInt8, 1000, 1000); yu8 = rand(UInt8, 1000, 1000);

julia> @btime $xu8 .+ $yu8;
  85.700 μs (2 allocations: 976.70 KiB)
julia> versioninfo()
Julia Version 1.6.0-DEV.1274
Commit 444aa87348 (2020-10-17 22:11 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.0 (ORCJIT, skylake)

julia> @btime $xu8 .+ $yu8;
  795.400 μs (2 allocations: 976.70 KiB)

julia> xu8a = rand(UInt8, 1000 * 10, 1000); yu8a = rand(UInt8, 1000 * 10, 1000);

julia> @btime $xu8a .+ $yu8a; # roughly proportional to the number of elements
  11.190 ms (2 allocations: 9.54 MiB)

julia> xu8b = rand(UInt8, 1000 ÷ 10, 1000); yu8b = rand(UInt8, 1000 ÷ 10, 1000);

julia> @btime $xu8b .+ $yu8b;
  86.100 μs (2 allocations: 97.77 KiB)
`@code_native` result for "1-D" arrays (It's somewhat misleading. See comments below.)
julia> @code_native broadcast(+, UInt8[], UInt8[]) # (**Edit:** FOR 1-D ARRAYS) at least it's SIMD vectorized.
...snip...
; ││││┌ @ simdloop.jl:77 within `macro expansion' @ broadcast.jl:932
; │││││┌ @ broadcast.jl:575 within `getindex'
; ││││││┌ @ broadcast.jl:620 within `_broadcast_getindex'
; │││││││┌ @ broadcast.jl:644 within `_getindex' @ broadcast.jl:645
; ││││││││┌ @ broadcast.jl:614 within `_broadcast_getindex'
; │││││││││┌ @ array.jl:809 within `getindex'
L1216:
        vmovdqu (%rcx,%rbx), %ymm0
        vmovdqu 32(%rcx,%rbx), %ymm1
        vmovdqu 64(%rcx,%rbx), %ymm2
        vmovdqu 96(%rcx,%rbx), %ymm3
; ││││││└└└└
; ││││││┌ @ broadcast.jl:621 within `_broadcast_getindex'
; │││││││┌ @ broadcast.jl:648 within `_broadcast_getindex_evalf'
; ││││││││┌ @ int.jl:87 within `+'
        vpaddb  (%r9,%rbx), %ymm0, %ymm0
        vpaddb  32(%r9,%rbx), %ymm1, %ymm1
        vpaddb  64(%r9,%rbx), %ymm2, %ymm2
        vpaddb  96(%r9,%rbx), %ymm3, %ymm3
; │││││└└└└
; │││││┌ @ array.jl:847 within `setindex!'
        vmovdqu %ymm0, (%rdx,%rbx)
        vmovdqu %ymm1, 32(%rdx,%rbx)
        vmovdqu %ymm2, 64(%rdx,%rbx)
        vmovdqu %ymm3, 96(%rdx,%rbx)
; ││││└└
; ││││┌ @ simdloop.jl:78 within `macro expansion'
; │││││┌ @ int.jl:87 within `+'
        subq    $-128, %rbx
        cmpq    %rbx, %rdi
        jne     L1216
; ││││└└
...snip...

The most noticeable difference is the LLVM version (i.e. 10 vs 11), but I have no evidence that the LLVM 11 is the cause at the moment.

Metadata

Metadata

Assignees

No one assigned

    Labels

    performanceMust go fasterregressionRegression in behavior compared to a previous version

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions