Skip to content

Conversation

mbauman
Copy link
Member

@mbauman mbauman commented Apr 29, 2025

This method intends to be a SIMD-able optimization for reductions with min and max, but it fails to achieve those goals on nearly every architecture. For example, on my Mac M1 the generic implementation is more than 6x faster than this method:

julia> A = rand(Float64, 10000);

julia> @btime reduce(max, $A);
  4.673 μs (0 allocations: 0 bytes)

julia> @btime reduce((x,y)->max(x,y), $A);
  718.441 ns (0 allocations: 0 bytes)

I asked for some crowd-sourced multi-architecture benchmarks for the above test case on Slack, and only on AMD Ryzen (znver2) and an old Intel i5 Haswell chip did this method help — and then it was only by about 20%.

Worse, this method is bugged for signed zeros; if the "wrong" signed zero is produced, then it re-runs the entire reduction (but without calling f) to see if the "right" zero should be returned.

Fixes #45932, fixes #36412, fixes #36081.

…aximum`

This method intends to be a SIMD-able optimization for reductions with `min` and `max`, but it fails to achieve those goals on nearly every architecture.  For example, on my Mac M1 the generic implementation is more than **6x** faster than this method:

```
julia> A = rand(Float64, 10000);

julia> @Btime reduce(max, $A);
  4.673 μs (0 allocations: 0 bytes)

julia> @Btime reduce((x,y)->max(x,y), $A);
  718.441 ns (0 allocations: 0 bytes)
```

I asked for some crowd-sourced multi-architecture benchmarks for the above test case on Slack, and only on AMD Ryzen (znver2) and an old Intel i5 Haswell chip did this method help — and then it was only by about 20%.

Worse, this method is bugged for signed zeros; if the "wrong" signed zero is produced, then it _re-runs_ the entire reduction (but without calling `f`) to see if the "right" zero should be returned.
@mbauman mbauman added performance Must go faster bugfix This change fixes an existing bug fold sum, maximum, reduce, foldl, etc. labels Apr 29, 2025
@mbauman mbauman changed the title Remove bugged and typically slower implementation of minimum and `m… Remove bugged and typically slower minimum/maximum method Apr 29, 2025
@giordano
Copy link
Member

Worse, this method is bugged for signed zeros; if the "wrong" signed zero is produced, then it re-runs the entire reduction (but without calling f) to see if the "right" zero should be returned.

Sounds like that'd make a good test (if not there already)?

@mikmoore
Copy link
Contributor

Looks like this eliminates some of the anti-performance specializations cited in #45581. Thanks!

@mbauman
Copy link
Member Author

mbauman commented Apr 29, 2025

Yeah, looks like this is better than the naive loop #36412 suggested, too:

julia> function f_minimum(A::Array{<:Integer})
           result = A[1]
           @inbounds for i=2:length(A)
               if A[i] < result
                   result = A[i]
               end
           end
           return result
       end
f_minimum (generic function with 1 method)

julia> x = rand(Int, 10_000);

julia> using BenchmarkTools

julia> @btime minimum($x)
  3.057 μs (0 allocations: 0 bytes)
-9222765095448017038

julia> @btime f_minimum($x)
  1.613 μs (0 allocations: 0 bytes)
-9222765095448017038

julia> @btime reduce((x,y)->min(x,y), $x)
  1.450 μs (0 allocations: 0 bytes)
-9222765095448017038

@mbauman mbauman mentioned this pull request Apr 29, 2025
oscardssmith pushed a commit that referenced this pull request May 2, 2025
This method is no longer an optimization over what Julia can do with the
naive definition on most (if not all) architectures.

Like #58267, I asked for a smattering of crowdsourced multi-architecture
benchmarking of this simple example:

```julia
using BenchmarkTools
A = rand(10000);
b1 = @benchmark extrema($A)
b2 = @benchmark mapreduce(x->(x,x),((min1, max1), (min2, max2))->(min(min1, min2), max(max1, max2)), $A)
println("$(Sys.CPU_NAME): $(round(median(b1).time/median(b2).time, digits=1))x faster")
```

With results:

```txt
cortex-a72: 13.2x faster
cortex-a76: 15.8x faster
neoverse-n1: 16.4x faster
neoverse-v2: 23.4x faster
a64fx: 46.5x faster

apple-m1: 54.9x faster
apple-m4*: 43.7x faster

znver2: 8.6x faster
znver4: 12.8x faster
znver5: 16.7x faster

haswell (32-bit): 3.5x faster
skylake-avx512: 7.4x faster
rocketlake: 7.8x faster
alderlake: 5.2x faster
cascadelake: 8.8x faster
cascadelake: 7.1x faster
```

The results are even more dramatic for Float32s, here on my M1:

```julia
julia> A = rand(Float32, 10000);

julia> @benchmark extrema($A)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  49.083 μs … 151.750 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     49.375 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   49.731 μs ±   2.350 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅██▅▁       ▁▂▂       ▁▂▁                                    ▂
  ██████▇▇▇▇█▇████▇▆▆▆▇▇███▇▇▆▆▆▅▆▅▃▄▃▄▅▄▄▆▆▅▃▁▄▃▅▄▅▄▄▁▄▄▅▃▄▁▄ █
  49.1 μs       Histogram: log(frequency) by time      56.8 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark mapreduce(x->(x,x),((min1, max1), (min2, max2))->(min(min1, min2), max(max1, max2)), $A)
BenchmarkTools.Trial: 10000 samples with 191 evaluations per sample.
 Range (min … max):  524.435 ns …  1.104 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     525.089 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   529.323 ns ± 20.876 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃      ▁ ▃▃                                                 ▁
  █████▇███▇███▇▇▇▇▇▇▇▇▅▆▆▆▆▆▅▅▄▆▃▄▄▃▅▅▄▃▅▄▄▄▅▅▅▃▅▄▄▁▄▄▅▆▄▄▅▄▅ █
  524 ns        Histogram: log(frequency) by time       609 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
```

Closes #34790, closes #31442, closes #44606.

---------

Co-authored-by: Mosè Giordano <[email protected]>
Copy link
Member

@giordano giordano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to #58280, I'm not familiar with the reduction stuff but deleting code and getting correct results (with tests) and getting better performance looks a no-brainer yes to me

@oscardssmith
Copy link
Member

On 11th gen intel, I see a much smaller, but still real performance win. It's likely worth noting for posterity that on 1.11 and older, the generic method is ~5x slower. I believe #56371 is what makes this PR an improvement rather than a massive regression.

@JeffBezanson JeffBezanson merged commit 99ba3c7 into master May 7, 2025
7 checks passed
@JeffBezanson JeffBezanson deleted the mb/minimize-reduce branch May 7, 2025 19:01
charleskawczynski pushed a commit to charleskawczynski/julia that referenced this pull request May 12, 2025
This method is no longer an optimization over what Julia can do with the
naive definition on most (if not all) architectures.

Like JuliaLang#58267, I asked for a smattering of crowdsourced multi-architecture
benchmarking of this simple example:

```julia
using BenchmarkTools
A = rand(10000);
b1 = @benchmark extrema($A)
b2 = @benchmark mapreduce(x->(x,x),((min1, max1), (min2, max2))->(min(min1, min2), max(max1, max2)), $A)
println("$(Sys.CPU_NAME): $(round(median(b1).time/median(b2).time, digits=1))x faster")
```

With results:

```txt
cortex-a72: 13.2x faster
cortex-a76: 15.8x faster
neoverse-n1: 16.4x faster
neoverse-v2: 23.4x faster
a64fx: 46.5x faster

apple-m1: 54.9x faster
apple-m4*: 43.7x faster

znver2: 8.6x faster
znver4: 12.8x faster
znver5: 16.7x faster

haswell (32-bit): 3.5x faster
skylake-avx512: 7.4x faster
rocketlake: 7.8x faster
alderlake: 5.2x faster
cascadelake: 8.8x faster
cascadelake: 7.1x faster
```

The results are even more dramatic for Float32s, here on my M1:

```julia
julia> A = rand(Float32, 10000);

julia> @benchmark extrema($A)
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  49.083 μs … 151.750 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     49.375 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   49.731 μs ±   2.350 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅██▅▁       ▁▂▂       ▁▂▁                                    ▂
  ██████▇▇▇▇█▇████▇▆▆▆▇▇███▇▇▆▆▆▅▆▅▃▄▃▄▅▄▄▆▆▅▃▁▄▃▅▄▅▄▄▁▄▄▅▃▄▁▄ █
  49.1 μs       Histogram: log(frequency) by time      56.8 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark mapreduce(x->(x,x),((min1, max1), (min2, max2))->(min(min1, min2), max(max1, max2)), $A)
BenchmarkTools.Trial: 10000 samples with 191 evaluations per sample.
 Range (min … max):  524.435 ns …  1.104 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     525.089 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   529.323 ns ± 20.876 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃      ▁ ▃▃                                                 ▁
  █████▇███▇███▇▇▇▇▇▇▇▇▅▆▆▆▆▆▅▅▄▆▃▄▄▃▅▅▄▃▅▄▄▄▅▅▅▃▅▄▄▁▄▄▅▆▄▄▅▄▅ █
  524 ns        Histogram: log(frequency) by time       609 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
```

Closes JuliaLang#34790, closes JuliaLang#31442, closes JuliaLang#44606.

---------

Co-authored-by: Mosè Giordano <[email protected]>
charleskawczynski pushed a commit to charleskawczynski/julia that referenced this pull request May 12, 2025
…ang#58267)

This method intends to be a SIMD-able optimization for reductions with
`min` and `max`, but it fails to achieve those goals on nearly every
architecture. For example, on my Mac M1 the generic implementation is
more than **6x** faster than this method.

Fixes JuliaLang#45932, fixes JuliaLang#36412, fixes JuliaLang#36081.

---------

Co-authored-by: Mosè Giordano <[email protected]>
Joel-Dahne added a commit to kalmarek/Arblib.jl that referenced this pull request May 17, 2025
The special implementation of minimum and maximum was removed in
JuliaLang/julia#58267. This means we no longer
need the workarounds for versions after this.
@mbauman mbauman added the backport 1.12 Change should be backported to release-1.12 label Jul 3, 2025
@mbauman
Copy link
Member Author

mbauman commented Jul 3, 2025

Consistent with what @oscardssmith notes, this PR sees the big perf improvement on v1.12 but is relatively perf-neutral on v1.11 and v1.10 on my M1 arch. Marking for backport to 1.12 given that it also fixes some bugs.

KristofferC pushed a commit that referenced this pull request Jul 8, 2025
This method intends to be a SIMD-able optimization for reductions with
`min` and `max`, but it fails to achieve those goals on nearly every
architecture. For example, on my Mac M1 the generic implementation is
more than **6x** faster than this method.

Fixes #45932, fixes #36412, fixes #36081.

---------

Co-authored-by: Mosè Giordano <[email protected]>
(cherry picked from commit 99ba3c7)
@KristofferC KristofferC mentioned this pull request Jul 8, 2025
60 tasks
@KristofferC KristofferC removed the backport 1.12 Change should be backported to release-1.12 label Jul 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix This change fixes an existing bug fold sum, maximum, reduce, foldl, etc. performance Must go faster
Projects
None yet
6 participants