Remove bugged and typically slower `minimum`/`maximum` method #58267

mbauman · 2025-04-29T13:37:27Z

This method intends to be a SIMD-able optimization for reductions with min and max, but it fails to achieve those goals on nearly every architecture. For example, on my Mac M1 the generic implementation is more than 6x faster than this method:

julia> A = rand(Float64, 10000);

julia> @btime reduce(max, $A);
  4.673 μs (0 allocations: 0 bytes)

julia> @btime reduce((x,y)->max(x,y), $A);
  718.441 ns (0 allocations: 0 bytes)

I asked for some crowd-sourced multi-architecture benchmarks for the above test case on Slack, and only on AMD Ryzen (znver2) and an old Intel i5 Haswell chip did this method help — and then it was only by about 20%.

Worse, this method is bugged for signed zeros; if the "wrong" signed zero is produced, then it re-runs the entire reduction (but without calling f) to see if the "right" zero should be returned.

Fixes #45932, fixes #36412, fixes #36081.

…aximum` This method intends to be a SIMD-able optimization for reductions with `min` and `max`, but it fails to achieve those goals on nearly every architecture. For example, on my Mac M1 the generic implementation is more than **6x** faster than this method: ``` julia> A = rand(Float64, 10000); julia> @Btime reduce(max, $A); 4.673 μs (0 allocations: 0 bytes) julia> @Btime reduce((x,y)->max(x,y), $A); 718.441 ns (0 allocations: 0 bytes) ``` I asked for some crowd-sourced multi-architecture benchmarks for the above test case on Slack, and only on AMD Ryzen (znver2) and an old Intel i5 Haswell chip did this method help — and then it was only by about 20%. Worse, this method is bugged for signed zeros; if the "wrong" signed zero is produced, then it _re-runs_ the entire reduction (but without calling `f`) to see if the "right" zero should be returned.

giordano · 2025-04-29T13:47:41Z

Worse, this method is bugged for signed zeros; if the "wrong" signed zero is produced, then it re-runs the entire reduction (but without calling f) to see if the "right" zero should be returned.

Sounds like that'd make a good test (if not there already)?

mikmoore · 2025-04-29T14:49:19Z

Looks like this eliminates some of the anti-performance specializations cited in #45581. Thanks!

mbauman · 2025-04-29T17:58:30Z

Yeah, looks like this is better than the naive loop #36412 suggested, too:

julia> function f_minimum(A::Array{<:Integer})
           result = A[1]
           @inbounds for i=2:length(A)
               if A[i] < result
                   result = A[i]
               end
           end
           return result
       end
f_minimum (generic function with 1 method)

julia> x = rand(Int, 10_000);

julia> using BenchmarkTools

julia> @btime minimum($x)
  3.057 μs (0 allocations: 0 bytes)
-9222765095448017038

julia> @btime f_minimum($x)
  1.613 μs (0 allocations: 0 bytes)
-9222765095448017038

julia> @btime reduce((x,y)->min(x,y), $x)
  1.450 μs (0 allocations: 0 bytes)
-9222765095448017038

@benchmark

This method is no longer an optimization over what Julia can do with the naive definition on most (if not all) architectures. Like #58267, I asked for a smattering of crowdsourced multi-architecture benchmarking of this simple example: ```julia using BenchmarkTools A = rand(10000); b1 = @benchmark extrema($A) b2 = @benchmark mapreduce(x->(x,x),((min1, max1), (min2, max2))->(min(min1, min2), max(max1, max2)), $A) println("$(Sys.CPU_NAME): $(round(median(b1).time/median(b2).time, digits=1))x faster") ``` With results: ```txt cortex-a72: 13.2x faster cortex-a76: 15.8x faster neoverse-n1: 16.4x faster neoverse-v2: 23.4x faster a64fx: 46.5x faster apple-m1: 54.9x faster apple-m4*: 43.7x faster znver2: 8.6x faster znver4: 12.8x faster znver5: 16.7x faster haswell (32-bit): 3.5x faster skylake-avx512: 7.4x faster rocketlake: 7.8x faster alderlake: 5.2x faster cascadelake: 8.8x faster cascadelake: 7.1x faster ``` The results are even more dramatic for Float32s, here on my M1: ```julia julia> A = rand(Float32, 10000); julia> @benchmark extrema($A) BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample. Range (min … max): 49.083 μs … 151.750 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 49.375 μs ┊ GC (median): 0.00% Time (mean ± σ): 49.731 μs ± 2.350 μs ┊ GC (mean ± σ): 0.00% ± 0.00% ▅██▅▁ ▁▂▂ ▁▂▁ ▂ ██████▇▇▇▇█▇████▇▆▆▆▇▇███▇▇▆▆▆▅▆▅▃▄▃▄▅▄▄▆▆▅▃▁▄▃▅▄▅▄▄▁▄▄▅▃▄▁▄ █ 49.1 μs Histogram: log(frequency) by time 56.8 μs < Memory estimate: 0 bytes, allocs estimate: 0. julia> @benchmark mapreduce(x->(x,x),((min1, max1), (min2, max2))->(min(min1, min2), max(max1, max2)), $A) BenchmarkTools.Trial: 10000 samples with 191 evaluations per sample. Range (min … max): 524.435 ns … 1.104 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 525.089 ns ┊ GC (median): 0.00% Time (mean ± σ): 529.323 ns ± 20.876 ns ┊ GC (mean ± σ): 0.00% ± 0.00% █▃ ▁ ▃▃ ▁ █████▇███▇███▇▇▇▇▇▇▇▇▅▆▆▆▆▆▅▅▄▆▃▄▄▃▅▅▄▃▅▄▄▄▅▅▅▃▅▄▄▁▄▄▅▆▄▄▅▄▅ █ 524 ns Histogram: log(frequency) by time 609 ns < Memory estimate: 0 bytes, allocs estimate: 0. ``` Closes #34790, closes #31442, closes #44606. --------- Co-authored-by: Mosè Giordano <[email protected]>

giordano

Similarly to #58280, I'm not familiar with the reduction stuff but deleting code and getting correct results (with tests) and getting better performance looks a no-brainer yes to me

oscardssmith · 2025-05-02T13:50:53Z

On 11th gen intel, I see a much smaller, but still real performance win. It's likely worth noting for posterity that on 1.11 and older, the generic method is ~5x slower. I believe #56371 is what makes this PR an improvement rather than a massive regression.

@benchmark

This method is no longer an optimization over what Julia can do with the naive definition on most (if not all) architectures. Like JuliaLang#58267, I asked for a smattering of crowdsourced multi-architecture benchmarking of this simple example: ```julia using BenchmarkTools A = rand(10000); b1 = @benchmark extrema($A) b2 = @benchmark mapreduce(x->(x,x),((min1, max1), (min2, max2))->(min(min1, min2), max(max1, max2)), $A) println("$(Sys.CPU_NAME): $(round(median(b1).time/median(b2).time, digits=1))x faster") ``` With results: ```txt cortex-a72: 13.2x faster cortex-a76: 15.8x faster neoverse-n1: 16.4x faster neoverse-v2: 23.4x faster a64fx: 46.5x faster apple-m1: 54.9x faster apple-m4*: 43.7x faster znver2: 8.6x faster znver4: 12.8x faster znver5: 16.7x faster haswell (32-bit): 3.5x faster skylake-avx512: 7.4x faster rocketlake: 7.8x faster alderlake: 5.2x faster cascadelake: 8.8x faster cascadelake: 7.1x faster ``` The results are even more dramatic for Float32s, here on my M1: ```julia julia> A = rand(Float32, 10000); julia> @benchmark extrema($A) BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample. Range (min … max): 49.083 μs … 151.750 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 49.375 μs ┊ GC (median): 0.00% Time (mean ± σ): 49.731 μs ± 2.350 μs ┊ GC (mean ± σ): 0.00% ± 0.00% ▅██▅▁ ▁▂▂ ▁▂▁ ▂ ██████▇▇▇▇█▇████▇▆▆▆▇▇███▇▇▆▆▆▅▆▅▃▄▃▄▅▄▄▆▆▅▃▁▄▃▅▄▅▄▄▁▄▄▅▃▄▁▄ █ 49.1 μs Histogram: log(frequency) by time 56.8 μs < Memory estimate: 0 bytes, allocs estimate: 0. julia> @benchmark mapreduce(x->(x,x),((min1, max1), (min2, max2))->(min(min1, min2), max(max1, max2)), $A) BenchmarkTools.Trial: 10000 samples with 191 evaluations per sample. Range (min … max): 524.435 ns … 1.104 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 525.089 ns ┊ GC (median): 0.00% Time (mean ± σ): 529.323 ns ± 20.876 ns ┊ GC (mean ± σ): 0.00% ± 0.00% █▃ ▁ ▃▃ ▁ █████▇███▇███▇▇▇▇▇▇▇▇▅▆▆▆▆▆▅▅▄▆▃▄▄▃▅▅▄▃▅▄▄▄▅▅▅▃▅▄▄▁▄▄▅▆▄▄▅▄▅ █ 524 ns Histogram: log(frequency) by time 609 ns < Memory estimate: 0 bytes, allocs estimate: 0. ``` Closes JuliaLang#34790, closes JuliaLang#31442, closes JuliaLang#44606. --------- Co-authored-by: Mosè Giordano <[email protected]>

…ang#58267) This method intends to be a SIMD-able optimization for reductions with `min` and `max`, but it fails to achieve those goals on nearly every architecture. For example, on my Mac M1 the generic implementation is more than **6x** faster than this method. Fixes JuliaLang#45932, fixes JuliaLang#36412, fixes JuliaLang#36081. --------- Co-authored-by: Mosè Giordano <[email protected]>

The special implementation of minimum and maximum was removed in JuliaLang/julia#58267. This means we no longer need the workarounds for versions after this.

mbauman · 2025-07-03T13:33:02Z

Consistent with what @oscardssmith notes, this PR sees the big perf improvement on v1.12 but is relatively perf-neutral on v1.11 and v1.10 on my M1 arch. Marking for backport to 1.12 given that it also fixes some bugs.

This method intends to be a SIMD-able optimization for reductions with `min` and `max`, but it fails to achieve those goals on nearly every architecture. For example, on my Mac M1 the generic implementation is more than **6x** faster than this method. Fixes #45932, fixes #36412, fixes #36081. --------- Co-authored-by: Mosè Giordano <[email protected]> (cherry picked from commit 99ba3c7)

mbauman added performance Must go faster bugfix This change fixes an existing bug fold sum, maximum, reduce, foldl, etc. labels Apr 29, 2025

mbauman changed the title ~~Remove bugged and typically slower implementation of minimum and `m…~~ Remove bugged and typically slower minimum/maximum method Apr 29, 2025

add tests

e4590c3

fixup! tests

c972419

mbauman mentioned this pull request Apr 29, 2025

Speed up extrema 3-50x #58280

Merged

Merge branch 'master' into mb/minimize-reduce

5273d63

giordano approved these changes May 2, 2025

View reviewed changes

JeffBezanson merged commit 99ba3c7 into master May 7, 2025
7 checks passed

JeffBezanson deleted the mb/minimize-reduce branch May 7, 2025 19:01

Joel-Dahne mentioned this pull request May 17, 2025

Set minimum Julia version to 1.10 kalmarek/Arblib.jl#209

Merged

Joel-Dahne mentioned this pull request May 17, 2025

Fix minimum and maximum on nightly kalmarek/Arblib.jl#211

Merged

mbauman added the backport 1.12 Change should be backported to release-1.12 label Jul 3, 2025

KristofferC mentioned this pull request Jul 8, 2025

Backports for 1.12.0-rc1 #58655

Merged

60 tasks

KristofferC removed the backport 1.12 Change should be backported to release-1.12 label Jul 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Remove bugged and typically slower `minimum`/`maximum` method #58267

Remove bugged and typically slower `minimum`/`maximum` method #58267

Uh oh!

mbauman commented Apr 29, 2025 •

edited

Loading

Uh oh!

giordano commented Apr 29, 2025

Uh oh!

mikmoore commented Apr 29, 2025

Uh oh!

mbauman commented Apr 29, 2025

Uh oh!

giordano left a comment

Uh oh!

oscardssmith commented May 2, 2025

Uh oh!

Uh oh!

mbauman commented Jul 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Remove bugged and typically slower minimum/maximum method #58267

Remove bugged and typically slower minimum/maximum method #58267

Uh oh!

Conversation

mbauman commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

giordano commented Apr 29, 2025

Uh oh!

mikmoore commented Apr 29, 2025

Uh oh!

mbauman commented Apr 29, 2025

Uh oh!

giordano left a comment

Choose a reason for hiding this comment

Uh oh!

oscardssmith commented May 2, 2025

Uh oh!

Uh oh!

mbauman commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Remove bugged and typically slower `minimum`/`maximum` method #58267

Remove bugged and typically slower `minimum`/`maximum` method #58267

mbauman commented Apr 29, 2025 •

edited

Loading

mbauman commented Jul 3, 2025 •

edited

Loading