-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Remove bugged and typically slower minimum
/maximum
method
#58267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…aximum` This method intends to be a SIMD-able optimization for reductions with `min` and `max`, but it fails to achieve those goals on nearly every architecture. For example, on my Mac M1 the generic implementation is more than **6x** faster than this method: ``` julia> A = rand(Float64, 10000); julia> @Btime reduce(max, $A); 4.673 μs (0 allocations: 0 bytes) julia> @Btime reduce((x,y)->max(x,y), $A); 718.441 ns (0 allocations: 0 bytes) ``` I asked for some crowd-sourced multi-architecture benchmarks for the above test case on Slack, and only on AMD Ryzen (znver2) and an old Intel i5 Haswell chip did this method help — and then it was only by about 20%. Worse, this method is bugged for signed zeros; if the "wrong" signed zero is produced, then it _re-runs_ the entire reduction (but without calling `f`) to see if the "right" zero should be returned.
minimum
and `m…minimum
/maximum
method
Sounds like that'd make a good test (if not there already)? |
Looks like this eliminates some of the anti-performance specializations cited in #45581. Thanks! |
Yeah, looks like this is better than the naive loop #36412 suggested, too: julia> function f_minimum(A::Array{<:Integer})
result = A[1]
@inbounds for i=2:length(A)
if A[i] < result
result = A[i]
end
end
return result
end
f_minimum (generic function with 1 method)
julia> x = rand(Int, 10_000);
julia> using BenchmarkTools
julia> @btime minimum($x)
3.057 μs (0 allocations: 0 bytes)
-9222765095448017038
julia> @btime f_minimum($x)
1.613 μs (0 allocations: 0 bytes)
-9222765095448017038
julia> @btime reduce((x,y)->min(x,y), $x)
1.450 μs (0 allocations: 0 bytes)
-9222765095448017038 |
This method is no longer an optimization over what Julia can do with the naive definition on most (if not all) architectures. Like #58267, I asked for a smattering of crowdsourced multi-architecture benchmarking of this simple example: ```julia using BenchmarkTools A = rand(10000); b1 = @benchmark extrema($A) b2 = @benchmark mapreduce(x->(x,x),((min1, max1), (min2, max2))->(min(min1, min2), max(max1, max2)), $A) println("$(Sys.CPU_NAME): $(round(median(b1).time/median(b2).time, digits=1))x faster") ``` With results: ```txt cortex-a72: 13.2x faster cortex-a76: 15.8x faster neoverse-n1: 16.4x faster neoverse-v2: 23.4x faster a64fx: 46.5x faster apple-m1: 54.9x faster apple-m4*: 43.7x faster znver2: 8.6x faster znver4: 12.8x faster znver5: 16.7x faster haswell (32-bit): 3.5x faster skylake-avx512: 7.4x faster rocketlake: 7.8x faster alderlake: 5.2x faster cascadelake: 8.8x faster cascadelake: 7.1x faster ``` The results are even more dramatic for Float32s, here on my M1: ```julia julia> A = rand(Float32, 10000); julia> @benchmark extrema($A) BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample. Range (min … max): 49.083 μs … 151.750 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 49.375 μs ┊ GC (median): 0.00% Time (mean ± σ): 49.731 μs ± 2.350 μs ┊ GC (mean ± σ): 0.00% ± 0.00% ▅██▅▁ ▁▂▂ ▁▂▁ ▂ ██████▇▇▇▇█▇████▇▆▆▆▇▇███▇▇▆▆▆▅▆▅▃▄▃▄▅▄▄▆▆▅▃▁▄▃▅▄▅▄▄▁▄▄▅▃▄▁▄ █ 49.1 μs Histogram: log(frequency) by time 56.8 μs < Memory estimate: 0 bytes, allocs estimate: 0. julia> @benchmark mapreduce(x->(x,x),((min1, max1), (min2, max2))->(min(min1, min2), max(max1, max2)), $A) BenchmarkTools.Trial: 10000 samples with 191 evaluations per sample. Range (min … max): 524.435 ns … 1.104 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 525.089 ns ┊ GC (median): 0.00% Time (mean ± σ): 529.323 ns ± 20.876 ns ┊ GC (mean ± σ): 0.00% ± 0.00% █▃ ▁ ▃▃ ▁ █████▇███▇███▇▇▇▇▇▇▇▇▅▆▆▆▆▆▅▅▄▆▃▄▄▃▅▅▄▃▅▄▄▄▅▅▅▃▅▄▄▁▄▄▅▆▄▄▅▄▅ █ 524 ns Histogram: log(frequency) by time 609 ns < Memory estimate: 0 bytes, allocs estimate: 0. ``` Closes #34790, closes #31442, closes #44606. --------- Co-authored-by: Mosè Giordano <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly to #58280, I'm not familiar with the reduction stuff but deleting code and getting correct results (with tests) and getting better performance looks a no-brainer yes to me
On 11th gen intel, I see a much smaller, but still real performance win. It's likely worth noting for posterity that on 1.11 and older, the generic method is ~5x slower. I believe #56371 is what makes this PR an improvement rather than a massive regression. |
This method is no longer an optimization over what Julia can do with the naive definition on most (if not all) architectures. Like JuliaLang#58267, I asked for a smattering of crowdsourced multi-architecture benchmarking of this simple example: ```julia using BenchmarkTools A = rand(10000); b1 = @benchmark extrema($A) b2 = @benchmark mapreduce(x->(x,x),((min1, max1), (min2, max2))->(min(min1, min2), max(max1, max2)), $A) println("$(Sys.CPU_NAME): $(round(median(b1).time/median(b2).time, digits=1))x faster") ``` With results: ```txt cortex-a72: 13.2x faster cortex-a76: 15.8x faster neoverse-n1: 16.4x faster neoverse-v2: 23.4x faster a64fx: 46.5x faster apple-m1: 54.9x faster apple-m4*: 43.7x faster znver2: 8.6x faster znver4: 12.8x faster znver5: 16.7x faster haswell (32-bit): 3.5x faster skylake-avx512: 7.4x faster rocketlake: 7.8x faster alderlake: 5.2x faster cascadelake: 8.8x faster cascadelake: 7.1x faster ``` The results are even more dramatic for Float32s, here on my M1: ```julia julia> A = rand(Float32, 10000); julia> @benchmark extrema($A) BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample. Range (min … max): 49.083 μs … 151.750 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 49.375 μs ┊ GC (median): 0.00% Time (mean ± σ): 49.731 μs ± 2.350 μs ┊ GC (mean ± σ): 0.00% ± 0.00% ▅██▅▁ ▁▂▂ ▁▂▁ ▂ ██████▇▇▇▇█▇████▇▆▆▆▇▇███▇▇▆▆▆▅▆▅▃▄▃▄▅▄▄▆▆▅▃▁▄▃▅▄▅▄▄▁▄▄▅▃▄▁▄ █ 49.1 μs Histogram: log(frequency) by time 56.8 μs < Memory estimate: 0 bytes, allocs estimate: 0. julia> @benchmark mapreduce(x->(x,x),((min1, max1), (min2, max2))->(min(min1, min2), max(max1, max2)), $A) BenchmarkTools.Trial: 10000 samples with 191 evaluations per sample. Range (min … max): 524.435 ns … 1.104 μs ┊ GC (min … max): 0.00% … 0.00% Time (median): 525.089 ns ┊ GC (median): 0.00% Time (mean ± σ): 529.323 ns ± 20.876 ns ┊ GC (mean ± σ): 0.00% ± 0.00% █▃ ▁ ▃▃ ▁ █████▇███▇███▇▇▇▇▇▇▇▇▅▆▆▆▆▆▅▅▄▆▃▄▄▃▅▅▄▃▅▄▄▄▅▅▅▃▅▄▄▁▄▄▅▆▄▄▅▄▅ █ 524 ns Histogram: log(frequency) by time 609 ns < Memory estimate: 0 bytes, allocs estimate: 0. ``` Closes JuliaLang#34790, closes JuliaLang#31442, closes JuliaLang#44606. --------- Co-authored-by: Mosè Giordano <[email protected]>
…ang#58267) This method intends to be a SIMD-able optimization for reductions with `min` and `max`, but it fails to achieve those goals on nearly every architecture. For example, on my Mac M1 the generic implementation is more than **6x** faster than this method. Fixes JuliaLang#45932, fixes JuliaLang#36412, fixes JuliaLang#36081. --------- Co-authored-by: Mosè Giordano <[email protected]>
The special implementation of minimum and maximum was removed in JuliaLang/julia#58267. This means we no longer need the workarounds for versions after this.
Consistent with what @oscardssmith notes, this PR sees the big perf improvement on v1.12 but is relatively perf-neutral on v1.11 and v1.10 on my M1 arch. Marking for backport to 1.12 given that it also fixes some bugs. |
This method intends to be a SIMD-able optimization for reductions with `min` and `max`, but it fails to achieve those goals on nearly every architecture. For example, on my Mac M1 the generic implementation is more than **6x** faster than this method. Fixes #45932, fixes #36412, fixes #36081. --------- Co-authored-by: Mosè Giordano <[email protected]> (cherry picked from commit 99ba3c7)
This method intends to be a SIMD-able optimization for reductions with
min
andmax
, but it fails to achieve those goals on nearly every architecture. For example, on my Mac M1 the generic implementation is more than 6x faster than this method:I asked for some crowd-sourced multi-architecture benchmarks for the above test case on Slack, and only on AMD Ryzen (znver2) and an old Intel i5 Haswell chip did this method help — and then it was only by about 20%.
Worse, this method is bugged for signed zeros; if the "wrong" signed zero is produced, then it re-runs the entire reduction (but without calling
f
) to see if the "right" zero should be returned.Fixes #45932, fixes #36412, fixes #36081.