Skip to content

reinterpret(reshape, T, A) is (still) slow on Windows #39382

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kimikage opened this issue Jan 24, 2021 · 1 comment · Fixed by #39636
Closed

reinterpret(reshape, T, A) is (still) slow on Windows #39382

kimikage opened this issue Jan 24, 2021 · 1 comment · Fixed by #39636
Labels
compiler:codegen Generation of LLVM IR and native code performance Must go faster system:windows Affects only Windows

Comments

@kimikage
Copy link
Contributor

kimikage commented Jan 24, 2021

PR #37559 introduced reinterpret(reshape, T, A)(ReshapedReinterpretArray) to improve the performance.
However, it does not always improve the speed on Windows (in some cases it slows things down). Since this issue is related to vectorization, the generated code can vary greatly depending on how the iterator is used. So, here is a very simple example.

function f!(B::AbstractArray{T}) where T
    @inbounds for I in eachindex(B)
        B[I] += T(0.5)
    end
end

sz = (1000, 1000);
A = zeros(ComplexF64, sz);
Br = reinterpret(Float64, zeros(UInt64, 2, sz...)); # `NonReshapedReinterpretArray`
Bsr = reshape(reinterpret(Float64, A), 2, sz...); # `ReshapedArray`
Brs = reinterpret(reshape, Float64, A); # `ReshapedReinterpretArray`
julia> @btime f!($Br);
  1.306 ms (0 allocations: 0 bytes) # v1.6.0-beta1 Debian on WSL2
  1.318 ms (0 allocations: 0 bytes) # v1.6.0-beta1 Windows
  1.323 ms (0 allocations: 0 bytes) # v1.7.0-DEV.361 Windows
  4.251 ms (0 allocations: 0 bytes) # v1.5.3 Windows

julia> @btime f!($Bsr);
  12.003 ms (0 allocations: 0 bytes) # v1.6.0-beta1 Debian on WSL2
  15.315 ms (0 allocations: 0 bytes) # v1.6.0-beta1 Windows
  15.277 ms (0 allocations: 0 bytes) # v1.7.0-DEV.361 Windows
  17.291 ms (0 allocations: 0 bytes) # v1.5.3 Windows

julia> @btime f!($Brs);
  1.048 ms (0 allocations: 0 bytes)  # v1.6.0-beta1 Debian on WSL2
  10.148 ms (0 allocations: 0 bytes) # v1.6.0-beta1 Windows
  10.143 ms (0 allocations: 0 bytes) # v1.7.0-DEV.361 Windows

The result of @code_typed is the same on Linux and Windows, but the result of @code_llvm is very different.

LLVM IR on v1.6.0-beta1 Linux
;  @ REPL[1]:1 within `f!'
define void @"julia_f!_436"({ {}*, i8, i8 }* nocapture nonnull readonly align 8 dereferenceable(16) %0) {
top:
;  @ REPL[1]:2 within `f!'
; ┌ @ abstractarray.jl:301 within `eachindex' @ reinterpretarray.jl:208
; │┌ @ reinterpretarray.jl:277 within `parent'
; ││┌ @ Base.jl:33 within `getproperty'
     %1 = bitcast { {}*, i8, i8 }* %0 to { i8*, i64, i16, i16, i32 }**
     %2 = load atomic { i8*, i64, i16, i16, i32 }*, { i8*, i64, i16, i16, i32 }** %1 unordered, align 8
; │└└
; │ @ abstractarray.jl:301 within `eachindex' @ reinterpretarray.jl:208 @ abstractarray.jl:311
; │┌ @ array.jl:197 within `length'
    %3 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %2, i64 0, i32 1
    %4 = load i64, i64* %3, align 8
; └└
; ┌ @ reinterpretarray.jl:227 within `iterate' @ range.jl:670
; │┌ @ range.jl:519 within `isempty'
; ││┌ @ operators.jl:305 within `>'
; │││┌ @ int.jl:83 within `<'
      %.not.not.not = icmp eq i64 %4, 0
; └└└└
  br i1 %.not.not.not, label %L128, label %L26.preheader

L26.preheader:                                    ; preds = %top
  %5 = bitcast { i8*, i64, i16, i16, i32 }* %2 to [2 x double]**
;  @ REPL[1]:3 within `f!'
; ┌ @ reinterpretarray.jl:334 within `getindex' @ array.jl:0
   %6 = load [2 x double]*, [2 x double]** %5, align 8
; └
; ┌ @ reinterpretarray.jl:234 within `iterate'
   %min.iters.check = icmp ult i64 %4, 4
   br i1 %min.iters.check, label %scalar.ph, label %vector.ph

vector.ph:                                        ; preds = %L26.preheader
   %n.vec = and i64 %4, 9223372036854775804
   %ind.end = or i64 %n.vec, 1
   br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
   %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
   %7 = getelementptr inbounds [2 x double], [2 x double]* %6, i64 %index, i64 0
   %8 = bitcast double* %7 to <8 x double>*
   %wide.vec = load <8 x double>, <8 x double>* %8, align 8
   %strided.vec = shufflevector <8 x double> %wide.vec, <8 x double> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
   %strided.vec98 = shufflevector <8 x double> %wide.vec, <8 x double> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
; └
; ┌ @ float.jl:326 within `+'
   %9 = fadd <4 x double> %strided.vec, <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>
   %10 = fadd <4 x double> %strided.vec98, <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>
; └
; ┌ @ reinterpretarray.jl within `setindex!'
   %interleaved.vec = shufflevector <4 x double> %9, <4 x double> %10, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32
 6, i32 3, i32 7>
   store <8 x double> %interleaved.vec, <8 x double>* %8, align 8
   %index.next = add i64 %index, 4
   %11 = icmp eq i64 %index.next, %n.vec
   br i1 %11, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body
; └
  %cmp.n = icmp eq i64 %4, %n.vec
  br i1 %cmp.n, label %L128, label %scalar.ph

scalar.ph:                                        ; preds = %middle.block, %L26.preheader
  %bc.resume.val = phi i64 [ %ind.end, %middle.block ], [ 1, %L26.preheader ]
; ┌ @ reinterpretarray.jl:234 within `iterate'
   br label %L26.outer

L26.outer:                                        ; preds = %L26.outer, %scalar.ph
   %value_phi19.ph = phi i64 [ %bc.resume.val, %scalar.ph ], [ %17, %L26.outer ]
; └
; ┌ @ reinterpretarray.jl:334 within `getindex' @ array.jl:0
   %12 = add nsw i64 %value_phi19.ph, -1
   %13 = getelementptr inbounds [2 x double], [2 x double]* %6, i64 %12, i64 0
   %14 = bitcast double* %13 to <2 x double>*
   %15 = load <2 x double>, <2 x double>* %14, align 8
; └
; ┌ @ float.jl:326 within `+'
   %16 = fadd <2 x double> %15, <double 5.000000e-01, double 5.000000e-01>
; └
; ┌ @ reinterpretarray.jl within `setindex!'
   store <2 x double> %16, <2 x double>* %14, align 8
; └
; ┌ @ reinterpretarray.jl:238 within `iterate' @ range.jl:674
; │┌ @ promotion.jl:410 within `=='
    %.not = icmp eq i64 %value_phi19.ph, %4
; │└
   %17 = add nuw nsw i64 %value_phi19.ph, 1
; └
  br i1 %.not, label %L128, label %L26.outer

L128:                                             ; preds = %L26.outer, %middle.block, %top
  ret void
}
LLVM IR on v1.6.0-beta1 Windows
;  @ REPL[1]:1 within `f!'
; Function Attrs: uwtable
define void @"julia_f!_389"({ {}*, i8, i8 }* nocapture nonnull readonly align 8 dereferenceable(16) %0) #0 {
top:
  %1 = alloca i128, align 16
  %2 = bitcast i128* %1 to i8*
  %3 = alloca <2 x i64>, align 16
  %4 = bitcast <2 x i64>* %3 to i8*
  %5 = alloca i64, align 16
  %6 = bitcast i64* %5 to i8*
  %7 = alloca <2 x i64>, align 16
  %8 = bitcast <2 x i64>* %7 to i8*
;  @ REPL[1]:2 within `f!'
; ┌ @ abstractarray.jl:301 within `eachindex' @ reinterpretarray.jl:208
; │┌ @ reinterpretarray.jl:277 within `parent'
; ││┌ @ Base.jl:33 within `getproperty'
     %9 = bitcast { {}*, i8, i8 }* %0 to { i8*, i64, i16, i16, i32 }**
     %10 = load atomic { i8*, i64, i16, i16, i32 }*, { i8*, i64, i16, i16, i32 }** %9 unordered, align 8
; │└└
; │ @ abstractarray.jl:301 within `eachindex' @ reinterpretarray.jl:208 @ abstractarray.jl:311
; │┌ @ array.jl:197 within `length'
    %11 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %10, i64 0, i32 1
    %12 = load i64, i64* %11, align 8
; └└
; ┌ @ reinterpretarray.jl:227 within `iterate' @ range.jl:670
; │┌ @ range.jl:519 within `isempty'
; ││┌ @ operators.jl:305 within `>'
; │││┌ @ int.jl:83 within `<'
      %.not.not.not = icmp eq i64 %12, 0
; └└└└
  br i1 %.not.not.not, label %L128, label %L26.preheader

L26.preheader:                                    ; preds = %top
  %13 = bitcast { i8*, i64, i16, i16, i32 }* %10 to [2 x double]**
;  @ REPL[1]:3 within `f!'
; ┌ @ reinterpretarray.jl:334 within `getindex' @ array.jl:0
   %14 = load [2 x double]*, [2 x double]** %13, align 8
; │ @ reinterpretarray.jl:334 within `getindex'
; │┌ @ refpointer.jl:136 within `Ref'
; ││┌ @ refvalue.jl within `RefValue'
     %15 = bitcast <2 x i64>* %3 to [2 x double]*
     %16 = getelementptr inbounds <2 x i64>, <2 x i64>* %3, i64 0, i64 0
     %.repack60 = getelementptr inbounds [2 x double], [2 x double]* %15, i64 0, i64 1
     %17 = bitcast double* %.repack60 to i64*
; └└└
; ┌ @ reinterpretarray.jl:336 within `getindex'
; │┌ @ refpointer.jl:172 within `unsafe_convert' @ refvalue.jl:40
; ││┌ @ pointer.jl within `pointer_from_objref'
     %18 = ptrtoint i128* %1 to i64
; └└└
; ┌ @ reinterpretarray.jl:337 within `getindex'
; │┌ @ refpointer.jl:101 within `unsafe_convert' @ refvalue.jl:40
; ││┌ @ pointer.jl within `pointer_from_objref'
     %19 = ptrtoint <2 x i64>* %3 to i64
; └└└
; ┌ @ reinterpretarray.jl:340 within `getindex' @ refvalue.jl:56
; │┌ @ Base.jl within `getproperty'
    %20 = bitcast i128* %1 to [2 x double]*
    %.elt64 = getelementptr inbounds [2 x double], [2 x double]* %20, i64 0, i64 1
; └└
; ┌ @ reinterpretarray.jl:457 within `setindex!'
; │┌ @ refpointer.jl:136 within `Ref'
; ││┌ @ refvalue.jl within `RefValue'
     %21 = bitcast i64* %5 to double*
; └└└
; ┌ @ reinterpretarray.jl:458 within `setindex!'
; │┌ @ refpointer.jl:136 within `Ref'
; ││┌ @ refvalue.jl within `RefValue'
     %22 = bitcast <2 x i64>* %7 to [2 x double]*
     %23 = getelementptr inbounds <2 x i64>, <2 x i64>* %7, i64 0, i64 0
     %.repack70 = getelementptr inbounds [2 x double], [2 x double]* %22, i64 0, i64 1
     %24 = bitcast double* %.repack70 to i64*
; └└└
; ┌ @ reinterpretarray.jl:460 within `setindex!'
; │┌ @ refpointer.jl:101 within `unsafe_convert' @ refvalue.jl:40
; ││┌ @ pointer.jl within `pointer_from_objref'
     %25 = ptrtoint i64* %5 to i64
; └└└
; ┌ @ reinterpretarray.jl:234 within `iterate'
   br label %L26.outer

L26.outer:                                        ; preds = %L26.outer, %L26.preheader
   %value_phi19.ph = phi i64 [ 1, %L26.preheader ], [ %41, %L26.outer ]
; └
; ┌ @ reinterpretarray.jl:334 within `getindex' @ array.jl:0
   %26 = add nsw i64 %value_phi19.ph, -1
   %27 = getelementptr inbounds [2 x double], [2 x double]* %14, i64 %26
   %28 = bitcast [2 x double]* %27 to i64*
   %.elt58 = getelementptr inbounds [2 x double], [2 x double]* %14, i64 %26, i64 1
   %29 = bitcast double* %.elt58 to i64*
   call void @llvm.lifetime.start.p0i8(i64 16, i8* nonnull %2)
; │ @ reinterpretarray.jl:334 within `getindex' @ array.jl:801
   %30 = bitcast [2 x double]* %27 to <2 x i64>*
   %31 = load <2 x i64>, <2 x i64>* %30, align 8
   call void @llvm.lifetime.start.p0i8(i64 16, i8* nonnull %4)
; │ @ reinterpretarray.jl:334 within `getindex'
; │┌ @ refpointer.jl:136 within `Ref'
; ││┌ @ refvalue.jl:8 within `RefValue'
     store <2 x i64> %31, <2 x i64>* %3, align 16
; └└└
; ┌ @ reinterpretarray.jl:338 within `getindex'
; │┌ @ reinterpretarray.jl:343 within `_memcpy!'
    call void inttoptr (i64 140721606831488 to void (i64, i64, i64)*)(i64 %18, i64 %19, i64 16)
; └└
; ┌ @ reinterpretarray.jl:340 within `getindex' @ refvalue.jl:56
; │┌ @ Base.jl:33 within `getproperty'
    %32 = bitcast i128* %1 to double*
    %.unpack6680 = load double, double* %32, align 16
; └└
; ┌ @ float.jl:326 within `+'
   %33 = fadd double %.unpack6680, 5.000000e-01
   call void @llvm.lifetime.end.p0i8(i64 16, i8* nonnull %2)
   call void @llvm.lifetime.end.p0i8(i64 16, i8* nonnull %4)
   call void @llvm.lifetime.start.p0i8(i64 8, i8* nonnull %6)
; └
; ┌ @ reinterpretarray.jl:457 within `setindex!'
; │┌ @ refpointer.jl:136 within `Ref'
; ││┌ @ refvalue.jl:8 within `RefValue'
     store double %33, double* %21, align 16
; └└└
; ┌ @ reinterpretarray.jl:458 within `setindex!'
; │┌ @ array.jl:801 within `getindex'
    %34 = load <2 x i64>, <2 x i64>* %30, align 8
    call void @llvm.lifetime.start.p0i8(i64 16, i8* nonnull %8)
; │└
; │┌ @ refpointer.jl:136 within `Ref'
; ││┌ @ refvalue.jl:8 within `RefValue'
     store <2 x i64> %34, <2 x i64>* %7, align 16
; └└└
; ┌ @ reinterpretarray.jl:462 within `setindex!'
; │┌ @ pointer.jl:159 within `+'
    %35 = ptrtoint <2 x i64>* %7 to i64
; │└
; │┌ @ reinterpretarray.jl:343 within `_memcpy!'
    call void inttoptr (i64 140721606831488 to void (i64, i64, i64)*)(i64 %35, i64 %25, i64 8)
; └└
; ┌ @ reinterpretarray.jl:464 within `setindex!'
; │┌ @ refvalue.jl:56 within `getindex'
; ││┌ @ Base.jl:33 within `getproperty'
     %.unpack78 = load i64, i64* %23, align 16
     %.unpack7579 = load i64, i64* %24, align 8
; │└└
; │ @ reinterpretarray.jl:464 within `setindex!' @ array.jl:839
   store i64 %.unpack78, i64* %28, align 8
   store i64 %.unpack7579, i64* %29, align 8
   call void @llvm.lifetime.start.p0i8(i64 16, i8* nonnull %2)
   call void @llvm.lifetime.start.p0i8(i64 16, i8* nonnull %4)
; └
; ┌ @ reinterpretarray.jl:334 within `getindex'
; │┌ @ refpointer.jl:136 within `Ref'
; ││┌ @ refvalue.jl:8 within `RefValue'
     store i64 %.unpack78, i64* %16, align 16
     store i64 %.unpack7579, i64* %17, align 8
; └└└
; ┌ @ reinterpretarray.jl:338 within `getindex'
; │┌ @ reinterpretarray.jl:343 within `_memcpy!'
    call void inttoptr (i64 140721606831488 to void (i64, i64, i64)*)(i64 %18, i64 %19, i64 16)
; └└
; ┌ @ reinterpretarray.jl:340 within `getindex' @ refvalue.jl:56
; │┌ @ Base.jl:33 within `getproperty'
    %.unpack6567.181 = load double, double* %.elt64, align 8
; └└
; ┌ @ float.jl:326 within `+'
   %36 = fadd double %.unpack6567.181, 5.000000e-01
   call void @llvm.lifetime.end.p0i8(i64 16, i8* nonnull %2)
   call void @llvm.lifetime.end.p0i8(i64 16, i8* nonnull %4)
   call void @llvm.lifetime.start.p0i8(i64 8, i8* nonnull %6)
; └
; ┌ @ reinterpretarray.jl:457 within `setindex!'
; │┌ @ refpointer.jl:136 within `Ref'
; ││┌ @ refvalue.jl:8 within `RefValue'
     store double %36, double* %21, align 16
; └└└
; ┌ @ reinterpretarray.jl:458 within `setindex!'
; │┌ @ array.jl:801 within `getindex'
    %37 = load <2 x i64>, <2 x i64>* %30, align 8
    call void @llvm.lifetime.start.p0i8(i64 16, i8* nonnull %8)
; │└
; │┌ @ refpointer.jl:136 within `Ref'
; ││┌ @ refvalue.jl:8 within `RefValue'
     store <2 x i64> %37, <2 x i64>* %7, align 16
; └└└
; ┌ @ reinterpretarray.jl:462 within `setindex!'
; │┌ @ pointer.jl:159 within `+'
    %38 = getelementptr inbounds i8, i8* %8, i64 8
    %39 = ptrtoint i8* %38 to i64
; │└
; │┌ @ reinterpretarray.jl:343 within `_memcpy!'
    call void inttoptr (i64 140721606831488 to void (i64, i64, i64)*)(i64 %39, i64 %25, i64 8)
; └└
; ┌ @ reinterpretarray.jl:464 within `setindex!'
; │┌ @ refvalue.jl:56 within `getindex'
; ││┌ @ Base.jl:33 within `getproperty'
     %40 = load <2 x i64>, <2 x i64>* %7, align 16
; │└└
; │ @ reinterpretarray.jl:464 within `setindex!' @ array.jl:839
   store <2 x i64> %40, <2 x i64>* %30, align 8
; └
; ┌ @ reinterpretarray.jl:238 within `iterate' @ range.jl:674
; │┌ @ promotion.jl:410 within `=='
    %.not = icmp eq i64 %value_phi19.ph, %12
; │└
   %41 = add nuw nsw i64 %value_phi19.ph, 1
; └
  br i1 %.not, label %L128, label %L26.outer

L128:                                             ; preds = %L26.outer, %top
  call void @llvm.lifetime.end.p0i8(i64 8, i8* nonnull %6)
  call void @llvm.lifetime.end.p0i8(i64 16, i8* nonnull %8)
  ret void
}

I don't understand why there is a difference between OSs, but in this case, the memcpy function call is not removed on Windows, causing a stall. (see #38751)

@inline _memcpy!(dst, src, n) = ccall(:memcpy, Cvoid, (Ptr{UInt8}, Ptr{UInt8}, Csize_t), dst, src, n)

The direct cause of the stall is the memcpy, but I don't know if it is the root cause.

This issue comes from the discussion in JuliaGraphics/ColorTypes.jl#220.

cc: @timholy

@KristofferC
Copy link
Member

The direct cause of the stall is the memcpy, but I don't know if it is the root cause.

Ref #38751

@JeffBezanson JeffBezanson added compiler:codegen Generation of LLVM IR and native code performance Must go faster system:windows Affects only Windows labels Jan 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:codegen Generation of LLVM IR and native code performance Must go faster system:windows Affects only Windows
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants