`reinterpret(reshape, T, A)` is (still) slow on Windows #39382

kimikage · 2021-01-24T16:06:43Z

PR #37559 introduced reinterpret(reshape, T, A)(ReshapedReinterpretArray) to improve the performance.
However, it does not always improve the speed on Windows (in some cases it slows things down). Since this issue is related to vectorization, the generated code can vary greatly depending on how the iterator is used. So, here is a very simple example.

function f!(B::AbstractArray{T}) where T
    @inbounds for I in eachindex(B)
        B[I] += T(0.5)
    end
end

sz = (1000, 1000);
A = zeros(ComplexF64, sz);
Br = reinterpret(Float64, zeros(UInt64, 2, sz...)); # `NonReshapedReinterpretArray`
Bsr = reshape(reinterpret(Float64, A), 2, sz...); # `ReshapedArray`
Brs = reinterpret(reshape, Float64, A); # `ReshapedReinterpretArray`

julia> @btime f!($Br);
  1.306 ms (0 allocations: 0 bytes) # v1.6.0-beta1 Debian on WSL2
  1.318 ms (0 allocations: 0 bytes) # v1.6.0-beta1 Windows
  1.323 ms (0 allocations: 0 bytes) # v1.7.0-DEV.361 Windows
  4.251 ms (0 allocations: 0 bytes) # v1.5.3 Windows

julia> @btime f!($Bsr);
  12.003 ms (0 allocations: 0 bytes) # v1.6.0-beta1 Debian on WSL2
  15.315 ms (0 allocations: 0 bytes) # v1.6.0-beta1 Windows
  15.277 ms (0 allocations: 0 bytes) # v1.7.0-DEV.361 Windows
  17.291 ms (0 allocations: 0 bytes) # v1.5.3 Windows

julia> @btime f!($Brs);
  1.048 ms (0 allocations: 0 bytes)  # v1.6.0-beta1 Debian on WSL2
  10.148 ms (0 allocations: 0 bytes) # v1.6.0-beta1 Windows
  10.143 ms (0 allocations: 0 bytes) # v1.7.0-DEV.361 Windows

The result of @code_typed is the same on Linux and Windows, but the result of @code_llvm is very different.

LLVM IR on v1.6.0-beta1 Linux

;  @ REPL[1]:1 within `f!'
define void @"julia_f!_436"({ {}*, i8, i8 }* nocapture nonnull readonly align 8 dereferenceable(16) %0) {
top:
;  @ REPL[1]:2 within `f!'
; ┌ @ abstractarray.jl:301 within `eachindex' @ reinterpretarray.jl:208
; │┌ @ reinterpretarray.jl:277 within `parent'
; ││┌ @ Base.jl:33 within `getproperty'
     %1 = bitcast { {}*, i8, i8 }* %0 to { i8*, i64, i16, i16, i32 }**
     %2 = load atomic { i8*, i64, i16, i16, i32 }*, { i8*, i64, i16, i16, i32 }** %1 unordered, align 8
; │└└
; │ @ abstractarray.jl:301 within `eachindex' @ reinterpretarray.jl:208 @ abstractarray.jl:311
; │┌ @ array.jl:197 within `length'
    %3 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %2, i64 0, i32 1
    %4 = load i64, i64* %3, align 8
; └└
; ┌ @ reinterpretarray.jl:227 within `iterate' @ range.jl:670
; │┌ @ range.jl:519 within `isempty'
; ││┌ @ operators.jl:305 within `>'
; │││┌ @ int.jl:83 within `<'
      %.not.not.not = icmp eq i64 %4, 0
; └└└└
  br i1 %.not.not.not, label %L128, label %L26.preheader

L26.preheader:                                    ; preds = %top
  %5 = bitcast { i8*, i64, i16, i16, i32 }* %2 to [2 x double]**
;  @ REPL[1]:3 within `f!'
; ┌ @ reinterpretarray.jl:334 within `getindex' @ array.jl:0
   %6 = load [2 x double]*, [2 x double]** %5, align 8
; └
; ┌ @ reinterpretarray.jl:234 within `iterate'
   %min.iters.check = icmp ult i64 %4, 4
   br i1 %min.iters.check, label %scalar.ph, label %vector.ph

vector.ph:                                        ; preds = %L26.preheader
   %n.vec = and i64 %4, 9223372036854775804
   %ind.end = or i64 %n.vec, 1
   br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
   %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
   %7 = getelementptr inbounds [2 x double], [2 x double]* %6, i64 %index, i64 0
   %8 = bitcast double* %7 to <8 x double>*
   %wide.vec = load <8 x double>, <8 x double>* %8, align 8
   %strided.vec = shufflevector <8 x double> %wide.vec, <8 x double> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
   %strided.vec98 = shufflevector <8 x double> %wide.vec, <8 x double> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
; └
; ┌ @ float.jl:326 within `+'
   %9 = fadd <4 x double> %strided.vec, <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>
   %10 = fadd <4 x double> %strided.vec98, <double 5.000000e-01, double 5.000000e-01, double 5.000000e-01, double 5.000000e-01>
; └
; ┌ @ reinterpretarray.jl within `setindex!'
   %interleaved.vec = shufflevector <4 x double> %9, <4 x double> %10, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32
 6, i32 3, i32 7>
   store <8 x double> %interleaved.vec, <8 x double>* %8, align 8
   %index.next = add i64 %index, 4
   %11 = icmp eq i64 %index.next, %n.vec
   br i1 %11, label %middle.block, label %vector.body

middle.block:                                     ; preds = %vector.body
; └
  %cmp.n = icmp eq i64 %4, %n.vec
  br i1 %cmp.n, label %L128, label %scalar.ph

scalar.ph:                                        ; preds = %middle.block, %L26.preheader
  %bc.resume.val = phi i64 [ %ind.end, %middle.block ], [ 1, %L26.preheader ]
; ┌ @ reinterpretarray.jl:234 within `iterate'
   br label %L26.outer

L26.outer:                                        ; preds = %L26.outer, %scalar.ph
   %value_phi19.ph = phi i64 [ %bc.resume.val, %scalar.ph ], [ %17, %L26.outer ]
; └
; ┌ @ reinterpretarray.jl:334 within `getindex' @ array.jl:0
   %12 = add nsw i64 %value_phi19.ph, -1
   %13 = getelementptr inbounds [2 x double], [2 x double]* %6, i64 %12, i64 0
   %14 = bitcast double* %13 to <2 x double>*
   %15 = load <2 x double>, <2 x double>* %14, align 8
; └
; ┌ @ float.jl:326 within `+'
   %16 = fadd <2 x double> %15, <double 5.000000e-01, double 5.000000e-01>
; └
; ┌ @ reinterpretarray.jl within `setindex!'
   store <2 x double> %16, <2 x double>* %14, align 8
; └
; ┌ @ reinterpretarray.jl:238 within `iterate' @ range.jl:674
; │┌ @ promotion.jl:410 within `=='
    %.not = icmp eq i64 %value_phi19.ph, %4
; │└
   %17 = add nuw nsw i64 %value_phi19.ph, 1
; └
  br i1 %.not, label %L128, label %L26.outer

L128:                                             ; preds = %L26.outer, %middle.block, %top
  ret void
}

LLVM IR on v1.6.0-beta1 Windows

;  @ REPL[1]:1 within `f!'
; Function Attrs: uwtable
define void @"julia_f!_389"({ {}*, i8, i8 }* nocapture nonnull readonly align 8 dereferenceable(16) %0) #0 {
top:
  %1 = alloca i128, align 16
  %2 = bitcast i128* %1 to i8*
  %3 = alloca <2 x i64>, align 16
  %4 = bitcast <2 x i64>* %3 to i8*
  %5 = alloca i64, align 16
  %6 = bitcast i64* %5 to i8*
  %7 = alloca <2 x i64>, align 16
  %8 = bitcast <2 x i64>* %7 to i8*
;  @ REPL[1]:2 within `f!'
; ┌ @ abstractarray.jl:301 within `eachindex' @ reinterpretarray.jl:208
; │┌ @ reinterpretarray.jl:277 within `parent'
; ││┌ @ Base.jl:33 within `getproperty'
     %9 = bitcast { {}*, i8, i8 }* %0 to { i8*, i64, i16, i16, i32 }**
     %10 = load atomic { i8*, i64, i16, i16, i32 }*, { i8*, i64, i16, i16, i32 }** %9 unordered, align 8
; │└└
; │ @ abstractarray.jl:301 within `eachindex' @ reinterpretarray.jl:208 @ abstractarray.jl:311
; │┌ @ array.jl:197 within `length'
    %11 = getelementptr inbounds { i8*, i64, i16, i16, i32 }, { i8*, i64, i16, i16, i32 }* %10, i64 0, i32 1
    %12 = load i64, i64* %11, align 8
; └└
; ┌ @ reinterpretarray.jl:227 within `iterate' @ range.jl:670
; │┌ @ range.jl:519 within `isempty'
; ││┌ @ operators.jl:305 within `>'
; │││┌ @ int.jl:83 within `<'
      %.not.not.not = icmp eq i64 %12, 0
; └└└└
  br i1 %.not.not.not, label %L128, label %L26.preheader

L26.preheader:                                    ; preds = %top
  %13 = bitcast { i8*, i64, i16, i16, i32 }* %10 to [2 x double]**
;  @ REPL[1]:3 within `f!'
; ┌ @ reinterpretarray.jl:334 within `getindex' @ array.jl:0
   %14 = load [2 x double]*, [2 x double]** %13, align 8
; │ @ reinterpretarray.jl:334 within `getindex'
; │┌ @ refpointer.jl:136 within `Ref'
; ││┌ @ refvalue.jl within `RefValue'
     %15 = bitcast <2 x i64>* %3 to [2 x double]*
     %16 = getelementptr inbounds <2 x i64>, <2 x i64>* %3, i64 0, i64 0
     %.repack60 = getelementptr inbounds [2 x double], [2 x double]* %15, i64 0, i64 1
     %17 = bitcast double* %.repack60 to i64*
; └└└
; ┌ @ reinterpretarray.jl:336 within `getindex'
; │┌ @ refpointer.jl:172 within `unsafe_convert' @ refvalue.jl:40
; ││┌ @ pointer.jl within `pointer_from_objref'
     %18 = ptrtoint i128* %1 to i64
; └└└
; ┌ @ reinterpretarray.jl:337 within `getindex'
; │┌ @ refpointer.jl:101 within `unsafe_convert' @ refvalue.jl:40
; ││┌ @ pointer.jl within `pointer_from_objref'
     %19 = ptrtoint <2 x i64>* %3 to i64
; └└└
; ┌ @ reinterpretarray.jl:340 within `getindex' @ refvalue.jl:56
; │┌ @ Base.jl within `getproperty'
    %20 = bitcast i128* %1 to [2 x double]*
    %.elt64 = getelementptr inbounds [2 x double], [2 x double]* %20, i64 0, i64 1
; └└
; ┌ @ reinterpretarray.jl:457 within `setindex!'
; │┌ @ refpointer.jl:136 within `Ref'
; ││┌ @ refvalue.jl within `RefValue'
     %21 = bitcast i64* %5 to double*
; └└└
; ┌ @ reinterpretarray.jl:458 within `setindex!'
; │┌ @ refpointer.jl:136 within `Ref'
; ││┌ @ refvalue.jl within `RefValue'
     %22 = bitcast <2 x i64>* %7 to [2 x double]*
     %23 = getelementptr inbounds <2 x i64>, <2 x i64>* %7, i64 0, i64 0
     %.repack70 = getelementptr inbounds [2 x double], [2 x double]* %22, i64 0, i64 1
     %24 = bitcast double* %.repack70 to i64*
; └└└
; ┌ @ reinterpretarray.jl:460 within `setindex!'
; │┌ @ refpointer.jl:101 within `unsafe_convert' @ refvalue.jl:40
; ││┌ @ pointer.jl within `pointer_from_objref'
     %25 = ptrtoint i64* %5 to i64
; └└└
; ┌ @ reinterpretarray.jl:234 within `iterate'
   br label %L26.outer

L26.outer:                                        ; preds = %L26.outer, %L26.preheader
   %value_phi19.ph = phi i64 [ 1, %L26.preheader ], [ %41, %L26.outer ]
; └
; ┌ @ reinterpretarray.jl:334 within `getindex' @ array.jl:0
   %26 = add nsw i64 %value_phi19.ph, -1
   %27 = getelementptr inbounds [2 x double], [2 x double]* %14, i64 %26
   %28 = bitcast [2 x double]* %27 to i64*
   %.elt58 = getelementptr inbounds [2 x double], [2 x double]* %14, i64 %26, i64 1
   %29 = bitcast double* %.elt58 to i64*
   call void @llvm.lifetime.start.p0i8(i64 16, i8* nonnull %2)
; │ @ reinterpretarray.jl:334 within `getindex' @ array.jl:801
   %30 = bitcast [2 x double]* %27 to <2 x i64>*
   %31 = load <2 x i64>, <2 x i64>* %30, align 8
   call void @llvm.lifetime.start.p0i8(i64 16, i8* nonnull %4)
; │ @ reinterpretarray.jl:334 within `getindex'
; │┌ @ refpointer.jl:136 within `Ref'
; ││┌ @ refvalue.jl:8 within `RefValue'
     store <2 x i64> %31, <2 x i64>* %3, align 16
; └└└
; ┌ @ reinterpretarray.jl:338 within `getindex'
; │┌ @ reinterpretarray.jl:343 within `_memcpy!'
    call void inttoptr (i64 140721606831488 to void (i64, i64, i64)*)(i64 %18, i64 %19, i64 16)
; └└
; ┌ @ reinterpretarray.jl:340 within `getindex' @ refvalue.jl:56
; │┌ @ Base.jl:33 within `getproperty'
    %32 = bitcast i128* %1 to double*
    %.unpack6680 = load double, double* %32, align 16
; └└
; ┌ @ float.jl:326 within `+'
   %33 = fadd double %.unpack6680, 5.000000e-01
   call void @llvm.lifetime.end.p0i8(i64 16, i8* nonnull %2)
   call void @llvm.lifetime.end.p0i8(i64 16, i8* nonnull %4)
   call void @llvm.lifetime.start.p0i8(i64 8, i8* nonnull %6)
; └
; ┌ @ reinterpretarray.jl:457 within `setindex!'
; │┌ @ refpointer.jl:136 within `Ref'
; ││┌ @ refvalue.jl:8 within `RefValue'
     store double %33, double* %21, align 16
; └└└
; ┌ @ reinterpretarray.jl:458 within `setindex!'
; │┌ @ array.jl:801 within `getindex'
    %34 = load <2 x i64>, <2 x i64>* %30, align 8
    call void @llvm.lifetime.start.p0i8(i64 16, i8* nonnull %8)
; │└
; │┌ @ refpointer.jl:136 within `Ref'
; ││┌ @ refvalue.jl:8 within `RefValue'
     store <2 x i64> %34, <2 x i64>* %7, align 16
; └└└
; ┌ @ reinterpretarray.jl:462 within `setindex!'
; │┌ @ pointer.jl:159 within `+'
    %35 = ptrtoint <2 x i64>* %7 to i64
; │└
; │┌ @ reinterpretarray.jl:343 within `_memcpy!'
    call void inttoptr (i64 140721606831488 to void (i64, i64, i64)*)(i64 %35, i64 %25, i64 8)
; └└
; ┌ @ reinterpretarray.jl:464 within `setindex!'
; │┌ @ refvalue.jl:56 within `getindex'
; ││┌ @ Base.jl:33 within `getproperty'
     %.unpack78 = load i64, i64* %23, align 16
     %.unpack7579 = load i64, i64* %24, align 8
; │└└
; │ @ reinterpretarray.jl:464 within `setindex!' @ array.jl:839
   store i64 %.unpack78, i64* %28, align 8
   store i64 %.unpack7579, i64* %29, align 8
   call void @llvm.lifetime.start.p0i8(i64 16, i8* nonnull %2)
   call void @llvm.lifetime.start.p0i8(i64 16, i8* nonnull %4)
; └
; ┌ @ reinterpretarray.jl:334 within `getindex'
; │┌ @ refpointer.jl:136 within `Ref'
; ││┌ @ refvalue.jl:8 within `RefValue'
     store i64 %.unpack78, i64* %16, align 16
     store i64 %.unpack7579, i64* %17, align 8
; └└└
; ┌ @ reinterpretarray.jl:338 within `getindex'
; │┌ @ reinterpretarray.jl:343 within `_memcpy!'
    call void inttoptr (i64 140721606831488 to void (i64, i64, i64)*)(i64 %18, i64 %19, i64 16)
; └└
; ┌ @ reinterpretarray.jl:340 within `getindex' @ refvalue.jl:56
; │┌ @ Base.jl:33 within `getproperty'
    %.unpack6567.181 = load double, double* %.elt64, align 8
; └└
; ┌ @ float.jl:326 within `+'
   %36 = fadd double %.unpack6567.181, 5.000000e-01
   call void @llvm.lifetime.end.p0i8(i64 16, i8* nonnull %2)
   call void @llvm.lifetime.end.p0i8(i64 16, i8* nonnull %4)
   call void @llvm.lifetime.start.p0i8(i64 8, i8* nonnull %6)
; └
; ┌ @ reinterpretarray.jl:457 within `setindex!'
; │┌ @ refpointer.jl:136 within `Ref'
; ││┌ @ refvalue.jl:8 within `RefValue'
     store double %36, double* %21, align 16
; └└└
; ┌ @ reinterpretarray.jl:458 within `setindex!'
; │┌ @ array.jl:801 within `getindex'
    %37 = load <2 x i64>, <2 x i64>* %30, align 8
    call void @llvm.lifetime.start.p0i8(i64 16, i8* nonnull %8)
; │└
; │┌ @ refpointer.jl:136 within `Ref'
; ││┌ @ refvalue.jl:8 within `RefValue'
     store <2 x i64> %37, <2 x i64>* %7, align 16
; └└└
; ┌ @ reinterpretarray.jl:462 within `setindex!'
; │┌ @ pointer.jl:159 within `+'
    %38 = getelementptr inbounds i8, i8* %8, i64 8
    %39 = ptrtoint i8* %38 to i64
; │└
; │┌ @ reinterpretarray.jl:343 within `_memcpy!'
    call void inttoptr (i64 140721606831488 to void (i64, i64, i64)*)(i64 %39, i64 %25, i64 8)
; └└
; ┌ @ reinterpretarray.jl:464 within `setindex!'
; │┌ @ refvalue.jl:56 within `getindex'
; ││┌ @ Base.jl:33 within `getproperty'
     %40 = load <2 x i64>, <2 x i64>* %7, align 16
; │└└
; │ @ reinterpretarray.jl:464 within `setindex!' @ array.jl:839
   store <2 x i64> %40, <2 x i64>* %30, align 8
; └
; ┌ @ reinterpretarray.jl:238 within `iterate' @ range.jl:674
; │┌ @ promotion.jl:410 within `=='
    %.not = icmp eq i64 %value_phi19.ph, %12
; │└
   %41 = add nuw nsw i64 %value_phi19.ph, 1
; └
  br i1 %.not, label %L128, label %L26.outer

L128:                                             ; preds = %L26.outer, %top
  call void @llvm.lifetime.end.p0i8(i64 8, i8* nonnull %6)
  call void @llvm.lifetime.end.p0i8(i64 16, i8* nonnull %8)
  ret void
}

I don't understand why there is a difference between OSs, but in this case, the memcpy function call is not removed on Windows, causing a stall. (see #38751)

julia/base/reinterpretarray.jl

Line 346 in 69d2453

    
           @inline _memcpy!(dst, src, n) = ccall(:memcpy, Cvoid, (Ptr{UInt8}, Ptr{UInt8}, Csize_t), dst, src, n)

The direct cause of the stall is the memcpy, but I don't know if it is the root cause.

This issue comes from the discussion in JuliaGraphics/ColorTypes.jl#220.

cc: @timholy

The text was updated successfully, but these errors were encountered:

KristofferC · 2021-01-24T16:21:57Z

The direct cause of the stall is the memcpy, but I don't know if it is the root cause.

Ref #38751

JeffBezanson added compiler:codegen Generation of LLVM IR and native code performance Must go faster system:windows Affects only Windows labels Jan 25, 2021

kimikage mentioned this issue Feb 13, 2021

Change Windows CRT func to be considered as libjulia func #39636

Merged

vtjnash closed this as completed in #39636 Feb 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

`reinterpret(reshape, T, A)` is (still) slow on Windows #39382

`reinterpret(reshape, T, A)` is (still) slow on Windows #39382

kimikage commented Jan 24, 2021 •

edited

Loading

KristofferC commented Jan 24, 2021

Uh oh!

Uh oh!

reinterpret(reshape, T, A) is (still) slow on Windows #39382

reinterpret(reshape, T, A) is (still) slow on Windows #39382

Comments

kimikage commented Jan 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

KristofferC commented Jan 24, 2021

Uh oh!

`reinterpret(reshape, T, A)` is (still) slow on Windows #39382

`reinterpret(reshape, T, A)` is (still) slow on Windows #39382

kimikage commented Jan 24, 2021 •

edited

Loading