You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[InnerProductSpace] Perf improvement for dimension not of factor 4 and 16
Currently SIMD (SSE or AVX) is used for the cases when dimension is
multiple of 4 or 16, when dimension size is not strictly equal to
multiple of 4 or 16 a slower non-vectorized method is used.
To improve performnance for these cases new methods are added:
`InnerProductSIMD(4|16)ExtResidual` - relies on existing
`InnerProductSIMD(4|16)Ext` to compute up to *4 and *16 dimensions and
finishes residual computation by non-vectorized method `InnerProduct`.
Performance improvement compared to baseline is x3-4 times depending on
dimension. Benchmark results:
Run on (4 X 3300 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x2)
L1 Instruction 32 KiB (x2)
L2 Unified 256 KiB (x2)
L3 Unified 4096 KiB (x1)
Load Average: 2.10, 2.25, 2.46
----------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------
TstDim65 14.0 ns 14.0 ns 20 * 48676012
RefDim65 50.3 ns 50.2 ns 20 * 12907985
TstDim101 23.8 ns 23.8 ns 20 * 27976276
RefDim101 91.4 ns 91.3 ns 20 * 7364003
TetDim129 30.0 ns 30.0 ns 20 * 23413955
RefDim129 123 ns 123 ns 20 * 5656383
TstDim257 57.8 ns 57.7 ns 20 * 11263073
RefDim257 268 ns 267 ns 20 * 2617478
0 commit comments