You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[L2Space] Perf improvement for dimension not of factor 4 and 16
Currently SIMD (SSE or AVX) is used for the cases when dimension is
multiple of 4 or 16, while when dimension size is not strictly equal
to multiple of 4 or 16 a slower non-vectorized method is used.
To improve performnance for these cases new methods are added:
`L2SqrSIMD(4|16)ExtResidual` - it relies on existing `L2SqrSIMD(4|16)Ext`
to compute up to *4 and *16 dimensions and finishes residual
computation by method `L2Sqr`.
Performance improvement compared to baseline is x3-4 times depending on
dimension. Benhmark results:
Run on (4 X 3300 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x2)
L1 Instruction 32 KiB (x2)
L2 Unified 256 KiB (x2)
L3 Unified 4096 KiB (x1)
Load Average: 2.18, 2.35, 3.88
-----------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------
TstDim65 14.7 ns 14.7 ns 20 * 47128209
RefDim65 50.2 ns 50.1 ns 20 * 10373751
TstDim101 24.7 ns 24.7 ns 20 * 28064436
RefDim101 90.4 ns 90.2 ns 20 * 7592191
TstDim129 31.4 ns 31.3 ns 20 * 22397921
RefDim129 125 ns 124 ns 20 * 5548862
TstDim257 59.3 ns 59.2 ns 20 * 10856753
RefDim257 266 ns 266 ns 20 * 2630926
0 commit comments