-
Notifications
You must be signed in to change notification settings - Fork 712
Perf improvement for dimension not of factor 4 and 16 #211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello @yurymalkov, thanks for the feedback and apologies for late response. I've closed previous PR and took into account your comment about strict equality. |
@2ooom Thank you for the update! I'll check it within this week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've looked though the PR. Thanks!
Some comments:
- One file is missing, so it does not compile.
- I think it makes sense to separate the benchmark into a folder, e.g.
./benchs
to keep the library isolated. - Also please do PR against develop branch.
Hi @yurymalkov. Apologies for messy PR, I didn't expect my 2 last commit (bench +.gitignore) would end up here. I've excluded both of them (it's WIP and I might push them in separate request to dev branch). |
@2ooom Thank you! I'll take a look. |
0959e00
to
1daca5e
Compare
Hello @yurymalkov I've updated InnerProduct computation with the same approach. Performance improvement compared to baseline is x3-4 times depending on
|
Currently SIMD (SSE or AVX) is used for the cases when dimension is multiple of 4 or 16, when dimension size is not strictly equal to multiple of 4 or 16 a slower non-vectorized method is used. To improve performance for these cases new methods are added: `L2SqrSIMD(4|16)ExtResidual` - relies on existing `L2SqrSIMD(4|16)Ext` to compute up to *4 and *16 dimensions and finishes residual computation by method `L2Sqr`. Performance improvement compared to baseline is x3-4 times depending on dimension. Benchmark results: Run on (4 X 3300 MHz CPU s) CPU Caches: L1 Data 32 KiB (x2) L1 Instruction 32 KiB (x2) L2 Unified 256 KiB (x2) L3 Unified 4096 KiB (x1) Load Average: 2.18, 2.35, 3.88 ----------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------- TstDim65 14.7 ns 14.7 ns 20 * 47128209 RefDim65 50.2 ns 50.1 ns 20 * 10373751 TstDim101 24.7 ns 24.7 ns 20 * 28064436 RefDim101 90.4 ns 90.2 ns 20 * 7592191 TstDim129 31.4 ns 31.3 ns 20 * 22397921 RefDim129 125 ns 124 ns 20 * 5548862 TstDim257 59.3 ns 59.2 ns 20 * 10856753 RefDim257 266 ns 266 ns 20 * 2630926
…d 16 Currently SIMD (SSE or AVX) is used for the cases when dimension is multiple of 4 or 16, when dimension size is not strictly equal to multiple of 4 or 16 a slower non-vectorized method is used. To improve performnance for these cases new methods are added: `InnerProductSIMD(4|16)ExtResidual` - relies on existing `InnerProductSIMD(4|16)Ext` to compute up to *4 and *16 dimensions and finishes residual computation by non-vectorized method `InnerProduct`. Performance improvement compared to baseline is x3-4 times depending on dimension. Benchmark results: Run on (4 X 3300 MHz CPU s) CPU Caches: L1 Data 32 KiB (x2) L1 Instruction 32 KiB (x2) L2 Unified 256 KiB (x2) L3 Unified 4096 KiB (x1) Load Average: 2.10, 2.25, 2.46 ---------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------- TstDim65 14.0 ns 14.0 ns 20 * 48676012 RefDim65 50.3 ns 50.2 ns 20 * 12907985 TstDim101 23.8 ns 23.8 ns 20 * 27976276 RefDim101 91.4 ns 91.3 ns 20 * 7364003 TetDim129 30.0 ns 30.0 ns 20 * 23413955 RefDim129 123 ns 123 ns 20 * 5656383 TstDim257 57.8 ns 57.7 ns 20 * 11263073 RefDim257 268 ns 267 ns 20 * 2617478
@2ooom Impressive results! Thank you! |
Thanks @yurymalkov , I can change destination branch, but there is not |
@2ooom Sorry, I've missed you response :( I also turns out I can switch the base myself :) Thank you so much for the PR! |
Thank you @yurymalkov, happy to contribute. |
That would be nice! Looking forward for it. |
Thanks for sharing. I'll try SIFT. For now I've been doing micro-benchmarking on distance functions (l2, inner product), which are quite well isolated already, but it's got to have end-to-end comparison. |
@2ooom Got it. BTW, the sift test should be already in the repository. |
Perf improvement for dimension not of factor 4 and 16
Currently SIMD (SSE or AVX) is used for the cases when dimension is
multiple of 4 or 16, while when dimension size is not strictly equal
to multiple of 4 or 16 a slower non-vectorized method is used.
To improve performnance for these cases new methods are added:
L2SqrSIMD(4|16)ExtResidual
- it relies on existingL2SqrSIMD(4|16)Ext
to compute up to *4 and *16 dimensions and finishes residual
computation by non-vectorized method
L2Sqr
.