Perf improvement for dimension not of factor 4 and 16 #211

2ooom · 2020-04-05T19:38:15Z

Currently SIMD (SSE or AVX) is used for the cases when dimension is
multiple of 4 or 16, while when dimension size is not strictly equal
to multiple of 4 or 16 a slower non-vectorized method is used.

To improve performnance for these cases new methods are added:
L2SqrSIMD(4|16)ExtResidual - it relies on existing L2SqrSIMD(4|16)Ext
to compute up to *4 and *16 dimensions and finishes residual
computation by non-vectorized method L2Sqr.

2ooom · 2020-04-05T19:44:58Z

Hello @yurymalkov, thanks for the feedback and apologies for late response. I've closed previous PR and took into account your comment about strict equality.
Indeed you were right and residual method call adds some overhead. So I've added a wrapping method for this specific case and kept the existing ones, hope that addressed your concerns.

yurymalkov · 2020-04-06T17:50:57Z

@2ooom Thank you for the update! I'll check it within this week.

yurymalkov

I've looked though the PR. Thanks!
Some comments:

One file is missing, so it does not compile.
I think it makes sense to separate the benchmark into a folder, e.g. ./benchs to keep the library isolated.
Also please do PR against develop branch.

hnswlib/bench.cpp

2ooom · 2020-04-13T09:03:56Z

Hi @yurymalkov. Apologies for messy PR, I didn't expect my 2 last commit (bench +.gitignore) would end up here. I've excluded both of them (it's WIP and I might push them in separate request to dev branch).
Could you please have another look and let me know if there any comments about space_l2.h changes?

yurymalkov · 2020-04-16T00:24:04Z

@2ooom Thank you! I'll take a look.

2ooom · 2020-04-19T07:49:39Z

Hello @yurymalkov I've updated InnerProduct computation with the same approach.
New methods added in space_ip: InnerProductSIMD(4|16)ExtResidual - relies on existing
InnerProductSIMD(4|16)Ext to compute up to *4 and *16 dimensions and
finishes residual computation by non-vectorized method InnerProduct.

Performance improvement compared to baseline is x3-4 times depending on
dimension. Benchmark results:

Run on (4 X 3300 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x2)
  L1 Instruction 32 KiB (x2)
  L2 Unified 256 KiB (x2)
  L3 Unified 4096 KiB (x1)
Load Average: 2.10, 2.25, 2.46

----------------------------------------------------------
Benchmark          Time             CPU        Iterations
----------------------------------------------------------
TstDim65        14.0 ns         14.0 ns     20 * 48676012
RefDim65        50.3 ns         50.2 ns     20 * 12907985
TstDim101       23.8 ns         23.8 ns     20 * 27976276
RefDim101       91.4 ns         91.3 ns     20 *  7364003
TetDim129       30.0 ns         30.0 ns     20 * 23413955
RefDim129        123 ns          123 ns     20 *  5656383
TstDim257       57.8 ns         57.7 ns     20 * 11263073
RefDim257        268 ns          267 ns     20 *  2617478

Currently SIMD (SSE or AVX) is used for the cases when dimension is multiple of 4 or 16, when dimension size is not strictly equal to multiple of 4 or 16 a slower non-vectorized method is used. To improve performance for these cases new methods are added: `L2SqrSIMD(4|16)ExtResidual` - relies on existing `L2SqrSIMD(4|16)Ext` to compute up to *4 and *16 dimensions and finishes residual computation by method `L2Sqr`. Performance improvement compared to baseline is x3-4 times depending on dimension. Benchmark results: Run on (4 X 3300 MHz CPU s) CPU Caches: L1 Data 32 KiB (x2) L1 Instruction 32 KiB (x2) L2 Unified 256 KiB (x2) L3 Unified 4096 KiB (x1) Load Average: 2.18, 2.35, 3.88 ----------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------- TstDim65 14.7 ns 14.7 ns 20 * 47128209 RefDim65 50.2 ns 50.1 ns 20 * 10373751 TstDim101 24.7 ns 24.7 ns 20 * 28064436 RefDim101 90.4 ns 90.2 ns 20 * 7592191 TstDim129 31.4 ns 31.3 ns 20 * 22397921 RefDim129 125 ns 124 ns 20 * 5548862 TstDim257 59.3 ns 59.2 ns 20 * 10856753 RefDim257 266 ns 266 ns 20 * 2630926

…d 16 Currently SIMD (SSE or AVX) is used for the cases when dimension is multiple of 4 or 16, when dimension size is not strictly equal to multiple of 4 or 16 a slower non-vectorized method is used. To improve performnance for these cases new methods are added: `InnerProductSIMD(4|16)ExtResidual` - relies on existing `InnerProductSIMD(4|16)Ext` to compute up to *4 and *16 dimensions and finishes residual computation by non-vectorized method `InnerProduct`. Performance improvement compared to baseline is x3-4 times depending on dimension. Benchmark results: Run on (4 X 3300 MHz CPU s) CPU Caches: L1 Data 32 KiB (x2) L1 Instruction 32 KiB (x2) L2 Unified 256 KiB (x2) L3 Unified 4096 KiB (x1) Load Average: 2.10, 2.25, 2.46 ---------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------- TstDim65 14.0 ns 14.0 ns 20 * 48676012 RefDim65 50.3 ns 50.2 ns 20 * 12907985 TstDim101 23.8 ns 23.8 ns 20 * 27976276 RefDim101 91.4 ns 91.3 ns 20 * 7364003 TetDim129 30.0 ns 30.0 ns 20 * 23413955 RefDim129 123 ns 123 ns 20 * 5656383 TstDim257 57.8 ns 57.7 ns 20 * 11263073 RefDim257 268 ns 267 ns 20 * 2617478

yurymalkov · 2020-04-20T05:28:15Z

@2ooom Impressive results! Thank you!
Can you please redo the PR to the develop branch?

2ooom · 2020-04-20T07:09:10Z

Thanks @yurymalkov , I can change destination branch, but there is not dev or develop branch in nmslib/hnswlib and I'm not sure I can create one. Could you please help?

yurymalkov · 2020-04-25T22:40:03Z

@2ooom Sorry, I've missed you response :(
I've created the branch and look through the code and tested the performance.

I also turns out I can switch the base myself :)

Thank you so much for the PR!

2ooom · 2020-04-26T16:04:39Z

Thank you @yurymalkov, happy to contribute.
I'm planing to submit a couple more PRs related to distance computation (further perf improvements and float16 support) so I was thinking maybe you could share you testing and benchmarking scenarios?

yurymalkov · 2020-04-27T02:47:43Z

That would be nice! Looking forward for it.
Regarding the benchmarks, a simple benchmark is to run 1M SIFT.
For more reliable results, when I was developing the HNSW I was comparing two search algorithms by implementing them both and running on the same index (checking the result and time), usually for small dim data (d=4) and high dim data (d=1024), as reloading the index from the disk leads to performance deviations.
Maybe it is worth to abstract the search method in the library for easy experimentation. I can do that within a week or two.

2ooom · 2020-04-27T15:43:18Z

Thanks for sharing. I'll try SIFT. For now I've been doing micro-benchmarking on distance functions (l2, inner product), which are quite well isolated already, but it's got to have end-to-end comparison.

yurymalkov · 2020-04-28T06:10:45Z

@2ooom Got it. BTW, the sift test should be already in the repository.

Perf improvement for dimension not of factor 4 and 16

yurymalkov requested changes Apr 13, 2020

View reviewed changes

hnswlib/bench.cpp Outdated Show resolved Hide resolved

2ooom force-pushed the master branch from 8e0844e to 6bd7e37 Compare April 13, 2020 08:56

2ooom force-pushed the master branch 3 times, most recently from 0959e00 to 1daca5e Compare April 19, 2020 07:41

2ooom changed the title ~~[L2Space] Perf improvement for dimension not of factor 4 and 16~~ Perf improvement for dimension not of factor 4 and 16 Apr 19, 2020

2ooom requested a review from yurymalkov April 19, 2020 07:49

2ooom added 2 commits April 19, 2020 09:50

2ooom force-pushed the master branch from 1daca5e to 30ac4c5 Compare April 19, 2020 07:53

yurymalkov approved these changes Apr 25, 2020

View reviewed changes

yurymalkov changed the base branch from master to develop April 25, 2020 22:38

yurymalkov merged commit a3ef160 into nmslib:develop Apr 25, 2020

sjwsl pushed a commit to sjwsl/hnswlib that referenced this pull request May 6, 2021

Merge pull request nmslib#211 from 2ooom/master

144a089

Perf improvement for dimension not of factor 4 and 16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Perf improvement for dimension not of factor 4 and 16 #211

Perf improvement for dimension not of factor 4 and 16 #211

Uh oh!

2ooom commented Apr 5, 2020 •

edited

Loading

Uh oh!

2ooom commented Apr 5, 2020

Uh oh!

yurymalkov commented Apr 6, 2020

Uh oh!

yurymalkov left a comment

Uh oh!

Uh oh!

2ooom commented Apr 13, 2020

Uh oh!

yurymalkov commented Apr 16, 2020

Uh oh!

2ooom commented Apr 19, 2020

Uh oh!

yurymalkov commented Apr 20, 2020

Uh oh!

2ooom commented Apr 20, 2020

Uh oh!

yurymalkov commented Apr 25, 2020

Uh oh!

2ooom commented Apr 26, 2020

Uh oh!

yurymalkov commented Apr 27, 2020

Uh oh!

2ooom commented Apr 27, 2020

Uh oh!

yurymalkov commented Apr 28, 2020

Uh oh!

Uh oh!

Perf improvement for dimension not of factor 4 and 16 #211

Perf improvement for dimension not of factor 4 and 16 #211

Uh oh!

Conversation

2ooom commented Apr 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

2ooom commented Apr 5, 2020

Uh oh!

yurymalkov commented Apr 6, 2020

Uh oh!

yurymalkov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

2ooom commented Apr 13, 2020

Uh oh!

yurymalkov commented Apr 16, 2020

Uh oh!

2ooom commented Apr 19, 2020

Uh oh!

yurymalkov commented Apr 20, 2020

Uh oh!

2ooom commented Apr 20, 2020

Uh oh!

yurymalkov commented Apr 25, 2020

Uh oh!

2ooom commented Apr 26, 2020

Uh oh!

yurymalkov commented Apr 27, 2020

Uh oh!

2ooom commented Apr 27, 2020

Uh oh!

yurymalkov commented Apr 28, 2020

Uh oh!

Uh oh!

2ooom commented Apr 5, 2020 •

edited

Loading