Skip to content

Perf improvement for dimension not of factor 4 and 16 #211

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 25, 2020

Conversation

2ooom
Copy link
Contributor

@2ooom 2ooom commented Apr 5, 2020

Currently SIMD (SSE or AVX) is used for the cases when dimension is
multiple of 4 or 16, while when dimension size is not strictly equal
to multiple of 4 or 16 a slower non-vectorized method is used.

To improve performnance for these cases new methods are added:
L2SqrSIMD(4|16)ExtResidual - it relies on existing L2SqrSIMD(4|16)Ext
to compute up to *4 and *16 dimensions and finishes residual
computation by non-vectorized method L2Sqr.

@2ooom
Copy link
Contributor Author

2ooom commented Apr 5, 2020

Hello @yurymalkov, thanks for the feedback and apologies for late response. I've closed previous PR and took into account your comment about strict equality.
Indeed you were right and residual method call adds some overhead. So I've added a wrapping method for this specific case and kept the existing ones, hope that addressed your concerns.

@yurymalkov
Copy link
Member

@2ooom Thank you for the update! I'll check it within this week.

Copy link
Member

@yurymalkov yurymalkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked though the PR. Thanks!
Some comments:

  1. One file is missing, so it does not compile.
  2. I think it makes sense to separate the benchmark into a folder, e.g. ./benchs to keep the library isolated.
  3. Also please do PR against develop branch.

@2ooom
Copy link
Contributor Author

2ooom commented Apr 13, 2020

Hi @yurymalkov. Apologies for messy PR, I didn't expect my 2 last commit (bench +.gitignore) would end up here. I've excluded both of them (it's WIP and I might push them in separate request to dev branch).
Could you please have another look and let me know if there any comments about space_l2.h changes?

@yurymalkov
Copy link
Member

@2ooom Thank you! I'll take a look.

@2ooom 2ooom force-pushed the master branch 3 times, most recently from 0959e00 to 1daca5e Compare April 19, 2020 07:41
@2ooom 2ooom changed the title [L2Space] Perf improvement for dimension not of factor 4 and 16 Perf improvement for dimension not of factor 4 and 16 Apr 19, 2020
@2ooom
Copy link
Contributor Author

2ooom commented Apr 19, 2020

Hello @yurymalkov I've updated InnerProduct computation with the same approach.
New methods added in space_ip: InnerProductSIMD(4|16)ExtResidual - relies on existing
InnerProductSIMD(4|16)Ext to compute up to *4 and *16 dimensions and
finishes residual computation by non-vectorized method InnerProduct.

Performance improvement compared to baseline is x3-4 times depending on
dimension. Benchmark results:

Run on (4 X 3300 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x2)
  L1 Instruction 32 KiB (x2)
  L2 Unified 256 KiB (x2)
  L3 Unified 4096 KiB (x1)
Load Average: 2.10, 2.25, 2.46

----------------------------------------------------------
Benchmark          Time             CPU        Iterations
----------------------------------------------------------
TstDim65        14.0 ns         14.0 ns     20 * 48676012
RefDim65        50.3 ns         50.2 ns     20 * 12907985
TstDim101       23.8 ns         23.8 ns     20 * 27976276
RefDim101       91.4 ns         91.3 ns     20 *  7364003
TetDim129       30.0 ns         30.0 ns     20 * 23413955
RefDim129        123 ns          123 ns     20 *  5656383
TstDim257       57.8 ns         57.7 ns     20 * 11263073
RefDim257        268 ns          267 ns     20 *  2617478

@2ooom 2ooom requested a review from yurymalkov April 19, 2020 07:49
2ooom added 2 commits April 19, 2020 09:50
Currently SIMD (SSE or AVX) is used for the cases when dimension is
multiple of 4 or 16, when dimension size is not strictly equal to
multiple of 4 or 16 a slower non-vectorized method is used.

To improve performance for these cases new methods are added:
`L2SqrSIMD(4|16)ExtResidual` - relies on existing `L2SqrSIMD(4|16)Ext`
to compute up to *4 and *16 dimensions and finishes residual
computation by method `L2Sqr`.

Performance improvement compared to baseline is x3-4 times depending on
dimension. Benchmark results:

Run on (4 X 3300 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x2)
  L1 Instruction 32 KiB (x2)
  L2 Unified 256 KiB (x2)
  L3 Unified 4096 KiB (x1)
Load Average: 2.18, 2.35, 3.88
-----------------------------------------------------------
Benchmark          Time             CPU        Iterations
-----------------------------------------------------------
TstDim65        14.7 ns         14.7 ns     20 * 47128209
RefDim65        50.2 ns         50.1 ns     20 * 10373751
TstDim101       24.7 ns         24.7 ns     20 * 28064436
RefDim101       90.4 ns         90.2 ns     20 *  7592191
TstDim129       31.4 ns         31.3 ns     20 * 22397921
RefDim129        125 ns          124 ns     20 *  5548862
TstDim257       59.3 ns         59.2 ns     20 * 10856753
RefDim257        266 ns          266 ns     20 *  2630926
…d 16

Currently SIMD (SSE or AVX) is used for the cases when dimension is
multiple of 4 or 16, when dimension size is not strictly equal to
multiple of 4 or 16 a slower non-vectorized method is used.

To improve performnance for these cases new methods are added:
`InnerProductSIMD(4|16)ExtResidual` - relies on existing
`InnerProductSIMD(4|16)Ext` to compute up to *4 and *16 dimensions and
finishes residual computation by non-vectorized method `InnerProduct`.

Performance improvement compared to baseline is x3-4 times depending on
dimension. Benchmark results:

Run on (4 X 3300 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x2)
  L1 Instruction 32 KiB (x2)
  L2 Unified 256 KiB (x2)
  L3 Unified 4096 KiB (x1)
Load Average: 2.10, 2.25, 2.46

----------------------------------------------------------
Benchmark          Time             CPU        Iterations
----------------------------------------------------------
TstDim65        14.0 ns         14.0 ns     20 * 48676012
RefDim65        50.3 ns         50.2 ns     20 * 12907985
TstDim101       23.8 ns         23.8 ns     20 * 27976276
RefDim101       91.4 ns         91.3 ns     20 *  7364003
TetDim129       30.0 ns         30.0 ns     20 * 23413955
RefDim129        123 ns          123 ns     20 *  5656383
TstDim257       57.8 ns         57.7 ns     20 * 11263073
RefDim257        268 ns          267 ns     20 *  2617478
@yurymalkov
Copy link
Member

@2ooom Impressive results! Thank you!
Can you please redo the PR to the develop branch?

@2ooom
Copy link
Contributor Author

2ooom commented Apr 20, 2020

Thanks @yurymalkov , I can change destination branch, but there is not dev or develop branch in nmslib/hnswlib and I'm not sure I can create one. Could you please help?

@yurymalkov yurymalkov changed the base branch from master to develop April 25, 2020 22:38
@yurymalkov
Copy link
Member

@2ooom Sorry, I've missed you response :(
I've created the branch and look through the code and tested the performance.

I also turns out I can switch the base myself :)

Thank you so much for the PR!

@yurymalkov yurymalkov merged commit a3ef160 into nmslib:develop Apr 25, 2020
@2ooom
Copy link
Contributor Author

2ooom commented Apr 26, 2020

Thank you @yurymalkov, happy to contribute.
I'm planing to submit a couple more PRs related to distance computation (further perf improvements and float16 support) so I was thinking maybe you could share you testing and benchmarking scenarios?

@yurymalkov
Copy link
Member

That would be nice! Looking forward for it.
Regarding the benchmarks, a simple benchmark is to run 1M SIFT.
For more reliable results, when I was developing the HNSW I was comparing two search algorithms by implementing them both and running on the same index (checking the result and time), usually for small dim data (d=4) and high dim data (d=1024), as reloading the index from the disk leads to performance deviations.
Maybe it is worth to abstract the search method in the library for easy experimentation. I can do that within a week or two.

@2ooom
Copy link
Contributor Author

2ooom commented Apr 27, 2020

Thanks for sharing. I'll try SIFT. For now I've been doing micro-benchmarking on distance functions (l2, inner product), which are quite well isolated already, but it's got to have end-to-end comparison.

@yurymalkov
Copy link
Member

@2ooom Got it. BTW, the sift test should be already in the repository.

sjwsl pushed a commit to sjwsl/hnswlib that referenced this pull request May 6, 2021
Perf improvement for dimension not of factor 4 and 16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants