Skip to content

[L2 Space] Improving performance when dimension is not a factor of 4 or 16 #131

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 33 commits into from
Closed

Conversation

2ooom
Copy link
Contributor

@2ooom 2ooom commented Jul 30, 2019

Processing 8 values at once and finishing computation by non-vectorized instructions in case dim % 8 != 0

…or 16

Processing 8 values at once and finishing computation by non-vectorized
instructions in case dim % 8 != 0
@yurymalkov
Copy link
Member

Hi @2ooom,
Thanks for the PR!

A long time ago I've done the performance tests and it turns out that the computation of the residual does contribute to the total run time even if dim%16==0 (no residual). So, I would like to keep the conditions of the exactness along with the others for the best performance in this case.

This probably can be done without much duplicate coding by using templates with bools in L2SqrSIMD16Ext (e.g. if dim%16==0 then set a bool to skip the tail at the compile time; if dim%16!=0 then process the tail).

Can you please update the PR to accommodate that?

piem and others added 28 commits April 5, 2020 12:53
* temp debug state

* fix bug in loading index with deleted elements

* adjust condition in test

* add check for file existence

* cleanup
addPoint(void*, args...) to addPoint(const void*, args...).

The changes include the interface in bruteforce.h, and all interfaces
related to addPoint in hnswlib.

I test the code using 1 million sift data, the result is ok.
1. searchKnn will fist check if the graph is empty
2. searchKnn will return a min-heap

The test code in sift_1b is changed and tested.
…be virtual.

I modified the sift_1b(not commited) to test the new interface, the result is ok.

Test result of sift_1b on 1 million data.
Loading GT:
Loading queries:
Loading index from sift1b_1m_ef_40_M_16.bin:
Actual memory usage: 417 Mb
Parsing gt:
10000
Loaded gt
1	0.2371	13.319 us
2	0.3712	15.691 us
3	0.4615	18.5166 us
4	0.5273	20.6371 us
5	0.5758	22.1235 us
6	0.6179	24.4141 us
7	0.6502	25.9906 us
8	0.6796	28.2004 us
9	0.7042	29.8559 us
10	0.7243	31.3286 us
11	0.7432	36.0276 us
12	0.7605	34.9448 us
13	0.7754	36.4176 us
14	0.7874	37.7606 us
15	0.8013	44.6698 us
16	0.8116	47.4424 us
17	0.8239	46.9154 us
18	0.8312	45.9322 us
19	0.8379	49.3406 us
20	0.8442	49.124 us
21	0.8507	52.1223 us
22	0.8566	52.4161 us
23	0.8622	56.9665 us
24	0.8675	71.5782 us
25	0.8731	72.4451 us
26	0.8768	57.0935 us
27	0.8812	58.3525 us
28	0.8845	59.5751 us
29	0.889	61.7516 us
30	0.8935	62.6091 us
40	0.9224	76.8735 us
50	0.9412	92.5431 us
60	0.9541	107.141 us
70	0.9632	121.24 us
80	0.9708	135.862 us
90	0.9756	163.516 us
100	0.9792	180.539 us
140	0.9883	228.747 us
180	0.9921	281.199 us
220	0.9942	338.32 us
260	0.9956	388.501 us
300	0.9962	445.776 us
340	0.9968	477.474 us
380	0.9975	534.054 us
420	0.9982	582.327 us
460	0.9983	625.824 us
Actual memory usage: 419 Mb
xiejianqiao and others added 4 commits April 5, 2020 12:53
L2 SIMD methods are split in 2:
 1. `L2SqrSIMD(4|16)Ext` - uses SSE or AVX to compute distance on
 dimensions that are multiples of 4 and 16
 2. `L2SqrSIMD(4|16)ExtResidual` - relies on (1) to compute full
 multiples of 4 and 16 dimensions and finishes residual computation by
 relying on non-SIMD method `L2Sqr`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants