Skip to content

Question about the recall performance #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
willard-yuan opened this issue Mar 30, 2018 · 13 comments
Closed

Question about the recall performance #3

willard-yuan opened this issue Mar 30, 2018 · 13 comments

Comments

@willard-yuan
Copy link

Hi yurymalkov,

I'm sorry to disturb you again. I have a question about the recall performance evaluation in sift_1b.cpp. In sift_1b.cpp, the number of returned sample for each query is 1, then for the samples at top K (top@K) in the ground truth, if the returned sample is in the samples at top K, it counts 1. I think there is another method which is more appropriate than this. This method is as follows:

The number of retrieved samples for each query is K (top@K), and for the sample at top 1 (top@1) in the ground truth, we check whether it (the sample of ground truth at top 1) in the top@K of retrieved samples. If true, it counts 1.

I think this evaluation is better than the method adopted in sift_1b.cpp. Taking face retrieval as example, we want to get high recall rate. If the same face doesn't recall at top of 10, we can set the number of retrieved samples to big, such as top@50.

How do you think about the evaluation method in the sift_1b.cpp and the method I show it on above?

Looking forward to your reply.

@yurymalkov
Copy link
Member

I am not sure I've understood you correctly.
In sift_1b.cpp the search quality is measured by intersecting the k results from the algorithm with the k-closest elements from the groundtruth(gt). The k is set to 1 for the comparison with the PQ-method results in the paper.
The 1B sift dataset does not has any information about element classes, the gt is just the k nearest elements. So, if the top1 gt element is in the result of a k-NN query it always comes the first (since the results are sorted).
I think that what you have suggested is useful when there are gt classes. I.e. the best answer is not always the nearest to the query (which is the case for the face recognition).

@willard-yuan
Copy link
Author

Hi yurymalkov,

Recently, I have done evaluation of top@N performance: The k-closest elements from the groundtruth(gt) is set as 1, and the number of retrieved samples N ranges from 1 to 1000 (ef is set as 1000). the performance is as follows:

N: 1 	 ef: 1000 	 recall: 0.983400 	 time_us_per_query: 5235.528809us
N: 2 	 ef: 1000 	 recall: 0.983600 	 time_us_per_query: 4660.152344us
N: 3 	 ef: 1000 	 recall: 0.983600 	 time_us_per_query: 4583.465332us
N: 4 	 ef: 1000 	 recall: 0.983600 	 time_us_per_query: 4617.218750us
N: 5 	 ef: 1000 	 recall: 0.983600 	 time_us_per_query: 4531.222168us
N: 10 	 ef: 1000 	 recall: 0.983600 	 time_us_per_query: 4606.251953us
N: 50 	 ef: 1000 	 recall: 0.983600 	 time_us_per_query: 4626.477051us
N: 100 	 ef: 1000 	 recall: 0.983600 	 time_us_per_query: 4567.841309us
N: 300 	 ef: 1000 	 recall: 0.983600 	 time_us_per_query: 4668.483887us
N: 500 	 ef: 1000 	 recall: 0.983600 	 time_us_per_query: 4685.919922us
N: 1000  ef: 1000 	 recall: 0.983600 	 time_us_per_query: 4748.951660us

Increasing the number of retrieved samples has little effect to boost the performance. I think the reason is that the recall is very high (close to 100%) when the number of retrieved samples N is 1, so Increasing N has little effect to boost the performance. I will test it on large-scale face feature.

@yurymalkov
Copy link
Member

Hi @willard-yuan,

Thank you for the tests! I think such behavior can be expected.

During a search the algorithm keeps internally a sorted list of ef current approximate nearest neighbors.
The list is updated via iterations through the graph.

When the stop condition is met (all neighbors of elements in the list are evaluated), the list is shrunk to N (or 'K' for the KNN notation) best elements to provide the output for users: https://github.com/nmslib/hnsw/blob/master/hnswlib/hnswalg.h#L713 .

So, in our case the recall should not depend on N, unless ef is changed, because the initial internal list is sorted. It does depend slightly in our case though (N=1,2). Probably, because there are several elements in the dataset which are equally close at least to some query.

@willard-yuan
Copy link
Author

@yurymalkov Thank for your explanation. I think I have got the main idea of the method. By the way, is there any more English slide or doc to give more detail of the method except the paper?

@yurymalkov
Copy link
Member

@willard-yuan Unfortunately, I am not aware of other detail documents. A somewhat different description can be found in nmslib's manual https://github.com/nmslib/nmslib/blob/master/manual/manual.pdf but it very short.
By the way, the paper is now being updated. I would appreciate feedback on what needs to be improved.

@willard-yuan
Copy link
Author

Hi @yurymalkov I have finished the experiment on CNN feature with 128 dimension. The recall is great:

image

M:40,Mem: 211533 Mb, index file:cnn2b_199485332m_ef_40_M_40.bin, number of queries: 10000,ef: 1000,distance:Inner Product:

top@K recall time(time_us_per_query)
1 0.928600 6448.714355us
2 0.948300 7658.459961us
3 0.952600 7674.244629us
4 0.954000 7659.506348us
5 0.954700 7679.874023us
10 0.955800 7709.812500us
50 0.957400 7720.283691us
100 0.957800 7722.512695us
300 0.958000 7763.615234us
500 0.958100 7779.351562us
1000 0.958100 7859.372559us

Thank you for your great work.

@yurymalkov
Copy link
Member

@willard-yuan Cool!
By the way, how did you set the efConstruction?

@Neltherion
Copy link

Neltherion commented May 4, 2018

@willard-yuan Thanks for the plot and results. I'm also curious to know the parameters you've used for the benchmark and the time it took to build the index.

@willard-yuan
Copy link
Author

@yurymalkov @Neltherion Sorry for later reply. The efConstruction is set as 80. Indexing 2b face vector takes one hour or more (I forgot the accurate time). The machine I used is as follows:

vendor_id	: GenuineIntel
cpu family	: 6
model name	: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
stepping	: 1
microcode	: 0xb00001f
cpu MHz		: 2040.242
cache size	: 51200 KB
physical id	: 1
siblings	: 40
core id		: 28
cpu cores	: 20
apicid		: 121
initial apicid	: 121
fpu		: yes
fpu_exception	: yes
cpuid level	: 20
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts
bugs		:
bogomips	: 4392.21
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual

@yurymalkov
Copy link
Member

@willard-yuan Thanks for the update. Probably, the search speed/accuracy can be improved significantly if efConstruction is increased.

@Neltherion
Copy link

Neltherion commented May 6, 2018

@willard-yuan Thanks so much.
@yurymalkov after reading your comment I ended up wishing there was a proper way to exactly get a grasp of what amount of efConstruction would be the most optimal (considering speed / accuracy / build-time).

I always see "you can pump up efConstruction" comments in the issues section but I don't know how much would be standard and how much would be overkill. is there a benchmark on efConstruction effects on a standard machine or we should just tune the parameters to get a grasp of their efficiency?

@yurymalkov
Copy link
Member

@Neltherion efConstruction can be autotuned for optimal query speed to the point where increasing it further will not help. This feature will be added to the lib at some point.
efConstruction and ef are basically the same parameter. So, when I see that to get a high recall (0.95 in this case) one has to set ef to 1000, while efConstruction is only 80 it means that the construction searches were not accurate and thus the search time can be improved.
I.e. a good choice of efConstruction for optimal query time is when a search with ef=efConstruction and k=M has recall close to 1.0 (i.e. 0.9-0.95).

@searchivarius
Copy link
Member

searchivarius commented May 6, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants