Skip to content

Does hnsw support adding items one by one in real-time? #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
willard-yuan opened this issue Mar 27, 2018 · 10 comments
Closed

Does hnsw support adding items one by one in real-time? #2

willard-yuan opened this issue Mar 27, 2018 · 10 comments

Comments

@willard-yuan
Copy link

Hi yurymalkov,

Thank you for your great work, and the performance shows in the paper is interesting. The python bindings shows the index can be built with incremental construction. Does the hnsw support adding items one by one? i.e. adding items one by one in real-time to the index dynamically.

@yurymalkov
Copy link
Member

@willard-yuan Yes, the binding supports adding elements one by one, and should work in parallel with python threads.

At the moment you need to add a dummy dimension for data and labels (i.e. call np.expand_dims(..,axis=0) on your vector and labels before passing them), as binding currently expects the input to be a matrix.
Also you should specify the parameter num_threads=1 to avoid creating threads, i.e.
int_labels = p.add_items(np.expand_dims(vector,axis=0), np.expand_dims(label,axis=0), num_threads=1)
I think I'll have time to fix both of these inconveniences within a week.

@willard-yuan
Copy link
Author

@yurymalkov Thank you for your reply. I plan to use hnsw for face retrieval.

@lzuwei
Copy link

lzuwei commented Mar 30, 2018

Hi @yurymalkov, thank you for the your great work as well,
I have a few questions as well for this implementation.

  1. is the performance of this similar to nmslib?
  2. based on the above discussion, am I right to say this implementation supports incremental indexing?
  3. in the example provided, you explicitly mention a need to specify max number of elements, what is this for?
# Initing index - the maximum number of elements should be known beforehand
p.init_index(max_elements = num_elements, ef_construction = 200, M = 16)

@yurymalkov
Copy link
Member

Hi @lzuwei

  1. The search performance is generally on par with the nmslib's implementation. On typical test datasets (i.e. sift, glove) it is a bit slower at high recalls compared to using nmslib with post='2' (not present in hnswlib because of the incremental construction), but also a bit faster at low recalls. Note that there are clear ways how to improve, but they require some coding work.
    See below a sift 1M plot with the closest competitors. Note that I am not completely sure if I have set all of the faiss parameters right.
    sift-128-euclidean

The good news is that the index time and indexing memory consumption are way better than in the nmslib, especially compared to post='2' case (the improvement in this case is ~3-4X).
2. Yes
3. The index stores the data in a single chunk. The chunk is allocated during the index initialization. If the number of data elements exceeds the predefined maximum, the chunk has to be extended (i.e. reallocated), but this functionality has not been implemented yet.

@danieloliveirabrazil
Copy link

how to use "string" as label in "add_items" ?

@willard-yuan
Copy link
Author

@danieloliveirabrazil You can use a map <int, string> to record the label when an item is added to the index.

@394781865
Copy link

when we can use Hamming distance? sorry! I'm english language is not well.

@394781865
Copy link

if hnsw use Squared L2 distance and lsh use Hamming distance ,who will be faster ? sorry! I'm Chinese student!

@yurymalkov
Copy link
Member

Hi @394781865,
The library does not currently support Hamming distance.
You should probably use https://github.com/nmslib/nmslib which supports it (at least, in C++).
Concerning your problem - usually LSH performs much worse, but I am not sure this is universal.

@searchivarius
Copy link
Member

@394781865 yes, NMSLIB supports the hamming distance, you can use the following example https://github.com/nmslib/nmslib/blob/master/python_bindings/tests/legacy_test.py#L159
with a few changes:

  1. The space name should be bit_hamming
  2. Data points are fed as strings containing space-separated zeros and ones.
  3. All bit vectors should have equal number of dimensions, i.e., you can't compare 100-d binary vectors with 200-d binary vectors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants