Skip to content

[L2 Space] Improving performance when dimension is not a factor of 4 or 16 #131

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 33 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
2bf4d13
[L2 Space] Improving performance when dimension is not a factor of 4 …
2ooom Jul 30, 2019
1bdbe2d
setExternalLabel returns void
piem Jul 24, 2019
5f10af8
use size_t counters to avoid size_t to int comparisons
piem Jul 24, 2019
e180326
Update README.md
yurymalkov Jul 31, 2019
34b142d
fix bug in sift test
Aug 1, 2019
ce80e99
update bruteforce to support element updates, add locks for multi-thr…
Aug 1, 2019
af0007c
pypi package
louisabraham Aug 20, 2019
bc91a93
travis installation
louisabraham Aug 20, 2019
2195112
include hsnwlib in sdist
louisabraham Aug 21, 2019
d58b9a8
removebdist_wheel from distribution
louisabraham Aug 23, 2019
a6b87f2
use symlink
louisabraham Aug 24, 2019
077b041
remove unuseful cp (and relaunch tests)
louisabraham Aug 24, 2019
86188b1
Update README.md
yurymalkov Aug 28, 2019
f47e853
Update README.md
yurymalkov Aug 28, 2019
d6d204f
fix/improve tests, #142
yurymalkov Aug 28, 2019
76db8ae
Fix load bugs/messages, update test, deprecate old indices (#148)
yurymalkov Sep 16, 2019
6c4ab29
The interface addPoint is changed from
orrorcol Sep 25, 2019
0334c8c
Two main changes:
orrorcol Sep 25, 2019
16c1175
Remove unneeded header <queue>
orrorcol Sep 25, 2019
2b400f3
Fix typos
PWhids Sep 26, 2019
eb8dc98
A new interface taking a template comparator is added, so it can not …
orrorcol Sep 26, 2019
5e037e2
Using std::sort to sort the result according to the comparator provid…
orrorcol Sep 28, 2019
dc25836
fix missing deletion initialization
Nov 5, 2019
65f35ac
Expose current-count and max-elements in Python
vimal-mathew Nov 6, 2019
d5f6aad
fix python tests
Nov 6, 2019
bf94915
Throw exception on malloc fails
vimal-mathew Nov 11, 2019
ef1d4e0
Update README.md
yurymalkov Nov 11, 2019
b75c713
updated path from static to dynamic
ashfaq92 Nov 20, 2019
83b635d
bump version
Dec 16, 2019
aae3be9
fix,overflow in getIdsList
Jan 20, 2020
679903c
include one more other implementation
Feb 12, 2020
fd4ebf4
Update README.md
yurymalkov Mar 4, 2020
4bd853c
[L2Space] Perf improvement for dimension of factor 4 and 16
2ooom Apr 5, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ language: python
matrix:
include:
- python: 3.6
- python: 3.7
install:
- |
cd python_bindings
Expand All @@ -12,4 +13,4 @@ install:
script:
- |
cd python_bindings
python setup.py test
python setup.py test
4 changes: 2 additions & 2 deletions ALGO_PARAMS.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,10 @@ The range ```M```=12-48 is ok for the most of the use cases. When ```M``` is cha
Nonetheless, ef and ef_construction parameters can be roughly estimated by assuming that ```M```*```ef_{construction}``` is
a constant.

* ```ef_constrution``` - the parameter has the same meaning as ```ef```, but controls the index_time/index_accuracy. Bigger
* ```ef_construction``` - the parameter has the same meaning as ```ef```, but controls the index_time/index_accuracy. Bigger
ef_construction leads to longer construction, but better index quality. At some point, increasing ef_construction does
not improve the quality of the index. One way to check if the selection of ef_construction was ok is to measure a recall
for M nearest neighbor search when ```ef``` =```ef_constuction```: if the recall is lower than 0.9, than there is room
for M nearest neighbor search when ```ef``` =```ef_construction```: if the recall is lower than 0.9, than there is room
for improvement.
* ```num_elements``` - defines the maximum number of elements in the index. The index can be extened by saving/loading(load_index
function has a parameter which defines the new maximum number of elements).
21 changes: 18 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
# Hnswlib - fast approximate nearest neighbor search
Header-only C++ HNSW implementation with python bindings. Paper code for the HNSW 200M SIFT experiment

**NEWS:**

**Thanks to Louis Abraham ([@louisabraham](https://github.com/louisabraham)) hnswlib is now can be installed via pip!**

Highlights:
1) Lightweight, header-only, no dependencies other than C++ 11.
2) Interfaces for C++, python and R (https://github.com/jlmelville/rcpphnsw).
Expand All @@ -26,7 +30,7 @@ Note that inner product is not an actual metric. An element can be closer to som

For other spaces use the nmslib library https://github.com/nmslib/nmslib.

#### short API description
#### Short API description
* `hnswlib.Index(space, dim)` creates a non-initialized index an HNSW in space `space` with integer dimension `dim`.

Index methods:
Expand All @@ -45,7 +49,7 @@ Index methods:
* `resize_index(new_size)` - changes the maximum capacity of the index. Not thread safe with `add_items` and `knn_query`.

* `set_ef(ef)` - sets the query time accuracy/speed trade-off, defined by the `ef` parameter (
[ALGO_PARAMS.md](ALGO_PARAMS.md)).
[ALGO_PARAMS.md](ALGO_PARAMS.md)). Note that the parameter is currently not saved along with the index, so you need to set it manually after loading.

* `knn_query(data, k = 1, num_threads = -1)` make a batch query for `k` closests elements for each element of the
* `data` (shape:`N*dim`). Returns a numpy array of (shape:`N*k`).
Expand All @@ -59,11 +63,13 @@ Index methods:

* `set_num_threads(num_threads)` set the default number of cpu threads used during data insertion/querying.

* `get_items(ids)` - returns a numpy array (shape:`N*dim`) of vectors that have integer identifiers specified in `ids` numpy vector (shape:`N`).
* `get_items(ids)` - returns a numpy array (shape:`N*dim`) of vectors that have integer identifiers specified in `ids` numpy vector (shape:`N`). Note that for cosine similarity it currently returns **normalized** vectors.

* `get_ids_list()` - returns a list of all elements' ids.

* `get_max_elements()` - returns the current capacity of the index

* `get_current_count()` - returns the current number of element stored in the index



Expand Down Expand Up @@ -166,13 +172,18 @@ print("Recall for two batches:", np.mean(labels.reshape(-1) == np.arange(len(dat
```

### Bindings installation

You can install from sources:
```bash
apt-get install -y python-setuptools python-pip
pip3 install pybind11 numpy setuptools
cd python_bindings
python3 setup.py install
```

or you can install via pip:
`pip install hnswlib`

### Other implementations
* Non-metric space library (nmslib) - main library(python, C++), supports exotic distances: https://github.com/nmslib/nmslib
* Faiss libary by facebook, uses own HNSW implementation for coarse quantization (python, C++):
Expand All @@ -186,9 +197,13 @@ https://github.com/dbaranchuk/ivf-hnsw
* Go implementation: https://github.com/Bithack/go-hnsw
* Python implementation (as a part of the clustering code by by Matteo Dell'Amico): https://github.com/matteodellamico/flexible-clustering
* Java implementation: https://github.com/jelmerk/hnswlib
* Java bindings using Java Native Access: https://github.com/stepstone-tech/hnswlib-jna
* .Net implementation: https://github.com/microsoft/HNSW.Net

### Contributing to the repository
Contributions are highly welcome!

Please make pull requests against the `develop` branch.

### 200M SIFT test reproduction
To download and extract the bigann dataset:
Expand Down
2 changes: 1 addition & 1 deletion examples/example.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@
# Serializing and deleting the index:
index_path='first_half.bin'
print("Saving index to '%s'" % index_path)
p.save_index("first_half.bin")
p.save_index(index_path)
del p

# Reiniting, loading the index
Expand Down
64 changes: 52 additions & 12 deletions hnswlib/bruteforce.h
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
#pragma once
#include <unordered_map>
#include <fstream>
#include <mutex>
#include <algorithm>

namespace hnswlib {
template<typename dist_t>
Expand All @@ -20,6 +22,8 @@ namespace hnswlib {
dist_func_param_ = s->get_dist_func_param();
size_per_element_ = data_size_ + sizeof(labeltype);
data_ = (char *) malloc(maxElements * size_per_element_);
if (data_ == nullptr)
std::runtime_error("Not enough memory: BruteforceSearch failed to allocate data");
cur_element_count = 0;
}

Expand All @@ -35,22 +39,37 @@ namespace hnswlib {
size_t data_size_;
DISTFUNC <dist_t> fstdistfunc_;
void *dist_func_param_;
std::mutex index_lock;

std::unordered_map<labeltype,size_t > dict_external_to_internal;

void addPoint(void *datapoint, labeltype label) {
if(dict_external_to_internal.count(label))
throw std::runtime_error("Ids have to be unique");
void addPoint(const void *datapoint, labeltype label) {

int idx;
{
std::unique_lock<std::mutex> lock(index_lock);



auto search=dict_external_to_internal.find(label);
if (search != dict_external_to_internal.end()) {
idx=search->second;
}
else{
if (cur_element_count >= maxelements_) {
throw std::runtime_error("The number of elements exceeds the specified limit\n");
}
idx=cur_element_count;
dict_external_to_internal[label] = idx;
cur_element_count++;
}
}
memcpy(data_ + size_per_element_ * idx + data_size_, &label, sizeof(labeltype));
memcpy(data_ + size_per_element_ * idx, datapoint, data_size_);



if (cur_element_count >= maxelements_) {
throw std::runtime_error("The number of elements exceeds the specified limit\n");
};
memcpy(data_ + size_per_element_ * cur_element_count + data_size_, &label, sizeof(labeltype));
memcpy(data_ + size_per_element_ * cur_element_count, datapoint, data_size_);
dict_external_to_internal[label]=cur_element_count;

cur_element_count++;
};

void removePoint(labeltype cur_external) {
Expand All @@ -68,8 +87,10 @@ namespace hnswlib {
}


std::priority_queue<std::pair<dist_t, labeltype >> searchKnn(const void *query_data, size_t k) const {
std::priority_queue<std::pair<dist_t, labeltype >>
searchKnn(const void *query_data, size_t k) const {
std::priority_queue<std::pair<dist_t, labeltype >> topResults;
if (cur_element_count == 0) return topResults;
for (int i = 0; i < k; i++) {
dist_t dist = fstdistfunc_(query_data, data_ + size_per_element_ * i, dist_func_param_);
topResults.push(std::pair<dist_t, labeltype>(dist, *((labeltype *) (data_ + size_per_element_ * i +
Expand All @@ -90,6 +111,24 @@ namespace hnswlib {
return topResults;
};

template <typename Comp>
std::vector<std::pair<dist_t, labeltype>>
searchKnn(const void* query_data, size_t k, Comp comp) {
std::vector<std::pair<dist_t, labeltype>> result;
if (cur_element_count == 0) return result;

auto ret = searchKnn(query_data, k);

while (!ret.empty()) {
result.push_back(ret.top());
ret.pop();
}

std::sort(result.begin(), result.end(), comp);

return result;
}

void saveIndex(const std::string &location) {
std::ofstream output(location, std::ios::binary);
std::streampos position;
Expand Down Expand Up @@ -118,12 +157,13 @@ namespace hnswlib {
dist_func_param_ = s->get_dist_func_param();
size_per_element_ = data_size_ + sizeof(labeltype);
data_ = (char *) malloc(maxelements_ * size_per_element_);
if (data_ == nullptr)
std::runtime_error("Not enough memory: loadIndex failed to allocate data");

input.read(data_, maxelements_ * size_per_element_);

input.close();

return;
}

};
Expand Down
Loading