Skip to content

Merge 0.5.0 changes to master #279

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 73 commits into from
Jan 29, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
73 commits
Select commit Hold shift + click to select a range
fa32338
currObj should be updated as the closest from all candidated.
orrorcol Jun 27, 2020
b6b338e
1. Replace the template interface searchKnn with virtual interface
orrorcol Jun 27, 2020
8987188
minor fix
orrorcol Jun 27, 2020
a35fcb5
Adding cassert include in header to fix compilation error on Ubuntu 1…
Jul 9, 2020
4a4689c
Small patch to enable compilation with sign_compare and reorder warni…
Jul 9, 2020
40f31da
Merge pull request #231 from jwimberley/cassert_beaver_gcc730_fix
yurymalkov Jul 14, 2020
ab012ae
Merge pull request #233 from jwimberley/gcc_warning_fixes
yurymalkov Jul 15, 2020
6f2c3fb
L2SqrI: add fallback if the dimension is not a multiple of 4
fabiencastan Aug 19, 2020
21b54fe
Merge pull request #243 from alicevision/dev/l2sqrI4x
yurymalkov Aug 31, 2020
cb7b398
New methods loadIndexFromStream and saveIndexToStream expose de-/seri…
dbespalov Oct 12, 2020
e161db8
Implement __getstate__ and __setstate__ to allow pickling of hnswlib.…
dbespalov Oct 12, 2020
e0eacad
Verify knn_query results match before/after pickling hnswlib.Index ob…
dbespalov Oct 12, 2020
ec4f4b1
add documeentation
dbespalov Oct 12, 2020
a3646cc
clean-up readme
dbespalov Oct 12, 2020
a1ba4e5
clean-up readme
dbespalov Oct 12, 2020
cf3846c
clean-up readme
dbespalov Oct 12, 2020
27471cd
clean-up readme
dbespalov Oct 12, 2020
4220956
Update bindings_test_pickle.py
dbespalov Oct 12, 2020
72b6501
Revert "New methods loadIndexFromStream and saveIndexToStream expose …
Oct 23, 2020
3a62b41
use python's buffer protocol to avoid making copies of ann data (stat…
Oct 23, 2020
fe6d2fa
replace tab characters with spaces
Oct 23, 2020
c9fb60d
test each space (ip/cosine/l2) as a separate unittest
Oct 23, 2020
3c4510d
return array_t pointers
dbespalov Oct 25, 2020
64c5154
expose static method of Index class as copy constructor in python
dbespalov Oct 25, 2020
7b445c8
do not waste space when returning serialized appr_alg->linkLists_
dbespalov Oct 25, 2020
c02f1dc
serialize element_lookup_ and element_level_ as array_t arrays; pass …
dbespalov Oct 26, 2020
1f25102
warn that serialization is not thread safe with add_items
dbespalov Nov 3, 2020
1165370
warn that serialization is not thread safe with add_items; add todo b…
dbespalov Nov 3, 2020
2c040e6
remove camel casing
dbespalov Nov 3, 2020
6298996
add static const int data member to class Index that stores serializa…
dbespalov Nov 6, 2020
c8276d8
add todo block to convert parameter tuple to dicts
dbespalov Nov 6, 2020
345f71d
add todo block to convert parameter tuple to dicts
dbespalov Nov 6, 2020
a64a001
Fixes of some typos in readme
dyashuni Nov 6, 2020
cee0e99
Merge pull request #251 from dbespalov/python_bindings_pickle_io
yurymalkov Nov 9, 2020
1c97b5d
Merge pull request #253 from dyashuni/patch-1
yurymalkov Nov 9, 2020
ec38db1
Rename space_name to space on the python side
Nov 18, 2020
a0c2076
Add gitignore file to ignore build folders
Nov 18, 2020
8cc442d
Merge pull request #255 from dyashuni/develop
yurymalkov Nov 22, 2020
ded26fc
use dict for Index serialization
dbespalov Nov 23, 2020
e845d8a
debugging; have to wrap state dict into a tuple
dbespalov Nov 25, 2020
6425deb
Move setup.py into root folder to fix bindings build when symlink doe…
Nov 28, 2020
376c8cd
Update gitignore
Nov 28, 2020
68a8a36
Update Makefile to clean tmp folder
Nov 28, 2020
19abf9b
Update readme
Nov 28, 2020
2799aab
clean assert error message
dbespalov Nov 30, 2020
4c002bc
fix compilation error on osx
dbespalov Nov 30, 2020
5b2585d
Revert symlink to hnswlib and add windows to build matrix
Dec 1, 2020
dda9b31
Fix symlink
Dec 1, 2020
b1994a5
Update travis
Dec 1, 2020
afd18d2
Update travis
Dec 1, 2020
334cc6c
Merge pull request #258 from dbespalov/python_bindings_state_dict
yurymalkov Dec 1, 2020
b4b7b86
Merge pull request #224 from uestc-lfs/fix-update-ep
yurymalkov Dec 7, 2020
6efa48c
Add symlink to setup.py instead of hnswlib
Dec 8, 2020
4cf279b
Merge remote-tracking branch 'upstream/develop' into fix-interface
orrorcol Dec 10, 2020
9fe639d
fix interface
orrorcol Dec 12, 2020
21c1ad7
minor fix
orrorcol Dec 13, 2020
52da3d2
Merge pull request #225 from uestc-lfs/fix-interface
yurymalkov Dec 14, 2020
21b908f
Update README.md
js1010 Jan 4, 2021
d2e5a18
Update README.md
js1010 Jan 4, 2021
6449e64
Merge pull request #273 from js1010/patch-2
yurymalkov Jan 5, 2021
6ae02a5
Run sift test from separate directory
Jan 6, 2021
6d3b29f
Merge pull request #260 from dyashuni/develop
yurymalkov Jan 7, 2021
68b6257
PEP-517 support
groodt Jan 10, 2021
e94c5dc
Simplify include_dirs
groodt Jan 10, 2021
467c98f
Remove deprecated `setup.py test`
groodt Jan 13, 2021
2248ab4
pybind11 isn't needed at runtime, only build time
groodt Jan 13, 2021
8fe02c0
Support for packaging sdist
groodt Jan 15, 2021
73134a7
https git clone in README
groodt Jan 15, 2021
a9153e9
Add license file to pypi package
Jan 16, 2021
65a5f28
Merge pull request #274 from groodt/groodt-pyproject-toml
yurymalkov Jan 17, 2021
215526c
Merge pull request #276 from dyashuni/develop
yurymalkov Jan 17, 2021
1469702
bump version
yurymalkov Jan 25, 2021
e03162b
Merge pull request #278 from nmslib/upd0.5
yurymalkov Jan 25, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
hnswlib.egg-info/
build/
dist/
tmp/
python_bindings/tests/__pycache__/
*.pyd
hnswlib.cpython*.so
var/
37 changes: 29 additions & 8 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -1,16 +1,37 @@
language: python

matrix:
jobs:
include:
- python: 3.6
- python: 3.7
- name: Linux Python 3.6
os: linux
python: 3.6

- name: Linux Python 3.7
os: linux
python: 3.7

- name: Windows Python 3.6
os: windows
language: shell # 'language: python' is an error on Travis CI Windows
before_install:
- choco install python --version 3.6.0
- python -m pip install --upgrade pip
- python --version
env: PATH=/c/Python36:/c/Python36/Scripts:$PATH

- name: Windows Python 3.7
os: windows
language: shell # 'language: python' is an error on Travis CI Windows
before_install:
- choco install python --version 3.7.0
- python -m pip install --upgrade pip
- python --version
env: PATH=/c/Python37:/c/Python37/Scripts:$PATH

install:
- |
cd python_bindings
pip install -r requirements.txt
python setup.py install
python -m pip install .

script:
- |
cd python_bindings
python setup.py test
python -m unittest discover --start-directory python_bindings/tests --pattern "*_test*.py"
2 changes: 2 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -23,4 +23,6 @@ endif()

add_executable(test_updates examples/updates_test.cpp)

add_executable(searchKnnCloserFirst_test examples/searchKnnCloserFirst_test.cpp)

target_link_libraries(main sift_test)
2 changes: 2 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
include hnswlib/*.h
include LICENSE
15 changes: 15 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
pypi: dist
twine upload dist/*

dist:
-rm dist/*
pip install build
python3 -m build --sdist

test:
python3 -m unittest discover --start-directory python_bindings/tests --pattern "*_test*.py"

clean:
rm -rf *.egg-info build dist tmp var tests/__pycache__ hnswlib.cpython*.so

.PHONY: dist
74 changes: 56 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,12 @@ Header-only C++ HNSW implementation with python bindings. Paper's code for the H

**NEWS:**

* **Thanks to Apoorv Sharma [@apoorv-sharma](https://github.com/apoorv-sharma), hnswlib now supports true element updates (the interface remained the same, but when you the perfromance/memory should not degrade as you update the element embeddinds).**

* **Thanks to Dmitry [@2ooom](https://github.com/2ooom), hnswlib got a boost in performance for vector dimensions that are not mutiple of 4**
* **hnswlib is now 0.5.0. Added support for pickling indices, support for PEP-517 and PEP-518 building, small speedups, bug and documentation fixes. Many thanks to [@dbespalov](https://github.com/dbespalov), [@dyashuni](https://github.com/dyashuni), [@groodt](https://github.com/groodt),[@uestc-lfs](https://github.com/uestc-lfs), [@vinnitu](https://github.com/vinnitu), [@fabiencastan](https://github.com/fabiencastan), [@JinHai-CN](https://github.com/JinHai-CN), [@js1010](https://github.com/js1010)!**

* **Thanks to Apoorv Sharma [@apoorv-sharma](https://github.com/apoorv-sharma), hnswlib now supports true element updates (the interface remained the same, but when you the performance/memory should not degrade as you update the element embeddings).**

* **Thanks to Dmitry [@2ooom](https://github.com/2ooom), hnswlib got a boost in performance for vector dimensions that are not multiple of 4**

* **Thanks to Louis Abraham ([@louisabraham](https://github.com/louisabraham)) hnswlib can now be installed via pip!**

Expand Down Expand Up @@ -37,7 +40,7 @@ For other spaces use the nmslib library https://github.com/nmslib/nmslib.
#### Short API description
* `hnswlib.Index(space, dim)` creates a non-initialized index an HNSW in space `space` with integer dimension `dim`.

Index methods:
`hnswlib.Index` methods:
* `init_index(max_elements, ef_construction = 200, M = 16, random_seed = 100)` initializes the index from with no elements.
* `max_elements` defines the maximum number of elements that can be stored in the structure(can be increased/shrunk).
* `ef_construction` defines a construction time/accuracy trade-off (see [ALGO_PARAMS.md](ALGO_PARAMS.md)).
Expand All @@ -49,14 +52,14 @@ Index methods:
* `data_labels` specifies the labels for the data. If index already has the elements with the same labels, their features will be updated. Note that update procedure is slower than insertion of a new element, but more memory- and query-efficient.
* Thread-safe with other `add_items` calls, but not with `knn_query`.

* `mark_deleted(data_label)` - marks the element as deleted, so it will be ommited from search results.
* `mark_deleted(data_label)` - marks the element as deleted, so it will be omitted from search results.

* `resize_index(new_size)` - changes the maximum capacity of the index. Not thread safe with `add_items` and `knn_query`.

* `set_ef(ef)` - sets the query time accuracy/speed trade-off, defined by the `ef` parameter (
[ALGO_PARAMS.md](ALGO_PARAMS.md)). Note that the parameter is currently not saved along with the index, so you need to set it manually after loading.

* `knn_query(data, k = 1, num_threads = -1)` make a batch query for `k` closests elements for each element of the
* `knn_query(data, k = 1, num_threads = -1)` make a batch query for `k` closest elements for each element of the
* `data` (shape:`N*dim`). Returns a numpy array of (shape:`N*k`).
* `num_threads` sets the number of cpu threads to use (-1 means use default).
* Thread-safe with other `knn_query` calls, but not with `add_items`.
Expand All @@ -76,14 +79,34 @@ Index methods:

* `get_current_count()` - returns the current number of element stored in the index



Read-only properties of `hnswlib.Index` class:

* `space` - name of the space (can be one of "l2", "ip", or "cosine").

* `dim` - dimensionality of the space.

* `M` - parameter that defines the maximum number of outgoing connections in the graph.

* `ef_construction` - parameter that controls speed/accuracy trade-off during the index construction.

* `max_elements` - current capacity of the index. Equivalent to `p.get_max_elements()`.

* `element_count` - number of items in the index. Equivalent to `p.get_current_count()`.

Properties of `hnswlib.Index` that support reading and writing:

* `ef` - parameter controlling query time/accuracy trade-off.

* `num_threads` - default number of threads to use in `add_items` or `knn_query`. Note that calling `p.set_num_threads(3)` is equivalent to `p.num_threads=3`.




#### Python bindings examples
```python
import hnswlib
import numpy as np
import pickle

dim = 128
num_elements = 10000
Expand All @@ -95,7 +118,7 @@ data_labels = np.arange(num_elements)
# Declaring index
p = hnswlib.Index(space = 'l2', dim = dim) # possible options are l2, cosine or ip

# Initing index - the maximum number of elements should be known beforehand
# Initializing index - the maximum number of elements should be known beforehand
p.init_index(max_elements = num_elements, ef_construction = 200, M = 16)

# Element insertion (can be called several times):
Expand All @@ -106,6 +129,18 @@ p.set_ef(50) # ef should always be > k

# Query dataset, k - number of closest elements (returns 2 numpy arrays)
labels, distances = p.knn_query(data, k = 1)

# Index objects support pickling
# WARNING: serialization via pickle.dumps(p) or p.__getstate__() is NOT thread-safe with p.add_items method!
# Note: ef parameter is included in serialization; random number generator is initialized with random_seed on Index load
p_copy = pickle.loads(pickle.dumps(p)) # creates a copy of index p using pickle round-trip

### Index parameters are exposed as class properties:
print(f"Parameters passed to constructor: space={p_copy.space}, dim={p_copy.dim}")
print(f"Index construction: M={p_copy.M}, ef_construction={p_copy.ef_construction}")
print(f"Index size is {p_copy.element_count} and index capacity is {p_copy.max_elements}")
print(f"Search speed/quality trade-off parameter: ef={p_copy.ef}")

```

An example with updates after serialization/deserialization:
Expand All @@ -126,7 +161,7 @@ data2 = data[num_elements // 2:]
# Declaring index
p = hnswlib.Index(space='l2', dim=dim) # possible options are l2, cosine or ip

# Initing index
# Initializing index
# max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded
# during insertion of an element.
# The capacity can be increased by saving/loading the index, see below.
Expand Down Expand Up @@ -160,7 +195,7 @@ print("Saving index to '%s'" % index_path)
p.save_index("first_half.bin")
del p

# Reiniting, loading the index
# Re-initializing, loading the index
p = hnswlib.Index(space='l2', dim=dim) # the space can be changed - keeps the data, alters the distance function.

print("\nLoading index from 'first_half.bin'\n")
Expand All @@ -181,17 +216,17 @@ print("Recall for two batches:", np.mean(labels.reshape(-1) == np.arange(len(dat
You can install from sources:
```bash
apt-get install -y python-setuptools python-pip
pip3 install pybind11 numpy setuptools
cd python_bindings
python3 setup.py install
git clone https://github.com/nmslib/hnswlib.git
cd hnswlib
pip install .
```

or you can install via pip:
`pip install hnswlib`

### Other implementations
* Non-metric space library (nmslib) - main library(python, C++), supports exotic distances: https://github.com/nmslib/nmslib
* Faiss libary by facebook, uses own HNSW implementation for coarse quantization (python, C++):
* Faiss library by facebook, uses own HNSW implementation for coarse quantization (python, C++):
https://github.com/facebookresearch/faiss
* Code for the paper
["Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors"](https://arxiv.org/abs/1802.02422)
Expand All @@ -203,21 +238,24 @@ https://github.com/dbaranchuk/ivf-hnsw
* Python implementation (as a part of the clustering code by by Matteo Dell'Amico): https://github.com/matteodellamico/flexible-clustering
* Java implementation: https://github.com/jelmerk/hnswlib
* Java bindings using Java Native Access: https://github.com/stepstone-tech/hnswlib-jna
* .Net implementation: https://github.com/microsoft/HNSW.Net
* .Net implementation: https://github.com/microsoft/HNSW.Net
* CUDA implementation: https://github.com/js1010/cuhnsw

### Contributing to the repository
Contributions are highly welcome!

Please make pull requests against the `develop` branch.

### 200M SIFT test reproduction
To download and extract the bigann dataset:
To download and extract the bigann dataset (from root directory):
```bash
python3 download_bigann.py
```
To compile:
```bash
cmake .
mkdir build
cd build
cmake ..
make all
```

Expand All @@ -226,7 +264,7 @@ To run the test on 200M SIFT subset:
./main
```

The size of the bigann subset (in millions) is controlled by the variable **subset_size_milllions** hardcoded in **sift_1b.cpp**.
The size of the BigANN subset (in millions) is controlled by the variable **subset_size_millions** hardcoded in **sift_1b.cpp**.

### Updates test
To generate testing data (from root directory):
Expand Down
84 changes: 84 additions & 0 deletions examples/searchKnnCloserFirst_test.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
// This is a test file for testing the interface
// >>> virtual std::vector<std::pair<dist_t, labeltype>>
// >>> searchKnnCloserFirst(const void* query_data, size_t k) const;
// of class AlgorithmInterface

#include "../hnswlib/hnswlib.h"

#include <assert.h>

#include <vector>
#include <iostream>

namespace
{

using idx_t = hnswlib::labeltype;

void test() {
int d = 4;
idx_t n = 100;
idx_t nq = 10;
size_t k = 10;

std::vector<float> data(n * d);
std::vector<float> query(nq * d);

std::mt19937 rng;
rng.seed(47);
std::uniform_real_distribution<> distrib;

for (idx_t i = 0; i < n * d; ++i) {
data[i] = distrib(rng);
}
for (idx_t i = 0; i < nq * d; ++i) {
query[i] = distrib(rng);
}


hnswlib::L2Space space(d);
hnswlib::AlgorithmInterface<float>* alg_brute = new hnswlib::BruteforceSearch<float>(&space, 2 * n);
hnswlib::AlgorithmInterface<float>* alg_hnsw = new hnswlib::HierarchicalNSW<float>(&space, 2 * n);

for (size_t i = 0; i < n; ++i) {
alg_brute->addPoint(data.data() + d * i, i);
alg_hnsw->addPoint(data.data() + d * i, i);
}

// test searchKnnCloserFirst of BruteforceSearch
for (size_t j = 0; j < nq; ++j) {
const void* p = query.data() + j * d;
auto gd = alg_brute->searchKnn(p, k);
auto res = alg_brute->searchKnnCloserFirst(p, k);
assert(gd.size() == res.size());
size_t t = gd.size();
while (!gd.empty()) {
assert(gd.top() == res[--t]);
gd.pop();
}
}
for (size_t j = 0; j < nq; ++j) {
const void* p = query.data() + j * d;
auto gd = alg_hnsw->searchKnn(p, k);
auto res = alg_hnsw->searchKnnCloserFirst(p, k);
assert(gd.size() == res.size());
size_t t = gd.size();
while (!gd.empty()) {
assert(gd.top() == res[--t]);
gd.pop();
}
}

delete alg_brute;
delete alg_hnsw;
}

} // namespace

int main() {
std::cout << "Testing ..." << std::endl;
test();
std::cout << "Test ok" << std::endl;

return 0;
}
18 changes: 0 additions & 18 deletions hnswlib/bruteforce.h
Original file line number Diff line number Diff line change
Expand Up @@ -111,24 +111,6 @@ namespace hnswlib {
return topResults;
};

template <typename Comp>
std::vector<std::pair<dist_t, labeltype>>
searchKnn(const void* query_data, size_t k, Comp comp) {
std::vector<std::pair<dist_t, labeltype>> result;
if (cur_element_count == 0) return result;

auto ret = searchKnn(query_data, k);

while (!ret.empty()) {
result.push_back(ret.top());
ret.pop();
}

std::sort(result.begin(), result.end(), comp);

return result;
}

void saveIndex(const std::string &location) {
std::ofstream output(location, std::ios::binary);
std::streampos position;
Expand Down
Loading