nmslib · 2ooom · Jul 30, 2019 · Jul 24, 2019 · Jul 24, 2019 · Jul 31, 2019
diff --git a/.travis.yml b/.travis.yml
@@ -3,6 +3,7 @@ language: python
 matrix:
   include:
     - python: 3.6
+    - python: 3.7
 install:
   - |
     cd python_bindings
@@ -12,4 +13,4 @@ install:
 script:
   - |
     cd python_bindings
-    python setup.py test
+    python setup.py test
diff --git a/ALGO_PARAMS.md b/ALGO_PARAMS.md
@@ -20,10 +20,10 @@ The range ```M```=12-48 is ok for the most of the use cases. When ```M``` is cha
 Nonetheless, ef and ef_construction parameters can be roughly estimated by assuming that ```M```*```ef_{construction}``` is 
 a constant.
 
-* ```ef_constrution``` - the parameter has the same meaning as ```ef```, but controls the index_time/index_accuracy. Bigger 
+* ```ef_construction``` - the parameter has the same meaning as ```ef```, but controls the index_time/index_accuracy. Bigger 
 ef_construction leads to longer construction, but better index quality. At some point, increasing ef_construction does
 not improve the quality of the index. One way to check if the selection of ef_construction was ok is to measure a recall 
-for M nearest neighbor search when ```ef``` =```ef_constuction```: if the recall is lower than 0.9, than there is room 
+for M nearest neighbor search when ```ef``` =```ef_construction```: if the recall is lower than 0.9, than there is room 
 for improvement.
 * ```num_elements``` - defines the maximum number of elements in the index. The index can be extened by saving/loading(load_index
 function has a parameter which defines the new maximum number of elements).
diff --git a/README.md b/README.md
@@ -1,6 +1,10 @@
 # Hnswlib - fast approximate nearest neighbor search
 Header-only C++ HNSW implementation with python bindings. Paper code for the HNSW 200M SIFT experiment
 
+**NEWS:**
+
+**Thanks to Louis Abraham ([@louisabraham](https://github.com/louisabraham)) hnswlib is now can be installed via pip!**
+
 Highlights:
 1) Lightweight, header-only, no dependencies other than C++ 11.
 2) Interfaces for C++, python and R (https://github.com/jlmelville/rcpphnsw).
@@ -26,7 +30,7 @@ Note that inner product is not an actual metric. An element can be closer to som
 
 For other spaces use the nmslib library https://github.com/nmslib/nmslib. 
 
-#### short API description
+#### Short API description
 * `hnswlib.Index(space, dim)` creates a non-initialized index an HNSW in space `space` with integer dimension `dim`.
 
 Index methods:
@@ -45,7 +49,7 @@ Index methods:
 * `resize_index(new_size)` - changes the maximum capacity of the index. Not thread safe with `add_items` and `knn_query`.
 
 * `set_ef(ef)` - sets the query time accuracy/speed trade-off, defined by the `ef` parameter (
-[ALGO_PARAMS.md](ALGO_PARAMS.md)).
+[ALGO_PARAMS.md](ALGO_PARAMS.md)). Note that the parameter is currently not saved along with the index, so you need to set it manually after loading.
 
 * `knn_query(data, k = 1, num_threads = -1)` make a batch query for `k` closests elements for each element of the 
     * `data` (shape:`N*dim`). Returns a numpy array of (shape:`N*k`).
@@ -59,11 +63,13 @@ Index methods:
 
 * `set_num_threads(num_threads)` set the default number of cpu threads used during data insertion/querying.
 
-* `get_items(ids)` - returns a numpy array (shape:`N*dim`) of vectors that have integer identifiers specified in `ids` numpy vector (shape:`N`).
+* `get_items(ids)` - returns a numpy array (shape:`N*dim`) of vectors that have integer identifiers specified in `ids` numpy vector (shape:`N`). Note that for cosine similarity it currently returns **normalized** vectors.
 
 * `get_ids_list()`  - returns a list of all elements' ids.
 
+* `get_max_elements()` - returns the current capacity of the index
 
+* `get_current_count()` - returns the current number of element stored in the index
 
 
 
@@ -166,13 +172,18 @@ print("Recall for two batches:", np.mean(labels.reshape(-1) == np.arange(len(dat
 ```
 
 ### Bindings installation
+
+You can install from sources:
 ```bash
 apt-get install -y python-setuptools python-pip
 pip3 install pybind11 numpy setuptools
 cd python_bindings
 python3 setup.py install
 ```
 
+or you can install via pip:
+`pip install hnswlib`
+
 ### Other implementations
 * Non-metric space library (nmslib) - main library(python, C++), supports exotic distances: https://github.com/nmslib/nmslib
 * Faiss libary by facebook, uses own HNSW  implementation for coarse quantization (python, C++):
@@ -186,9 +197,13 @@ https://github.com/dbaranchuk/ivf-hnsw
 * Go implementation: https://github.com/Bithack/go-hnsw
 * Python implementation (as a part of the clustering code by by Matteo Dell'Amico): https://github.com/matteodellamico/flexible-clustering
 * Java implementation: https://github.com/jelmerk/hnswlib
+* Java bindings using Java Native Access: https://github.com/stepstone-tech/hnswlib-jna
 * .Net implementation:  https://github.com/microsoft/HNSW.Net
 
+### Contributing to the repository
+Contributions are highly welcome!
 
+Please make pull requests against the `develop` branch.
 
 ### 200M SIFT test reproduction 
 To download and extract the bigann dataset:

diff --git a/examples/example.py b/examples/example.py
@@ -45,7 +45,7 @@
 # Serializing and deleting the index:
 index_path='first_half.bin'
 print("Saving index to '%s'" % index_path)
-p.save_index("first_half.bin")
+p.save_index(index_path)
 del p
 
 # Reiniting, loading the index

diff --git a/hnswlib/bruteforce.h b/hnswlib/bruteforce.h
@@ -1,6 +1,8 @@
 #pragma once
 #include <unordered_map>
 #include <fstream>
+#include <mutex>
+#include <algorithm>
 
 namespace hnswlib {
     template<typename dist_t>
@@ -20,6 +22,8 @@ namespace hnswlib {
             dist_func_param_ = s->get_dist_func_param();
             size_per_element_ = data_size_ + sizeof(labeltype);
             data_ = (char *) malloc(maxElements * size_per_element_);
+            if (data_ == nullptr)
+                std::runtime_error("Not enough memory: BruteforceSearch failed to allocate data");
             cur_element_count = 0;
         }
 
@@ -35,22 +39,37 @@ namespace hnswlib {
         size_t data_size_;
         DISTFUNC <dist_t> fstdistfunc_;
         void *dist_func_param_;
+        std::mutex index_lock;
 
         std::unordered_map<labeltype,size_t > dict_external_to_internal;
 
-        void addPoint(void *datapoint, labeltype label) {
-            if(dict_external_to_internal.count(label))
-                throw std::runtime_error("Ids have to be unique");
+        void addPoint(const void *datapoint, labeltype label) {
+
+            int idx;
+            {
+                std::unique_lock<std::mutex> lock(index_lock);
+
+
+
+                auto search=dict_external_to_internal.find(label);
+                if (search != dict_external_to_internal.end()) {
+                    idx=search->second;
+                }
+                else{
+                    if (cur_element_count >= maxelements_) {
+                        throw std::runtime_error("The number of elements exceeds the specified limit\n");
+                    }
+                    idx=cur_element_count;
+                    dict_external_to_internal[label] = idx;
+                    cur_element_count++;
+                }
+            }
+            memcpy(data_ + size_per_element_ * idx + data_size_, &label, sizeof(labeltype));
+            memcpy(data_ + size_per_element_ * idx, datapoint, data_size_);
+
 
 
-            if (cur_element_count >= maxelements_) {
-                throw std::runtime_error("The number of elements exceeds the specified limit\n");
-            };
-            memcpy(data_ + size_per_element_ * cur_element_count + data_size_, &label, sizeof(labeltype));
-            memcpy(data_ + size_per_element_ * cur_element_count, datapoint, data_size_);
-            dict_external_to_internal[label]=cur_element_count;
 
-            cur_element_count++;
         };
 
         void removePoint(labeltype cur_external) {
@@ -68,8 +87,10 @@ namespace hnswlib {
         }
 
 
-        std::priority_queue<std::pair<dist_t, labeltype >> searchKnn(const void *query_data, size_t k) const {
+        std::priority_queue<std::pair<dist_t, labeltype >>
+        searchKnn(const void *query_data, size_t k) const {
             std::priority_queue<std::pair<dist_t, labeltype >> topResults;
+            if (cur_element_count == 0) return topResults;
             for (int i = 0; i < k; i++) {
                 dist_t dist = fstdistfunc_(query_data, data_ + size_per_element_ * i, dist_func_param_);
                 topResults.push(std::pair<dist_t, labeltype>(dist, *((labeltype *) (data_ + size_per_element_ * i +
@@ -90,6 +111,24 @@ namespace hnswlib {
             return topResults;
         };
 
+        template <typename Comp>
+        std::vector<std::pair<dist_t, labeltype>>
+        searchKnn(const void* query_data, size_t k, Comp comp) {
+            std::vector<std::pair<dist_t, labeltype>> result;
+            if (cur_element_count == 0) return result;
+
+            auto ret = searchKnn(query_data, k);
+
+            while (!ret.empty()) {
+                result.push_back(ret.top());
+                ret.pop();
+            }
+
+            std::sort(result.begin(), result.end(), comp);
+
+            return result;
+        }
+
         void saveIndex(const std::string &location) {
             std::ofstream output(location, std::ios::binary);
             std::streampos position;
@@ -118,12 +157,13 @@ namespace hnswlib {
             dist_func_param_ = s->get_dist_func_param();
             size_per_element_ = data_size_ + sizeof(labeltype);
             data_ = (char *) malloc(maxelements_ * size_per_element_);
+            if (data_ == nullptr)
+                std::runtime_error("Not enough memory: loadIndex failed to allocate data");
 
             input.read(data_, maxelements_ * size_per_element_);
 
             input.close();
 
-            return;
         }
 
     };