-
Notifications
You must be signed in to change notification settings - Fork 710
Handling missing values #90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @maxstrobel, Yes, you can do that using the inner product distance (which just seem to be natural for classification). If you zero all dimensions in the query that are missing (or unknown), they effectively will not be used during the search. But if you zero too many dimensions of the query, the seach accuracy/speed ratio of the search may degarde (since the index was built using a different metric than it was searched), so do not forgot to increase the |
Hi @yurymalkov, sorry I was a little bit inprecise. I am currently using the 'Squared L2' distance. Is there some workaround to handle missing values here? I tried your suggestion and used the inner product as distance metric, but sadly the accuracy of my application drops to a fraction opposed to using the L2 distance. Thanks, |
@maxstrobel I doubt. This IMHO requires keeping an inverse of a missing value mask together with each vector. Then, when comparing two vectors you would have to mask-out (zero-out) all the differences where a dimension in x or y is missing. |
@maxstrobel This can be done only using C++ interface. As @searchivarius has mentioned, you would need to use query-specific masks(there are several ways to do that, e.g. change distance function before every query, or modify the distance to contain masks). |
@yurymalkov @searchivarius Thanks for your kind replies! @yurymalkov I thought about your second advice to modify the distance to contain a mask.
I could mask a missing features by setting the diagonal matrix W=I (I := unity matrix) and zero out those elements that are missing. Do you have an estimate how this would affect the performance of queries? I would index all samples without missing features, but at test time there would sometimes occur queries with missing features. |
@maxstrobel as to my experience, this is a valid approach. |
@yurymalkov Nice to hear that this procedure may work. If I find time during next week I will give it a try.
Would that be sufficient or am I missing some steps? |
@maxstrobel Yes, this should be sufficient. |
@yurymalkov @searchivarius Summing up:
I made also some small benchmarks to check the behaviour. As benchmark I used sklearns exact kNN and fitted on the subset of available features. The adapted hnswlib kNN is fitted only once with the whole dataset (i.e. all features).
As you can see your algorithm works really well with the feature weighting as extension to handle missing features (I hope I made no silly mistakes and the results are correct...). Maybe it would make sense to incorporate this in the master branch? What do you think? Feel free to ask questions if something is unclear. |
Wow @maxstrobel looks like this warrants an implementation of a special space. |
Hi @maxstrobel, |
@yurymalkov I try to code it in a clean way in the next days. The accuracy in the plots is calculated as followed:
I hope you can follow my explanations. TL;DR It might be usefull to look at the code snippet. From here it should be clear, what I did. # Approx kNN - fitted with whole feature space
approx_knn = hnswlib.Index(space='l2', dim=dim)
approx_knn.init_index(max_elements=len(X_train), ef_construction=ef, M=M)
approx_knn.add_items(X_train, num_threads=-1)
approx_knn.set_ef(ef)
accuracy_weighting, accuracy_no_weighting = [], []
for d in range(dim):
# Disable randomly features
missing_features = np.random.permutation(dim)[:d]
available_features = np.setdiff1d(np.arange(dim), missing_features)
###########
# Exact kNN - fitted & queried with the current subset of feature
exact_knn = KNeighborsClassifier(n_neighbors=k).fit(X_train[:, available_features], y_train)
exact_query = exact_knn.kneighbors(X_test[:, available_features])[1]
###########
# HNSWLIB
weights = np.ones(dim)
weights[missing_features] = 0 # disable features via 0-weighting
# Manipulate test data in order to use it with default HNSWLIB & no weighting
X_test_ = X_test.copy()
X_test_[:, missing_features] = 0
# Query data with imputed values (only zeros...) or disabled features
approx_query = approx_knn.knn_query(X_test_, k=k)[0]
approx_query_weight = approx_knn.weighted_knn_query(X_test_, weights, k=k)[0]
accuracy_no_weighting.append(np.mean(exact_query == approx_query))
accuracy_weighting.append(np.mean(exact_query == approx_query_weight)) |
@maxstrobel Thank you for a comprehensive explanation! |
template<typename MTYPE>
using DISTFUNC = MTYPE(*)(const void *, const void *, const void *);
template<typename MTYPE>
using WDISTFUNC = MTYPE(*)(const void *, const void *, const void *, const void *);
// Standard distance
std::priority_queue<std::pair<dist_t, labeltype >> searchKnn(const void *query_data, size_t k)
// Weighted distance
std::priority_queue<std::pair<dist_t, labeltype >> searchWeightedKnn(const void *query_data, const void *weights, size_t k)
// Standard distance
std::priority_queue<std::pair<dist_t, tableint>, std::vector<std::pair<dist_t, tableint>>, CompareByFirst>
searchBaseLayerST(tableint ep_id, const void *data_point, size_t ef)
// Weighted distance
std::priority_queue<std::pair<dist_t, tableint>, std::vector<std::pair<dist_t, tableint>>, CompareByFirst>
searchBaseLayerSTWeighted(tableint ep_id, const void *data_point, const void *weights, size_t ef)
I tought too complicated, now I have got a simple solution... Pull request with some first code for reviewing comes soon. It seems to work now and the code looks good. Now I have to touch the processor dependent code. |
@yurymalkov @searchivarius The build passes and the python test are also sucessful. I hope the code is acceptable. Feel free to ask if you have some questions. |
Hi @maxstrobel, Many thanks for the update! The pull request should be against the develop branch. I've look through the commit. One thing that that grasped by attention was that the base search algorithm was modified (e.g. additional We can test the speed on benchmarks (this would take time as I have to re-establish internal benchmark) and decide what to do next. Another option is make a new space specific to query time, which uses the mask via a different |
Hi @yurymalkov, thanks for the response! Regarding your performance concerns, I fully agree with you that only performance tests show the truth. However, I think the additional argument should not drastically impact the timing. The only thing that has really changed, if you compare to current state, is an additional The encoding of the weight vector at the end of the query data seems to be a reasonable approach, but I think that is not too easy to implement. I would implement the weightings also for InnerSpace (I think that should equally work) then the two spaces are balanced and generic again. However, if it should not work, ot would still be as generic as before, only with one more parameter that is used only partially. I hope I got your understanding of the generic modular structure right. So I would really suppose to do the performance tests, since this would keep the code clean and simple. I will open a pull request against the developer branch, and then you can decide in which direction we continue. |
@maxstrobel Thanks! |
Hi @maxstrobel! I found that there is an about 2-3% slowdown on low dimensional data (0.5M d=4). Not much, but I would like to avoid it. I can help with the code by moving the weights logic inside space implementations, thus keeping the general part intact. Shouldn't be a problem. |
Hello @yurymalkov ! sounds good so far! The slowdown is only on small dimensional data perceptible, isn't it? Then I can use this state for my own tests until we have a clean version. However, I totally agree that we should tackle the slowdown. I am not sure if I understand 100% correctly, where you would place the weights logic. Could you please explain that a little bit more detailed. I have a simple notebook, where I tested the influence of the weights. However, these are only toy exsamples... I made a test on a bigger dataset with 100.000 train samples and 10.000 test queries. Note (probably again): The recall is calculcated with respect to exact knn queries on the reduced feature space, not against an exact query on the whole feature space. If we compare the queries against an exact query on the full feature space, we get the following results. As expected, a strong descent in accuracy compared to the original query, but still way better than mean imputation... |
@maxstrobel Thank you the toy example! It should be enough to make simple tests. The logic for the weights can be encapsulated inside the distance function. To change the spaces between different user queries (which potentially can be done with and without weights in parallel), the space should be passed throughout the query processing (e.g. up to |
Hi @yurymalkov, Best, |
Hi @maxstrobel, |
Hi @yurymalkov ,
So we would have to introduce a setter function ( Does this procedure represent your thoughts or am I completely off ? Best, |
Hi @maxstrobel, Sorry for a late reply. Yes, it seems correct in general. I would prefer to concentrate all logic on weighted distance inside the python bindings, not touching much the C++ module. Does it sound ok? |
Yes, that sounds fine. I started to make the changes. When I am done I'll update the pull request. Best, |
@maxstrobel |
@yurymalkov
I think the second problem can be easily solved. The first one seems to be more difficult... What do you think about it? |
Hi @maxstrobel , On can also redirect this problem to the user (e.g. this can be done in python), but this seem like a more dirty solution. |
Hi @yurymalkov, I had a hard time fiddling out how to reorganize the memory and decided to stay with my fork because the few percent of performance do not matter for my application. However, I want to thank you for your support and patience, while fiddling out this issue. Thanks, |
Hi @maxstrobel, |
Hi @yurymalkov Has weighted_knn_query function been added to hnswlib already? I would really love to see this function. Unfortunately, I'm not good enough with C to fix the challenges that @maxstrobel was facing. I'm pretty sure this functionality would benefit a lot of people. Thanks, |
Hi @parashardhapola, |
Hi,
at first thanks for this great library. It works great.
My question refers to missing values in feature vectors. Say, we have used a set of N d-dimensional vectors to create the classifier.
Is it possible to query neighbours if our query vector has less than d dimension, e.g. a missing values in one or more dimensions?
Thanks in advance & Best,
Max
The text was updated successfully, but these errors were encountered: