Skip to content

Handling missing values #90

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
maxstrobel opened this issue Jan 7, 2019 · 33 comments
Closed

Handling missing values #90

maxstrobel opened this issue Jan 7, 2019 · 33 comments

Comments

@maxstrobel
Copy link

Hi,

at first thanks for this great library. It works great.

My question refers to missing values in feature vectors. Say, we have used a set of N d-dimensional vectors to create the classifier.
Is it possible to query neighbours if our query vector has less than d dimension, e.g. a missing values in one or more dimensions?

Thanks in advance & Best,
Max

@yurymalkov
Copy link
Member

Hi @maxstrobel,
Thank you for the kind words!

Yes, you can do that using the inner product distance (which just seem to be natural for classification). If you zero all dimensions in the query that are missing (or unknown), they effectively will not be used during the search.

But if you zero too many dimensions of the query, the seach accuracy/speed ratio of the search may degarde (since the index was built using a different metric than it was searched), so do not forgot to increase the ef parameter.

@maxstrobel
Copy link
Author

Hi @yurymalkov,

sorry I was a little bit inprecise. I am currently using the 'Squared L2' distance. Is there some workaround to handle missing values here?

I tried your suggestion and used the inner product as distance metric, but sadly the accuracy of my application drops to a fraction opposed to using the L2 distance.

Thanks,
Max

@searchivarius
Copy link
Member

@maxstrobel I doubt. This IMHO requires keeping an inverse of a missing value mask together with each vector. Then, when comparing two vectors you would have to mask-out (zero-out) all the differences where a dimension in x or y is missing.

@yurymalkov
Copy link
Member

@maxstrobel This can be done only using C++ interface. As @searchivarius has mentioned, you would need to use query-specific masks(there are several ways to do that, e.g. change distance function before every query, or modify the distance to contain masks).

@maxstrobel
Copy link
Author

maxstrobel commented Jan 9, 2019

@yurymalkov @searchivarius Thanks for your kind replies!

@yurymalkov I thought about your second advice to modify the distance to contain a mask.
If I would add a mask to the distance function via an diagonal matrix, e.g.:

Distance parameter Equation
Squared L2 'l2' d = sum((Ai-Bi)^2) = (Ai-Bi)^T * W * (Ai-Bi) with W=I
Weighted Squared L2 'Wl2' d = sum((Ai-Bi)^2) = (Ai-Bi)^T * W * (Ai-Bi) with arbitrary diagonal W

I could mask a missing features by setting the diagonal matrix W=I (I := unity matrix) and zero out those elements that are missing.

Do you have an estimate how this would affect the performance of queries? I would index all samples without missing features, but at test time there would sometimes occur queries with missing features.
In general, is this a valid approach at all or would this procedure fail?

@yurymalkov
Copy link
Member

@maxstrobel as to my experience, this is a valid approach.
Although i wouldn't use matricies(which have quadratic compleixty), but just multiply Ai and Bi by the binary mask before computing the distance.

@maxstrobel
Copy link
Author

@yurymalkov Nice to hear that this procedure may work.

If I find time during next week I will give it a try.
I would use hnswlib/space_l2.h as a boilerplate and follow those steps:

  1. Adopt distance functions (L2Sqr, L2SqrSIMD16Ext, L2SqrSIMD4Ext, L2SqrI)
  2. Adjust the space classes (L2Space, L2SpaceI) to use the new weighted versions
  3. Extend hnswlib.h to make the new distance available
  4. Extend python_bindings to support the new distance

Would that be sufficient or am I missing some steps?

@yurymalkov
Copy link
Member

@maxstrobel Yes, this should be sufficient.

@maxstrobel
Copy link
Author

@yurymalkov @searchivarius
It seems that handling missing features via weighting of features works. I have implemented only a very basic prototype version (only copy&paste - no clean code...) to check the general behaviour.

Summing up:

  1. Added a new distance function to the space interface: get_weighted_dist_func
  2. Implemented the weighted distance functions (not the CPU optimized ones...)
  3. Added new search methods to hnswalg (i.e. just a copy using the weighted distance) : searchWeightedKnn and searchBaseLayerSTWeighted
  4. Added weighted_knn_query in python bindings

I made also some small benchmarks to check the behaviour. As benchmark I used sklearns exact kNN and fitted on the subset of available features. The adapted hnswlib kNN is fitted only once with the whole dataset (i.e. all features).
Then I increased the number of (random) missing features and compared the queries of the exact kNN and the hnswlib kNN with weights to disable those missing features. (As a baseline I included also standard hnswlib and imputed zeros formissing features).
I used three toy datasets and got the following results:

image
dataset: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html#sklearn.datasets.load_digits

image
dataset: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html

image
dataset: human postures from http://users.eecs.northwestern.edu/~jwa368/my_data.html

As you can see your algorithm works really well with the feature weighting as extension to handle missing features (I hope I made no silly mistakes and the results are correct...). Maybe it would make sense to incorporate this in the master branch? What do you think?

Feel free to ask questions if something is unclear.

@searchivarius
Copy link
Member

Wow @maxstrobel looks like this warrants an implementation of a special space.

@yurymalkov
Copy link
Member

Hi @maxstrobel,
Sure, this can be added to the library, but it has to be clean code.
Also, can you explain what is the accuracy on the plots?

@maxstrobel
Copy link
Author

@yurymalkov I try to code it in a clean way in the next days.

The accuracy in the plots is calculated as followed:

  1. Fit HNSW on training data with full feature set
  2. Iterate from zero to number of features. At each iteration, I eliminate one more feature. The current set of features is randomly sampled at each iteration. Then during each iteration:
    1. Fit an exact KNN and query from a test set with the current set of features.
    2. Use the current set of features to build a weight vector ('1' for every available feature; '0' for missing features)
    3. Use the weight vector and the modified HNSW to query from the test set with the current set of features.
    4. As comparison: Impute zeros in all missing features and query with HNSW from 'imputed original space'
    5. Accuracy is built via an overlap (order of elements counts) of the exact query and the approximate query using the weights (and the imputed zeros)

I hope you can follow my explanations.

TL;DR It might be usefull to look at the code snippet. From here it should be clear, what I did.

# Approx kNN - fitted with whole feature space
approx_knn = hnswlib.Index(space='l2', dim=dim)
approx_knn.init_index(max_elements=len(X_train), ef_construction=ef, M=M)
approx_knn.add_items(X_train, num_threads=-1)
approx_knn.set_ef(ef)

accuracy_weighting, accuracy_no_weighting = [], []
for d in range(dim):    
    # Disable randomly features
    missing_features = np.random.permutation(dim)[:d]
    available_features = np.setdiff1d(np.arange(dim), missing_features)
    
    ###########
    # Exact kNN - fitted & queried with the current subset of feature
    exact_knn = KNeighborsClassifier(n_neighbors=k).fit(X_train[:, available_features], y_train)
    exact_query = exact_knn.kneighbors(X_test[:, available_features])[1]
    
    ###########
    # HNSWLIB
    weights = np.ones(dim)
    weights[missing_features] = 0  # disable features via 0-weighting
    # Manipulate test data in order to use it with default HNSWLIB & no weighting
    X_test_ = X_test.copy()
    X_test_[:, missing_features] = 0
    # Query data with imputed values (only zeros...) or disabled features
    approx_query = approx_knn.knn_query(X_test_, k=k)[0]
    approx_query_weight = approx_knn.weighted_knn_query(X_test_, weights, k=k)[0]

    accuracy_no_weighting.append(np.mean(exact_query == approx_query))
    accuracy_weighting.append(np.mean(exact_query == approx_query_weight))

@yurymalkov
Copy link
Member

@maxstrobel Thank you for a comprehensive explanation!
Looking forward to get a pool request.

@maxstrobel
Copy link
Author

maxstrobel commented Jan 25, 2019

@yurymalkov

It turned out that the inclusion of a weighted distance is a little bit tricky.
The problem is that a weigthed distance needs one more parameter for the weights.
The standard distance is defined as:

   template<typename MTYPE>
   using DISTFUNC = MTYPE(*)(const void *, const void *, const void *);

a weighted distance would require:

    template<typename MTYPE>
    using WDISTFUNC = MTYPE(*)(const void *, const void *, const void *, const void *);

Do you have any proposal how this can be integrated in a clean fashion?
In the first run I implemented a parallel branch for the Weighted distances. This means a copy of:

// Standard distance
std::priority_queue<std::pair<dist_t, labeltype >> searchKnn(const void *query_data, size_t k)

// Weighted distance
std::priority_queue<std::pair<dist_t, labeltype >> searchWeightedKnn(const void *query_data, const void *weights, size_t k)


// Standard distance
std::priority_queue<std::pair<dist_t, tableint>, std::vector<std::pair<dist_t, tableint>>, CompareByFirst>
searchBaseLayerST(tableint ep_id, const void *data_point, size_t ef)

// Weighted distance
std::priority_queue<std::pair<dist_t, tableint>, std::vector<std::pair<dist_t, tableint>>, CompareByFirst>
searchBaseLayerSTWeighted(tableint ep_id, const void *data_point, const void *weights, size_t ef)

This means a lot of replicated code... Spliting up the functions in distance-independent and distance dependent parts seems to be a bad idea, too.

Maybe you have a look at maxstrobel@307b676 to imagine what I have done so far.

I tought too complicated, now I have got a simple solution... Pull request with some first code for reviewing comes soon. It seems to work now and the code looks good. Now I have to touch the processor dependent code.

@maxstrobel
Copy link
Author

@yurymalkov @searchivarius
I pushed the first part of the pull request: maxstrobel@f1a95a4 (For which branch should I do a pull request? master?)

The build passes and the python test are also sucessful.
Currently I implemented only the weighted versions of the L2 square. When the code passes your review I would implement also InnerSpace.
btw: Are there some performance issues regarding the duplicated code inside the while-loops inside the CPU optimized functions in the space_l2.h?

I hope the code is acceptable. Feel free to ask if you have some questions.

@yurymalkov
Copy link
Member

Hi @maxstrobel,

Many thanks for the update! The pull request should be against the develop branch.

I've look through the commit. One thing that that grasped by attention was that the base search algorithm was modified (e.g. additional pWeights argument in DISTFUNC). I am not sure, but this can lead to performance degradation (only performance tests can show it) and makes the C++ code a bit less generic (as the feature is specific only to vector spaces).

We can test the speed on benchmarks (this would take time as I have to re-establish internal benchmark) and decide what to do next.

Another option is make a new space specific to query time, which uses the mask via a different DISTFUNC method; add concatenation of mask vector data to the end of the query data in RAM (so the weights mask could be extracted from first DISTFUNC argument during the query-time distance calculation) and add a separate python call which takes the mask as an input. This would allow keeping the generic modular structure.

@maxstrobel
Copy link
Author

Hi @yurymalkov,

thanks for the response!

Regarding your performance concerns, I fully agree with you that only performance tests show the truth. However, I think the additional argument should not drastically impact the timing. The only thing that has really changed, if you compare to current state, is an additional if and some pointer instantiations in the distance calculation, which should also not lead to performance degradations.

The encoding of the weight vector at the end of the query data seems to be a reasonable approach, but I think that is not too easy to implement. I would implement the weightings also for InnerSpace (I think that should equally work) then the two spaces are balanced and generic again. However, if it should not work, ot would still be as generic as before, only with one more parameter that is used only partially. I hope I got your understanding of the generic modular structure right.

So I would really suppose to do the performance tests, since this would keep the code clean and simple. I will open a pull request against the developer branch, and then you can decide in which direction we continue.

@yurymalkov
Copy link
Member

@maxstrobel Thanks!
I'll run performance tests on the next week, hopefully, tomorrow.

@yurymalkov
Copy link
Member

Hi @maxstrobel!
I've finally found time to run the performance tests - sorry for the delay.

I found that there is an about 2-3% slowdown on low dimensional data (0.5M d=4). Not much, but I would like to avoid it.

I can help with the code by moving the weights logic inside space implementations, thus keeping the general part intact. Shouldn't be a problem.
I wonder, can you provide the tests for the weighted search?

@maxstrobel
Copy link
Author

maxstrobel commented Feb 5, 2019

Hello @yurymalkov !

sounds good so far! The slowdown is only on small dimensional data perceptible, isn't it? Then I can use this state for my own tests until we have a clean version.

However, I totally agree that we should tackle the slowdown. I am not sure if I understand 100% correctly, where you would place the weights logic. Could you please explain that a little bit more detailed.

I have a simple notebook, where I tested the influence of the weights. However, these are only toy exsamples...
https://github.com/maxstrobel/hnswlib/blob/master/MissingFeatures.ipynb

I made a test on a bigger dataset with 100.000 train samples and 10.000 test queries.
The internal graph structure seems to be really stable against this type of dimensionality reduction
study_reduced_features

Note (probably again): The recall is calculcated with respect to exact knn queries on the reduced feature space, not against an exact query on the whole feature space. If we compare the queries against an exact query on the full feature space, we get the following results. As expected, a strong descent in accuracy compared to the original query, but still way better than mean imputation...
image

@yurymalkov
Copy link
Member

@maxstrobel Thank you the toy example! It should be enough to make simple tests.
Yes, the slowdown is only for low dimensional spaces.

The logic for the weights can be encapsulated inside the distance function.
I.e. in a modification L2Sqr you can extract the info about the weights using pWeights = pVect1+qty*sizeof(float).
To set the weights inside the python interface you can create a copy of the vector and concatenate the weights to the end of the first vector.

To change the spaces between different user queries (which potentially can be done with and without weights in parallel), the space should be passed throughout the query processing (e.g. up to searchBaseLayerST), so the queries with weights and without will be done using a different distance function.

@maxstrobel
Copy link
Author

Hi @yurymalkov,
sorry for the delay, I am a little busy at the moment...
However, I will update the code as you proposed. I think at the beginning/middle of march I can handle it.
I leave this issue open until then.

Best,
Max

@yurymalkov
Copy link
Member

Hi @maxstrobel,
Sure, thank you!

@maxstrobel
Copy link
Author

Hi @yurymalkov ,
I am available again.
I thought about your approach (encoding the weights at the end of the data). If we would implement it in this way, we would have to dynamically adapt the used distance function (standard vs. weighted), wouldn't we?
The procedure would be (I hope I got your ideas):

1. Call knn_query with/without weights
2. Set distance function:
        without weights: standard distance
        with weights: weighted distance
3. Call standard HNSW algo
4. Inside the distance function:
        without weights: calculate standard distance, as implemented in the current state
        with weights: extract weights from end of the data & calculate weighted distance

So we would have to introduce a setter function (set_dist_func(bool weighted)) for the distance function in the space implementations.

Does this procedure represent your thoughts or am I completely off ?

Best,
Max

@yurymalkov
Copy link
Member

Hi @maxstrobel,

Sorry for a late reply. Yes, it seems correct in general.

I would prefer to concentrate all logic on weighted distance inside the python bindings, not touching much the C++ module.
E.g. calling set_dist_func in this case would just alter the space in the HNSWlib module from the bindings (e.g. override the reference to a space and change it back after the method call).

Does it sound ok?

@maxstrobel
Copy link
Author

Yes, that sounds fine. I started to make the changes. When I am done I'll update the pull request.

Best,
Max

@yurymalkov
Copy link
Member

@maxstrobel
Cool! Many thanks!

@maxstrobel
Copy link
Author

@yurymalkov
I tried to implement the solution in the way we discussed. I refactored the spaces and separated the weighted calculations from the standard ones. However I found some problems:

  1. In the distance functions, we process always exactly two datapoints (query point + point from the network). When we concatenate the query point and its weights as you proposed, we have to do so for every single datapoint, i.e. if we feed an array into the algorithm we have to add the weights to every single entry. This looks somehow weird for me. Am I correct or am I missing a detail?
  2. A more technical one: do you know how to concatenate numpy arrays with pybind?

I think the second problem can be easily solved. The first one seems to be more difficult... What do you think about it?

@yurymalkov
Copy link
Member

Hi @maxstrobel ,
Concerning the #1 - yes, you have to somehow reorganize the memory(e.g. via copying to a new array).
Indeed, it does not seem like a beautiful solution, but in terms of speed, it should be very fast.

On can also redirect this problem to the user (e.g. this can be done in python), but this seem like a more dirty solution.

@maxstrobel
Copy link
Author

Hi @yurymalkov,

I had a hard time fiddling out how to reorganize the memory and decided to stay with my fork because the few percent of performance do not matter for my application.

However, I want to thank you for your support and patience, while fiddling out this issue.
I will close the issue and the PR. When I have a need for the last few percent of performance, I will give it maybe again a try.

Thanks,
Max

@yurymalkov
Copy link
Member

Hi @maxstrobel,
Thanks for trying! I can help in case you will decide to do it.
Also, maybe, if I will had time, I'll add it myself.

@parashardhapola
Copy link

Hi @yurymalkov

Has weighted_knn_query function been added to hnswlib already? I would really love to see this function. Unfortunately, I'm not good enough with C to fix the challenges that @maxstrobel was facing. I'm pretty sure this functionality would benefit a lot of people.

Thanks,
Parashar

@yurymalkov
Copy link
Member

Hi @parashardhapola,
No it was not added... I do not have much time to add it, but if someone willing to help I know how to do it without losing the performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants