Skip to content

PredictedLabel is always true for Anomaly Detection #3990

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
colbylwilliams opened this issue Jul 11, 2019 · 12 comments · Fixed by #4039
Closed

PredictedLabel is always true for Anomaly Detection #3990

colbylwilliams opened this issue Jul 11, 2019 · 12 comments · Fixed by #4039
Assignees
Labels
bug Something isn't working

Comments

@colbylwilliams
Copy link
Member

System information

  • OS version/distro: macOS & Windows
  • .NET Version (eg., dotnet --info): .Net Core

Issue: PredictedLabel is always true for Anomaly Detection

In my experience, and as demonstrated by this sample, predictions from models trained with the RandomizedPcaTrainer always set the value for PredictedLabel to true.

Note: I’m very new to machine learning, I am not a data scientist, nor am I very familiar with this code base, but I’ve taken a crack at figuring out why...

The BinaryClassifierScorer is used for scoring anomaly detection models, specifically those trained using the RandomizedPcaTrainer. Which I think makes sense, as with binary classification the PredictedLabel in anomaly detection will be one of two values, true or false.

However, when using binary classification, PredictiveLabel is set to true if the prediction's Score is a positive value and set to false if the Score is negative. This is one place it seems to break down with anomaly detection, as the Score is going to be a value between one and zero. So, the current implementation of BinaryClassifierScorer is going to return a value of true for any prediction that does not have a Score of zero or NAN.

Additionally, it’s my understanding that in anomaly detection it is up to the user to set the threshold of the model that indicates whether a Score is considered an anomaly or a normal value. (Or at least this is the case for supervised training). From what I can tell, the implementation of BinaryClassifierScorer used by anomaly detection, does have a Threshold property which it compares the Score value to, to get the value for PredictedLabel. It would seem the BinaryClassifierScorer could be used for anomaly detection if the user was able to manually set a value for Threshold, or if the scorer could intelligently set the value based on the distribution of Scores. However, the Threshold property is by default set to zero, with no public way of changing its value.

Thus, based on my understanding, the Scorer compares the prediction’s Score to zero, and the value for PredictedLabel will always be set to true, with the exception of the edge case where score is zero or NAN.

During my research, I did find that BinaryClassificationCatalog has a method ChangeModelThreshold to manually override the value of the scorer’s Threshold property. Unfortunately, this functionality is is not exposed on the AnomalyDetectionCatalog, so can’t be used with anomaly detection.


Finally, and this may need to be moved to a separate issue, but I've found contradictory information on how to interpret the Score value of an anomaly detection prediction. For example this sample indicates that outliers (or anomalies) will have a smaller value for Score than will normal values. However, this documentation states "If the error is close to 0, the instance is considered normal (non-anomaly)." This matches the results I'm getting from my sample, where anomalies have a higher value for Score than normal values.

@colbylwilliams
Copy link
Member Author

@eerhardt
Copy link
Member

@wschin @artidoro @codemzs @ganik - any thoughts on this?

@artidoro
Copy link
Contributor

Thanks for the detailed description of the issue @colbylwilliams!

This definitely seems like an issue. I intuitively would have thought that the threshold would be set during training, or that the threshold value would have been taken into account during training. But it definitely does not seem correct that the value is 0 and cannot be changed.

I'll double check with @codemzs who has worked with these time series algorithms more in depth but it seems necessary to either set the threshold in the catalog extension, or have a ChangeModelThreshold method like in binary classification like you suggested.

Once I double check I'll make the change so that you are unblocked for the sample.

@CESARDELATORRE
Copy link
Contributor

@artidoro, @wschin, eerhardt, @codemzs, @ganik - Any progress on this issue?

We were building a sample for Fraud detection based on 'AnomalyDetection-PCA' but it is blocked until this issue is fixed.

But going further, basically, this issue means that our only AnomalyDetection algorithm for cases such as Fraud detection (comparable to binary-classification tasks but a better algorithm for imbalanced datasets), etc. is currently not viable in ML.NET. Can you confirm this?

@ganik
Copy link
Member

ganik commented Jul 19, 2019

@colbylwilliams if this is were true that PredictedLabel is always set to true then our own example would definitely show that as well. Pls see ML.NET exmple that shows it otherwise
I guess one thing that stands out to me in your case, can you set EnsureZeroMean to false in your code. I have hunch that in this case BinaryClassifier will consider score range [0,1] rather than [-1, 1] .

@colbylwilliams
Copy link
Member Author

@ganik I did find/run the example you referenced. As I mention in my description, PredictedLabel returns true for any prediction that does not have a Score equal to zero or NAN. The Score for the anomaly in that example evaluates to zero, so getting false for PredictedLabel aligns with this issue. I believe the sample produces a score of zero because the same dataset used to train is also being used to test. I wouldn't expect to get a score of zero in real-world scenarios.

Per your recommendation, I tried setting EnsureZeroMean to false, but it yielded the same results described above (i.e. PredictedLabel is always true).

@artidoro
Copy link
Contributor

By the way the formula for the score is:

sqrt( (|x - m|^2 - |Ux - p|^2) / |x - m|^2 ) = sqrt( 1 - | Ux - p|^2 / |x - m|^2 )

Where x is the input vector, U is the projection matrix, m is the mean vector in the input space and p is the mean vector in the projection space.

What it is essentially computing is 1 minus the ratio of the length of the projected vector and the input vector. We expect the higher the scores the higher the chance of an anomaly because in case of no anomaly the data point should be able to be projected without changing the length much.

@artidoro
Copy link
Contributor

I will be doing a first PR that enables the user to change the score, and that changes the default value to something more meaningful, say 0.5.

In a second PR I will add an extra feature that allows the user to specify the percentage of the training data points that will be considered as anomaly and will automatically choose a threshold based on that.

@ganik
Copy link
Member

ganik commented Jul 20, 2019

@colbylwilliams thats true, it does return PredictedLabel set to true (meaning Anomaly) for all scores that are nonzero. The score is a normalized error (normalized distance from the data vector to eigenvectors space). This means closer to 0 it is the less of anomaly it is. The score range is always [0,1]. You are right again that with current threshold of 0 BinaryClassifier will classify all nonzero ones as Anomaly. The design issue here is to expose Threshold to user to set. However with the understanding of what the score is you dont need to use PredictiveLabel at all. You are good to set your own Threshold to define what Anomalies are. You would probably do that by experimenting with it, for ex. having plotted precision recall curve if you have existing labeled data.
Another way to tune the PCA Anomaly detector is to raise the Rank. The higher the rank, the smaller number of outliers (test records/vectors that are not in eigenvectors space). I bet if you set Rank to some higher number then PredictedLabel will mostly be false with just few true's. Again choosing "right" rank is an art in itself similar to choosing "right" Threshold. One can do it through precision recall curve having labeled data.

@CESARDELATORRE
Copy link
Contributor

@colbylwilliams - Did you have a chance to try the configuration changes suggested by @ganik? Please, keep me posted with the results when it is possible for you. :)

@colbylwilliams
Copy link
Member Author

@CESARDELATORRE the sample works as expected aside from the PredictedLabel always being true. The values I'm getting for Score are what I'd expect and sufficient for predicting which values are anomalies. Currently the sample ignores the PredictedLabel and instead compares the Score when printing to the console. Once I'm able to set the Threshold, I'll get correct values for PredictedLabel and will update the sample.

@CESARDELATORRE
Copy link
Contributor

Cool. Please ping me when finished so we'll make a final review of the sample app and make it public, ok? 👍

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants