-
Notifications
You must be signed in to change notification settings - Fork 1.9k
PredictedLabel is always true for Anomaly Detection #3990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the detailed description of the issue @colbylwilliams! This definitely seems like an issue. I intuitively would have thought that the threshold would be set during training, or that the threshold value would have been taken into account during training. But it definitely does not seem correct that the value is 0 and cannot be changed. I'll double check with @codemzs who has worked with these time series algorithms more in depth but it seems necessary to either set the threshold in the catalog extension, or have a Once I double check I'll make the change so that you are unblocked for the sample. |
@artidoro, @wschin, eerhardt, @codemzs, @ganik - Any progress on this issue? We were building a sample for Fraud detection based on 'AnomalyDetection-PCA' but it is blocked until this issue is fixed. But going further, basically, this issue means that our only AnomalyDetection algorithm for cases such as Fraud detection (comparable to binary-classification tasks but a better algorithm for imbalanced datasets), etc. is currently not viable in ML.NET. Can you confirm this? |
@colbylwilliams if this is were true that PredictedLabel is always set to true then our own example would definitely show that as well. Pls see ML.NET exmple that shows it otherwise |
@ganik I did find/run the example you referenced. As I mention in my description, Per your recommendation, I tried setting |
By the way the formula for the score is:
Where What it is essentially computing is 1 minus the ratio of the length of the projected vector and the input vector. We expect the higher the scores the higher the chance of an anomaly because in case of no anomaly the data point should be able to be projected without changing the length much. |
I will be doing a first PR that enables the user to change the score, and that changes the default value to something more meaningful, say 0.5. In a second PR I will add an extra feature that allows the user to specify the percentage of the training data points that will be considered as anomaly and will automatically choose a threshold based on that. |
@colbylwilliams thats true, it does return PredictedLabel set to true (meaning Anomaly) for all scores that are nonzero. The score is a normalized error (normalized distance from the data vector to eigenvectors space). This means closer to 0 it is the less of anomaly it is. The score range is always [0,1]. You are right again that with current threshold of 0 BinaryClassifier will classify all nonzero ones as Anomaly. The design issue here is to expose Threshold to user to set. However with the understanding of what the score is you dont need to use PredictiveLabel at all. You are good to set your own Threshold to define what Anomalies are. You would probably do that by experimenting with it, for ex. having plotted precision recall curve if you have existing labeled data. |
@colbylwilliams - Did you have a chance to try the configuration changes suggested by @ganik? Please, keep me posted with the results when it is possible for you. :) |
@CESARDELATORRE the sample works as expected aside from the |
Cool. Please ping me when finished so we'll make a final review of the sample app and make it public, ok? 👍 |
System information
Issue: PredictedLabel is always true for Anomaly Detection
In my experience, and as demonstrated by this sample, predictions from models trained with the
RandomizedPcaTrainer
always set the value forPredictedLabel
totrue
.Note: I’m very new to machine learning, I am not a data scientist, nor am I very familiar with this code base, but I’ve taken a crack at figuring out why...
The
BinaryClassifierScorer
is used for scoring anomaly detection models, specifically those trained using theRandomizedPcaTrainer
. Which I think makes sense, as with binary classification thePredictedLabel
in anomaly detection will be one of two values,true
orfalse
.However, when using binary classification,
PredictiveLabel
is set totrue
if the prediction'sScore
is a positive value and set tofalse
if theScore
is negative. This is one place it seems to break down with anomaly detection, as theScore
is going to be a value between one and zero. So, the current implementation ofBinaryClassifierScorer
is going to return a value of true for any prediction that does not have aScore
of zero or NAN.Additionally, it’s my understanding that in anomaly detection it is up to the user to set the threshold of the model that indicates whether a
Score
is considered an anomaly or a normal value. (Or at least this is the case for supervised training). From what I can tell, the implementation ofBinaryClassifierScorer
used by anomaly detection, does have aThreshold
property which it compares theScore
value to, to get the value forPredictedLabel
. It would seem theBinaryClassifierScorer
could be used for anomaly detection if the user was able to manually set a value forThreshold
, or if the scorer could intelligently set the value based on the distribution ofScore
s. However, theThreshold
property is by default set to zero, with no public way of changing its value.Thus, based on my understanding, the Scorer compares the prediction’s
Score
to zero, and the value forPredictedLabel
will always be set totrue
, with the exception of the edge case where score is zero or NAN.During my research, I did find that
BinaryClassificationCatalog
has a methodChangeModelThreshold
to manually override the value of the scorer’sThreshold
property. Unfortunately, this functionality is is not exposed on theAnomalyDetectionCatalog
, so can’t be used with anomaly detection.Finally, and this may need to be moved to a separate issue, but I've found contradictory information on how to interpret the
Score
value of an anomaly detection prediction. For example this sample indicates that outliers (or anomalies) will have a smaller value forScore
than will normal values. However, this documentation states "If the error is close to 0, the instance is considered normal (non-anomaly)." This matches the results I'm getting from my sample, where anomalies have a higher value forScore
than normal values.The text was updated successfully, but these errors were encountered: