Skip to content

Randomised PCA anomaly detection not detecting anomalies #3871

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
LDWDev opened this issue Jun 17, 2019 · 2 comments
Closed

Randomised PCA anomaly detection not detecting anomalies #3871

LDWDev opened this issue Jun 17, 2019 · 2 comments
Assignees
Labels
P3 Doc bugs, questions, minor issues, etc. question Further information is requested

Comments

@LDWDev
Copy link

LDWDev commented Jun 17, 2019

System information

Windows 10, .NET Core 2.2 console app, VS2019

Issue

SETUP

Good morning. I am encountering some issues with the RPCA trainer and was hoping someone could help me out here. I'm not really sure what I'm doing wrong but I am not getting the results I would expect.

I've made a toy model to test out the ML.NET anomaly detection funtionality. I manufacture two random numbers for each data point, and call them gene one and gene two. They are constrained to lie in a particular range: gene one lies between .8 and .9, and gene two lies between .1 and .5.

Using a sample from this data (with the same seed each time), I apply an RPCA pipeline. then I call fit, then transform the training data.

I then make a gene entry with a ludicrous score (10000, 25000), and transform that to see where it lies. ML.Net claims this is not an anomaly.

PROBLEM

I expect to see an anomaly. I've tried this with less silly values, more silly values, with or without all the available kinds of normalisation, and reducing the rank of the PCA trainer.

LOGS

Here's the output of the program. It shows the score, whether data is an inlier, and the transform of the data points.

Here's the pipeline:

var rpcaProjection = ct.Transforms.Concatenate("Features", "GeneOneScore", "GeneTwoScore")
                  .Append(ct.Transforms.NormalizeMeanVariance("NormalisedFeatures", "Features"))
.Append(ct.AnomalyDetection.Trainers.RandomizedPca(featureColumnName: "NormalisedFeatures", rank: 2));

Results from transforming first 20 training data: Predicted, score, PCA co-ordinates

True, 0.005851699, 0.7284454, 0.6617603
True, 0.002783414, 1.028686, 0.7808771
True, 0.004348077, 0.9824907, 1.543225
True, 0.004398529, 1.021341, 0.510879
True, 0.003683135, 1.005801, 1.091905
False, NaN, 0.9997306, 1.01117
True, 0.003618588, 1.004708, 0.8854353
True, 0.004349507, 0.9971809, 1.588225
True, 0.004412429, 1.018427, 0.4791145
True, 0.003997049, 1.008593, 1.246756
True, 0.004217377, 1.00574, 1.322197
True, 0.004192073, 0.9794555, 1.412197
True, 0.004289094, 1.019702, 1.569695
True, 0.005468629, 1.021341, 1.083963
True, 0.003488006, 1.007865, 0.7861712
False, NaN, 1.025348, 0.9171998
True, 0.004403637, 1.020734, 1.508814
False, NaN, 0.9888039, 0.9224939
True, 0.004757768, 1.009807, 0.8113181
True, 0.004348857, 1.011143, 0.4394088

Results from transforming the "anomaly": Predicted, score, PCA co-ordinates
True, 0.006252703, 121407.6, 82720.04

I've put this repo on github here:

https://github.com/LDWDev/MLWoes/blob/master/MLtestapp/Program.cs

Could anyone point out what is going on here? I would not expect the score to be the value it is.

Do I need to give the trainer anomalous data? Why does the scorer think such distant values should be considered 'normal'? I have had trouble finding useful documentation/tutorials on this.

@colbylwilliams
Copy link
Member

Likely related to #3990

@antoniovs1029
Copy link
Member

Hi @LDWDev , it seems things have changed since you opened this issue, particularly since #4039 added samples, and other methods related to PCA Anomaly detection.

If I run your code as it is right now I get the following output, which is different from your original output (where you used to have all the input points labeled as anomalies):

Results from transforming first 20 training data: Predicted, score, PCA co-ordinates
False, 0.003840973, 1.0243503, 0.77820677
False, 0.004364683, 0.9783494, 1.5379477
False, 0.0042510876, 1.0170361, 0.5091319
False, 0.003400095, 1.0015614, 1.0881705
False, NaN, 0.9955166, 1.0077118
False, 0.0014221083, 1.0004733, 0.8824073
False, 0.0042578345, 0.99297774, 1.5827934
False, 0.004307051, 1.0141345, 0.47747603
False, 0.004401572, 1.004342, 1.2424928
False, 0.0043062223, 1.001501, 1.3176755
False, 0.0042433045, 0.975327, 1.4073672
False, 0.004442298, 1.015404, 1.5643276
False, 0.0017190819, 1.0170361, 1.0802565
False, 0.0035820855, 1.0036166, 0.78348273
False, 0.0007224202, 1.0210257, 0.9140633
False, 0.0042893523, 1.0164316, 1.5036538
False, NaN, 0.984636, 0.91933924
False, 0.0043272316, 1.0055509, 0.8085436
False, 0.0044419686, 1.0068808, 0.43790618
False, 0.0043820036, 1.0028309, 0.61069447
Results from transforming the "anomaly": Predicted, score, PCA co-ordinates
False, 0.006146703, 120895.81, 82437.16

Notice that now none of them were tagged as anomalies.

Still, if I change the rank you used (rank = 2) to a lower value (rank = 1) (as suggested here) I get the following output which is probably more what you expected to see:

Results from transforming first 20 training data: Predicted, score, PCA co-ordinates
False, 0.2112529, 1.0243503, 0.77820677
False, 0.16840076, 0.9783494, 1.5379477
False, 0.24335083, 1.0170361, 0.5091319
False, 0.109844014, 1.0015614, 1.0881705
False, 0.1439669, 0.9955166, 1.0077118
True, 0.50474775, 1.0004733, 0.8824073
False, 0.19581862, 0.99297774, 1.5827934
False, 0.2489134, 1.0141345, 0.47747603
False, 0.18574826, 1.004342, 1.2424928
False, 0.18869121, 1.001501, 1.3176755
False, 0.14238065, 0.975327, 1.4073672
False, 0.23007028, 1.015404, 1.5643276
False, 0.21633859, 1.0170361, 1.0802565
False, 0.33171, 1.0036166, 0.78348273
False, 0.2030785, 1.0210257, 0.9140633
False, 0.23109731, 1.0164316, 1.5036538
True, 0.90728194, 0.984636, 0.91933924
False, 0.3354059, 1.0055509, 0.8085436
False, 0.2617581, 1.0068808, 0.43790618
False, 0.28585696, 1.0028309, 0.61069447
Results from transforming the "anomaly": Predicted, score, PCA co-ordinates
True, 0.9362057, 120895.81, 82437.16

So I will close this issue, but please feel free to reopen it if you still have problems with this. Thanks.

@antoniovs1029 antoniovs1029 added P3 Doc bugs, questions, minor issues, etc. question Further information is requested labels Jan 9, 2020
@antoniovs1029 antoniovs1029 self-assigned this May 6, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Mar 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P3 Doc bugs, questions, minor issues, etc. question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants