Randomised PCA anomaly detection not detecting anomalies #3871

LDWDev · 2019-06-17T09:31:37Z

System information

Windows 10, .NET Core 2.2 console app, VS2019

Issue

SETUP

Good morning. I am encountering some issues with the RPCA trainer and was hoping someone could help me out here. I'm not really sure what I'm doing wrong but I am not getting the results I would expect.

I've made a toy model to test out the ML.NET anomaly detection funtionality. I manufacture two random numbers for each data point, and call them gene one and gene two. They are constrained to lie in a particular range: gene one lies between .8 and .9, and gene two lies between .1 and .5.

Using a sample from this data (with the same seed each time), I apply an RPCA pipeline. then I call fit, then transform the training data.

I then make a gene entry with a ludicrous score (10000, 25000), and transform that to see where it lies. ML.Net claims this is not an anomaly.

PROBLEM

I expect to see an anomaly. I've tried this with less silly values, more silly values, with or without all the available kinds of normalisation, and reducing the rank of the PCA trainer.

LOGS

Here's the output of the program. It shows the score, whether data is an inlier, and the transform of the data points.

Here's the pipeline:

var rpcaProjection = ct.Transforms.Concatenate("Features", "GeneOneScore", "GeneTwoScore")
                  .Append(ct.Transforms.NormalizeMeanVariance("NormalisedFeatures", "Features"))
.Append(ct.AnomalyDetection.Trainers.RandomizedPca(featureColumnName: "NormalisedFeatures", rank: 2));

Results from transforming first 20 training data: Predicted, score, PCA co-ordinates

True, 0.005851699, 0.7284454, 0.6617603
True, 0.002783414, 1.028686, 0.7808771
True, 0.004348077, 0.9824907, 1.543225
True, 0.004398529, 1.021341, 0.510879
True, 0.003683135, 1.005801, 1.091905
False, NaN, 0.9997306, 1.01117
True, 0.003618588, 1.004708, 0.8854353
True, 0.004349507, 0.9971809, 1.588225
True, 0.004412429, 1.018427, 0.4791145
True, 0.003997049, 1.008593, 1.246756
True, 0.004217377, 1.00574, 1.322197
True, 0.004192073, 0.9794555, 1.412197
True, 0.004289094, 1.019702, 1.569695
True, 0.005468629, 1.021341, 1.083963
True, 0.003488006, 1.007865, 0.7861712
False, NaN, 1.025348, 0.9171998
True, 0.004403637, 1.020734, 1.508814
False, NaN, 0.9888039, 0.9224939
True, 0.004757768, 1.009807, 0.8113181
True, 0.004348857, 1.011143, 0.4394088

Results from transforming the "anomaly": Predicted, score, PCA co-ordinates
True, 0.006252703, 121407.6, 82720.04

I've put this repo on github here:

https://github.com/LDWDev/MLWoes/blob/master/MLtestapp/Program.cs

Could anyone point out what is going on here? I would not expect the score to be the value it is.

Do I need to give the trainer anomalous data? Why does the scorer think such distant values should be considered 'normal'? I have had trouble finding useful documentation/tutorials on this.

The text was updated successfully, but these errors were encountered:

colbylwilliams · 2019-07-11T19:54:58Z

Likely related to #3990

antoniovs1029 · 2020-01-09T23:33:47Z

Hi @LDWDev , it seems things have changed since you opened this issue, particularly since #4039 added samples, and other methods related to PCA Anomaly detection.

If I run your code as it is right now I get the following output, which is different from your original output (where you used to have all the input points labeled as anomalies):

Results from transforming first 20 training data: Predicted, score, PCA co-ordinates
False, 0.003840973, 1.0243503, 0.77820677
False, 0.004364683, 0.9783494, 1.5379477
False, 0.0042510876, 1.0170361, 0.5091319
False, 0.003400095, 1.0015614, 1.0881705
False, NaN, 0.9955166, 1.0077118
False, 0.0014221083, 1.0004733, 0.8824073
False, 0.0042578345, 0.99297774, 1.5827934
False, 0.004307051, 1.0141345, 0.47747603
False, 0.004401572, 1.004342, 1.2424928
False, 0.0043062223, 1.001501, 1.3176755
False, 0.0042433045, 0.975327, 1.4073672
False, 0.004442298, 1.015404, 1.5643276
False, 0.0017190819, 1.0170361, 1.0802565
False, 0.0035820855, 1.0036166, 0.78348273
False, 0.0007224202, 1.0210257, 0.9140633
False, 0.0042893523, 1.0164316, 1.5036538
False, NaN, 0.984636, 0.91933924
False, 0.0043272316, 1.0055509, 0.8085436
False, 0.0044419686, 1.0068808, 0.43790618
False, 0.0043820036, 1.0028309, 0.61069447
Results from transforming the "anomaly": Predicted, score, PCA co-ordinates
False, 0.006146703, 120895.81, 82437.16

Notice that now none of them were tagged as anomalies.

Still, if I change the rank you used (rank = 2) to a lower value (rank = 1) (as suggested here) I get the following output which is probably more what you expected to see:

Results from transforming first 20 training data: Predicted, score, PCA co-ordinates
False, 0.2112529, 1.0243503, 0.77820677
False, 0.16840076, 0.9783494, 1.5379477
False, 0.24335083, 1.0170361, 0.5091319
False, 0.109844014, 1.0015614, 1.0881705
False, 0.1439669, 0.9955166, 1.0077118
True, 0.50474775, 1.0004733, 0.8824073
False, 0.19581862, 0.99297774, 1.5827934
False, 0.2489134, 1.0141345, 0.47747603
False, 0.18574826, 1.004342, 1.2424928
False, 0.18869121, 1.001501, 1.3176755
False, 0.14238065, 0.975327, 1.4073672
False, 0.23007028, 1.015404, 1.5643276
False, 0.21633859, 1.0170361, 1.0802565
False, 0.33171, 1.0036166, 0.78348273
False, 0.2030785, 1.0210257, 0.9140633
False, 0.23109731, 1.0164316, 1.5036538
True, 0.90728194, 0.984636, 0.91933924
False, 0.3354059, 1.0055509, 0.8085436
False, 0.2617581, 1.0068808, 0.43790618
False, 0.28585696, 1.0028309, 0.61069447
Results from transforming the "anomaly": Predicted, score, PCA co-ordinates
True, 0.9362057, 120895.81, 82437.16

So I will close this issue, but please feel free to reopen it if you still have problems with this. Thanks.

antoniovs1029 closed this as completed Jan 9, 2020

antoniovs1029 added P3 Doc bugs, questions, minor issues, etc. question Further information is requested labels Jan 9, 2020

antoniovs1029 self-assigned this May 6, 2020

ghost locked as resolved and limited conversation to collaborators Mar 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Randomised PCA anomaly detection not detecting anomalies #3871

Randomised PCA anomaly detection not detecting anomalies #3871

LDWDev commented Jun 17, 2019

colbylwilliams commented Jul 11, 2019

antoniovs1029 commented Jan 9, 2020

Randomised PCA anomaly detection not detecting anomalies #3871

Randomised PCA anomaly detection not detecting anomalies #3871

Comments

LDWDev commented Jun 17, 2019

System information

Issue

SETUP

PROBLEM

LOGS

colbylwilliams commented Jul 11, 2019

antoniovs1029 commented Jan 9, 2020