Skip to content

Scrub projection transforms #2865

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Mar 8, 2019
Merged

Scrub projection transforms #2865

merged 14 commits into from
Mar 8, 2019

Conversation

wschin
Copy link
Member

@wschin wschin commented Mar 6, 2019

Fix #2831.

Transforms touched:
RFF, LpNorm, GcNorm, PCA, Whiten, Normalize.

@wschin wschin changed the title Scrub norm Scrub projection transforms Mar 6, 2019
@wschin wschin requested review from Ivanidzo4ka, artidoro and sfilipi and removed request for artidoro March 6, 2019 01:36
@wschin wschin changed the title Scrub projection transforms [WIP] Scrub projection transforms Mar 6, 2019
@codecov
Copy link

codecov bot commented Mar 6, 2019

Codecov Report

Merging #2865 into master will decrease coverage by <.01%.
The diff coverage is 86.34%.

@@            Coverage Diff             @@
##           master    #2865      +/-   ##
==========================================
- Coverage   71.79%   71.79%   -0.01%     
==========================================
  Files         812      812              
  Lines      142680   142673       -7     
  Branches    16124    16124              
==========================================
- Hits       102436   102430       -6     
  Misses      35831    35831              
+ Partials     4413     4412       -1
Flag Coverage Δ
#Debug 71.79% <86.34%> (-0.01%) ⬇️
#production 67.92% <83.85%> (ø) ⬆️
#test 86.23% <95.45%> (-0.01%) ⬇️
Impacted Files Coverage Δ
.../Microsoft.ML.Data/Transforms/TransformsCatalog.cs 100% <ø> (ø) ⬆️
...rosoft.ML.Transforms/FourierDistributionSampler.cs 84.16% <ø> (ø) ⬆️
test/Microsoft.ML.Benchmarks/RffTransform.cs 0% <0%> (ø) ⬆️
...s/StochasticDualCoordinateAscentClassifierBench.cs 0% <0%> (ø) ⬆️
...icrosoft.ML.Mkl.Components/MklComponentsCatalog.cs 94.28% <0%> (ø) ⬆️
...icrosoft.ML.Functional.Tests/DataTransformation.cs 100% <100%> (ø) ⬆️
src/Microsoft.ML.Data/Transforms/Normalizer.cs 84.81% <100%> (ø) ⬆️
...ents.StaticPipe/VectorWhiteningStaticExtensions.cs 100% <100%> (ø) ⬆️
...ft.ML.Tests/TrainerEstimators/TrainerEstimators.cs 92.5% <100%> (ø) ⬆️
test/Microsoft.ML.Tests/Transformers/RffTests.cs 100% <100%> (ø) ⬆️
... and 24 more

string features,
string weights = null,
string featureColumnName,
string exampleWeightColumnName = null,
Copy link
Member

@sfilipi sfilipi Mar 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exampleWeightColumnName [](start = 19, length = 23)

does 'example' need to be in the name? just too long.. #WontFix

Copy link
Member Author

@wschin wschin Mar 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but it's used everywhere. I will open an issue (#2873) instead.


In reply to: 262785707 [](ancestors = 262785707)

/// <param name="newDim">Expected size of new vector.</param>
/// <param name="useSin">Create two features for every random Fourier frequency? (one for cos and one for sin) </param>
/// <param name="dimension">The number of random Fourier features to create.</param>
/// <param name="useCosAndSinBases">If <see langword="true"/>, use both of cos and sin basis functions to create two features for every random Fourier frequency. /// Otherwise, only cos bases would be used.</param>
Copy link
Member

@sfilipi sfilipi Mar 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// Otherwise, only cos bases would be used. [](start = 171, length = 52)

new line #Resolved

Copy link
Member

@sfilipi sfilipi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@@ -18,18 +18,18 @@ public static class PcaCatalog
/// <param name="exampleWeightColumnName">The name of the example weight column (optional).</param>
/// <param name="rank">The number of principal components.</param>
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Mar 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to have consistency among transforms which lower feature vector dimension.
Right now we have dimension and rank and pcaNum in whitening.
#Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


In reply to: 262791198 [](ancestors = 262791198)

@wschin wschin changed the title [WIP] Scrub projection transforms Scrub projection transforms Mar 6, 2019
@wschin wschin requested review from artidoro and Ivanidzo4ka March 6, 2019 23:15
@artidoro
Copy link
Contributor

artidoro commented Mar 7, 2019

I know @TomFinley has some opinions about how MLContext should be organized.
In your PR you remove the Transforms.Projection catalog section and instead create three others (I also added the transforms that go in each section so it's easier to review):

  • Transforms.DimensionReduction
    • VectorWhitening
    • PCA
  • Transforms.KernelExpansion
    • CreateRandomFourierFeatures
  • Transforms.Normalization
    • Normalize
    • LpNormalize
    • GlobalContrastNormalize

Since this introduces quite a few sections I think @TomFinley should take a look before we can check it in. #Resolved

/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.</param>
/// <param name="inputColumnName">Name of column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param>
/// <param name="dimension">The number of random Fourier features to create.</param>
/// <param name="useCosAndSinBases">If <see langword="true"/>, use both of cos and sin basis functions to create two features for every random Fourier frequency. /// Otherwise, only cos bases would be used.</param>
Copy link
Contributor

@artidoro artidoro Mar 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// [](start = 171, length = 3)

new line #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch.


In reply to: 263198787 [](ancestors = 263198787)

/// ]]>
/// </format>
/// </example>
public static RandomFourierExpansionEstimator RandomFourierExpand(this TransformsCatalog.KernelExpansionTransforms catalog,
Copy link
Contributor

@artidoro artidoro Mar 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RandomFourierExpansionEstimator [](start = 22, length = 31)

Why Random? I like the Expansion part, but why Random? I don't think the process it that random right? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formal name: randomized Fourier kernel approximation.


In reply to: 263199601 [](ancestors = 263199601)

@TomFinley
Copy link
Contributor

TomFinley commented Mar 7, 2019

Since this introduces quite a few sections I think @TomFinley should take a look before we can check it in.

Well, I feel like this is introducing too much categorization @wschin and @artidoro. So if the categories start to have one or two items, then I wonder what the point of categorization is. So why not just not categorize if we get to that point? Does anything really bad happen if transforms isn't endlessly categorized? This is starting to resemble a Linnaeusian taxonomy.

I see a few end states with categorization. We have in this PR at the time I am writing descriptive categories, but they are too fine grained. We had previously categories that were, while accurate descriptions, utterly useless in terms of communicating anything to users (projection). The alternative is that we don't categorize these at all, and so I wonder if that's the most desirable state if we can't come up with good descriptive categories that don't suffer from having only one or two items in them.

So how bad is it if we just put these things in transforms? #Resolved

@wschin
Copy link
Member Author

wschin commented Mar 7, 2019

Not bad! Let's do this.


In reply to: 470339262 [](ancestors = 470339262)

Copy link
Contributor

@artidoro artidoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

/// </example>
public static NormalizingEstimator Normalize(this TransformsCatalog catalog,
string outputColumnName, string inputColumnName = null,
NormalizingEstimator.NormalizerMode mode = NormalizingEstimator.NormalizerMode.MinMax)
Copy link
Contributor

@artidoro artidoro Mar 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NormalizerMode [](start = 33, length = 14)

How about NormalizationMode? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.


In reply to: 263234205 [](ancestors = 263234205)

/// </format>
/// </example>
public static NormalizingEstimator Normalize(this TransformsCatalog catalog,
NormalizingEstimator.NormalizerMode mode,
Copy link
Contributor

@artidoro artidoro Mar 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NormalizerMode [](start = 33, length = 14)

NormalizationMode? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.


In reply to: 263234251 [](ancestors = 263234251)

@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Mar 7, 2019

        public string WeightColumn = PrincipalComponentAnalysisEstimator.Defaults.WeightColumn;

Can we update it to our default exampleWeightsColumnName i think. #Resolved


Refers to: src/Microsoft.ML.PCA/PcaTransformer.cs:44 in d4edc48. [](commit_id = d4edc48, deletion_comment = False)

@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Mar 7, 2019

    }

Need better names.
NormKind is not great. StdDev is not great, Linf is not great. #Resolved


Refers to: src/Microsoft.ML.Transforms/GcnTransform.cs:657 in d4edc48. [](commit_id = d4edc48, deletion_comment = False)

@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Mar 7, 2019

    Pca,

PrincipalComponentAnalysis
and
ZeroPhaseComponentAnalysis
#Resolved


Refers to: src/Microsoft.ML.Mkl.Components/VectorWhitening.cs:42 in d4edc48. [](commit_id = d4edc48, deletion_comment = False)

@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Mar 7, 2019

    public static RandomizedPcaTrainer RandomizedPca(this AnomalyDetectionCatalog.AnomalyDetectionTrainers catalog,

ApproximatePrincipalComponentsAnalysis? #Resolved


Refers to: src/Microsoft.ML.PCA/PCACatalog.cs:56 in d4edc48. [](commit_id = d4edc48, deletion_comment = False)

@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Mar 7, 2019

public sealed class RandomizedPcaTrainer : TrainerEstimatorBase<AnomalyPredictionTransformer<PcaModelParameters>, PcaModelParameters>

can we move away from PCA to PrincipalComponentAnalysis ? #Resolved


Refers to: src/Microsoft.ML.PCA/PcaTrainer.cs:42 in d4edc48. [](commit_id = d4edc48, deletion_comment = False)

@wschin
Copy link
Member Author

wschin commented Mar 7, 2019

    public static RandomizedPcaTrainer RandomizedPca(this AnomalyDetectionCatalog.AnomalyDetectionTrainers catalog,

Will do AnalyzeRandomizedPrincipleComponents


In reply to: 470688500 [](ancestors = 470688500)


Refers to: src/Microsoft.ML.PCA/PCACatalog.cs:56 in d4edc48. [](commit_id = d4edc48, deletion_comment = False)

@wschin
Copy link
Member Author

wschin commented Mar 7, 2019

public sealed class RandomizedPcaTrainer : TrainerEstimatorBase<AnomalyPredictionTransformer<PcaModelParameters>, PcaModelParameters>

Sure. I will replace RandomizedPcaTrainer with RandomizedPrincipalComponentAnalyzer.


In reply to: 470690330 [](ancestors = 470690330)


Refers to: src/Microsoft.ML.PCA/PcaTrainer.cs:42 in d4edc48. [](commit_id = d4edc48, deletion_comment = False)

@wschin
Copy link
Member Author

wschin commented Mar 7, 2019

    Pca,

Sure.


In reply to: 470687736 [](ancestors = 470687736)


Refers to: src/Microsoft.ML.Mkl.Components/VectorWhitening.cs:42 in d4edc48. [](commit_id = d4edc48, deletion_comment = False)

@wschin
Copy link
Member Author

wschin commented Mar 7, 2019

    }

How about

        public enum NormFunction : byte
        {
            L2Norm = 0,
            StandardDeviation = 1,
            L1Norm = 2,
            InfinityNorm = 3
        

In reply to: 470682324 [](ancestors = 470682324)


Refers to: src/Microsoft.ML.Transforms/GcnTransform.cs:657 in d4edc48. [](commit_id = d4edc48, deletion_comment = False)

@wschin
Copy link
Member Author

wschin commented Mar 7, 2019

        public string WeightColumn = PrincipalComponentAnalysisEstimator.Defaults.WeightColumn;

Sure. Btw, it should be "E" and without "s".


In reply to: 470681275 [](ancestors = 470681275)


Refers to: src/Microsoft.ML.PCA/PcaTransformer.cs:44 in d4edc48. [](commit_id = d4edc48, deletion_comment = False)

@wschin wschin merged commit f09b25f into dotnet:master Mar 8, 2019
@wschin wschin deleted the scrub-norm branch March 8, 2019 03:47
@wschin wschin mentioned this pull request Mar 8, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Mar 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants