Skip to content

More Normalizer Scrubbing #2888

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Mar 14, 2019
6 changes: 3 additions & 3 deletions docs/code/MlNetCookBook.md
Original file line number Diff line number Diff line change
Expand Up @@ -595,7 +595,7 @@ As a general rule, *if you use a parametric learner, you need to make sure your

ML.NET offers several built-in scaling algorithms, or 'normalizers':
- MinMax normalizer: for each feature, we learn the minimum and maximum value of it, and then linearly rescale it so that the values fit between -1 and 1.
- MeanVar normalizer: for each feature, compute the mean and variance, and then linearly rescale it to zero-mean, unit-variance.
- MeanVariance normalizer: for each feature, compute the mean and variance, and then linearly rescale it to zero-mean, unit-variance.
Copy link
Contributor

@rogancarr rogancarr Mar 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MeanVariance [](start = 2, length = 12)

In the field, we usually use the term Standardize to reflect this normalization technique. This is very "statistics-y", but it does seem to be standard. How would everyone feel about changing "MeanVariance" to "Standardize", or at least offer a "Standardize" alias? #Resolved

Copy link
Member Author

@wschin wschin Mar 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel MVN is super bad because it's already an operator in neural networks (e.g., ONNX, Caffe, CoreML).
Just for references:
https://apple.github.io/coremltools/coremlspecification/sections/NeuralNetwork.html#meanvariancenormalizelayerparams
https://github.com/onnx/onnx/blob/master/docs/Operators.md#MeanVarianceNormalization


In reply to: 264826182 [](ancestors = 264826182)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Plus it's more precise than Standardize. Let's keep it MVN. We can always add an alias for Standardize if people are up in arms.


In reply to: 264891367 [](ancestors = 264891367,264826182)

- CDF normalizer: for each feature, compute the mean and variance, and then replace each value `x` with `Cdf(x)`, where `Cdf` is the cumulative density function of normal distribution with these mean and variance.
- Binning normalizer: discretize the feature value into `N` 'buckets', and then replace each value with the index of the bucket, divided by `N-1`.

Expand Down Expand Up @@ -630,8 +630,8 @@ var trainData = mlContext.Data.LoadFromTextFile<IrisInputAllFeatures>(dataPath,
var pipeline =
mlContext.Transforms.Normalize(
new NormalizingEstimator.MinMaxColumnOptions("MinMaxNormalized", "Features", fixZero: true),
new NormalizingEstimator.MeanVarColumnOptions("MeanVarNormalized", "Features", fixZero: true),
new NormalizingEstimator.BinningColumnOptions("BinNormalized", "Features", numBins: 256));
new NormalizingEstimator.MeanVarianceColumnOptions("MeanVarNormalized", "Features", fixZero: true),
new NormalizingEstimator.BinningColumnOptions("BinNormalized", "Features", maximumBinCount: 256));

// Let's train our pipeline of normalizers, and then apply it to the same data.
var normalizedData = pipeline.Fit(trainData).Transform(trainData);
Expand Down
17 changes: 1 addition & 16 deletions docs/samples/Microsoft.ML.Samples/Dynamic/Normalizer.cs
Original file line number Diff line number Diff line change
Expand Up @@ -32,15 +32,7 @@ public static void Example()
// The transformed (normalized according to Normalizer.NormalizerMode.MinMax) data.
var transformer = pipeline.Fit(trainData);

var modelParams = transformer.Columns
.First(x => x.Name == "Induced")
.ModelParameters as NormalizingTransformer.AffineNormalizerModelParameters<float>;

Console.WriteLine($"The normalization parameters are: Scale = {modelParams.Scale} and Offset = {modelParams.Offset}");
//Preview
//
//The normalization parameters are: Scale = 0.5 and Offset = 0"

// Normalize the data.
var transformedData = transformer.Transform(trainData);

// Getting the data of the newly created column, so we can preview it.
Expand Down Expand Up @@ -94,13 +86,6 @@ public static void Example()
// 0
// 0
// 0.1586974

// Inspect the weights of normalizing the columns
var multiColModelParams = multiColtransformer.Columns
.First(x=> x.Name == "LogInduced")
.ModelParameters as NormalizingTransformer.CdfNormalizerModelParameters<float>;

Console.WriteLine($"The normalization parameters are: Mean = {multiColModelParams.Mean} and Stddev = {multiColModelParams.Stddev}");
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ public static void Example()
};

// A pipeline to project Features column into Random fourier space.
var rffPipeline = ml.Transforms.RandomFourierKernelMap(nameof(SamplesUtils.DatasetUtils.SampleVectorOfNumbersData.Features), rank: 4);
var rffPipeline = ml.Transforms.ApproximatedKernelMap(nameof(SamplesUtils.DatasetUtils.SampleVectorOfNumbersData.Features), rank: 4);
// The transformed (projected) data.
var transformedData = rffPipeline.Fit(trainData).Transform(trainData);
// Getting the data of the newly created column, so we can preview it.
Expand All @@ -55,7 +55,7 @@ public static void Example()
//0.165 0.117 -0.547 0.014

// A pipeline to project Features column into L-p normalized vector.
var lpNormalizePipeline = ml.Transforms.LpNormalize(nameof(SamplesUtils.DatasetUtils.SampleVectorOfNumbersData.Features), normKind: Transforms.LpNormalizingEstimatorBase.NormFunction.L1);
var lpNormalizePipeline = ml.Transforms.NormalizeLpNorm(nameof(SamplesUtils.DatasetUtils.SampleVectorOfNumbersData.Features), norm: Transforms.LpNormNormalizingEstimatorBase.NormFunction.L1);
// The transformed (projected) data.
transformedData = lpNormalizePipeline.Fit(trainData).Transform(trainData);
// Getting the data of the newly created column, so we can preview it.
Expand All @@ -73,7 +73,7 @@ public static void Example()
// 0.133 0.156 0.178 0.200 0.000 0.022 0.044 0.067 0.089 0.111

// A pipeline to project Features column into L-p normalized vector.
var gcNormalizePipeline = ml.Transforms.GlobalContrastNormalize(nameof(SamplesUtils.DatasetUtils.SampleVectorOfNumbersData.Features), ensureZeroMean:false);
var gcNormalizePipeline = ml.Transforms.NormalizeGlobalContrast(nameof(SamplesUtils.DatasetUtils.SampleVectorOfNumbersData.Features), ensureZeroMean:false);
// The transformed (projected) data.
transformedData = gcNormalizePipeline.Fit(trainData).Transform(trainData);
// Getting the data of the newly created column, so we can preview it.
Expand Down
80 changes: 41 additions & 39 deletions src/Microsoft.ML.Data/Transforms/NormalizeColumn.cs
Original file line number Diff line number Diff line change
Expand Up @@ -50,8 +50,9 @@ internal sealed partial class NormalizeTransform
{
public abstract class ColumnBase : OneToOneColumn
{
[Argument(ArgumentType.AtMostOnce, HelpText = "Max number of examples used to train the normalizer", ShortName = "maxtrain")]
public long? MaxTrainingExamples;
[Argument(ArgumentType.AtMostOnce, HelpText = "Max number of examples used to train the normalizer",
Name = "MaxTrainingExamples", ShortName = "maxtrain")]
public long? MaximumExampleCount;

private protected ColumnBase()
{
Expand All @@ -60,29 +61,29 @@ private protected ColumnBase()
private protected override bool TryUnparseCore(StringBuilder sb)
{
Contracts.AssertValue(sb);
if (MaxTrainingExamples != null)
if (MaximumExampleCount != null)
return false;
return base.TryUnparseCore(sb);
}
}

// REVIEW: Support different aggregators on different columns, eg, MinMax vs Variance/ZScore.
public abstract class FixZeroColumnBase : ColumnBase
public abstract class ControlZeroColumnBase : ColumnBase
{
// REVIEW: This only allows mapping either zero or min to zero. It might make sense to allow also max, midpoint and mean to be mapped to zero.
[Argument(ArgumentType.AtMostOnce, HelpText = "Whether to map zero to zero, preserving sparsity", ShortName = "zero")]
public bool? FixZero;
[Argument(ArgumentType.AtMostOnce, Name="FixZero", HelpText = "Whether to map zero to zero, preserving sparsity", ShortName = "zero")]
public bool? EnsureZeroUntouched;

private protected override bool TryUnparseCore(StringBuilder sb)
{
Contracts.AssertValue(sb);
if (FixZero != null)
if (EnsureZeroUntouched != null)
return false;
return base.TryUnparseCore(sb);
}
}

public sealed class AffineColumn : FixZeroColumnBase
public sealed class AffineColumn : ControlZeroColumnBase
{
internal static AffineColumn Parse(string str)
{
Expand All @@ -101,7 +102,7 @@ internal bool TryUnparse(StringBuilder sb)
}
}

public sealed class BinColumn : FixZeroColumnBase
public sealed class BinColumn : ControlZeroColumnBase
{
[Argument(ArgumentType.AtMostOnce, HelpText = "Max number of bins, power of 2 recommended", ShortName = "bins")]
[TGUI(Label = "Max number of bins")]
Expand Down Expand Up @@ -147,22 +148,22 @@ internal bool TryUnparse(StringBuilder sb)

private static class Defaults
{
public const bool FixZero = true;
public const bool EnsureZeroUntouched = true;
public const bool MeanVarCdf = false;
public const bool LogMeanVarCdf = true;
public const int NumBins = 1024;
public const int MinBinSize = 10;
}

public abstract class FixZeroArgumentsBase : ArgumentsBase
public abstract class ControlZeroArgumentsBase : ArgumentsBase
{
// REVIEW: This only allows mapping either zero or min to zero. It might make sense to allow also max, midpoint and mean to be mapped to zero.
// REVIEW: Convert this to bool? or even an enum{Auto, No, Yes}, and automatically map zero to zero when it is null/Auto.
[Argument(ArgumentType.AtMostOnce, HelpText = "Whether to map zero to zero, preserving sparsity", ShortName = "zero")]
public bool FixZero = Defaults.FixZero;
[Argument(ArgumentType.AtMostOnce, Name = "FixZero", HelpText = "Whether to map zero to zero, preserving sparsity", ShortName = "zero")]
public bool EnsureZeroUntouched = Defaults.EnsureZeroUntouched;
}

public abstract class AffineArgumentsBase : FixZeroArgumentsBase
public abstract class AffineArgumentsBase : ControlZeroArgumentsBase
{
[Argument(ArgumentType.Multiple | ArgumentType.Required, HelpText = "New column definition(s) (optional form: name:src)", Name = "Column", ShortName = "col", SortOrder = 1)]
public AffineColumn[] Columns;
Expand All @@ -182,8 +183,9 @@ public sealed class MeanVarArguments : AffineArgumentsBase

public abstract class ArgumentsBase : TransformInputBase
{
[Argument(ArgumentType.AtMostOnce, HelpText = "Max number of examples used to train the normalizer", ShortName = "maxtrain")]
public long MaxTrainingExamples = 1000000000;
[Argument(ArgumentType.AtMostOnce, HelpText = "Max number of examples used to train the normalizer",
Name = "MaxTrainingExamples", ShortName = "maxtrain")]
public long MaximumExampleCount = 1000000000;

public abstract OneToOneColumn[] GetColumns();

Expand Down Expand Up @@ -217,7 +219,7 @@ public sealed class LogMeanVarArguments : ArgumentsBase
public override OneToOneColumn[] GetColumns() => Columns;
}

public abstract class BinArgumentsBase : FixZeroArgumentsBase
public abstract class BinArgumentsBase : ControlZeroArgumentsBase
{
[Argument(ArgumentType.Multiple, HelpText = "New column definition(s) (optional form: name:src)", Name = "Column", ShortName = "col", SortOrder = 1)]
public BinColumn[] Columns;
Expand Down Expand Up @@ -291,8 +293,8 @@ internal static IDataTransform Create(IHostEnvironment env, MinMaxArguments args
.Select(col => new NormalizingEstimator.MinMaxColumnOptions(
col.Name,
col.Source ?? col.Name,
col.MaxTrainingExamples ?? args.MaxTrainingExamples,
col.FixZero ?? args.FixZero))
col.MaximumExampleCount ?? args.MaximumExampleCount,
col.EnsureZeroUntouched ?? args.EnsureZeroUntouched))
.ToArray();
var normalizer = new NormalizingEstimator(env, columns);
return normalizer.Fit(input).MakeDataTransform(input);
Expand All @@ -306,11 +308,11 @@ internal static IDataTransform Create(IHostEnvironment env, MeanVarArguments arg
env.CheckValue(args.Columns, nameof(args.Columns));

var columns = args.Columns
.Select(col => new NormalizingEstimator.MeanVarColumnOptions(
.Select(col => new NormalizingEstimator.MeanVarianceColumnOptions(
col.Name,
col.Source ?? col.Name,
col.MaxTrainingExamples ?? args.MaxTrainingExamples,
col.FixZero ?? args.FixZero))
col.MaximumExampleCount ?? args.MaximumExampleCount,
col.EnsureZeroUntouched ?? args.EnsureZeroUntouched))
.ToArray();
var normalizer = new NormalizingEstimator(env, columns);
return normalizer.Fit(input).MakeDataTransform(input);
Expand All @@ -326,10 +328,10 @@ internal static IDataTransform Create(IHostEnvironment env, LogMeanVarArguments
env.CheckValue(args.Columns, nameof(args.Columns));

var columns = args.Columns
.Select(col => new NormalizingEstimator.LogMeanVarColumnOptions(
.Select(col => new NormalizingEstimator.LogMeanVarianceColumnOptions(
col.Name,
col.Source ?? col.Name,
col.MaxTrainingExamples ?? args.MaxTrainingExamples,
col.MaximumExampleCount ?? args.MaximumExampleCount,
args.UseCdf))
.ToArray();
var normalizer = new NormalizingEstimator(env, columns);
Expand All @@ -349,8 +351,8 @@ internal static IDataTransform Create(IHostEnvironment env, BinArguments args, I
.Select(col => new NormalizingEstimator.BinningColumnOptions(
col.Name,
col.Source ?? col.Name,
col.MaxTrainingExamples ?? args.MaxTrainingExamples,
col.FixZero ?? args.FixZero,
col.MaximumExampleCount ?? args.MaximumExampleCount,
col.EnsureZeroUntouched ?? args.EnsureZeroUntouched,
col.NumBins ?? args.NumBins))
.ToArray();
var normalizer = new NormalizingEstimator(env, columns);
Expand Down Expand Up @@ -927,8 +929,8 @@ public static IColumnFunctionBuilder CreateBuilder(MinMaxArguments args, IHost h
return CreateBuilder(new NormalizingEstimator.MinMaxColumnOptions(
args.Columns[icol].Name,
args.Columns[icol].Source ?? args.Columns[icol].Name,
args.Columns[icol].MaxTrainingExamples ?? args.MaxTrainingExamples,
args.Columns[icol].FixZero ?? args.FixZero), host, srcIndex, srcType, cursor);
args.Columns[icol].MaximumExampleCount ?? args.MaximumExampleCount,
args.Columns[icol].EnsureZeroUntouched ?? args.EnsureZeroUntouched), host, srcIndex, srcType, cursor);
}

public static IColumnFunctionBuilder CreateBuilder(NormalizingEstimator.MinMaxColumnOptions column, IHost host,
Expand Down Expand Up @@ -961,15 +963,15 @@ public static IColumnFunctionBuilder CreateBuilder(MeanVarArguments args, IHost
Contracts.AssertValue(host);
host.AssertValue(args);

return CreateBuilder(new NormalizingEstimator.MeanVarColumnOptions(
return CreateBuilder(new NormalizingEstimator.MeanVarianceColumnOptions(
args.Columns[icol].Name,
args.Columns[icol].Source ?? args.Columns[icol].Name,
args.Columns[icol].MaxTrainingExamples ?? args.MaxTrainingExamples,
args.Columns[icol].FixZero ?? args.FixZero,
args.Columns[icol].MaximumExampleCount ?? args.MaximumExampleCount,
args.Columns[icol].EnsureZeroUntouched ?? args.EnsureZeroUntouched,
args.UseCdf), host, srcIndex, srcType, cursor);
}

public static IColumnFunctionBuilder CreateBuilder(NormalizingEstimator.MeanVarColumnOptions column, IHost host,
public static IColumnFunctionBuilder CreateBuilder(NormalizingEstimator.MeanVarianceColumnOptions column, IHost host,
int srcIndex, DataViewType srcType, DataViewRowCursor cursor)
{
Contracts.AssertValue(host);
Expand Down Expand Up @@ -1001,14 +1003,14 @@ public static IColumnFunctionBuilder CreateBuilder(LogMeanVarArguments args, IHo
Contracts.AssertValue(host);
host.AssertValue(args);

return CreateBuilder(new NormalizingEstimator.LogMeanVarColumnOptions(
return CreateBuilder(new NormalizingEstimator.LogMeanVarianceColumnOptions(
args.Columns[icol].Name,
args.Columns[icol].Source ?? args.Columns[icol].Name,
args.Columns[icol].MaxTrainingExamples ?? args.MaxTrainingExamples,
args.Columns[icol].MaximumExampleCount ?? args.MaximumExampleCount,
args.UseCdf), host, srcIndex, srcType, cursor);
}

public static IColumnFunctionBuilder CreateBuilder(NormalizingEstimator.LogMeanVarColumnOptions column, IHost host,
public static IColumnFunctionBuilder CreateBuilder(NormalizingEstimator.LogMeanVarianceColumnOptions column, IHost host,
int srcIndex, DataViewType srcType, DataViewRowCursor cursor)
{
Contracts.AssertValue(host);
Expand Down Expand Up @@ -1044,8 +1046,8 @@ public static IColumnFunctionBuilder CreateBuilder(BinArguments args, IHost host
return CreateBuilder(new NormalizingEstimator.BinningColumnOptions(
args.Columns[icol].Name,
args.Columns[icol].Source ?? args.Columns[icol].Name,
args.Columns[icol].MaxTrainingExamples ?? args.MaxTrainingExamples,
args.Columns[icol].FixZero ?? args.FixZero,
args.Columns[icol].MaximumExampleCount ?? args.MaximumExampleCount,
args.Columns[icol].EnsureZeroUntouched ?? args.EnsureZeroUntouched,
args.Columns[icol].NumBins ?? args.NumBins), host, srcIndex, srcType, cursor);
}

Expand Down Expand Up @@ -1095,8 +1097,8 @@ public static IColumnFunctionBuilder CreateBuilder(SupervisedBinArguments args,
args.Columns[icol].Name,
args.Columns[icol].Source ?? args.Columns[icol].Name,
args.LabelColumn ?? DefaultColumnNames.Label,
args.Columns[icol].MaxTrainingExamples ?? args.MaxTrainingExamples,
args.Columns[icol].FixZero ?? args.FixZero,
args.Columns[icol].MaximumExampleCount ?? args.MaximumExampleCount,
args.Columns[icol].EnsureZeroUntouched ?? args.EnsureZeroUntouched,
args.Columns[icol].NumBins ?? args.NumBins,
args.MinBinSize),
host, labelColumnId, srcIndex, srcType, cursor);
Expand Down
Loading