Direct API: Auto-normalization #433

TomFinley · 2018-06-27T17:47:56Z

One of the details of training that happen after a loader/transform pipeline is created, but before the cache. We've typically automatically done this for users. While usage in the API is very distinct in that people tend to like implicit behavior in tools but dislike implicit behavior in APIs, at least offering a convenience for normalization is appropriate.

Existing Method

Some familiar with this codebase are aware of this existing method, in the TrainUtils utility class, that serves a similar function.

machinelearning/src/Microsoft.ML.Data/Commands/TrainCommand.cs

Line 492 in 2501049

    
           public static bool AddNormalizerIfNeeded(IHostEnvironment env, IChannel ch, ITrainer trainer, ref IDataView view, string featureColumn, NormalizeOption autoNorm)

The goal of that method is not to provide a convenient API, so much as to factor out code common to the various commands that train models (e.g., train, traintest, cross-validation, some transforms like train-and-score). The same is true of many methods in that TrainUtils class. This as we see in the first few lines:

machinelearning/src/Microsoft.ML.Data/Commands/TrainCommand.cs

Lines 499 to 506 in 2501049

    
           ch.CheckUserArg(Enum.IsDefined(typeof(NormalizeOption), autoNorm), nameof(TrainCommand.Arguments.NormalizeFeatures), 
        
               "Normalize option is invalid. Specify one of 'norm=No', 'norm=Warn', 'norm=Auto', or 'norm=Yes'."); 
        
           if (autoNorm == NormalizeOption.No) 
        
           { 
        
               ch.Info("Not adding a normalizer."); 
        
               return false; 
        
           }

While beneficial in providing consistent behavior across all of these things from a command-line perspective, the condition where it just exits would be inappropriate to have in an ML.NET API -- you might imagine someone designing a method with a parameter bool doNothing where the first thing is, if it's true, the method returns without doing anything. Again, appropriate from the point of view of factoring out common code, but not appropriate for an API. Also the method of communicating important information to the user is via the console, which again is not the most helpful option for an API.

Proposed API Helpers

Nonetheless, this function has several things that are helpful to do: it detects if a trainer wants normalization, if data is normalized, and if appropriate and necessary applies normalization.

This would probably take the form of a static method on the NormalizerTransform class, perhaps following this signature:

public static bool CreateIfNeeded(IHostEnvironment env, ref RoleMappedData data, ITrainer trainer)

We could also have two additional methods to provide key information.

public static bool FeatureVectorIsNormalized(RoleMappedData data)
public static bool NeedsNormalization(this ITrainer trainer)

The text was updated successfully, but these errors were encountered:

TomFinley · 2018-06-27T17:48:28Z

Incidentally, followup on #371.

TomFinley · 2018-06-27T18:38:08Z

In the process of doing this, I notice that NormalizeTransform and its support classes is in the Microsoft.ML.Transforms project. This despite the fact that Microsoft.ML.Data (a more fundamental class) calls it here, so it has a dependency, it's just not obvious since it's happening via dependency injection.

machinelearning/src/Microsoft.ML.Data/Commands/TrainCommand.cs

Lines 538 to 546 in 2501049

    
           var component = new SubComponent<IDataTransform, SignatureDataTransform>("MinMax", string.Format("col={{ name={0} source={0} }}", quotedFeatureColumnName)); 
        
           var loader = view as IDataLoader; 
        
           if (loader != null) 
        
           { 
        
               view = CompositeDataLoader.Create(env, loader, 
        
                   new KeyValuePair<string, SubComponent<IDataTransform, SignatureDataTransform>>(null, component)); 
        
           } 
        
           else 
        
               view = component.CreateInstance(env, view);

(This itself is something that needs to change, though perhaps not as part of this PR -- this code is just so ugly and awful.)

Also more broadly normalization is a very practically important part of ML -- it really ought to be in the more fundamental DLL. As part of this work I will have to move it.

zeahmed · 2018-06-27T21:36:16Z

Thanks @TomFinley for looking into this.

While writing a test for new API, I also faced this issue. The argument to linear learners contains Caching and NormalizeFeatures options. However, these options don't work when Learner.Train method is called directly. This creates a bit of confusion when using API.

public abstract class LearnerInputBase
{
    [Argument(ArgumentType.Required, ShortName = "data", HelpText = "The data to be used for training", SortOrder = 1, Visibility = ArgumentAttribute.VisibilityType.EntryPointsOnly)]
    public IDataView TrainingData;

    [Argument(ArgumentType.AtMostOnce, HelpText = "Column to use for features", ShortName = "feat", SortOrder = 2, Visibility = ArgumentAttribute.VisibilityType.EntryPointsOnly)]
    public string FeatureColumn = DefaultColumnNames.Features;

    [Argument(ArgumentType.AtMostOnce, HelpText = "Normalize option for the feature column", ShortName = "norm", SortOrder = 5, Visibility = ArgumentAttribute.VisibilityType.EntryPointsOnly)]
    public NormalizeOption NormalizeFeatures = NormalizeOption.Auto;

    [Argument(ArgumentType.LastOccurenceWins, HelpText = "Whether learner should cache input training data", ShortName = "cache", SortOrder = 6, Visibility = ArgumentAttribute.VisibilityType.EntryPointsOnly)]
    public CachingOptions Caching = CachingOptions.Auto;
}

glebuk · 2018-06-28T00:15:11Z

One complication to consider is that some trainers expect specific kind of normalization. There should be a way to specialize from "NeedNormalization" to a "NeedMeanVarNormalization"

TomFinley · 2018-06-28T12:43:24Z

Hi @zeahmed yup, I was sort of planning on tackling caching next... or possibly making RoleMappedSchema/Data a bit less obnoxious to work with.

Hi @glebuk while probably worth looking into, I'd rather not design fundamentally new features as part of this, for three reasons:

First reason, I'm just trying to more or less recreate the convenience we already have. I'd just as soon not do anything fundamentally new. That is, I consider this out of scope for now.

Second, the problem of how algorithms can "control" their input to undergo certain desirable preprocessing is not a problem limited to this alone. If we insist on solving it, I might prefer something broader than this: something like an ITrainer could provide an optional interface method along the lines of void DoPreprocessing(ref RoleMappedData data) and apply whatever "special" transforms they thought were appropriate to them. I dislike this NeedMeanVarNormalization because it's way, way too specific.

Third, as you might tell I am very wary of doing so, for the reason I mentioned at the start of the issue about "implicit" behavior: every implicit convenience like this we write fundamentally renders the API less comprehensible, versus an explicit call. When it comes to tools and GUIs and whatnot providing conveniences like this, I agree the more the merrier. Something like an API, we should shift our thinking from "why not" to "why."

Indeed our prior (non-entry-points based) API (not written by me) took the position, that since it was an API, a user was responsible for whether they wanted their own normalization or not. An API is not a GUI, and we shouldn't pretend it is.

shauheen · 2018-07-17T17:17:21Z

Closing as fixed #446

TomFinley added the API Issues pertaining the friendly API label Jun 27, 2018

TomFinley self-assigned this Jun 27, 2018

TomFinley mentioned this issue Jun 28, 2018

Normalization API helpers #446

Merged

shauheen added this to the 0718 milestone Jun 28, 2018

shauheen closed this as completed Jul 17, 2018

TomFinley mentioned this issue Dec 20, 2018

Replace ColumnInfo usage with Schema.Column, remove ColumnInfo #1924

Merged

ghost locked as resolved and limited conversation to collaborators Mar 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Direct API: Auto-normalization #433

Direct API: Auto-normalization #433

TomFinley commented Jun 27, 2018

TomFinley commented Jun 27, 2018 •

edited

Loading

TomFinley commented Jun 27, 2018

zeahmed commented Jun 27, 2018 •

edited

Loading

glebuk commented Jun 28, 2018

TomFinley commented Jun 28, 2018

shauheen commented Jul 17, 2018

Direct API: Auto-normalization #433

Direct API: Auto-normalization #433

Comments

TomFinley commented Jun 27, 2018

Existing Method

Proposed API Helpers

TomFinley commented Jun 27, 2018 • edited Loading

TomFinley commented Jun 27, 2018

zeahmed commented Jun 27, 2018 • edited Loading

glebuk commented Jun 28, 2018

TomFinley commented Jun 28, 2018

shauheen commented Jul 17, 2018

TomFinley commented Jun 27, 2018 •

edited

Loading

zeahmed commented Jun 27, 2018 •

edited

Loading