Skip to content

[WIP] Transformers for AutoML #4157

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

michaelgsharp
Copy link
Contributor

This is still a work in progress, but will be updated and finished over the next several hours.

In order for AutoML to take a full dependency on ML.NET, several pieces of functionality need to be added. This PR is the first step in that process adding in 3 new transformers. The actual implementation of the transformers is done in C++. This code is currently in another repo, but will be turned in a Nuget package that ML.Net can take a binary dependency on. The code going into ML.Net is mostly wrapper code dealing with data interop and, since the binary is exposed via C api's with no overloading, code to call the correct methods based on the data type.

ToStringTransformer - Takes in a column and converts it into an appropriate string representation. It currently supports all integer types, float, double, bool, and string. This is fully done, and should be reviewed first.

CategoryImputer - Fill in missing values with the most common value in the column. It currently supports all integer types, float, double, bool, and string. 0's are treated as missing for integer types. This is done except for a small piece of the c++ interop.

DateTimeTransformer - Splits a column into its appropriate date time components (20 columns total are added). Takes a Long that represents seconds since 1970. The c++ interop will be added for this tonight and can be reviewed last.

Until we get the binary added as an official dependency, the constructors of the estimator will throw an error if the correct .dll is not found. This allows us to do a private drop for now. The tests will also be disabled currently for that reason.

@michaelgsharp michaelgsharp requested a review from a team as a code owner August 29, 2019 23:14
@ganik ganik changed the title Transformers for AutoML [WIP] Transformers for AutoML Aug 29, 2019
@@ -0,0 +1,1829 @@
using System;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) copyright on all code files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copyright has been added. Thanks for the reminder.

internal override void CreateTransformerFromSavedData(byte[] data)
{
bool result;
GCHandle handle = GCHandle.Alloc(data, GCHandleType.Pinned);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just use a fixed statement in C#. This is slightly faster than allocing and freeing a GCHandle. See the rest of our code (like in CpuMath) where we get a pointer for an array.

Here's an example, from one of our tests:

But you get the idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed all occurrences to fixed.

}
}

[DllImport("Featurizers", EntryPoint = "GetErrorInfoString", CallingConvention = CallingConvention.Cdecl), SuppressUnmanagedCodeSecurity]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new transforms should be split out of the main Microsoft.ML nuget package. Since you are adding them to the main nuget package, EVERY application that uses ML.NET will have a dependency on these native assemblies, even if they don't use these new transforms.

See what we did with the transforms that depend on MKL - we split them into their own nuget package, so only people who needed those components would be forced to bring MKL along with their application.

https://www.nuget.org/packages/Microsoft.ML.Mkl.Components/

@ganik
Copy link
Member

ganik commented Aug 30, 2019

    [DllImport("Featurizers", EntryPoint = "GetErrorInfoString"), SuppressUnmanagedCodeSecurity]

NuGet dependency should be added to use this native assembly.


Refers to: src/Microsoft.ML.Transforms/ToStringTransformer.cs:313 in 260ff25. [](commit_id = 260ff25, deletion_comment = False)

@ganik
Copy link
Member

ganik commented Aug 30, 2019

    [DllImport("Featurizers", EntryPoint = "DestroyErrorInfoString"), SuppressUnmanagedCodeSecurity]

As I am aware the NuGet for this native lib is not ready for various Linux flavors and Mac. I dont think it can go into master branch until the NuGet has them


Refers to: src/Microsoft.ML.Transforms/ToStringTransformer.cs:316 in 260ff25. [](commit_id = 260ff25, deletion_comment = False)

@ganik
Copy link
Member

ganik commented Aug 30, 2019

public sealed class ToStringTransformerEstimator : IEstimator<ToStringTransformer>

This ToStringTransformer could be easily added as an add-on functionality to TypeConverting transform. And TypeConverting transform has SaveAsOnnx(..) functionality already


Refers to: src/Microsoft.ML.Transforms/ToStringTransformer.cs:73 in 260ff25. [](commit_id = 260ff25, deletion_comment = False)

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please split these new Transformers out into a separate nuget package. We cannot add new dependencies in the core Microsoft.ML NuGet package. Not every ML.NET application will need these new dependencies, so they shouldn't be in the core. New dependencies need to be opted in to.

Copy link

@ebarsoumMS ebarsoumMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace Cpp with Native.
Why are you mixing DLLImport with the rest of the code?
Add comment to the public transformer, describing what it does.

}

public override void Dispose()
{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why manually? Can we use SafeHandler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using a SafeHandler for all handles themselves. This is for the main transformer class itself, and it just calls dispose on any SafeHandles it has.

}
#endregion

// Using this to confirm DLL exists. If does it will just return false since no parameters are being passed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need that? Isn't the DLL is part of the package?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be a dependency we get through nuget, but until we get the nuget up and running this is in there just as a check. Once the nuget is ready we will remove this check.


internal delegate bool DestroyCppTransformerEstimator(IntPtr estimator, out IntPtr errorHandle);
internal delegate bool DestroyTransformerSaveData(IntPtr buffer, IntPtr bufferSize, out IntPtr errorHandle);
internal delegate bool DestroyTransformedDataCpp(IntPtr output, IntPtr outputSize, out IntPtr errorHandle);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Native instead of Cpp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, changed all occurrences from Cpp to Native.

<Project Sdk="Microsoft.NET.Sdk" DefaultTargets="Pack">

<PropertyGroup>
<Authors>Intel</Authors>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to rename this file to be the name of the NuGet package.

Also, remove this Authors property.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the files. It is creating separate nuget packages now in the bin/packages folder. It is not creating a folder under packages in the root though. Should it be creating them there as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little confusing, but $RepoRoot\packages are the NuGet packages that our code depends on - what are downloaded when we do a restore. $RepoRoot\bin\packages are the NuGet packages that our build produces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that makes sense. I have the $RepoRoot\bin\packages creating the NuGet package correctly and tested it with a local import into another solution. So we should be good then. Thanks for the instructions!

@@ -8,7 +8,7 @@
// </auto-generated>
//------------------------------------------------------------------------------

namespace Microsoft.ML.Transforms.Properties
namespace Microsoft.ML.AutoMLFeaturizers.Properties
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this was an intended change, was it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, good catch. Reverting that in next push.

using System.Runtime.CompilerServices;
using Microsoft.ML;

[assembly: InternalsVisibleTo(assemblyName: "Microsoft.ML.Tests" + PublicKey.TestValue)]
Copy link
Member

@eerhardt eerhardt Sep 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What internals are the tests using? I think you should be able to test everything from the public API, right? #Resolved

Copy link
Contributor Author

@michaelgsharp michaelgsharp Sep 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was testing the constructor that took in the Options class, but looking at the other Transformers and their tests they are going about it differently. Let me fix it and remove that VisibleTo tag. #Resolved

Copy link
Contributor Author

@michaelgsharp michaelgsharp Sep 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I just double checked, and I was testing the entry point that AutoML uses and it takes in the Options class which is internal. Is it ok to leave everything as is? #Resolved

Copy link
Member

@eerhardt eerhardt Sep 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, all our tests have IVT, so this is fine to leave. I was mostly just curious why it was needed here. IMO it is best to test without Internals Visible To, but if you need it don't worry. #Resolved

<TargetFramework>netstandard2.0</TargetFramework>
<IncludeInPackage>Microsoft.ML.Featurizers</IncludeInPackage>
<AllowUnsafeBlocks>true</AllowUnsafeBlocks>
<DefineConstants>CORECLR</DefineConstants>
Copy link
Member

@eerhardt eerhardt Sep 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this DefineConstant is being used in the new assembly. It can be removed. #Resolved

Copy link
Contributor Author

@michaelgsharp michaelgsharp Sep 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as the prior one. This was removed last night, but is not showing up in the PR for some reason even though it is in my branch. Will rebase and check again. #Resolved


<PropertyGroup>
<TargetFramework>netstandard2.0</TargetFramework>
<IncludeInPackage>Microsoft.ML.Featurizers</IncludeInPackage>
Copy link
Member

@eerhardt eerhardt Sep 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IncludeInPackage needs to match the name of the NuGet package. Or else this assembly won't be included in the right NuGet package. #Resolved

Copy link
Contributor Author

@michaelgsharp michaelgsharp Sep 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why it is showing up like that. Last night I pushed a fix that had that change to the correct name. It looks like somehow the branch in my repo and this PR aren't syncing up correctly. The number of commits this PR shows is not the same as the number of commits in my branch itself? I will rebase and see if that fixes the issue. #Resolved

</ItemGroup>

<ItemGroup>
<Content Include="..\common\CommonPackage.props" Pack="true" PackagePath="build\netstandard2.0\$(MSBuildProjectName).props" />
Copy link
Member

@eerhardt eerhardt Sep 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line isn't necessary. This CommonPackage.props file probably should have a better name, but it is needed for NuGet packages that contain native assets inside of them. This one doesn't, so it is unnecessary. #Resolved

Copy link
Contributor Author

@michaelgsharp michaelgsharp Sep 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Removing in the next push. #Resolved

@@ -0,0 +1,16 @@
<Project Sdk="Microsoft.NET.Sdk" DefaultTargets="Pack">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if AutoMLFeaturizers is the best name for this package. Technically, anyone can use these featurizers without using AutoML.

What is going to be the name of the underlying NuGet package that contains the native assemblies? We should probably align with that name. So if that NuGet package is going to be named "AutoMLFeaturizers", then I guess that would be a good name here as well.

Copy link
Contributor Author

@michaelgsharp michaelgsharp Sep 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will sync with my team today and figure that out. Part of the reason I picked this name is that these Transformers are added specifically for the AutoML team. So while anyone can use them, they are custom tailored for their use. #Resolved

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about naming it NativeFeaturizers? Hinting that implementation is in native code


In reply to: 323317152 [](ancestors = 323317152)

[DllImport("Featurizers", EntryPoint = "GetErrorInfoString"), SuppressUnmanagedCodeSecurity]
private static extern bool CheckIfDllExists(IntPtr error, out IntPtr errorHandleString, out IntPtr errorHandleStringSize);

public static ToStringTransformerEstimator Create(IHostEnvironment env, string outputColumnName, string inputColumnName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These methods shouldn't be public. External users should only create the estimators from the Catalog APIs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed them all to internal. Fixed in next push.

}
}

internal class TransformedDataSafeHandler : SafeHandleZeroOrMinusOneIsInvalid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) The .NET type is a SafeHandle, not SafeHandler. I think we should drop the r here and just be TransformedDataSafeHandle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the extra r from all the SafeHandle classes. Fixed in next push.

@michaelgsharp michaelgsharp force-pushed the AutoMLTransformers branch 3 times, most recently from 4d6434f to e657947 Compare September 17, 2019 22:05
@michaelgsharp
Copy link
Contributor Author

CodeFlow for GitHub does not download executable files. If you trust this file and would like to run it, please download it yourself through GitHub.

This is temporary and will be removed before this PR is merged into master. This is just so that we can build using the nuget package.


Refers to: temp-nuget-folder/MicrosoftMLFeaturizers.0.1.0.nupkg:1 in efdf22a. [](commit_id = efdf22a, deletion_comment = False)

@@ -7,6 +7,8 @@

<ItemGroup>
<ProjectReference Include="../Microsoft.ML/Microsoft.ML.nupkgproj" />

<PackageReference Include="MicrosoftMLFeaturizers" Version="0.1.0" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this be the final package name of the native featurizers? If yes, then I find it confusing that the ML.NET package has the word Auto in it, but the underlying native package doesn't. What about the ML.NET package makes it "AutoML", but the underlying native package isn't "AutoML"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will renaming the ML.NET package to be called NativeFeaturizers per Gani's suggestion. Or are you thinking since the native package is MicrosoftMLFeaturizers that I should name the ML.NET package the same thing?


In reply to: 326226514 [](ancestors = 326226514)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally just my opinion, but I don't really like the word Native in the public API. I don't think it conveys any real information, and is an implementation detail.

It almost feels like the native assemblies, and the managed assemblies belong in the same NuGet package. But if we need to split them, I think their names should align in an understandable way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. I'll rename the ML.NET package than to just be Microsoft.ML.Featurizers.


In reply to: 326238905 [](ancestors = 326238905)

@codecov
Copy link

codecov bot commented Sep 30, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@0b20a99). Click here to learn what that means.
The diff coverage is 81.97%.

@@            Coverage Diff            @@
##             master    #4157   +/-   ##
=========================================
  Coverage          ?   74.69%           
=========================================
  Files             ?      888           
  Lines             ?   157644           
  Branches          ?    17193           
=========================================
  Hits              ?   117753           
  Misses            ?    35018           
  Partials          ?     4873
Flag Coverage Δ
#Debug 74.69% <81.97%> (?)
#production 70.24% <74.97%> (?)
#test 89.81% <100%> (?)
Impacted Files Coverage Δ
src/Microsoft.ML.Featurizers/CategoryImputer.cs 69.31% <ø> (ø)
...irectInstantiation/IrisPlantClassificationTests.cs 100% <ø> (ø)
...ft.ML.Tests/Transformers/TimeSeriesImputerTests.cs 100% <100%> (ø)
....ML.Tests/Transformers/ToStringTransformerTests.cs 100% <100%> (ø)
...crosoft.ML.Core.Tests/UnitTests/TestEntryPoints.cs 98.71% <100%> (ø)
....ML.Tests/Transformers/DateTimeTransformerTests.cs 100% <100%> (ø)
...soft.ML.Tests/Transformers/CategoryImputerTests.cs 100% <100%> (ø)
src/Microsoft.ML.Featurizers/Common.cs 17.3% <17.3%> (ø)
...rc/Microsoft.ML.Featurizers/ToStringTransformer.cs 70.53% <70.53%> (ø)
...rosoft.ML.Featurizers/TimeSeriesImputerDataView.cs 71.88% <71.88%> (ø)
... and 2 more

@eerhardt
Copy link
Member

CategoryImputer

Why are we creating a whole new transform here, when ML.NET already has ReplaceMissingValues ?

I think this feature should just be an option to the existing ReplaceMissingValues transform. Today ML.NET supports:

/// <summary>
/// The possible ways to replace missing values.
/// </summary>
public enum ReplacementMode : byte
{
/// <summary>
/// Replace with the default value of the column based on its type.
/// </summary>
DefaultValue = 0,
/// <summary>
/// Replace with the mean value of the column.
/// </summary>
Mean = 1,
/// <summary>
/// Replace with the minimum value of the column.
/// </summary>
Minimum = 2,
/// <summary>
/// Replace with the maximum value of the column.
/// </summary>
Maximum = 3,
}

We should just be adding a new value to that enum called MostCommon (or similar).

/cc @codemzs @ebarsoumMS

ToStringTransformer is done.

CatagoryImputer is done.

TimeSeriesImputer is done.

RobustScaler is done.

Adding in samples and documentation. General code cleanup. Made the RowToRowMapperTransform create a new mapper if possible for each cursor.
@michaelgsharp
Copy link
Contributor Author

Closing this as we are merging in them one featurizer at a time instead of all together.

@michaelgsharp michaelgsharp deleted the AutoMLTransformers branch February 24, 2020 21:03
@ghost ghost locked as resolved and limited conversation to collaborators Mar 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants