-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[WIP] Transformers for AutoML #4157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -0,0 +1,1829 @@ | |||
using System; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) copyright on all code files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copyright has been added. Thanks for the reminder.
internal override void CreateTransformerFromSavedData(byte[] data) | ||
{ | ||
bool result; | ||
GCHandle handle = GCHandle.Alloc(data, GCHandleType.Pinned); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just use a fixed
statement in C#. This is slightly faster than allocing and freeing a GCHandle. See the rest of our code (like in CpuMath) where we get a pointer for an array.
Here's an example, from one of our tests:
machinelearning/test/Microsoft.ML.CpuMath.PerformanceTests/NativePerformanceTests.cs
Line 27 in b861b5d
fixed (float* pdst = dst) |
But you get the idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed all occurrences to fixed
.
} | ||
} | ||
|
||
[DllImport("Featurizers", EntryPoint = "GetErrorInfoString", CallingConvention = CallingConvention.Cdecl), SuppressUnmanagedCodeSecurity] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These new transforms should be split out of the main Microsoft.ML
nuget package. Since you are adding them to the main nuget package, EVERY application that uses ML.NET will have a dependency on these native assemblies, even if they don't use these new transforms.
See what we did with the transforms that depend on MKL - we split them into their own nuget package, so only people who needed those components would be forced to bring MKL along with their application.
As I am aware the NuGet for this native lib is not ready for various Linux flavors and Mac. I dont think it can go into master branch until the NuGet has them Refers to: src/Microsoft.ML.Transforms/ToStringTransformer.cs:316 in 260ff25. [](commit_id = 260ff25, deletion_comment = False) |
This ToStringTransformer could be easily added as an add-on functionality to TypeConverting transform. And TypeConverting transform has SaveAsOnnx(..) functionality already Refers to: src/Microsoft.ML.Transforms/ToStringTransformer.cs:73 in 260ff25. [](commit_id = 260ff25, deletion_comment = False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please split these new Transformers out into a separate nuget package. We cannot add new dependencies in the core Microsoft.ML
NuGet package. Not every ML.NET application will need these new dependencies, so they shouldn't be in the core. New dependencies need to be opted in to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace Cpp with Native.
Why are you mixing DLLImport with the rest of the code?
Add comment to the public transformer, describing what it does.
} | ||
|
||
public override void Dispose() | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why manually? Can we use SafeHandler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am using a SafeHandler for all handles themselves. This is for the main transformer class itself, and it just calls dispose on any SafeHandles it has.
} | ||
#endregion | ||
|
||
// Using this to confirm DLL exists. If does it will just return false since no parameters are being passed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need that? Isn't the DLL is part of the package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be a dependency we get through nuget, but until we get the nuget up and running this is in there just as a check. Once the nuget is ready we will remove this check.
|
||
internal delegate bool DestroyCppTransformerEstimator(IntPtr estimator, out IntPtr errorHandle); | ||
internal delegate bool DestroyTransformerSaveData(IntPtr buffer, IntPtr bufferSize, out IntPtr errorHandle); | ||
internal delegate bool DestroyTransformedDataCpp(IntPtr output, IntPtr outputSize, out IntPtr errorHandle); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use Native instead of Cpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, changed all occurrences from Cpp
to Native
.
<Project Sdk="Microsoft.NET.Sdk" DefaultTargets="Pack"> | ||
|
||
<PropertyGroup> | ||
<Authors>Intel</Authors> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to rename this file to be the name of the NuGet package.
Also, remove this Authors property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the files. It is creating separate nuget packages now in the bin/packages folder. It is not creating a folder under packages in the root though. Should it be creating them there as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little confusing, but $RepoRoot\packages
are the NuGet packages that our code depends on - what are downloaded when we do a restore. $RepoRoot\bin\packages
are the NuGet packages that our build produces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that makes sense. I have the $RepoRoot\bin\packages
creating the NuGet package correctly and tested it with a local import into another solution. So we should be good then. Thanks for the instructions!
@@ -8,7 +8,7 @@ | |||
// </auto-generated> | |||
//------------------------------------------------------------------------------ | |||
|
|||
namespace Microsoft.ML.Transforms.Properties | |||
namespace Microsoft.ML.AutoMLFeaturizers.Properties |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this was an intended change, was it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, good catch. Reverting that in next push.
using System.Runtime.CompilerServices; | ||
using Microsoft.ML; | ||
|
||
[assembly: InternalsVisibleTo(assemblyName: "Microsoft.ML.Tests" + PublicKey.TestValue)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What internals are the tests using? I think you should be able to test everything from the public API, right? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was testing the constructor that took in the Options
class, but looking at the other Transformers and their tests they are going about it differently. Let me fix it and remove that VisibleTo
tag. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I just double checked, and I was testing the entry point that AutoML uses and it takes in the Options
class which is internal. Is it ok to leave everything as is? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, all our tests have IVT, so this is fine to leave. I was mostly just curious why it was needed here. IMO it is best to test without Internals Visible To, but if you need it don't worry. #Resolved
<TargetFramework>netstandard2.0</TargetFramework> | ||
<IncludeInPackage>Microsoft.ML.Featurizers</IncludeInPackage> | ||
<AllowUnsafeBlocks>true</AllowUnsafeBlocks> | ||
<DefineConstants>CORECLR</DefineConstants> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this DefineConstant
is being used in the new assembly. It can be removed. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as the prior one. This was removed last night, but is not showing up in the PR for some reason even though it is in my branch. Will rebase and check again. #Resolved
|
||
<PropertyGroup> | ||
<TargetFramework>netstandard2.0</TargetFramework> | ||
<IncludeInPackage>Microsoft.ML.Featurizers</IncludeInPackage> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IncludeInPackage
needs to match the name of the NuGet package. Or else this assembly won't be included in the right NuGet package. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why it is showing up like that. Last night I pushed a fix that had that change to the correct name. It looks like somehow the branch in my repo and this PR aren't syncing up correctly. The number of commits this PR shows is not the same as the number of commits in my branch itself? I will rebase and see if that fixes the issue. #Resolved
</ItemGroup> | ||
|
||
<ItemGroup> | ||
<Content Include="..\common\CommonPackage.props" Pack="true" PackagePath="build\netstandard2.0\$(MSBuildProjectName).props" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line isn't necessary. This CommonPackage.props file probably should have a better name, but it is needed for NuGet packages that contain native assets inside of them. This one doesn't, so it is unnecessary. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Removing in the next push. #Resolved
@@ -0,0 +1,16 @@ | |||
<Project Sdk="Microsoft.NET.Sdk" DefaultTargets="Pack"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if AutoMLFeaturizers
is the best name for this package. Technically, anyone can use these featurizers without using AutoML.
What is going to be the name of the underlying NuGet package that contains the native assemblies? We should probably align with that name. So if that NuGet package is going to be named "AutoMLFeaturizers", then I guess that would be a good name here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will sync with my team today and figure that out. Part of the reason I picked this name is that these Transformers are added specifically for the AutoML team. So while anyone can use them, they are custom tailored for their use. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about naming it NativeFeaturizers? Hinting that implementation is in native code
In reply to: 323317152 [](ancestors = 323317152)
[DllImport("Featurizers", EntryPoint = "GetErrorInfoString"), SuppressUnmanagedCodeSecurity] | ||
private static extern bool CheckIfDllExists(IntPtr error, out IntPtr errorHandleString, out IntPtr errorHandleStringSize); | ||
|
||
public static ToStringTransformerEstimator Create(IHostEnvironment env, string outputColumnName, string inputColumnName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These methods shouldn't be public
. External users should only create the estimators from the Catalog APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed them all to internal. Fixed in next push.
} | ||
} | ||
|
||
internal class TransformedDataSafeHandler : SafeHandleZeroOrMinusOneIsInvalid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(minor) The .NET type is a SafeHandle
, not SafeHandler
. I think we should drop the r
here and just be TransformedDataSafeHandle
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the extra r
from all the SafeHandle
classes. Fixed in next push.
4d6434f
to
e657947
Compare
This is temporary and will be removed before this PR is merged into master. This is just so that we can build using the nuget package. Refers to: temp-nuget-folder/MicrosoftMLFeaturizers.0.1.0.nupkg:1 in efdf22a. [](commit_id = efdf22a, deletion_comment = False) |
@@ -7,6 +7,8 @@ | |||
|
|||
<ItemGroup> | |||
<ProjectReference Include="../Microsoft.ML/Microsoft.ML.nupkgproj" /> | |||
|
|||
<PackageReference Include="MicrosoftMLFeaturizers" Version="0.1.0" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this be the final package name of the native featurizers? If yes, then I find it confusing that the ML.NET package has the word Auto
in it, but the underlying native package doesn't. What about the ML.NET package makes it "AutoML", but the underlying native package isn't "AutoML"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will renaming the ML.NET package to be called NativeFeaturizers per Gani's suggestion. Or are you thinking since the native package is MicrosoftMLFeaturizers that I should name the ML.NET package the same thing?
In reply to: 326226514 [](ancestors = 326226514)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally just my opinion, but I don't really like the word Native
in the public API. I don't think it conveys any real information, and is an implementation detail.
It almost feels like the native assemblies, and the managed assemblies belong in the same NuGet package. But if we need to split them, I think their names should align in an understandable way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. I'll rename the ML.NET package than to just be Microsoft.ML.Featurizers.
In reply to: 326238905 [](ancestors = 326238905)
4b44abf
to
eabf20e
Compare
Codecov Report
@@ Coverage Diff @@
## master #4157 +/- ##
=========================================
Coverage ? 74.69%
=========================================
Files ? 888
Lines ? 157644
Branches ? 17193
=========================================
Hits ? 117753
Misses ? 35018
Partials ? 4873
|
eabf20e
to
3258b6d
Compare
Why are we creating a whole new transform here, when ML.NET already has ReplaceMissingValues ? I think this feature should just be an option to the existing ReplaceMissingValues transform. Today ML.NET supports: machinelearning/src/Microsoft.ML.Transforms/MissingValueReplacing.cs Lines 913 to 934 in 6d192b6
We should just be adding a new value to that enum called /cc @codemzs @ebarsoumMS |
ToStringTransformer is done. CatagoryImputer is done. TimeSeriesImputer is done. RobustScaler is done. Adding in samples and documentation. General code cleanup. Made the RowToRowMapperTransform create a new mapper if possible for each cursor.
f4b4dd9
to
93a994f
Compare
Closing this as we are merging in them one featurizer at a time instead of all together. |
This is still a work in progress, but will be updated and finished over the next several hours.
In order for AutoML to take a full dependency on ML.NET, several pieces of functionality need to be added. This PR is the first step in that process adding in 3 new transformers. The actual implementation of the transformers is done in C++. This code is currently in another repo, but will be turned in a Nuget package that ML.Net can take a binary dependency on. The code going into ML.Net is mostly wrapper code dealing with data interop and, since the binary is exposed via C api's with no overloading, code to call the correct methods based on the data type.
ToStringTransformer - Takes in a column and converts it into an appropriate string representation. It currently supports all integer types, float, double, bool, and string. This is fully done, and should be reviewed first.
CategoryImputer - Fill in missing values with the most common value in the column. It currently supports all integer types, float, double, bool, and string. 0's are treated as missing for integer types. This is done except for a small piece of the c++ interop.
DateTimeTransformer - Splits a column into its appropriate date time components (20 columns total are added). Takes a Long that represents seconds since 1970. The c++ interop will be added for this tonight and can be reviewed last.
Until we get the binary added as an official dependency, the constructors of the estimator will throw an error if the correct .dll is not found. This allows us to do a private drop for now. The tests will also be disabled currently for that reason.