-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Replace DropColumns,KeepColumns and ChooseColumns with SelectColumns #1269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This adds the SelectColumns Transform and Estimator that is replacing the DropColumns and ChooseColumns Transforms. With this check-in, Drop and Choose are still in the code base but will be removed. In order to support loading older models, SelectColumns supports loading in Drop and Choose transforms. The changes include: - Implementation of the SelectColumnsTransform, SelectColumnsDataTransform and SelectColumnsEstimator - Backward compatibility with Drop and Choose columns by providing functions on SelectColumns that will be called when loading the model. - Entry point apis for calling select from the command line. - Additional tests. These changes are related to dotnet#754.
src/Microsoft.ML.Core/Data/Schema.cs
Outdated
@@ -0,0 +1,217 @@ | |||
// Licensed to the .NET Foundation under one or more agreements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: This file will be coming in via this PR:
1167 #Resolved
@@ -15,7 +15,8 @@ public CSharpCodeGen(ITestOutputHelper output) : base(output) | |||
{ | |||
} | |||
|
|||
[Fact(Skip = "Execute this test if you want to regenerate CSharpApi file")] | |||
//[Fact(Skip = "Execute this test if you want to regenerate CSharpApi file")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will revert this change. #Resolved
{ | ||
private readonly Mapper _mapper; | ||
private readonly IRow _input; | ||
public Row(IRow input, Mapper mapper, Schema outputSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Schema outputSchema [](start = 50, length = 19)
Looks weird #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
=> _input.GetIdGetter(); | ||
|
||
public bool IsColumnActive(int col) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ [](start = 12, length = 1)
why not => true? #Closed
{ | ||
disposer = null; | ||
using (var ch = _host.Start("GetEntireRow")) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason to create it? I don't see any usage of it. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
public ValueGetter<TValue> GetGetter<TValue>(int col) | ||
|
||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra line #Closed
} | ||
|
||
public bool IsColumnActive(int col) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ [](start = 12, length = 1)
feel free to use => notion for one liners. #Closed
for (int colIdx = 0; colIdx < columnCount; ++colIdx) | ||
{ | ||
if (activeOutput(colIdx)) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ [](start = 20, length = 1)
we tend to omit braces for one liners. #Closed
|
||
public IRowCursor GetRowCursor(Func<int, bool> needCol, IRandom rand = null) | ||
{ | ||
_host.AssertValue(needCol, "needCol"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needCol" [](start = 44, length = 8)
nameof #Closed
} | ||
|
||
public void Save(ModelSaveContext ctx) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ [](start = 12, length = 1)
=> #Closed
public ISchema InputSchema => Source.Schema; | ||
|
||
public long? GetRowCount(bool lazy = true) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ [](start = 12, length = 1)
=> #Closed
private sealed class SelectColumnsDataTransform : IDataTransform, IRowToRowMapper | ||
{ | ||
private readonly Mapper _mapper; | ||
private readonly IHostEnvironment _host; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_host [](start = 46, length = 5)
i would call it _env if it's IHostEnvironment. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_transform = transform; | ||
_mapper = mapper; | ||
Source = input; | ||
_host = host; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_host = host [](start = 16, length = 12)
accept env object and make _host is IHost, and do _host = env.Register(nameof(SelectColumnsDataTransform)); #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably gave two contradictive comments.
Can you call it IHost _host?
In reply to: 225346656 [](ancestors = 225346656)
=> _input.GetIdGetter(); | ||
|
||
public bool IsColumnActive(int col) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ [](start = 12, length = 1)
=> #Closed
private readonly Mapper _mapper; | ||
private readonly IHostEnvironment _host; | ||
private readonly SelectColumnsTransform _transform; | ||
public SelectColumnsDataTransform(IHostEnvironment host, SelectColumnsTransform transform, Mapper mapper, IDataView input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add new line #Closed
/// <summary> | ||
/// Constructs the Select Columns Estimator with an array of column names to keep. | ||
/// </summary> | ||
/// <param name="env">Instance of the host environment</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"." in the end of sentence #Closed
/// Constructs the Select Columns Estimator with an array of column names to keep. | ||
/// </summary> | ||
/// <param name="env">Instance of the host environment</param> | ||
/// <param name="columns">The array of column names to keep</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep [](start = 63, length = 4)
"." in the end of sentence, everywhere please #Closed
@@ -242,7 +242,8 @@ private string GetBuildPrefix() | |||
#endif | |||
} | |||
|
|||
[Fact(Skip = "Execute this test if you want to regenerate ep-list and _manifest.json")] | |||
//[Fact(Skip = "Execute this test if you want to regenerate ep-list and _manifest.json")] | |||
[Fact] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fact [](start = 9, length = 4)
don't forget to undo #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
namespace Microsoft.ML.Tests | ||
{ | ||
public class SelectColumnsTransformsTests : BaseTestClass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BaseTestClass [](start = 48, length = 13)
Please add a test that calls TestEstimatorCore
#Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
namespace Microsoft.ML.Transforms | ||
{ | ||
/// <summary> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
summary [](start = 9, length = 7)
Please re-word this comment. It should not mention that this is an estimator class, and the 2nd sentence looked off. Maybe the clarification about 'it may or may not keep hidden columns' don't really belong to the summary comment #Closed
|
||
public SchemaShape GetOutputSchema(SchemaShape inputSchema) | ||
{ | ||
var columns = new List<SchemaShape.Column>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about
var columns = inputSchema.Columns.Where(c=>_selectPredicate(c.Name))
return new SchemaShape(columns)
``` #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I really want an IEnumerable for Columns, but all I have is ColumnCount and GetColumnName(idx). Maybe this is coming in the near future? Oh wait -- I just saw that your PR went though. I will update...
In reply to: 225387439 [](ancestors = 225387439)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_selectedColumns = columns; | ||
} | ||
|
||
/// Helper function to determine the model version that is being loaded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// [](start = 8, length = 3)
either //
or make a full summary comment #Closed
var length = _selectedColumns.Length; | ||
ctx.Writer.Write(length); | ||
for (int i = 0; i < length; i++) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ [](start = 12, length = 1)
no curlies for 1-line bodies #Closed
} | ||
} | ||
|
||
internal Mapper(SelectColumnsTransform transform, ISchema inputSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
internal [](start = 12, length = 8)
public
is fine, since it's private class #Closed
|
||
public ISchema Schema => _outputSchema; | ||
|
||
public int this [int colIndex] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this [](start = 23, length = 4)
I'm not sure I approve of the use of indexer in this manner. Let's just make a public method #Closed
Thanks Artidoro - this is a good point. I am following up to see if it even makes sense for SelectColumns to have pigsty extensions. If so, I will create an issue and followup on a different PR. In reply to: 430325275 [](ancestors = 430325275) |
_host.CheckValue(inputSchema, nameof(inputSchema)); | ||
return new SelectColumnsDataTransform(_host, this, | ||
new Mapper(this, inputSchema), | ||
new EmptyDataView(_host, Schema.Create(inputSchema))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Schema.Create(inputSchema) [](start = 75, length = 26)
This looks excessive. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually if you look at Schema.Create
, you'll see that it will just return inputSchema
, so let's not call the method at all
In reply to: 226045960 [](ancestors = 226045960)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// <param name="env">Instance of the host environment.</param> | ||
/// <param name="keepHidden">Specifies if hidden columns should be removed. Default is true to keep hidden columns.</param> | ||
/// <param name="selectPredicate">The predicate that will determines the columns to keep.</param> | ||
public SelectColumnsEstimator(IHostEnvironment env, bool keepHidden, Func<string, bool> selectPredicate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SelectColumnsEstimator [](start = 15, length = 22)
why not bundle with the above, and default keepHidden
to true
? #Closed
/// Constructs the Select Columns Estimator using a specified predicate to determine the columns to keep. | ||
/// </summary> | ||
/// <param name="env">Instance of the host environment.</param> | ||
/// <param name="keepHidden">Specifies if hidden columns should be removed. Default is true to keep hidden columns.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default is true to keep hidden columns. [](start = 84, length = 39)
remove #Closed
{ | ||
var dataView = ComponentCreation.CreateDataView(env, data); | ||
var est = new SelectColumnsEstimator(env, new[]{ "D", "E" }); | ||
TestEstimatorCore(est, validFitInput: dataView); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TestEstimatorCore [](start = 16, length = 17)
actually, let's make it more elaborate, and
- test the predicate version here (for instance, keep only columns with >1 characters in name)
- add an invalid dataset #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the api change, I have TestEstimatorCore for both keep and drop scenarios both with invalid data sets.
In reply to: 226068148 [](ancestors = 226068148)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Constructs the Select Columns Estimator. | ||
/// </summary> | ||
/// <param name="env">Instance of the host environment.</param> | ||
/// <param name="keepColumns">The array of column names to keep, cannot be set with dropColumns.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dropColumns [](start = 92, length = 11)
<paramref>
#Resolved
/// <param name="env">Instance of the host environment.</param> | ||
/// <param name="keepColumns">The array of column names to keep, cannot be set with dropColumns.</param> | ||
/// <param name="dropColumns">The array of column names to drop, cannot be set with keepColumns.</param> | ||
/// <param name="keepHidden">Boolean if true will keep hidden columns and false will remove hidden columns.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Boolean [](start = 37, length = 7)
no need to specify the type #Resolved
/// <param name="keepColumns">The array of column names to keep, cannot be set with dropColumns.</param> | ||
/// <param name="dropColumns">The array of column names to drop, cannot be set with keepColumns.</param> | ||
/// <param name="keepHidden">Boolean if true will keep hidden columns and false will remove hidden columns.</param> | ||
/// <param name="ignoreMismatch">Boolean if false will check for any mismatch between what is specified in keepColumns or dropColumns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignoreMismatch [](start = 25, length = 14)
'mismatch' is too broad. 'Missing' is probably better #Resolved
} | ||
|
||
/// <summary> | ||
/// Static helper function to create a SelectColumnsEstimator with columns to drop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tatic helper function to create a SelectColumnsEstimator with columns to drop. [](start = 13, length = 78)
Please write a user-facing summary comment, not developer-facing #Resolved
!IsSchemaValid(inputSchema.GetColumns().Select(x=>x.column.Name), | ||
SelectColumns, | ||
out IEnumerable<string> invalidColumns)) | ||
throw _host.ExceptSchemaMismatch(nameof(inputSchema), "input", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
throw [](start = 18, length = 5)
for such complicated statement above, it's appropriate to use curlies #Resolved
} | ||
|
||
private static int[] BuildOutputToInputMap(IEnumerable<string> selectedColumns, | ||
bool keepColumns, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool [](start = 57, length = 4)
this is a non-standard padding. Standard is 4 spaces to the right of 'private' #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this over, but does this apply to function calls as well? I usually try to keep the parameters together for readability.
In reply to: 226763760 [](ancestors = 226763760)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds the SelectColumns Transform and Estimator that is replacing
the DropColumns and ChooseColumns Transforms. With this check-in, Drop
and Choose are still in the code base but will be removed. In order to
support loading older models, SelectColumns supports loading in Drop and
Choose transforms. The changes include:
SelectColumnsDataTransform and SelectColumnsEstimator
functions on SelectColumns that will be called when loading the model.
These changes are related to #754.