Skip to content

Replace DropColumns,KeepColumns and ChooseColumns with SelectColumns #1269

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Oct 20, 2018

Conversation

singlis
Copy link
Member

@singlis singlis commented Oct 15, 2018

This adds the SelectColumns Transform and Estimator that is replacing
the DropColumns and ChooseColumns Transforms. With this check-in, Drop
and Choose are still in the code base but will be removed. In order to
support loading older models, SelectColumns supports loading in Drop and
Choose transforms. The changes include:

  • Implementation of the SelectColumnsTransform,
    SelectColumnsDataTransform and SelectColumnsEstimator
  • Backward compatibility with Drop and Choose columns by providing
    functions on SelectColumns that will be called when loading the model.
  • Entry point apis for calling select from the command line.
  • Additional tests.

These changes are related to #754.

This adds the SelectColumns Transform and Estimator that is replacing
the DropColumns and ChooseColumns Transforms. With this check-in, Drop
and Choose are still in the code base but will be removed. In order to
support loading older models, SelectColumns supports loading in Drop and
Choose transforms. The changes include:
- Implementation of the SelectColumnsTransform,
SelectColumnsDataTransform and SelectColumnsEstimator
- Backward compatibility with Drop and Choose columns by providing
functions on SelectColumns that will be called when loading the model.
- Entry point apis for calling select from the command line.
- Additional tests.

These changes are related to dotnet#754.
@@ -0,0 +1,217 @@
// Licensed to the .NET Foundation under one or more agreements.
Copy link
Member Author

@singlis singlis Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This file will be coming in via this PR:
1167 #Resolved

@@ -15,7 +15,8 @@ public CSharpCodeGen(ITestOutputHelper output) : base(output)
{
}

[Fact(Skip = "Execute this test if you want to regenerate CSharpApi file")]
//[Fact(Skip = "Execute this test if you want to regenerate CSharpApi file")]
Copy link
Member Author

@singlis singlis Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will revert this change. #Resolved

{
private readonly Mapper _mapper;
private readonly IRow _input;
public Row(IRow input, Mapper mapper, Schema outputSchema)
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schema outputSchema [](start = 50, length = 19)

Looks weird #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the outputSchema


In reply to: 225345828 [](ancestors = 225345828)

=> _input.GetIdGetter();

public bool IsColumnActive(int col)
{
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{ [](start = 12, length = 1)

why not => true? #Closed

{
disposer = null;
using (var ch = _host.Start("GetEntireRow"))
{
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason to create it? I don't see any usage of it. #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no - I have removed it.


In reply to: 225346084 [](ancestors = 225346084)


public ValueGetter<TValue> GetGetter<TValue>(int col)

{
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra line #Closed

}

public bool IsColumnActive(int col)
{
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{ [](start = 12, length = 1)

feel free to use => notion for one liners. #Closed

for (int colIdx = 0; colIdx < columnCount; ++colIdx)
{
if (activeOutput(colIdx))
{
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{ [](start = 20, length = 1)

we tend to omit braces for one liners. #Closed


public IRowCursor GetRowCursor(Func<int, bool> needCol, IRandom rand = null)
{
_host.AssertValue(needCol, "needCol");
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needCol" [](start = 44, length = 8)

nameof #Closed

}

public void Save(ModelSaveContext ctx)
{
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{ [](start = 12, length = 1)

=> #Closed

public ISchema InputSchema => Source.Schema;

public long? GetRowCount(bool lazy = true)
{
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{ [](start = 12, length = 1)

=> #Closed

private sealed class SelectColumnsDataTransform : IDataTransform, IRowToRowMapper
{
private readonly Mapper _mapper;
private readonly IHostEnvironment _host;
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_host [](start = 46, length = 5)

i would call it _env if it's IHostEnvironment. #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the file as well.


In reply to: 225346508 [](ancestors = 225346508)

_transform = transform;
_mapper = mapper;
Source = input;
_host = host;
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_host = host [](start = 16, length = 12)

accept env object and make _host is IHost, and do _host = env.Register(nameof(SelectColumnsDataTransform)); #Pending

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably gave two contradictive comments.
Can you call it IHost _host?


In reply to: 225346656 [](ancestors = 225346656)

=> _input.GetIdGetter();

public bool IsColumnActive(int col)
{
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{ [](start = 12, length = 1)

=> #Closed

private readonly Mapper _mapper;
private readonly IHostEnvironment _host;
private readonly SelectColumnsTransform _transform;
public SelectColumnsDataTransform(IHostEnvironment host, SelectColumnsTransform transform, Mapper mapper, IDataView input)
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add new line #Closed

/// <summary>
/// Constructs the Select Columns Estimator with an array of column names to keep.
/// </summary>
/// <param name="env">Instance of the host environment</param>
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"." in the end of sentence #Closed

/// Constructs the Select Columns Estimator with an array of column names to keep.
/// </summary>
/// <param name="env">Instance of the host environment</param>
/// <param name="columns">The array of column names to keep</param>
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep [](start = 63, length = 4)

"." in the end of sentence, everywhere please #Closed

@@ -242,7 +242,8 @@ private string GetBuildPrefix()
#endif
}

[Fact(Skip = "Execute this test if you want to regenerate ep-list and _manifest.json")]
//[Fact(Skip = "Execute this test if you want to regenerate ep-list and _manifest.json")]
[Fact]
Copy link
Contributor

@Zruty0 Zruty0 Oct 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fact [](start = 9, length = 4)

don't forget to undo #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert it to skip


In reply to: 225386340 [](ancestors = 225386340)

@Zruty0
Copy link
Contributor

Zruty0 commented Oct 16, 2018

using Microsoft.ML.Runtime.Api;

Copyright header #Closed


Refers to: test/Microsoft.ML.Tests/SelectColumnsTransformsTests.cs:1 in 1b0fff9. [](commit_id = 1b0fff9, deletion_comment = False)


namespace Microsoft.ML.Tests
{
public class SelectColumnsTransformsTests : BaseTestClass
Copy link
Contributor

@Zruty0 Zruty0 Oct 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BaseTestClass [](start = 48, length = 13)

Please add a test that calls TestEstimatorCore #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or add that to one of the existing tests.


In reply to: 225386549 [](ancestors = 225386549)


namespace Microsoft.ML.Transforms
{
/// <summary>
Copy link
Contributor

@Zruty0 Zruty0 Oct 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

summary [](start = 9, length = 7)

Please re-word this comment. It should not mention that this is an estimator class, and the 2nd sentence looked off. Maybe the clarification about 'it may or may not keep hidden columns' don't really belong to the summary comment #Closed


public SchemaShape GetOutputSchema(SchemaShape inputSchema)
{
var columns = new List<SchemaShape.Column>();
Copy link
Contributor

@Zruty0 Zruty0 Oct 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about

var columns = inputSchema.Columns.Where(c=>_selectPredicate(c.Name))
return new SchemaShape(columns)
``` #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I really want an IEnumerable for Columns, but all I have is ColumnCount and GetColumnName(idx). Maybe this is coming in the near future? Oh wait -- I just saw that your PR went though. I will update...


In reply to: 225387439 [](ancestors = 225387439)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thank you - I have updated.


In reply to: 225682139 [](ancestors = 225682139,225387439)

_selectedColumns = columns;
}

/// Helper function to determine the model version that is being loaded
Copy link
Contributor

@Zruty0 Zruty0 Oct 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// [](start = 8, length = 3)

either // or make a full summary comment #Closed

var length = _selectedColumns.Length;
ctx.Writer.Write(length);
for (int i = 0; i < length; i++)
{
Copy link
Contributor

@Zruty0 Zruty0 Oct 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{ [](start = 12, length = 1)

no curlies for 1-line bodies #Closed

}
}

internal Mapper(SelectColumnsTransform transform, ISchema inputSchema)
Copy link
Contributor

@Zruty0 Zruty0 Oct 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal [](start = 12, length = 8)

public is fine, since it's private class #Closed


public ISchema Schema => _outputSchema;

public int this [int colIndex]
Copy link
Contributor

@Zruty0 Zruty0 Oct 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this [](start = 23, length = 4)

I'm not sure I approve of the use of indexer in this manner. Let's just make a public method #Closed

@artidoro artidoro self-requested a review October 16, 2018 17:09
@singlis
Copy link
Member Author

singlis commented Oct 16, 2018

Thanks Artidoro - this is a good point. I am following up to see if it even makes sense for SelectColumns to have pigsty extensions. If so, I will create an issue and followup on a different PR.


In reply to: 430325275 [](ancestors = 430325275)

_host.CheckValue(inputSchema, nameof(inputSchema));
return new SelectColumnsDataTransform(_host, this,
new Mapper(this, inputSchema),
new EmptyDataView(_host, Schema.Create(inputSchema)));
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schema.Create(inputSchema) [](start = 75, length = 26)

This looks excessive. #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually if you look at Schema.Create, you'll see that it will just return inputSchema, so let's not call the method at all


In reply to: 226045960 [](ancestors = 226045960)

Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

/// <param name="env">Instance of the host environment.</param>
/// <param name="keepHidden">Specifies if hidden columns should be removed. Default is true to keep hidden columns.</param>
/// <param name="selectPredicate">The predicate that will determines the columns to keep.</param>
public SelectColumnsEstimator(IHostEnvironment env, bool keepHidden, Func<string, bool> selectPredicate)
Copy link
Contributor

@Zruty0 Zruty0 Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SelectColumnsEstimator [](start = 15, length = 22)

why not bundle with the above, and default keepHidden to true? #Closed

/// Constructs the Select Columns Estimator using a specified predicate to determine the columns to keep.
/// </summary>
/// <param name="env">Instance of the host environment.</param>
/// <param name="keepHidden">Specifies if hidden columns should be removed. Default is true to keep hidden columns.</param>
Copy link
Contributor

@Zruty0 Zruty0 Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default is true to keep hidden columns. [](start = 84, length = 39)

remove #Closed

{
var dataView = ComponentCreation.CreateDataView(env, data);
var est = new SelectColumnsEstimator(env, new[]{ "D", "E" });
TestEstimatorCore(est, validFitInput: dataView);
Copy link
Contributor

@Zruty0 Zruty0 Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestEstimatorCore [](start = 16, length = 17)

actually, let's make it more elaborate, and

  • test the predicate version here (for instance, keep only columns with >1 characters in name)
  • add an invalid dataset #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the api change, I have TestEstimatorCore for both keep and drop scenarios both with invalid data sets.


In reply to: 226068148 [](ancestors = 226068148)

Copy link
Contributor

@Zruty0 Zruty0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

/// Constructs the Select Columns Estimator.
/// </summary>
/// <param name="env">Instance of the host environment.</param>
/// <param name="keepColumns">The array of column names to keep, cannot be set with dropColumns.</param>
Copy link
Contributor

@Zruty0 Zruty0 Oct 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropColumns [](start = 92, length = 11)

<paramref> #Resolved

/// <param name="env">Instance of the host environment.</param>
/// <param name="keepColumns">The array of column names to keep, cannot be set with dropColumns.</param>
/// <param name="dropColumns">The array of column names to drop, cannot be set with keepColumns.</param>
/// <param name="keepHidden">Boolean if true will keep hidden columns and false will remove hidden columns.</param>
Copy link
Contributor

@Zruty0 Zruty0 Oct 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Boolean [](start = 37, length = 7)

no need to specify the type #Resolved

/// <param name="keepColumns">The array of column names to keep, cannot be set with dropColumns.</param>
/// <param name="dropColumns">The array of column names to drop, cannot be set with keepColumns.</param>
/// <param name="keepHidden">Boolean if true will keep hidden columns and false will remove hidden columns.</param>
/// <param name="ignoreMismatch">Boolean if false will check for any mismatch between what is specified in keepColumns or dropColumns
Copy link
Contributor

@Zruty0 Zruty0 Oct 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignoreMismatch [](start = 25, length = 14)

'mismatch' is too broad. 'Missing' is probably better #Resolved

}

/// <summary>
/// Static helper function to create a SelectColumnsEstimator with columns to drop.
Copy link
Contributor

@Zruty0 Zruty0 Oct 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tatic helper function to create a SelectColumnsEstimator with columns to drop. [](start = 13, length = 78)

Please write a user-facing summary comment, not developer-facing #Resolved

!IsSchemaValid(inputSchema.GetColumns().Select(x=>x.column.Name),
SelectColumns,
out IEnumerable<string> invalidColumns))
throw _host.ExceptSchemaMismatch(nameof(inputSchema), "input",
Copy link
Contributor

@Zruty0 Zruty0 Oct 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throw [](start = 18, length = 5)

for such complicated statement above, it's appropriate to use curlies #Resolved

}

private static int[] BuildOutputToInputMap(IEnumerable<string> selectedColumns,
bool keepColumns,
Copy link
Contributor

@Zruty0 Zruty0 Oct 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool [](start = 57, length = 4)

this is a non-standard padding. Standard is 4 spaces to the right of 'private' #Resolved

Copy link
Member Author

@singlis singlis Oct 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this over, but does this apply to function calls as well? I usually try to keep the parameters together for readability.


In reply to: 226763760 [](ancestors = 226763760)

Copy link
Contributor

@Zruty0 Zruty0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Contributor

@Zruty0 Zruty0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@singlis singlis merged commit 9a33cd4 into dotnet:master Oct 20, 2018
@singlis singlis deleted the singlis/drop branch March 7, 2019 00:54
@ghost ghost locked as resolved and limited conversation to collaborators Mar 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants