Skip to content

Implementing copy column estimator #706

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Aug 27, 2018

Conversation

Ivanidzo4ka
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka requested a review from Zruty0 August 21, 2018 23:23
// *** Binary format ***
// int: number of added columns
// for each added column
// int: id of output column name
Copy link
Contributor Author

@Ivanidzo4ka Ivanidzo4ka Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int: id [](start = 16, length = 8)

string, not int #Closed

@@ -45,6 +45,12 @@ public Column(string name, VectorKind vecKind, DataKind itemKind, bool isKey, st
IsKey = isKey;
MetadataKinds = metadataKinds;
}

public Column CloneWithNewName(string newName)
Copy link
Contributor

@Zruty0 Zruty0 Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CloneWithNewName [](start = 26, length = 16)

maybe this calls for mutable Column object?
If we don't do this, I guess you just have to write line 52 everywhere :) #Closed

throw Contracts.ExceptUserArg(nameof(columns), $"New column {column.Name} specified multiple times");
newNames.Add(column.Name);
}
_columns = columns;
Copy link
Contributor

@Zruty0 Zruty0 Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

columns.Select(x=>x.Name).Distinct().Count #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to tell user where he make mistake.


In reply to: 211790617 [](ancestors = 211790617)

private readonly IHost _host;

public CopyColumnsEstimator(IHostEnvironment env, string input, string output)
{
Copy link
Contributor

@Zruty0 Zruty0 Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{ [](start = 8, length = 1)

Check env, then register host. then host.Check() everywhere #Closed

_host = env.Register("CopyColumnsEstimator");
}

public CopyColumnsEstimator(IHostEnvironment env, (string Source, string Name)[] columns)
Copy link
Contributor

@Zruty0 Zruty0 Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CopyColumnsEstimator [](start = 15, length = 20)

params? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what params?


In reply to: 211790834 [](ancestors = 211790834)

Copy link
Contributor

@Zruty0 Zruty0 Aug 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make these a params (Source, Name)[] argument


In reply to: 212126225 [](ancestors = 212126225,211790834)


public CopyColumnsTransformer Fit(IDataView input)
{
// invoke schema validation.
Copy link
Contributor

@Zruty0 Zruty0 Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

invoke [](start = 15, length = 6)

Invoke #Closed


public SchemaShape GetOutputSchema(SchemaShape inputSchema)
{
Contracts.CheckValue(inputSchema, nameof(inputSchema));
Copy link
Contributor

@Zruty0 Zruty0 Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Contracts [](start = 12, length = 9)

_host #Closed

public SchemaShape GetOutputSchema(SchemaShape inputSchema)
{
Contracts.CheckValue(inputSchema, nameof(inputSchema));
Contracts.CheckValue(inputSchema.Columns, nameof(inputSchema.Columns));
Copy link
Contributor

@Zruty0 Zruty0 Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Columns [](start = 45, length = 7)

No need. A valid schema will always have columns #Closed

throw Contracts.ExceptUserArg(nameof(inputSchema), $"{pair.source} not found in {nameof(inputSchema)}");
}
}
var result = new SchemaShape(originDic.Values.ToArray());
Copy link
Contributor

@Zruty0 Zruty0 Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

originDic [](start = 41, length = 9)

the other one, right? #Closed

{
Contracts.CheckValue(inputSchema, nameof(inputSchema));
Contracts.CheckValue(inputSchema.Columns, nameof(inputSchema.Columns));
var originDic = inputSchema.Columns.ToDictionary(x => x.Name);
Copy link
Contributor

@Zruty0 Zruty0 Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

originDic [](start = 16, length = 9)

this is all supplanted by schema.FindColumn #Closed


}

public Delegate[] CreateGetters(IRow input, Func<int, bool> activeOutput, out Action disposer)
Copy link
Contributor

@Zruty0 Zruty0 Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input [](start = 49, length = 5)

need to assert that input.Schema matches _schema #Closed


private Delegate MakeGetter<T>(IRow row, int src) => row.GetGetter<T>(src);

public Func<int, bool> GetDependencies(Func<int, bool> activeOutput) => (col) => (activeOutput(col) && _originalColumnSources.Contains(col));
Copy link
Contributor

@Zruty0 Zruty0 Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetDependencies [](start = 35, length = 15)

this is wrong: the argument to activeOutput is between 0 and GreateGetters().Length. The argument to the resulting Func is between 0 and _schema.ColumnCount #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect you use _colMap as old-to-new mapping, but in reality you populate the other way round.


In reply to: 211792129 [](ancestors = 211792129)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need both I guess


In reply to: 212480267 [](ancestors = 212480267,211792129)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old to new is one-to-many


In reply to: 212482646 [](ancestors = 212482646,212480267,211792129)

public RowMapperColumnInfo[] GetOutputColumns()
{
var result = new RowMapperColumnInfo[_columns.Length];
// Do I need return all columns or only output???
Copy link
Contributor

@Zruty0 Zruty0 Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output [](start = 56, length = 6)

only output #Closed

for (int i = 0; i < _columns.Length; i++)
{
_schema.TryGetColumnIndex(_columns[i].source, out int colIndex);
// how to get metadata?
Copy link
Contributor

@Zruty0 Zruty0 Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add a REVIEW for now, that we want metadata to become IRow.
For now, pre-populate this output in ctor by pulling metadata from src schema #Closed

@Zruty0
Copy link
Contributor

Zruty0 commented Aug 21, 2018

[assembly: LoadableClass(CopyColumnsTransform.Summary, typeof(CopyColumnsTransform), typeof(CopyColumnsTransform.Arguments), typeof(SignatureDataTransform),

replace this with a Create method that manufactures a Transformer and wraps it's Transform into a RowToRowTransform #Closed


Refers to: src/Microsoft.ML.Data/Transforms/CopyColumnsTransform.cs:18 in 039653c. [](commit_id = 039653c, deletion_comment = False)

@@ -18,6 +21,9 @@
[assembly: LoadableClass(CopyColumnsTransform.Summary, typeof(CopyColumnsTransform), null, typeof(SignatureLoadDataTransform),
Copy link
Contributor

@Zruty0 Zruty0 Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LoadableClass [](start = 11, length = 13)

remove altogether

Add loadable class for the mapper instead #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with SignatureLoadRowMapper as a signature.


In reply to: 211794398 [](ancestors = 211794398)

{
failed = true;
}
Assert.True(failed);
Copy link
Contributor

@Zruty0 Zruty0 Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assert [](start = 16, length = 6)

can you just make it Assert.True(false) inside the try block? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I want to validate what it actually failed. Your suggestion applicable if I want validate absence of failure.


In reply to: 211795480 [](ancestors = 211795480)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reject your rejection. Just put Assert.False() inside the try (not catch) block after the code that is supposed to throw


In reply to: 212474277 [](ancestors = 212474277,211795480)

[Fact]
void TestBadTransformSchmea()
{
var data = new List<TestClass>() { new TestClass() { A = 1, B = 2, C = 3, }, new TestClass() { A = 4, B = 5, C = 6 } };
Copy link
Contributor

@Zruty0 Zruty0 Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new [](start = 23, length = 3)

can new[] work here? #Closed


namespace Microsoft.ML.Tests
{
public class CopyColumnEstimatorTests
Copy link
Contributor

@Zruty0 Zruty0 Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CopyColumnEstimatorTests [](start = 17, length = 24)

The transform behavior is checked in TestDataPipeBase.TestCore. So we need to only check the estimator behavior: that we throw at schema errors for example.

I expect we need to create an estimator workout test as well, similar to TestCore. Not in this PR though.
#Closed

throw _host.ExceptParam(nameof(schema), $"{_columns[i].Source} not found in {nameof(schema)}");
}
_colNewToOldMapping.Add(i, colIndex);
_colOldToNewMapping.Add(i, colIndex);
Copy link
Contributor

@Zruty0 Zruty0 Aug 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_colOldToNewMapping.Add(i, colIndex); [](start = 16, length = 37)

I suppose you meant the reverse?

Also note that the same source can map to more than one target. I think it's better to have the new-to-old only (which is one to one), and build a bool array of dependencies inside GetDependencies #Closed

@Zruty0
Copy link
Contributor

Zruty0 commented Aug 23, 2018

using Microsoft.ML.Runtime;

header #Closed


Refers to: test/Microsoft.ML.Tests/CopyColumnEstimatorTests.cs:1 in 3d76715. [](commit_id = 3d76715, deletion_comment = False)

@Ivanidzo4ka Ivanidzo4ka requested a review from TomFinley August 24, 2018 00:22
Copy link
Contributor

@Zruty0 Zruty0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Contributor

@Zruty0 Zruty0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

var list = Column == null ? new List<Microsoft.ML.Transforms.CopyColumnsTransformColumn>() : new List<Microsoft.ML.Transforms.CopyColumnsTransformColumn>(Column);
list.Add(OneToOneColumn<Microsoft.ML.Transforms.CopyColumnsTransformColumn>.Create(inputColumn));
var list = Column == null ? new List<Microsoft.ML.Transforms.CopyColumnsTransformerColumn>() : new List<Microsoft.ML.Transforms.CopyColumnsTransformerColumn>(Column);
list.Add(OneToOneColumn<Microsoft.ML.Transforms.CopyColumnsTransformerColumn>.Create(inputColumn));
Copy link
Contributor

@TomFinley TomFinley Aug 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CopyColumnsTransformerColumn [](start = 64, length = 28)

This is a little unfortunate, maybe there is a way we can avoid this rename, since this is still our public facing API at least in the immediate term. #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you've solved this based on a rename


In reply to: 212749809 [](ancestors = 212749809)

{
if (newNames.Contains(column.Name))
throw Contracts.ExceptUserArg(nameof(columns), $"New column {column.Name} specified multiple times");
newNames.Add(column.Name);
Copy link
Contributor

@TomFinley TomFinley Aug 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that HashSet<>.Add will itself return false if the element was already present in the set. Therefore this check can be written more efficiently as:

if (newNames.Add(column.Name)
    throw ...;
``` #Resolved

{
}

public CopyColumnsEstimator(IHostEnvironment env, params (string Source, string Name)[] columns)
Copy link
Contributor

@TomFinley TomFinley Aug 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Source [](start = 73, length = 6)

From what I've seen in MSDN documentation on ValueTuple, while it's a little strange they seem to eschew normal .NET naming conventions on objects when used in this context and instead cleave to .NET naming conventions on parameters, since despite being a public member name it is being used kinda as a parameter name. So maybe source and name? #Resolved

Copy link
Contributor

@TomFinley TomFinley Aug 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe RoboTom should be updated to be ValueTuple aware... not in this PR to be clear, just sometime.


In reply to: 212752862 [](ancestors = 212752862)

private readonly IHost _host;

public CopyColumnsEstimator(IHostEnvironment env, string input, string output) :
this(env, new (string, string)[1] { (input, output) })
Copy link
Contributor

@TomFinley TomFinley Aug 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new (string, string)[1] { (input, output) } [](start = 22, length = 43)

It's a params, right? So could we do this(env, (input, output)) instead of this long thing? #Resolved

@@ -925,18 +925,10 @@ protected override IDataView GetOverallResultsCore(IDataView overall)

private IDataView ChangeTopKAccColumnName(IDataView input)
{
var cpyArgs = new CopyColumnsTransform.Arguments
input = new CopyColumnsTransformer(Host, new[]
{
Copy link
Contributor

@TomFinley TomFinley Aug 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's params, was it necessary to create this new array? Indeed since it is one item, could we not just have used the alternate non-params constructor? #Resolved

Ivan Matantsev added 2 commits August 27, 2018 11:15
Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@Ivanidzo4ka Ivanidzo4ka merged commit 4ba15b8 into dotnet:master Aug 27, 2018
@Zruty0 Zruty0 mentioned this pull request Aug 27, 2018
singlis added a commit to singlis/machinelearning that referenced this pull request Oct 3, 2018
  (dotnet#706)
This builds on the Estimator conversion for the CopyColumnsTransform.
This change is mainly refactoring as common code has moved to base level classes.

This change is the following:
       - CopyColumnTransform now derives from OneToOneTransformerBase
       - CopyColumnEstimator now derives from TrivialEstimator
       - CopyColumnTransform::Mapper now derives from MapperBase
       - Removed code that was no longer needed due to these changes
singlis added a commit that referenced this pull request Oct 4, 2018
) (#1141)

* Updating the CopyColumnsEstimator and Transform to use common code
  (#706)
This builds on the Estimator conversion for the CopyColumnsTransform.
This change is mainly refactoring as common code has moved to base level classes.

This change is the following:
       - CopyColumnTransform now derives from OneToOneTransformerBase
       - CopyColumnEstimator now derives from TrivialEstimator
       - CopyColumnTransform::Mapper now derives from MapperBase
       - Removed code that was no longer needed due to these changes

*  - Moved CopyColumnsTransform into Microsoft.ML.Transforms namespace,
 updated namespace usage and entrypoints due to this change.
 - Save now uses the SaveColumns from the base class
 - Other various changes based upon feedback.
@ghost ghost locked as resolved and limited conversation to collaborators Mar 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants