-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Implementing copy column estimator #706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementing copy column estimator #706
Conversation
Ivanidzo4ka
commented
Aug 21, 2018
- @Zruty0
// *** Binary format *** | ||
// int: number of added columns | ||
// for each added column | ||
// int: id of output column name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int: id [](start = 16, length = 8)
string, not int #Closed
@@ -45,6 +45,12 @@ public Column(string name, VectorKind vecKind, DataKind itemKind, bool isKey, st | |||
IsKey = isKey; | |||
MetadataKinds = metadataKinds; | |||
} | |||
|
|||
public Column CloneWithNewName(string newName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CloneWithNewName [](start = 26, length = 16)
maybe this calls for mutable Column object?
If we don't do this, I guess you just have to write line 52 everywhere :) #Closed
throw Contracts.ExceptUserArg(nameof(columns), $"New column {column.Name} specified multiple times"); | ||
newNames.Add(column.Name); | ||
} | ||
_columns = columns; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
columns.Select(x=>x.Name).Distinct().Count
#Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private readonly IHost _host; | ||
|
||
public CopyColumnsEstimator(IHostEnvironment env, string input, string output) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ [](start = 8, length = 1)
Check env, then register host. then host.Check()
everywhere #Closed
_host = env.Register("CopyColumnsEstimator"); | ||
} | ||
|
||
public CopyColumnsEstimator(IHostEnvironment env, (string Source, string Name)[] columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CopyColumnsEstimator [](start = 15, length = 20)
params? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make these a params (Source, Name)[]
argument
In reply to: 212126225 [](ancestors = 212126225,211790834)
|
||
public CopyColumnsTransformer Fit(IDataView input) | ||
{ | ||
// invoke schema validation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
invoke [](start = 15, length = 6)
Invoke #Closed
|
||
public SchemaShape GetOutputSchema(SchemaShape inputSchema) | ||
{ | ||
Contracts.CheckValue(inputSchema, nameof(inputSchema)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Contracts [](start = 12, length = 9)
_host #Closed
public SchemaShape GetOutputSchema(SchemaShape inputSchema) | ||
{ | ||
Contracts.CheckValue(inputSchema, nameof(inputSchema)); | ||
Contracts.CheckValue(inputSchema.Columns, nameof(inputSchema.Columns)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Columns [](start = 45, length = 7)
No need. A valid schema will always have columns #Closed
throw Contracts.ExceptUserArg(nameof(inputSchema), $"{pair.source} not found in {nameof(inputSchema)}"); | ||
} | ||
} | ||
var result = new SchemaShape(originDic.Values.ToArray()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
originDic [](start = 41, length = 9)
the other one, right? #Closed
{ | ||
Contracts.CheckValue(inputSchema, nameof(inputSchema)); | ||
Contracts.CheckValue(inputSchema.Columns, nameof(inputSchema.Columns)); | ||
var originDic = inputSchema.Columns.ToDictionary(x => x.Name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
originDic [](start = 16, length = 9)
this is all supplanted by schema.FindColumn
#Closed
|
||
} | ||
|
||
public Delegate[] CreateGetters(IRow input, Func<int, bool> activeOutput, out Action disposer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
input [](start = 49, length = 5)
need to assert that input.Schema matches _schema
#Closed
|
||
private Delegate MakeGetter<T>(IRow row, int src) => row.GetGetter<T>(src); | ||
|
||
public Func<int, bool> GetDependencies(Func<int, bool> activeOutput) => (col) => (activeOutput(col) && _originalColumnSources.Contains(col)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetDependencies [](start = 35, length = 15)
this is wrong: the argument to activeOutput is between 0 and GreateGetters().Length
. The argument to the resulting Func is between 0 and _schema.ColumnCount
#Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect you use _colMap
as old-to-new mapping, but in reality you populate the other way round.
In reply to: 211792129 [](ancestors = 211792129)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public RowMapperColumnInfo[] GetOutputColumns() | ||
{ | ||
var result = new RowMapperColumnInfo[_columns.Length]; | ||
// Do I need return all columns or only output??? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output [](start = 56, length = 6)
only output #Closed
for (int i = 0; i < _columns.Length; i++) | ||
{ | ||
_schema.TryGetColumnIndex(_columns[i].source, out int colIndex); | ||
// how to get metadata? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add a REVIEW for now, that we want metadata to become IRow.
For now, pre-populate this output in ctor by pulling metadata from src schema #Closed
replace this with a Refers to: src/Microsoft.ML.Data/Transforms/CopyColumnsTransform.cs:18 in 039653c. [](commit_id = 039653c, deletion_comment = False) |
@@ -18,6 +21,9 @@ | |||
[assembly: LoadableClass(CopyColumnsTransform.Summary, typeof(CopyColumnsTransform), null, typeof(SignatureLoadDataTransform), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LoadableClass [](start = 11, length = 13)
remove altogether
Add loadable class for the mapper instead #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ | ||
failed = true; | ||
} | ||
Assert.True(failed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assert [](start = 16, length = 6)
can you just make it Assert.True(false)
inside the try block? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I want to validate what it actually failed. Your suggestion applicable if I want validate absence of failure.
In reply to: 211795480 [](ancestors = 211795480)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reject your rejection. Just put Assert.False() inside the try (not catch) block after the code that is supposed to throw
In reply to: 212474277 [](ancestors = 212474277,211795480)
[Fact] | ||
void TestBadTransformSchmea() | ||
{ | ||
var data = new List<TestClass>() { new TestClass() { A = 1, B = 2, C = 3, }, new TestClass() { A = 4, B = 5, C = 6 } }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new [](start = 23, length = 3)
can new[]
work here? #Closed
|
||
namespace Microsoft.ML.Tests | ||
{ | ||
public class CopyColumnEstimatorTests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CopyColumnEstimatorTests [](start = 17, length = 24)
The transform behavior is checked in TestDataPipeBase.TestCore
. So we need to only check the estimator behavior: that we throw at schema errors for example.
I expect we need to create an estimator workout test as well, similar to TestCore
. Not in this PR though.
#Closed
throw _host.ExceptParam(nameof(schema), $"{_columns[i].Source} not found in {nameof(schema)}"); | ||
} | ||
_colNewToOldMapping.Add(i, colIndex); | ||
_colOldToNewMapping.Add(i, colIndex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_colOldToNewMapping.Add(i, colIndex); [](start = 16, length = 37)
I suppose you meant the reverse?
Also note that the same source can map to more than one target. I think it's better to have the new-to-old only (which is one to one), and build a bool array of dependencies inside GetDependencies #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/Microsoft.ML/CSharpApi.cs
Outdated
var list = Column == null ? new List<Microsoft.ML.Transforms.CopyColumnsTransformColumn>() : new List<Microsoft.ML.Transforms.CopyColumnsTransformColumn>(Column); | ||
list.Add(OneToOneColumn<Microsoft.ML.Transforms.CopyColumnsTransformColumn>.Create(inputColumn)); | ||
var list = Column == null ? new List<Microsoft.ML.Transforms.CopyColumnsTransformerColumn>() : new List<Microsoft.ML.Transforms.CopyColumnsTransformerColumn>(Column); | ||
list.Add(OneToOneColumn<Microsoft.ML.Transforms.CopyColumnsTransformerColumn>.Create(inputColumn)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CopyColumnsTransformerColumn [](start = 64, length = 28)
This is a little unfortunate, maybe there is a way we can avoid this rename, since this is still our public facing API at least in the immediate term. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ | ||
if (newNames.Contains(column.Name)) | ||
throw Contracts.ExceptUserArg(nameof(columns), $"New column {column.Name} specified multiple times"); | ||
newNames.Add(column.Name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that HashSet<>.Add
will itself return false
if the element was already present in the set. Therefore this check can be written more efficiently as:
if (newNames.Add(column.Name)
throw ...;
``` #Resolved
{ | ||
} | ||
|
||
public CopyColumnsEstimator(IHostEnvironment env, params (string Source, string Name)[] columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Source [](start = 73, length = 6)
From what I've seen in MSDN documentation on ValueTuple
, while it's a little strange they seem to eschew normal .NET naming conventions on objects when used in this context and instead cleave to .NET naming conventions on parameters, since despite being a public member name it is being used kinda as a parameter name. So maybe source
and name
? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe RoboTom should be updated to be ValueTuple
aware... not in this PR to be clear, just sometime.
In reply to: 212752862 [](ancestors = 212752862)
private readonly IHost _host; | ||
|
||
public CopyColumnsEstimator(IHostEnvironment env, string input, string output) : | ||
this(env, new (string, string)[1] { (input, output) }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new (string, string)[1] { (input, output) } [](start = 22, length = 43)
It's a params
, right? So could we do this(env, (input, output))
instead of this long thing? #Resolved
@@ -925,18 +925,10 @@ protected override IDataView GetOverallResultsCore(IDataView overall) | |||
|
|||
private IDataView ChangeTopKAccColumnName(IDataView input) | |||
{ | |||
var cpyArgs = new CopyColumnsTransform.Arguments | |||
input = new CopyColumnsTransformer(Host, new[] | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's params
, was it necessary to create this new array? Indeed since it is one item, could we not just have used the alternate non-params
constructor? #Resolved
+ some pr comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(dotnet#706) This builds on the Estimator conversion for the CopyColumnsTransform. This change is mainly refactoring as common code has moved to base level classes. This change is the following: - CopyColumnTransform now derives from OneToOneTransformerBase - CopyColumnEstimator now derives from TrivialEstimator - CopyColumnTransform::Mapper now derives from MapperBase - Removed code that was no longer needed due to these changes
) (#1141) * Updating the CopyColumnsEstimator and Transform to use common code (#706) This builds on the Estimator conversion for the CopyColumnsTransform. This change is mainly refactoring as common code has moved to base level classes. This change is the following: - CopyColumnTransform now derives from OneToOneTransformerBase - CopyColumnEstimator now derives from TrivialEstimator - CopyColumnTransform::Mapper now derives from MapperBase - Removed code that was no longer needed due to these changes * - Moved CopyColumnsTransform into Microsoft.ML.Transforms namespace, updated namespace usage and entrypoints due to this change. - Save now uses the SaveColumns from the base class - Other various changes based upon feedback.