Skip to content

swapping the order or arguments on the constructors of the ConversionExtensionsCatalog. Internalizing the constructors. #2118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

sfilipi
Copy link
Member

@sfilipi sfilipi commented Jan 11, 2019

Working towards resolving #2064 by swapping the order or arguments on the constructors of the ConversionExtensions Catalog. Internalizing the constructors.

@sfilipi sfilipi added the API Issues pertaining the friendly API label Jan 11, 2019
@sfilipi sfilipi added this to the 0119 milestone Jan 11, 2019
@sfilipi sfilipi self-assigned this Jan 11, 2019
@@ -39,7 +39,7 @@ public void TensorFlowTransforCifarEndToEndTest()
.Append(new ImagePixelExtractingEstimator(mlContext, "ImageCropped", "Input", interleave: true))
.Append(new TensorFlowEstimator(mlContext, model_location, new[] { "Input" }, new[] { "Output" }))
.Append(new ColumnConcatenatingEstimator(mlContext, "Features", "Output"))
.Append(new ValueToKeyMappingEstimator(mlContext, "Label"))
.Append(new ValueToKeyMappingEstimator(mlContext, "Label", "Label"))
Copy link
Member Author

@sfilipi sfilipi Jan 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, "Label" [](start = 77, length = 9)

side effect of this change is that we cannot have the output column default to null, and therefore substitute it with the input column when not provided... #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's reverse that. We should retain the behavior and attitude that existed before the fiasco began, of the "name" (what was misidentified as "output") always being specified, and in the case where there is a single "source" (what was misidentified as "input"), it is taken as being the same as the "name."


In reply to: 246970621 [](ancestors = 246970621)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was marked as resolved, but I still see that we've retained this change.


In reply to: 246974415 [](ancestors = 246974415,246970621)

/// <param name="hashBits">Number of bits to hash into. Must be between 1 and 31, inclusive.</param>
/// <param name="invertHash">During hashing we constuct mappings between original values and the produced hash values.
/// Text representation of original values are stored in the slot names of the metadata for the new column.Hashing, as such, can map many initial values to one.
/// <paramref name="invertHash"/> specifies the upper bound of the number of distinct input values mapping to a hash that should be retained.
/// <value>0</value> does not retain any input values. <value>-1</value> retains all input values mapping to each hash.</param>
public HashingEstimator(IHostEnvironment env, string inputColumn, string outputColumn = null,
internal HashingEstimator(IHostEnvironment env, string inputColumn, string outputColumn,
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Jan 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal [](start = 8, length = 8)

why estimator constructor becomes internal? #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and shouldn't you swap output and input?


In reply to: 246971081 [](ancestors = 246971081)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constructors become internal, so that users instantiate them only through the MlContext. See #2100


In reply to: 246971081 [](ancestors = 246971081)

@@ -92,24 +92,24 @@ public static KeyToValueMappingEstimator MapKeyToValue(this TransformsCatalog.Co
/// <param name="outputColumn">The name of the output column.</param>
/// <param name="bag">Whether bagging is used for the conversion. </param>
public static KeyToVectorMappingEstimator MapKeyToVector(this TransformsCatalog.ConversionTransforms catalog,
string inputColumn, string outputColumn = null, bool bag = KeyToVectorMappingEstimator.Defaults.Bag)
=> new KeyToVectorMappingEstimator(CatalogUtils.GetEnvironment(catalog), inputColumn, outputColumn, bag);
string outputColumn, string inputColumn, bool bag = KeyToVectorMappingEstimator.Defaults.Bag)
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Jan 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inputColumn [](start = 40, length = 11)

I'm not a Tom, and I can't speak from his name, but I don't remember him objecting on idea of one of column be optional, and I thinks it's perfectly fine to have second column nullable and replace it with first one. #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in fact that was our prior behavior.


In reply to: 246971850 [](ancestors = 246971850)

@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Jan 11, 2019

        => new ValueToKeyMappingEstimator(CatalogUtils.GetEnvironment(catalog), columns, file, termsColumn, loaderFactory);

I don't think this constructor and constructor for estimator which accepts term column and loader should exist. (you can just remove term column and loader)
First because we don't want to expose IDataLoader interface and IComponentFactory to user.
Second because we already have ValueMapping estimator which allow you provide custom mappings and just do mappings from file is subset of that functionality. #Pending


Refers to: src/Microsoft.ML.Data/Transforms/ConversionsExtensionsCatalog.cs:127 in c6961d9. [](commit_id = c6961d9, deletion_comment = False)

: this(env, new KeyToVectorMappingTransformer(env, columns))
{
}

public KeyToVectorMappingEstimator(IHostEnvironment env, string inputColumn, string outputColumn = null, bool bag = Defaults.Bag)
: this(env, new KeyToVectorMappingTransformer(env, new KeyToVectorMappingTransformer.ColumnInfo(inputColumn, outputColumn ?? inputColumn, bag)))
internal KeyToVectorMappingEstimator(IHostEnvironment env, string outputColumn, string inputColumn, bool bag = Defaults.Bag)
Copy link
Contributor

@TomFinley TomFinley Jan 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outputColumn, string inputColumn [](start = 74, length = 32)

outputColumn, string inputColumn [](start = 74, length = 32)

Thank you @sfilipi! It was in a note somewhat down on the issue, but in a reply to @@justinormont where we were talking about consistency with existing argument names, I also observed that the names "inputs" and "outputs" are incorrect (since the inputs to outputs to estimators/transforms are not columns, but IDataViews), and should retain something closer the existing names we have for this name and source. Indeed, I think the bad naming was part of the reason some people got so badly confused in the first place. #Resolved

Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Jan 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But these ones are InputColumn and OutputColumn and estimator operate over columns in IDataView. What is wrong with them? And why name and source and not destination and source? name is such a broad term to use it.


In reply to: 246974093 [](ancestors = 246974093)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope i am not annoying and further fueling the conversation, but sharing my personal opinion:

1- I agree that the order of arguments should be swapped, and have destination, than source.
2- I find the name "name" for the output column confusing too. Yes it does make sense in the scenario, but you have to blink about 4 times before it does so; just because, as Ivan says, it is such a generic thing.
The names InputColumn, and OutputColumn make a lot more sense to me too, since they are actually column names..

The end!


In reply to: 246974845 [](ancestors = 246974845,246974093)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term name and source is a well established name in this codebase. The concept of the input and output for a transform is already being established as being the dataviews themselves. Transforms do not take columns as inputs and outputs, they take data views as inputs and outputs -- that's obvious from the interface specification. So: let's jsut agree, the dataviews are the inputs and outputs, full stop. In the case where a transform happens to be producing an additional column, we have parameterized the transform as producing a column with that "name." And when that parameterization also happens to pay attention to one column, we often call that a "source." Always.

It's interesting that you have a different interpretation, and I'd like to emphasize that I don't discount it, but it is distinct from how we've architected the old IDataTransform and I don't like the new interpretation for the reasons already discussed at length.

Going forward, I think this shows how dangerous it is to introduce a new interpretation without discussing that new interpretation with anyone, discussing its merits, etc. I'm sorry that was done, but not sorry that we're fixing it.


In reply to: 247204046 [](ancestors = 247204046,246974845,246974093)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transforms do not take columns as inputs and outputs, they take data views as inputs and outputs

Estimators during it construction doesn't take any dataview, they take columns which expected to be in dataview. We don't touch dataview till we call Fit. We do schema propagation for GetOutputSchema on SchemaShape.Column,
Only thing which take data views as inputs as far as I know is AppendDataViewRows transform.

The term name and source is a well established name in this codebase.

Internally, yes, mostly as legacy for command line. Legacy doesn't mean it's a great and useful for another task (Public Api).

let's jsut agree, the dataviews are the inputs and outputs, full stop.

I'm as user, don't see input columns and output columns in estimators as input and output dataviews. I think @sfilipi has the same opinion. I can't even come up with estimator/transformer which would produce multiple dataviews (maybe evaluator with metrics as dataviews)

So far we have two people who think name and source is not a great idea (especially name)
We do it not because we want to make you mad, we do it because we also care about project.

without discussing that new interpretation with anyone

Can we actually have discussion regarding name and source vs inputColumn and outputColumn ? I would love to hear other people opinion, don't you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Estimators during it construction doesn't take any dataview, they take columns which expected to be in dataview. We don't touch dataview till we call Fit.

If you read what I wrote, you'll see I was explicitly talking about transformers, but estimators are similar. They don't take inputs and outputs as columns. They take inputs as dataviews (as you say), and they the output of that is a transformer. This, again, already covered in #2064.

We do schema propagation for GetOutputSchema on SchemaShape.Column,

You're mistaken. Check the IEstimator interface again.

SchemaShape GetOutputSchema(SchemaShape inputSchema);

As you see, again, inputs and outputs to this process are not columnar, they in fact are more global than this.

Internally, yes, mostly as legacy for command line. Legacy doesn't mean it's a great and useful for another task (Public Api).

Already covered in #2064... to summarize, you must justify inconsistencies. We should try to minimize inconsistencies. Just pointing out that sometimes we've judged it healthy that the two should be unaligned because they're in different settings is not a free pass to introduce whatever inconsistencies you want. Not only was this change unjustified, it has the flaws that I have identified.

Can we actually have discussion regarding name and source vs inputColumn and outputColumn ? I would love to hear other people opinion, don't you?

At some point, as again established in #2064, one of my responsibilities (which, just to emphasize, I do not enjoy) is to judge when discussions cease to become productive, so that we can actually move on with the work that must be done. While I'd love to have the luxury of enough free time to just endlessly circle around things like this, we can't. Let's move on.


In reply to: 248024251 [](ancestors = 248024251)

public ValueToKeyMappingEstimator(IHostEnvironment env, string inputColumn, string outputColumn = null, int maxNumTerms = Defaults.MaxNumTerms, ValueToKeyMappingTransformer.SortOrder sort = Defaults.Sort) :
this(env, new [] { new ValueToKeyMappingTransformer.ColumnInfo(inputColumn, outputColumn ?? inputColumn, maxNumTerms, sort) })
internal ValueToKeyMappingEstimator(IHostEnvironment env, string outputColumn, string inputColumn, int maxNumTerms = Defaults.MaxNumTerms, ValueToKeyMappingTransformer.SortOrder sort = Defaults.Sort) :
this(env, new [] { new ValueToKeyMappingTransformer.ColumnInfo(inputColumn, outputColumn, maxNumTerms, sort) })
Copy link
Contributor

@TomFinley TomFinley Jan 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inputColumn, outputColumn [](start = 74, length = 25)

It would be well to make sure we're consistent everywhere, and also swap these. #Resolved

@sfilipi
Copy link
Member Author

sfilipi commented Jan 13, 2019

    public static KeyToValueMappingEstimator MapKeyToValue(this TransformsCatalog.ConversionTransforms catalog, params (string input, string output)[] columns)

I will fix this in the next PR here, because this goes all the way down to the OneToOneTransformerBase, and that change will require touching everything .. #Pending


Refers to: src/Microsoft.ML.Data/Transforms/ConversionsExtensionsCatalog.cs:75 in c6961d9. [](commit_id = c6961d9, deletion_comment = False)

@sfilipi
Copy link
Member Author

sfilipi commented Jan 13, 2019

        => new ValueToKeyMappingEstimator(CatalogUtils.GetEnvironment(catalog), columns, file, termsColumn, loaderFactory);

Left the ctor, because it is used through the codebase; it;s internal. Removed the reference here.


In reply to: 453322670 [](ancestors = 453322670)


Refers to: src/Microsoft.ML.Data/Transforms/ConversionsExtensionsCatalog.cs:127 in c6961d9. [](commit_id = c6961d9, deletion_comment = False)

@sfilipi sfilipi force-pushed the trasformerEstimatorInputOutputSwap branch from c6961d9 to 0064b05 Compare January 13, 2019 08:24
/// <param name="input">Name of input column.</param>
/// <param name="output">Name of the column resulting from the transformation of <paramref name="input"/>. Null means <paramref name="input"/> is replaced.</param>
/// <param name="source">Name of column to work on.</param>
/// <param name="name">Name of the column resulting from the transformation of <paramref name="source"/>. Null means <paramref name="source"/> is replaced.</param>
Copy link
Contributor

@TomFinley TomFinley Jan 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Null means is replaced [](start = 118, length = 48)

Other way around. null on source means this. Also FYI, while I'm pleased the see that the parameters themselves are in the right order, the params tags themselves are not reversed. #Resolved

public readonly string Input;
public readonly string Output;
public readonly string Source;
public readonly string Name;
public readonly int HashBits;
Copy link
Contributor

@TomFinley TomFinley Jan 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're at it, might want to put these in the right order. #Resolved

public readonly DataKind OutputKind;
public readonly KeyRange OutputKeyRange;

/// <summary>
/// Describes how the transformer handles one column pair.
/// </summary>
/// <param name="input">Name of input column.</param>
/// <param name="output">Name of output column.</param>
/// <param name="source">Name of input column.</param>
Copy link
Contributor

@TomFinley TomFinley Jan 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name of input column [](start = 37, length = 20)

Similar comment to other file about ordering of tags, description, etc. #Resolved

/// <param name="outputKind">The expected type of the converted column.</param>
/// <param name="outputKeyRange">New key range if we work with key type.</param>
public TypeConvertingTransformer(IHostEnvironment env, string inputColumn, string outputColumn, DataKind outputKind, KeyRange outputKeyRange = null)
: this(env, new ColumnInfo(inputColumn, outputColumn, outputKind, outputKeyRange))
public TypeConvertingTransformer(IHostEnvironment env, string source, string name, DataKind outputKind, KeyRange outputKeyRange = null)
Copy link
Contributor

@TomFinley TomFinley Jan 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string source, string name [](start = 63, length = 26)

This is not in proper order. #Pending

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will tackle the transformers in the next pr, because i have to change all of them, they all derive from OneToOneTransformer.

Agree, it is confusing because I did partial work, on renaming the args.


In reply to: 248078355 [](ancestors = 248078355)

@@ -141,7 +130,7 @@ public static KeyToValueMappingEstimator MapKeyToValue(this TransformsCatalog.Co
this TransformsCatalog.ConversionTransforms catalog,
IEnumerable<TInputType> keys,
IEnumerable<TOutputType> values,
params (string source, string name)[] columns)
params (string inputColumn, string outputColumn)[] columns)
Copy link
Contributor

@TomFinley TomFinley Jan 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inputColumn, string outputColumn [](start = 27, length = 32)

Remember, name, source, not source, name. And definitely not input, output. :) #Pending

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think for conistency sake we want the column mapping to be at the head, even when we have to forgo the convenience of the params.


In reply to: 248079402 [](ancestors = 248079402)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do this on the next PR, because the tuple is passed directly from the estimator -> transfomer-> OneToOnetransformer base class.

I agree that it is confusing to review, because i did partial work with renaming the tuple items.


In reply to: 248079667 [](ancestors = 248079667,248079402)

/// <param name="hashBits">Number of bits to hash into. Must be between 1 and 31, inclusive.</param>
/// <param name="invertHash">During hashing we constuct mappings between original values and the produced hash values.
/// Text representation of original values are stored in the slot names of the metadata for the new column.Hashing, as such, can map many initial values to one.
/// <paramref name="invertHash"/> specifies the upper bound of the number of distinct input values mapping to a hash that should be retained.
/// <value>0</value> does not retain any input values. <value>-1</value> retains all input values mapping to each hash.</param>
public HashingEstimator(IHostEnvironment env, string inputColumn, string outputColumn = null,
internal HashingEstimator(IHostEnvironment env, string outputColumn, string inputColumn = null,
Copy link
Contributor

@TomFinley TomFinley Jan 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outputColumn [](start = 63, length = 12)

Another remnant of the bad input/output names. Also another case where the XML comments should be made less awkward, as noted elsewhere. #Resolved

: this(env, new KeyToVectorMappingTransformer(env, columns))
{
}

public KeyToVectorMappingEstimator(IHostEnvironment env, string inputColumn, string outputColumn = null, bool bag = Defaults.Bag)
: this(env, new KeyToVectorMappingTransformer(env, new KeyToVectorMappingTransformer.ColumnInfo(inputColumn, outputColumn ?? inputColumn, bag)))
internal KeyToVectorMappingEstimator(IHostEnvironment env, string outputColumn, string inputColumn = null, bool bag = Defaults.Bag)
Copy link
Contributor

@TomFinley TomFinley Jan 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outputColumn, string inputColumn [](start = 74, length = 32)

Another name, source... #Resolved

public TypeConvertingEstimator(IHostEnvironment env,
string inputColumn, string outputColumn = null,
internal TypeConvertingEstimator(IHostEnvironment env,
string outputColumn, string inputColumn = null,
Copy link
Contributor

@TomFinley TomFinley Jan 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outputColumn, string inputColumn [](start = 19, length = 32)

Here's another. I'm just going to stop commenting on these, maybe self-review, double check them? #Resolved

/// <param name="inputColumn">Name of the column to be transformed.</param>
/// <param name="outputColumn">Name of the output column. If this is null '<paramref name="inputColumn"/>' will be used.</param>
/// <param name="source">Name of the column to be transformed.</param>
/// <param name="name">Name of the column produced.</param>
Copy link
Contributor

@TomFinley TomFinley Jan 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// Name of the column produced. [](start = 8, length = 59)

Reverse <param tags. #Resolved

/// <param name="inputColumn">Name of the column to be transformed.</param>
/// <param name="outputColumn">Name of the output column. If this is null '<paramref name="inputColumn"/>' will be used.</param>
/// <param name="source">Name of the column to be transformed.</param>
/// <param name="name">Name of the column produced.</param>
Copy link
Contributor

@TomFinley TomFinley Jan 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name of the column produced [](start = 31, length = 27)

To my eyes, this phrasing is awkward. What was wrong with the existing descriptors we already used for this in existing tooling?

public abstract class SourceNameColumnBase
{
[Argument(ArgumentType.AtMostOnce, HelpText = "Name of the new column", ShortName = "name")]
public string Name;
[Argument(ArgumentType.AtMostOnce, HelpText = "Name of the source column", ShortName = "src")]
public string Source;

Maybe use that, with appropriate clarifications about what happens when source is null, and so on. #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This BTW is a general comment, I see lots of inconsistency in our documentation. Including descriptions of these columns as somehow being "inputs" and "outputs" which, as we've seen direct examples of, leads people to think about these structures in the wrong way.


In reply to: 248087222 [](ancestors = 248087222)

@@ -43,7 +43,7 @@ public void CV_Multiclass_Digits_RffTransform_OVAAveragedPerceptron()
var pipeline = mlContext.Transforms.Projection.CreateRandomFourierFeatures("Features", "FeaturesRFF")
.AppendCacheCheckpoint(mlContext)
.Append(mlContext.Transforms.Concatenate("Features", "FeaturesRFF"))
.Append(new ValueToKeyMappingEstimator(mlContext, "Label"))
.Append(new ValueToKeyMappingEstimator(mlContext, "Label", "Label"))
Copy link
Contributor

@TomFinley TomFinley Jan 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, "Label" [](start = 69, length = 9)

Now that the source column has a default value again, from what I see, we can get rid of this, right? #Resolved

@@ -27,7 +27,7 @@ public void Metacomponents()
var sdcaTrainer = ml.BinaryClassification.Trainers.StochasticDualCoordinateAscent("Label", "Features", advancedSettings: (s) => { s.MaxIterations = 100; s.Shuffle = true; s.NumThreads = 1; });

var pipeline = new ColumnConcatenatingEstimator (ml, "Features", "SepalLength", "SepalWidth", "PetalLength", "PetalWidth")
.Append(new ValueToKeyMappingEstimator(ml, "Label"), TransformerScope.TrainTest)
.Append(new ValueToKeyMappingEstimator(ml, "Label", "Label"), TransformerScope.TrainTest)
Copy link
Contributor

@TomFinley TomFinley Jan 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, "Label" [](start = 66, length = 9)

Similar here. Unneeded, or should be. #Resolved

@@ -39,7 +39,7 @@ void Extensibility()
};
var pipeline = new ColumnConcatenatingEstimator (ml, "Features", "SepalLength", "SepalWidth", "PetalLength", "PetalWidth")
.Append(new CustomMappingEstimator<IrisData, IrisData>(ml, action, null), TransformerScope.TrainTest)
.Append(new ValueToKeyMappingEstimator(ml, "Label"), TransformerScope.TrainTest)
.Append(new ValueToKeyMappingEstimator(ml, "Label", "Label"), TransformerScope.TrainTest)
Copy link
Contributor

@TomFinley TomFinley Jan 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, "Label" [](start = 66, length = 9)

Same, unneeded. #Resolved

@@ -30,7 +30,7 @@ void DecomposableTrainAndPredict()
var data = ml.Data.ReadFromTextFile<IrisData>(dataPath, separatorChar: ',');

var pipeline = new ColumnConcatenatingEstimator (ml, "Features", "SepalLength", "SepalWidth", "PetalLength", "PetalWidth")
.Append(new ValueToKeyMappingEstimator(ml, "Label"), TransformerScope.TrainTest)
.Append(new ValueToKeyMappingEstimator(ml, "Label", "Label"), TransformerScope.TrainTest)
Copy link
Contributor

@TomFinley TomFinley Jan 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, "Label" [](start = 66, length = 9)

Same, unneeded. #Resolved

@TomFinley
Copy link
Contributor

TomFinley commented Jan 15, 2019

        public ColumnInfo(string input, string output,

This one was missed. #Pending


Refers to: src/Microsoft.ML.Transforms/OneHotHashEncoding.cs:230 in 520cb79. [](commit_id = 520cb79, deletion_comment = False)

@TomFinley
Copy link
Contributor

TomFinley commented Jan 15, 2019

            var binaryCols = new List<(string input, string output)>();

While less essential since this is a field in an method that is not even privately visible, we've seen the mischief caused by this conflation of "input" and "output" with things that are neither inputs nor otuputs, and the ordering has all sorts of confusing far reaching consequences by confusing some people. Consider making this consistent. #Pending


Refers to: src/Microsoft.ML.Transforms/OneHotHashEncoding.cs:273 in 520cb79. [](commit_id = 520cb79, deletion_comment = False)

@TomFinley
Copy link
Contributor

            var binaryCols = new List<(string input, string output)>();

Obviously far less essential than getting the public surface area right.


In reply to: 454578490 [](ancestors = 454578490)


Refers to: src/Microsoft.ML.Transforms/OneHotHashEncoding.cs:273 in 520cb79. [](commit_id = 520cb79, deletion_comment = False)

@TomFinley
Copy link
Contributor

TomFinley commented Jan 15, 2019

        string outputColumn = null, OneHotEncodingTransformer.OutputKind outputKind = Defaults.OutKind)

Another one missed. #Pending


Refers to: src/Microsoft.ML.Transforms/OneHotEncoding.cs:218 in 520cb79. [](commit_id = 520cb79, deletion_comment = False)

@TomFinley
Copy link
Contributor

TomFinley commented Jan 15, 2019

            IReadOnlyDictionary<PipelineColumn, string> outputNames,

Hmmm. I wonder if I indirectly bear some fault for the confusion, since in my static API, I do consider the operations to be on columns, as opposed to IEstimator/ITransformer which considers operations to be on dataviews/schemas and whatnot. Not something to change, just something that amused me. #Closed


Refers to: src/Microsoft.ML.StaticPipe/TransformsStatic.cs:578 in 520cb79. [](commit_id = 520cb79, deletion_comment = False)

public readonly string Input;
public readonly string Output;
public readonly string Source;
public readonly string Name;
Copy link
Contributor

@TomFinley TomFinley Jan 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name first. #Resolved

@@ -114,10 +114,10 @@ public ColumnInfo(string input, string output = null, bool bag = KeyToVectorMapp
public IReadOnlyCollection<ColumnInfo> Columns => _columns.AsReadOnly();
private readonly ColumnInfo[] _columns;

private static (string input, string output)[] GetColumnPairs(ColumnInfo[] columns)
private static (string source, string name)[] GetColumnPairs(ColumnInfo[] columns)
Copy link
Contributor

@TomFinley TomFinley Jan 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source, string name [](start = 31, length = 19)

Probably best to make consistent as we're trying to do elsewhere, even if it is private. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will address it in my next PR, because they get used in the Transformer, which passes it to the base class OneToOnetransformer. Once that changes everything needs to change.


In reply to: 248091617 [](ancestors = 248091617)

@@ -31,16 +31,16 @@ public static void KeyToValue_Term()
// making use of default settings.
string defaultColumnName = "DefaultKeys";
// REVIEW create through the catalog extension
var default_pipeline = new WordTokenizingEstimator(ml, "Review")
.Append(new ValueToKeyMappingEstimator(ml, "Review", defaultColumnName));
var default_pipeline = mlContext.Transforms.Text.TokenizeWords("Review")
Copy link
Contributor

@TomFinley TomFinley Jan 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default_pipeline [](start = 16, length = 16)

Ugh. Our samples have been written with Python variable naming conventions, not C# ones? I'll have to file an issue about this... #2155. #WontFix

@@ -76,7 +76,7 @@ public void MetacomponentsFeaturesRenamed()

var sdcaTrainer = new SdcaBinaryTrainer(Env, "Label", "Vars", advancedSettings: (s) => { s.MaxIterations = 100; s.Shuffle = true; s.NumThreads = 1; });
var pipeline = new ColumnConcatenatingEstimator(Env, "Vars", "SepalLength", "SepalWidth", "PetalLength", "PetalWidth")
.Append(new ValueToKeyMappingEstimator(Env, "Label"), TransformerScope.TrainTest)
.Append(new ValueToKeyMappingEstimator(Env, "Label", "Label"), TransformerScope.TrainTest)
Copy link
Contributor

@TomFinley TomFinley Jan 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, "Label" [](start = 67, length = 9)

Another unnecessary. #Resolved

@@ -193,7 +193,7 @@ private TextLoader.Arguments GetIrisLoaderArgs()
}
}).Read(GetDataPath(IrisDataPath));

var pipeline = new ValueToKeyMappingEstimator(Env, "Label");
var pipeline = new ValueToKeyMappingEstimator(Env, "Label", "Label");
Copy link
Contributor

@TomFinley TomFinley Jan 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, "Label" [](start = 70, length = 9)

Another unnecessary. #Resolved

@@ -249,7 +249,7 @@ public void NgramWorkout()
.Read(sentimentDataPath);

var est = new WordTokenizingEstimator(Env, "text", "text")
.Append(new ValueToKeyMappingEstimator(Env, "text", "terms"))
.Append(new ValueToKeyMappingEstimator(Env, "terms", "text"))
.Append(new NgramExtractingEstimator(Env, "terms", "ngrams"))
Copy link
Contributor

@TomFinley TomFinley Jan 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NgramExtractingEstimator [](start = 28, length = 24)

Hmmm. This didn't get fixed? #Pending

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not converting the NGramExtracting Estimator on this PR. It will come in the next PR.


In reply to: 248095484 [](ancestors = 248095484)

@@ -249,7 +249,7 @@ public void NgramWorkout()
.Read(sentimentDataPath);

var est = new WordTokenizingEstimator(Env, "text", "text")
.Append(new ValueToKeyMappingEstimator(Env, "text", "terms"))
.Append(new ValueToKeyMappingEstimator(Env, "terms", "text"))
.Append(new NgramExtractingEstimator(Env, "terms", "ngrams"))
.Append(new NgramHashingEstimator(Env, "terms", "ngramshash"));
Copy link
Contributor

@TomFinley TomFinley Jan 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NgramHashingEstimator [](start = 28, length = 21)

Nor this? #Pending

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not converting the NGramHashing Estimator on this PR. It will come in the next PR.


In reply to: 248095509 [](ancestors = 248095509)

@TomFinley
Copy link
Contributor

TomFinley commented Jan 15, 2019

        public ColumnInfo(string input, string output = null,

Another. #Pending


Refers to: src/Microsoft.ML.Transforms/OneHotEncoding.cs:192 in 520cb79. [](commit_id = 520cb79, deletion_comment = False)

@sfilipi
Copy link
Member Author

sfilipi commented Jan 16, 2019

        public ColumnInfo(string input, string output = null,

I did write a comment that i am not swapping for the Transformers on this PR. the column info is shared between the estimators and the transformers.

The reason I am not swapping for the transformers, is because they all pass the columns to base - OneToOneTransformer; and if you swap them in there, you have to do all the transforms. This was meant to be a small PR to just test understanding of the changes, reach agreement. I am tackling all of them in the next PR.


In reply to: 454584468 [](ancestors = 454584468)


Refers to: src/Microsoft.ML.Transforms/OneHotEncoding.cs:192 in 520cb79. [](commit_id = 520cb79, deletion_comment = False)

@sfilipi
Copy link
Member Author

sfilipi commented Jan 16, 2019

        string outputColumn = null, OneHotEncodingTransformer.OutputKind outputKind = Defaults.OutKind)

I am not doing all the estimators in this PR; just the one for the ConversionsExtensionsCatalog. The reason this file is here, is because of the changes in KeyToVector.


In reply to: 454578807 [](ancestors = 454578807)


Refers to: src/Microsoft.ML.Transforms/OneHotEncoding.cs:218 in 520cb79. [](commit_id = 520cb79, deletion_comment = False)

@sfilipi
Copy link
Member Author

sfilipi commented Jan 16, 2019

        public ColumnInfo(string input, string output,

I am not doing the OneHotHash in the PR. The changes below are for Hashing.


In reply to: 454577719 [](ancestors = 454577719)


Refers to: src/Microsoft.ML.Transforms/OneHotHashEncoding.cs:230 in 520cb79. [](commit_id = 520cb79, deletion_comment = False)

@sfilipi
Copy link
Member Author

sfilipi commented Jan 16, 2019

            var binaryCols = new List<(string input, string output)>();

I will do it in the next PR, where the order gets swapped on the OneHotEncodingTransformer.

This PR has not addressed that component.


In reply to: 454578613 [](ancestors = 454578613,454578490)


Refers to: src/Microsoft.ML.Transforms/OneHotHashEncoding.cs:273 in 520cb79. [](commit_id = 520cb79, deletion_comment = False)

@sfilipi sfilipi force-pushed the trasformerEstimatorInputOutputSwap branch from 520cb79 to 60ab6cd Compare January 16, 2019 18:28
@sfilipi
Copy link
Member Author

sfilipi commented Jan 18, 2019

@glebuk, @TomFinley, @Ivanidzo4ka would you have time for a second look? Thanks.

@sfilipi
Copy link
Member Author

sfilipi commented Jan 25, 2019

Closing this since the work got also done in: #2239

@sfilipi sfilipi closed this Jan 25, 2019
@sfilipi sfilipi deleted the trasformerEstimatorInputOutputSwap branch January 25, 2019 01:09
@ghost ghost locked as resolved and limited conversation to collaborators Mar 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants