Estimator arguments should take output column name as first parameter, any inputs as subsequent parameters

So, a somewhat embarrassing blunder to report... not quite sure how to say this.

Throughout our codebase, for years we've ***always*** specified the name of the output column of a transform first, *then* the source(s).

That's good. But when estimators were introduced, somehow, nearly all of them were introduced in the reverse order: nearly all of them specify the inputs, *then* the outputs. This was probably an unconscious mistake, but it's one with fairly wide consequences, since that mistaken pattern was copied again and again as people did the estimator conversion work, to the point where most (not all!) estimators are now inconsistent with the whole rest of the codebase!

This should be corrected: it is at least inconsistent, and even if not inconsistent actually rather obnoxious practically because specifying the name of the output first has practical benefits, and makes a lot more sense, since if you're specifying a transformation the *most* important information someone will want to know is what you're calling your output!

# The Story Until a Few Months Ago

So, throughout our codebase, for years, it has been our practice that when specifying a transform, you specify the name of the output, *then* you specify the name of the input(s) (if any). The reason for this is practical: the outputs are usually the result of a well defined single calculation (the application of the transform), whereas what is taken as an "input" to a transform can have various meanings since it is sometimes a multivariate function in its *inputs* more often than in its *outputs*. (E.g., concatenate can have multiple inputs, text can have multiple inputs, MI feature selection takes multiple heterogeneous inputs, but all produce single output columns.)

This was our practice certainly when this code was a tool, as we see in the command line help string, specifying name *then* source.

https://github.com/dotnet/machinelearning/blob/bb46fdf405828134de9feebce74f21b4aacb15ed/src/Microsoft.ML.Data/Transforms/ColumnCopying.cs#L109-L110

This trend continued during our initial attempts at an API, as seen in this [early discussion in PR 405](https://github.com/dotnet/machinelearning/pull/405#discussion_r197980521), and as seen here:

https://github.com/zeahmed/machinelearning/blob/16f7883933b56f8fd86077bf0fd262b24374e9d0/src/Microsoft.ML.Data/Transforms/ConvertTransform.cs#L116

and here

https://github.com/zeahmed/machinelearning/blob/16f7883933b56f8fd86077bf0fd262b24374e9d0/src/Microsoft.ML.Data/Transforms/DropSlotsTransform.cs#L226

and numerous other places.

This is done for the practical reason that, when a transform produces an
output, what outputs it has are usually finite and well defined, whereas it can take multiple examles. The most conspicuous and widely used example of this is
the concatenation transform. Also included are things like the text featurizing transform, and other such things.

So far so good...

# But Then...

Now, somehow, through some mechanism that wasn't quite clear to me, as `IEstimator` implementations are being created, someone commits a fateful mistake of *reversing* inputs and outputs. Suddenly instead of being typically parameterized as `name` and a sometimes optional `source`, we instead have the required `input` and a sometimes optional `output`. What a disaster! And, I did not catch it in review. An early example:

https://github.com/dotnet/machinelearning/blob/c8de311476eb04dd4c8be1ad5b898487d66c9ef5/src/Microsoft.ML.Transforms/Text/TextTransform.cs#L278-L279

Then, as people use this as a template for their own estimator conversion work, nearly *all* estimators copied this mistake, until practically all estimators had this problem. This includes much of the extensions.

https://github.com/dotnet/machinelearning/blob/d9270c9c42da70817e8a71e39a069d2339f6972d/src/Microsoft.ML.Data/Transforms/ExtensionsCatalog.cs#L30-L31

Now then, the reason why I know any of this is that @stephentoub wrote and says, "hey, how come you have your column concatenation operation specify output *then* inputs? That's different from everywhere else! I know it's less convenient, but it's better to be consistent."

https://github.com/dotnet/machinelearning/blob/d9270c9c42da70817e8a71e39a069d2339f6972d/src/Microsoft.ML.Data/Transforms/ExtensionsCatalog.cs#L39

And, after thinking that he must be very confused, since again, the pattern of name *then* source being very, very well defined throughout the codebase, I look and find **he is absolutely correct**, and a thoroughly massive blunder had somehow made it throughout the entire codebase under our very noses, including it must be said mine. :) :)

So, this is bad, but happily this was caught before v1. But, needless to say this must be fixed immediately.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Estimator arguments should take output column name as first parameter, any inputs as subsequent parameters #2064

The Story Until a Few Months Ago

But Then...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	[Argument(ArgumentType.Multiple \| ArgumentType.Required, HelpText = "New column definition(s) (optional form: name:src)", ShortName = "col", SortOrder = 1)]
	public Column[] Column;

	public TextTransform(IHostEnvironment env, IEnumerable<string> inputColumns, string outputColumn,
	Action<Settings> advancedSettings = null)

	public static ColumnCopyingEstimator CopyColumns(this TransformsCatalog catalog, params (string source, string name)[] columns)
	=> new ColumnCopyingEstimator(CatalogUtils.GetEnvironment(catalog), columns);

Estimator arguments should take output column name as first parameter, any inputs as subsequent parameters #2064

Description

The Story Until a Few Months Ago

But Then...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions