-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Estimator arguments should take output column name as first parameter, any inputs as subsequent parameters #2064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Semi-related: We may also want to take a pass at ensuring the parameter names are consistent as I expect this to be difficult to correct after v1.0. I assume the work would be creating a large spread sheet of rows of components & columns of parameter names (and ensuring we don't create a new column needlessly). |
I would like to disagree. (With idea what Output first, inputs later)
I've tend to believe what reason behind name-> source is command line which force you to express yourself in kinda whiteboard way so it's somewhat natural to write y = func(x1,x2,x3) where you specify y first and then you specify rest. For programmatic API in C# I found idea to write output parameter first, and then everything else, is bizarre and against nature. I look on transformers as some functions on top of data. |
Hi @Ivanidzo4ka ! Thanks for your feedback. I am not sure exactly what you mean. Are we talking about the same language? Indeed, let's take your precise example. The signature of that is very clear. public delegate TResult Func<in T1,in T2,in T3,in T4,in T5,out TResult>(T1 arg1, T2 arg2, T3 arg3, T4 arg4, T5 arg5); I see a int a = b + 2; not this
Indeed, most languages I can think of when assigning variables... we have F# with its This of course assuming I accept the idea that we have to be analogous to a programming language, which I'm not sure I do, but if I did, it certainly appears at first glance to be utterly disastrous to your case. But in any case, generally I think the most important information should generally go first. What I want to know out of every transform is "what are you making, and how are you making it," not "how are you making something, and what is it called?" This in addition to the advantages already enumerated originally, the consistency argument, etc. The fact that when transforms take multiple inputs many times those auxiliary helper inputs are optional (so should have
@justinormont that's a good point actually, even for this. (I mean, you're right in the broader sense, but just going to stay on topic.) Not only did we screw up the ordering of the column names, we even screwed up their names. For years even with one-to-one transforms these have been named We have changed this to So in addition to changing the order: these parameter should be named, as they are literally everywhere else, |
But this is my point. It's not always about a "source" and a "destination." There is almost always a destination, but the "source" is sometimes indeterminate, multiple, etc. We did not settle on this convention by accident. You'll note also that these are all about filling in values in an already allocated buffer. This is not that. It's more like, "we're declaring this new slot named |
This is why I try to bring Func as example, but I guess I wasn't clear. |
Thank you @Ivanidzo4ka for trying to explain to me again. I think I do understand the example, I am trying to explain gently that it points in a different direction than you suppose. I'll go over a separate point from my original one, in case it helps illustrate why. Let's actually look at a few of these extension methods, in particular, the Since this involves an enumeration and examination, this response will be somewhat long, I'm sorry to say. I titled the examples so they're easy to browse, and put my conclusion at the thing titled "Conclusion" more towards the end. Simple Example 1The first one I see is machinelearning/src/Microsoft.ML.Data/Transforms/ConversionsExtensionsCatalog.cs Lines 30 to 31 in c8755a2
I do not see the input parameters, followed by an output. Instead, I see one input paramter, followed by the output, followed by two more input parameters. Simple Example 2machinelearning/src/Microsoft.ML.Data/Transforms/ConversionsExtensionsCatalog.cs Lines 66 to 67 in c8755a2
One input parameter, the output parameter, followed by one more input parameter. Complex Example 1Actually, now that I'm looking, there are a few more here that are worthy of inspection in this file; despite not directly filling my pattern, they indirectly do if we look at them. They also indirectly illustrate the practical reasons why we years ago arrived at the pattern that we did. These are, if anything, even more confused, since they have neither input nor output first! This is one. You will see we have the column definition first, which happens to be a machinelearning/src/Microsoft.ML.Data/Transforms/ValueToKeyMappingTransformer.cs Lines 184 to 188 in c8755a2
So this again has the pattern of input, followed by output, followed by four more inputs. The same confused mixing. But let's return again then to the original linked code... we have more parameters. What are they? It is the name of the input file. Followed by the name of the input column from that file, followed by how to load this file. So we have more input "columns" calculated. So not only are we drawing an artificial distinction between the input column and other inputs to the estimator, we are not even being consistent with the input columns themselves. machinelearning/src/Microsoft.ML.Data/Transforms/ConversionsExtensionsCatalog.cs Lines 122 to 124 in c8755a2
Complex Example 2machinelearning/src/Microsoft.ML.Data/Transforms/ConversionsExtensionsCatalog.cs Lines 140 to 144 in c8755a2
Here we've arrived at something else. This actually fulfills the pattern that you claimed you wanted to fulfill, of the output is actually specified last!! Let me ask you this. Do you like it? 😄 Is it consistent with anything? Would we want anything to be consistent with it? Simple Example 3
This one actually fills your pattern, but only by virtue of having no parameterization except the name of the input, which is far from typical. There are a few more in here, but they're somewhat similar and confused (keep and drop which are kind of special in the sense that they describe only inputs, or only outputs, depending on how you look at it. Also actually concatenate, which is one of the few conversions that actually wound up being correct, and which sparked this conversation.) Simple Examples Batch 4Next up, normalizers. Only one here that directly fulfills the pattern. machinelearning/src/Microsoft.ML.Data/Transforms/NormalizerCatalog.cs Lines 30 to 32 in c8755a2
Same confused pattern. Input, output, input. machinelearning/src/Microsoft.ML.Transforms/Text/TextCatalog.cs Lines 45 to 48 in c8755a2
Ooo. Inputs (as machinelearning/src/Microsoft.ML.Transforms/Text/TextCatalog.cs Lines 59 to 62 in c8755a2
Same confused pattern. Input output input. ConclusionI hope you don't mind, I had intended to keep going, but, it's the kids' bedtime, I think by this point, it's getting fairly redundant, and I'm starting to belabor the point. So this is what you said:
That's fine. But if I were to actually put the destination last, past all other things we could reasonably call inputs, then the result would not be what we really want, since many of these "functional inputs" have default values, etc. You seem to favor neither approach. Instead, what you favor, to judge from your vigorous defense, is something chimerical:
As we see in the original text of the issue, this is somewhat reversed. This new pattern introduced a few months ago is inconsistent with our longstanding practices. The mail is informative in pointing out the discrepancy and other issues (which we must fix), but we must decide how. And, sometimes (in fact, always) specific suggestions made spur the moment have to be carefully considered if they are in fact wise. That this blunder got repeated so many times that it somehow got established as "the way" by some people is unfortunate, but, again, it must be fixed. I hope it's obvious that I really value your input @Ivanidzo4ka and @stephentoub. You are enormously helpful. I generally find myself in agreement with you on most things, and I hope one disagreement doesn't put you off. But, unpleasant as I find it to do so, I can't agree in this case. I'm sorry. |
If you look on all that cases as I've check examples you linked and all of them follow I've look into links provided by @stephentoub and they either use same pattern
I was just try to point out what I don't see pattern where people specify output result first.
Again, let me disagree, we don't have longstanding practices of developing good public API. I agree we should have consistent API, I just disagree with output first, everything is later approach.
I hope I correctly depicted your main reasoning behind having output column first and input columns second. You find it more practical and this is why you want output to be first. So basically we are in holy war state, Little endians vs big endians. |
So in your first version of the argument there are "inputs" and "outputs." Then I point out it is not so. Now there are "inputs" and "outputs" and "parameters." So let me ask, what is an "input?" What is a "parameter?" Can you define it for me, precisely and clearly in a way I can apply and explain in all cases? I repeat again the lesson taught in my prior "Complex Example 1"... machinelearning/src/Microsoft.ML.Data/Transforms/ConversionsExtensionsCatalog.cs Lines 122 to 126 in c8755a2
So, in this case, the "inputs" vs. "parameters," I guess, are that the things defining the terms with its file and column names are "parameters." How is this decided? What is the correct split then? Is the You see, I hope, why I don't find this idea of being "inputs" and "parameters" convincing. It is true certainly that sometimes the API must differ from the tool that's been around for years. I have argued this myself when it makes sense on many occasions. An API is a different environment after all, some problems have different natural solutions given that we're not writing a GUI or command line tool. But usually, you'll note that when we have a disparity, we actually provide reasons. In this case, it is unclear to me that the reasons why we did things they way we've done for years have changed. Certainly I do not find merely mentioning that we sometimes do it sufficient justification. Also, not to put too fine a point on it, but I did explicitly already talk about how futile it was to try to draw a sharp line between "some inputs go before the result, some inputs go after."
Putting stuff into an already established (and, yes, named) buffer was not an analogy I found compelling. It doesn't reflect what's going on at all. Far closer is creating something new and giving it a name. In such APIs like that, I almost always see the name is the created thing is usually before anything related to its value, or parameterization. So: you're not providing an output. The output does not exist yet. You're creating it, right then. You're naming it. In any such API I can find where you're naming something and then providing how it is defined, when naming a new item and defining it, usually it is the name of the new item that comes first, then the content. I mean, even in the most basic data structures. Most people call them "key-value" and not "value-key" stores for a reason. Right? Plus of course I already of course pointed out that every C derived language puts the name before the value as part of its basic syntax. (Please, I know that I write somewhat voluminously, but, every last thing you're bringing up I've already covered. I know, I know, "tl;dr", but, please do. I'm spending so much time on this because I take it seriously.)
First you try to avoid the conflict in the first place, by trying to reach consensus. This happens of course 99% of the time. In the remaining 1% where someone is insistent in the face of all argument, ultimately, in this case as in all cases we must make a single decision, since the code cannot be in superposition. A choice must be made. So, I try to be as gentle and polite as I can, explaining the decision, continue to try to reach consensus. (Again, I invite you to read my arguments.) If it still doesn't work, well... we obviously can't continue forever, so the hard decision is made. We've done this in some other occasions. With the entry-point based API, the "predicting model" of the v0.1 API, the ideas that mutable estimators are a good idea. This one thing isn't the first, and I'm sure won't be the last. I always find it deeply unpleasant, but, what can you do. |
Leaving the order of arguments aside, let's discuss column argument names. I've always found the current default of source/name to be quite confusing. Let's look at the current state. ColumnCopyingEstimator ( string input, string output)
ColumnCopyingEstimator (params (string source, string name)[] columns)
NormalizingEstimator Normalize(… params (string input, string output)[] columns)
NormalizingEstimator Normalize(...params NormalizingEstimator.ColumnBase[] columns)
/* where the ctor is: */ ColumnBase(string input, string output) We have at least three way to declare inputs and outputs: via strings, via tuples and via Column classes. We should come up with a solution that is readable and consistent between all those usages. However, what are the alternatives? Thoughts? |
You will note that all of these things are, each last one of them., estimators. The fact that estimators diverged from our already established pattern is the subject of this issue. We already have a solution ready to hand: make this new
Again, this discussion concerns machinelearning/src/Microsoft.ML.Core/Data/IEstimator.cs Lines 300 to 313 in 1b6b3c3
You will note two methods. One for schema propagation prior to fitting and one for actually fitting. One takes the input's schema (or more precisely, the more provisional parallel structure, a "schema shape") to perform preliminary schema validation and propagation -- that input named I mean, It's right there in the interface signature of the structure that is the subject of this discussion, right?
Am I to understand that 3 is your first choice and 1 is your most disfavored choice? You would favor renaming what we currently call "name" to "destination." In fact that one change alone would shift your opinion from liking the solution we are pursuing least to liking it best. In fact that would be absolutely perfectly clear in your head. Let's be clear. These estimators are, in some sense, creating new columns. Everywhere* that we talk about columns, what is the primary identifier for a column? Always "name." In Let's not even talk about our code, but all similar code like SQL and data frames and Parquet files. What do they call the string identifier they use to reference a column? The column's name. Not only are we internally consistent, but we are consistent with longstanding usage of this sort of structure. But suddenly, now, after years of calling this thing "name" everywhere, we have suddenly had an epiphany that in estimators specifically (and only there, apparently) calling it You also asked for thoughts, but I will express them privately. |
We have several places where the concepts of "input column names" and "output column names " are used. // simple-typed estimator ctor args:
EstimatorFactoryMethodFoo(string name, string source)
// tuple-typed ctor args:
EstimatorFactoryMethodFoo (string name, string source)[] columns)
// column info-typed ctor args:
EstimatorFactoryMethodFoo (Foo.ColumnBase[] columns)
// where the Column type has fields
class ColumnBase { string Name; string Source; }
// readers:
CreateTextReader(... TextLoader.Column[] columns)
// where Column ctor as defined as
public Column(string name, DataKind? type, int index)
// and a few others such as command line args and entrypoints If we want the variable/field naming consistent among all of those usages, we will have to stay with the original naming convention of the {name, source}, at the expense of fluency in case of simple-type ctor usage. @KrzysztofCwalina @eerhardt @terrajobst - can you please weigh in? |
In new .NET APIs (Span transformations) we use source/destination. I think I started to use the names (in the transformation APIs) because in many existing APIs where we actually use input/output (or in/out) the direction of data flow is actually not very intuitive. For example, you might be surprised to discover that Console.In is actually something that returns TextReader that the callers read (duh). |
Obviously there are multiple topics being discussed in this issue. And, as far as I can tell, we probably won't be able to reach a consensus without some sort real-time meeting. However, I wanted to weigh in on my opinions:
I don't agree that's where you'd logically start, or even how you'd want to read the code. Looking at Apache Spark's Estimators, which we've publicly said we modeled the principles and naming from), they have their samples/docs with input column then output column: Scala:
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.001)
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr)) Python:
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) (Note they also refer to these things as "inputCol" and "outputCol".) In this case, this is exact prior art for naming and ordering. This is the same scenario we are solving. Of course we shouldn't just make the decision based on what others are doing. But it does give more weight to which approach to take if others have already taken this path. One other, minor, reasoning for having input-first, output-second.Append(new Estimator1("Column1", "Column2"))
.Append(new Estimator2("Column2", "Column3")
.Append(new Estimator3("Column3", "Column4")
etc. See how Whereas, if you flip it, it doesn't read has nice. The "flow" is interrupted. output-first, input-second.Append(new Estimator1("Column2", "Column1"))
.Append(new Estimator2("Column3", "Column2")
.Append(new Estimator3("Column4", "Column3")
etc. |
@eerhardt , you've summarized I think the reasons why @Zruty0 thought that the arrangement might be a more intuitive one, which, again, I don't dispute; certainly people find it intuitive. Thank you for that then. So, I'm going to explain as best I can how this conversation could become more useful. (Not with respect to you specifically.) I get the sense people are thinking about this radically differently from how I am. Perhaps the best way to start is to summarize how we got here. (This according to my best understanding at this moment. As near as I can tell, the current state was reached by a single line item of a part of the proposal in #1318; which is to say, this inconsistency that was introduced was not accidental as I originally thought, but a purposeful act.)
Fast forward a few months. The practical effects of putting that proposal into effect seem to result in a series of problems, which I've attempted to summarize both in terms of general trends and specific examples. I also have a specific proposal, which is, "hey, looks like this new way didn't work out, the old way existed for a reason and results in a more logically consistent API surface, let's go back ot that." OK so far? So, now we come to why I am not finding this discussion as productive as I might like: merely saying "I like the proposal in #1318" or "I like having inputs then outputs" or "it's like Spark," does not really help. That observation, correct or incorrect though it may be, by itself it does nothing to help solve the problems with consistency that seem to arise out of that choice. Indeed, I have not seen that anyone is even paying attention to or attempting to address the problems with consistency that have been identified. This is not to say that I think that my points are unassailable or perfect. I'd be surprised if they were; I've been wrong too often to suppose that. But literally no one is bothering to dispute them. Or telling me anything what I'd like to hear like, "oh, yes, I see how it is problematic in this case... how about we solve it via the following scheme," and then laying out a specific proposal that seems to assuage my concerns in any fashion. Instead, this is what is happening: we just rehash the suggestion in #1318 again and again and again in one way or another (by direct appeals to intuitiveness, through analogies to this or that other API), which by itself does not help. Now then, intuition and analogy are great supporting arguments to a specific proposal, but no one is giving a specific proposal except, "the original idea was perfect," but I already know it is not. I'm being invited repeatedly to look at how pretty it is when there's one input and one output -- yes, that's great, but can we move to the next level of the discussion please? Or, if unwilling or unable to do so, at least please acknowledge that maybe there's a deeper issue being discussed than merely that? Please? To give just one example, it's been about a week since I asked for a clear definition of what is a "parameter" vs. an "input." Still haven't seen anything. But this is not unique. We also see above, again, yet another iteration of the implicit assumption that estimators are things that map columns to columns, but spares, as usual, not one word about those cases where they don't, what specifically to do about them, the problems arising from the current schemes, etc. So here's what I'd like. Many of them can be viewed in this way. What do we do about the ones that can't? Can we come up with a logically consistent scheme for them? What about not just estimators specifically, but other ways of "creating" columns? If you can in a way that addresses the problems, do so. I literally cannot do anything without that. Further, try to argue against my proposal. If we have a proposal that seems to view them as logically consistently not just with other estimators, but other ways with creating columns in an Again, I'm not saying these questions are not answerable. But, no one is bothering to answer them. So I am, again, not finding this discussion productive. So displeased though you may be, look at it from my perspective. I have to solve a practical problem that seem to arise out of a proposal that was acted upon. I have a plan to do so, which I have attempted in other posts to talk about why it is appealing to me. I am sympathetic to the idea that some people like the original proposal, but that sympathy does not help draw me any closer to a solution with logical consistency. By itself it just makes my job just that much more unpleasant. |
There is a literally a wall of text in this issue, and trying to find the succinct reasons why
Are there others? I can maybe be convinced of argument (1) above as being a valid reason for the order. But I don't see any valid reasons for naming the parameters |
Because elsewhere when naming new columns that is called it's name and that argument is always first, as far as I see. It's not clear to me that estimators are exceptional, and I see by direct observation that treating them as if they are exceptional leads to problems. Regarding wall of text that takes 30 or 40 minutes to discern, it may be that when it comes to consistency of one hundred components, sometimes appreciating the issue in detail may be required to hold a conversation that can go anywhere. Also we again see the problem that I am having with this discussion. We have dissatisfaction expressed, no alternative is being proposed. What am I supposed to do with that? |
I'm trying to understand all the issues being raised before proposing an alternative.
I would expect someone to read and understand the feedback, and think about any merits the feedback has.
I really want to understand why you are so insistent on the name of the output/destination column parameter be simply I totally understand when you have a class/struct that is named machinelearning/src/Microsoft.ML.Data/Transforms/ColumnBindingsBase.cs Lines 122 to 128 in a570da1
This makes sense because of the context in which the properties are being used - they are in a Column class. The usage of the word However, I am not convinced that it is obvious, or even intuitive, that when I am creating an Estimator and asked to provide a Is the only reason you are rejecting having the parameter names being |
OK, I guess I need to weigh in as well. First of all, definitions:
@TomFinley is right: it was my suggestion to change to the current scheme ( The majority case is that we have 'one to one column mapping estimators'. So, in the ideal world, we would have the only factory method to be Even in this simple case, there is complexity:
mlContext.FooEstimator(string inputColumn, string outputColumn = null, float param1 = 0, string param2 = null); This is very similar to common .NET APIs (like
So, for regular rank-and-file 'one-to-one mapping', we already have a matrix of possible factory methods:
In the hindsight, I think this 4-way matrix was not necessary. I would suggest to trim down to only 2 options:
Arguably, maybe we can take it even further? We could declare that the ONLY special treatment applies to SINGLE column pair with NO extra params:
In any case, let us move from the simple majority case to the exceptions.
mlContext.BinaryClassification.Trainers.Foo(string featureColumn = "Features", string labelColumn = "Label", string weightColumn = null, float customParam1 = 0, string customParam2 = null) I would argue that in this case we can leave things be. Trainers tend to be both unique and fairly uniform with column mapping, and they also all reside in
In this case, it looks hard to define signature consistent with the above (where we specify inputs, then outputs, then other params). However, we can still do it: mlContext.ManyToOneEstimator(string[] inputColumns, string outputColumn, extraParams); Not exactly identical to the above, but the pattern is the same. More than one input column - good. Not nullable output - also good (since we no longer assume that output column name defaults to the input name). Again, in case of any multiple-mapping scenarios we can back off to the 'standard' column-object approach: mlContext.ManyToOneEstimator(params ColumnInfo[] columns);
I believe it's tricky to have a consistent interface for all these cases, but I think we can strive to adopt some form of unification. Again, when applicable, we can back off to the good old |
So, TL;DR version of the above is: let's put all inputs first, then all outputs, then all other params. Now let me go over Tom's examples and try to come up with a unified conversion. Simple 1 public static HashingEstimator Hash(this TransformsCatalog.ConversionTransforms catalog, string inputColumn, string outputColumn = null,
int hashBits = HashDefaults.HashBits, int invertHash = HashDefaults.InvertHash)
Simple 2 public static KeyToValueMappingEstimator MapKeyToValue(this TransformsCatalog.ConversionTransforms catalog, string inputColumn) This looks just like a bug to me. Should have the output column.
public static KeyToValueMappingEstimator MapKeyToValue(this TransformsCatalog.ConversionTransforms catalog, string inputColumn, string outputColumn = null) Complex 1 public ColumnInfo(string input, string output = null,
int maxNumTerms = ValueToKeyMappingEstimator.Defaults.MaxNumTerms,
SortOrder sort = ValueToKeyMappingEstimator.Defaults.Sort,
string[] term = null,
bool textKeyValues = false)
Complex 2 public static ValueMappingEstimator<TInputType, TOutputType> ValueMap<TInputType, TOutputType>(
this TransformsCatalog.ConversionTransforms catalog,
IEnumerable<TInputType> keys,
IEnumerable<TOutputType> values,
params (string source, string name)[] columns) This looks really like an oddball: parameters before everything. I am not sure what the thinking was behind it.
public static ValueMappingEstimator<TInputType, TOutputType> ValueMap<TInputType, TOutputType>(
this TransformsCatalog.ConversionTransforms catalog,
string inputColumn,
string outputColumn,
IEnumerable<TInputType> keys,
IEnumerable<TOutputType> values);
public static ValueMappingEstimator<TInputType, TOutputType> ValueMap<TInputType, TOutputType>(
this TransformsCatalog.ConversionTransforms catalog,
params MappingColumnInfo<TInputType, TOutputType>[] columns); // for multi-pair mapping. Here, we are not 100% consistent (output column is not nullable, although we'd like it to be). This is because the params are required, there is no default behavior. I think this is small price to pay for the streamlined order of 'inputs, outputs, params'. Simple 3 public static ColumnCopyingEstimator CopyColumns(this TransformsCatalog catalog, string inputColumn, string outputColumn)
Text public static TextFeaturizingEstimator FeaturizeText(this TransformsCatalog.TextTransforms catalog,
IEnumerable<string> inputColumns,
string outputColumn,
Action<TextFeaturizingEstimator.Settings> advancedSettings = null)
Keep, Drop Keep(params string[] outputColumns);
Keep(IEnumerable<string> outputColumns, bool keepHidden);
Drop(params string[] inputColumns);
Drop(IEnumerable<string> inputColumns, bool keepHidden); Concat Concat(IEnumerable<string>inputColumns, string outputColumn);
Concat(params ColumnInfo[] columns); |
Reading through the rest of the comments.
|
Moving my old comment from the PR to the issue, so all discussion is in one place, and adding a note about defaults. 1- I also prefer the sourceColumn, destinationColumn (anything but the word "name" for the generated column). "name" as an API parameter is not meaningful. I agree that we are not going to be consistent with out command line usage, and entry point usage, but i think that is worth the user experience. 2- After having worked through the changes adapting the new proposal, my biggest problem is that we are assigning default values to the source column, rather than making it mandatory.
on the api with the new signature:
The users need to know that there exists an input column with the name "Induced", and that if they assign that string to the output/name column param, everything will be fine. |
Let's stop right there, because this is the root of where we've gone down different paths, and if we can't agree on this, we can't agree on anything. This is the root of why I consider your scheme dead wrong. You gloss over it, but this is the point of why we decided a while ago to go in the opposite direction back when @zeahmed was writing the convenience constructors for transforms. Look at this discussion. People have the idea of "input" and "output" so baked into their heads that when I asked them to consider alternate cases where this wasn't quite so simple, most of them simply could not continue the discussion. How many weeks passed before someone besides me even bothered to mention this other case? That's how basic and fundamental this idea is, that people didn't even think about alternates. That in mind, let's now think about what you've just said: You've observed, as I did, that as soon as this "other required arguments" thing pops up, the scheme falls to pieces. We're talking about inconsistency of the most basic part of estimators. And your response to such a fundamental inconsistency is "small price to pay?" Really? No, no no no... that is a huge price to pay. This is literally the most basic thing we have to get right. My premise is that we must be consistent there, and if our scheme does not allow us to be consistent, then it is discarded. Hence this issue, where I tried to undo all this, since I view it as a gigantic blunder. But, you know what, that's all fine. Since everyone agrees this pattern is great, maybe I'm wrong. I disagree and think you're committing an error that will haunt this API, but it is simply impossible for me to continue this discussion, however strongly I think you're all making a huge blunder. So, do what you like here. Just, some of you, if you're at least willing to consider the possibility that this inconsistency represents a fatal flaw... for your sake at least, I'll retrace my steps that I took with @zeahmed this past summer. If you continue on this path, what you'll I hope realize is that it is useless to talk about "input" then "output" parameters and say one is necessarily first then second, but due to the basic syntax of C# there is only required and unrequired parameters. If we want either one of the "input" or "output" [sic] column names to be non-required, then you will find based on an examination of actual cases that it is the "inputs" that tend to be most polymorphic w.r.t. their behavior (in the sense that they are sometimes absent, optional, or whatever), and so most naturally group them among unrequired. Then among the required you will arrive at the scheme of the "output" column name being first, followed by any other required parameters (if any). But, you haven't, and that's fine. I'm only one person and I can't fight against every thing. |
Regarding the ordering of inputs and outputs, I believe there is other important related work to consider: SQL. There, the output is specified before the input: @interesaaat could be in a position to (dis)proof this perspective. |
In general, the proposal of having the output column being nullable is a bit strange to me. Instead of having a pattern:
What if we split these methods into two overloads:
If we have a 1-1 transform, and we want the ease of just specifying a single column name, and not the same name for both the input and output columns, then use overloads instead of nullable/default parameters for one of the columns. Would that make everything consistent? |
Replying to Markus: you are right, the output goes last and I think that this is because it ease composability. SQL queries (as well as Datalog for that matter) are in fact composable, and it is much easier to get the output schema if is specified as first than if it is instead baked into the query (or last). Nevertheless, when you write queries, we always start from the input schema (it is difficult to derive the output schema without knowing what you get as input). So in practice, output schema goes first because it is easier to use the query afterwards, but when you write it, you first look at the input schema. This being said, the output first and input later looks odd to me in a "functional API", and since apparently this is true for the majority of people, if I can put my two cents, I would rather go with an inconsistent API (my understanding---I really tried to read everything but I failed---is that the bigger inconsistency is that output is sometimes not null?) than with something that people have problem understanding/using. |
I think that falls apart for reasons of confusion, and maybe consistency, once we consider that some of these parameters are very often strings themselves. That is, your So, without checking, let me ask you a question about .NET. We will use a hypothetical simple example to guide our thinking. void Foo(string input, string optionalInput = null);
void Foo(string input, string output, string optionalInput = null); Then imagine I have some other code further down that calls this method. Foo("hello", "friend"); I'll tell you frankly, I don't know what will happen here. We might say I'm just ignorant, but I've been using .NET professionally for a decade now. I know a fair amount about its behavior, but this just seems confusing to me. What does the C# compiler do here? Does it complain? Does it favor the second one because that one is the one with two required arguments and I provided two arguments? If it prefers the second, let us imagine a user. They write this. Foo("hello"); They read the documentation. They see the first overload. Then they say, "hey, this optional parameter does something I want, and it is optional. They naturally change it to be this. Foo("hello", "friend"); But lurking in ths shadows is this overload you made me add. To be honest, I have absolutely no idea what C# does in the face of a situation like this. (I am about to find out now, but I wanted to write the above, to make the point that I, a 10 year C# veteran working with it day in day out, have absolutely no idea what will happen in this hypothetical API. So let me find out... public static class Program
{
public interface Bubba
{
void Foo(string input, string optionalInput = null);
void Foo(string input, string output, string optionalInput = null);
}
private static void Main(string[] args)
{
Bubba bubs = default;
bubs.Foo("hello", "friend");
}
} OK. Apparently against all my guesses, it did not complain, but it also did not prefer the second, it preferred the first. That I'd argue is even worse from this specific perspective. Let's imagine I am trying to map an input to an output column, it's not even clear to me at that point how I'd write that code. (I mean, I guess I could be explicit and write But, I'd argue that your proposal has, in this common case, led to a worse not a better world. So I'm afraid I not enthusiastic about this "overload" idea. (There's also the secondary point about it entailing a lot of work, etc. etc. etc., which we ought not to be blind to, but that's a secondary point.)
I have a different view. Adjustment of initial prejudices to "get" a consistent API, even if this hypothetical person initially has trouble "getting" it, is a cost paid once. Being inconsistent is a cost paid forever. I have a clear preference in this case. But, if all the .NET people are screaming at me that I'm wrong, they know API design better than I do, so... whatevs. Do what you want. |
After discussing this with @Ivanidzo4ka @TomFinley @eerhardt @markusweimer the decision is to:
@stephentoub @Zruty0 @glebuk @zeahmed @justinormont @KrzysztofCwalina @interesaaat thank you all for your input. cc @terrajobst in case he wants to discuss it further on the API review meetings. |
Fixed with #2239 |
So, a somewhat embarrassing blunder to report... not quite sure how to say this.
Throughout our codebase, for years we've always specified the name of the output column of a transform first, then the source(s).
That's good. But when estimators were introduced, somehow, nearly all of them were introduced in the reverse order: nearly all of them specify the inputs, then the outputs. This was probably an unconscious mistake, but it's one with fairly wide consequences, since that mistaken pattern was copied again and again as people did the estimator conversion work, to the point where most (not all!) estimators are now inconsistent with the whole rest of the codebase!
This should be corrected: it is at least inconsistent, and even if not inconsistent actually rather obnoxious practically because specifying the name of the output first has practical benefits, and makes a lot more sense, since if you're specifying a transformation the most important information someone will want to know is what you're calling your output!
The Story Until a Few Months Ago
So, throughout our codebase, for years, it has been our practice that when specifying a transform, you specify the name of the output, then you specify the name of the input(s) (if any). The reason for this is practical: the outputs are usually the result of a well defined single calculation (the application of the transform), whereas what is taken as an "input" to a transform can have various meanings since it is sometimes a multivariate function in its inputs more often than in its outputs. (E.g., concatenate can have multiple inputs, text can have multiple inputs, MI feature selection takes multiple heterogeneous inputs, but all produce single output columns.)
This was our practice certainly when this code was a tool, as we see in the command line help string, specifying name then source.
machinelearning/src/Microsoft.ML.Data/Transforms/ColumnCopying.cs
Lines 109 to 110 in bb46fdf
This trend continued during our initial attempts at an API, as seen in this early discussion in PR 405, and as seen here:
https://github.com/zeahmed/machinelearning/blob/16f7883933b56f8fd86077bf0fd262b24374e9d0/src/Microsoft.ML.Data/Transforms/ConvertTransform.cs#L116
and here
https://github.com/zeahmed/machinelearning/blob/16f7883933b56f8fd86077bf0fd262b24374e9d0/src/Microsoft.ML.Data/Transforms/DropSlotsTransform.cs#L226
and numerous other places.
This is done for the practical reason that, when a transform produces an
output, what outputs it has are usually finite and well defined, whereas it can take multiple examles. The most conspicuous and widely used example of this is
the concatenation transform. Also included are things like the text featurizing transform, and other such things.
So far so good...
But Then...
Now, somehow, through some mechanism that wasn't quite clear to me, as
IEstimator
implementations are being created, someone commits a fateful mistake of reversing inputs and outputs. Suddenly instead of being typically parameterized asname
and a sometimes optionalsource
, we instead have the requiredinput
and a sometimes optionaloutput
. What a disaster! And, I did not catch it in review. An early example:machinelearning/src/Microsoft.ML.Transforms/Text/TextTransform.cs
Lines 278 to 279 in c8de311
Then, as people use this as a template for their own estimator conversion work, nearly all estimators copied this mistake, until practically all estimators had this problem. This includes much of the extensions.
machinelearning/src/Microsoft.ML.Data/Transforms/ExtensionsCatalog.cs
Lines 30 to 31 in d9270c9
Now then, the reason why I know any of this is that @stephentoub wrote and says, "hey, how come you have your column concatenation operation specify output then inputs? That's different from everywhere else! I know it's less convenient, but it's better to be consistent."
machinelearning/src/Microsoft.ML.Data/Transforms/ExtensionsCatalog.cs
Line 39 in d9270c9
And, after thinking that he must be very confused, since again, the pattern of name then source being very, very well defined throughout the codebase, I look and find he is absolutely correct, and a thoroughly massive blunder had somehow made it throughout the entire codebase under our very noses, including it must be said mine. :) :)
So, this is bad, but happily this was caught before v1. But, needless to say this must be fixed immediately.
The text was updated successfully, but these errors were encountered: