Description
So, I had a conversation with @eerhardt and @yaeldekel, about the nature of models, in particular relating to the saving and loading routines. This is very important for us to get right, since the artifacts of what we learn and how we transform data, and their persistability, is probably the most important thing we have to do correctly.
We take the view initially that the model is the ITransformer
(note that a chain of ITransformer
s is itself an ITransformer
). But, by itself this is an insufficient description, was we saw in in #2663 and its subsequent "child" #2735, from the point of view of model being practically "the stuff you need to keep," there's a lot more to a machine learning model than merely just the ITransformer
chain -- you also need to preserve some notion of what the input is to that. So we added these things to take either a loader, or the input schema, to be saved as part of the model.
Yet, is the loader a model itself? Sometimes that's precisely what we call it:
And in the same file we call it something else:
It is a model in one sense because it is yet another things that takes input data and produces output data -- the fact that ITransformer
does it specifically over IDataView
as input specifically does not give it some magical, special status to allow it to be called a model, to the exclusion of other candidates. If I take a loader, and append a transform to it, then the whole aggregate thing is still a loader. If it isn't a model, it only isn't one by the mere skin of its teeth. Hence the presence of the original thing, and why we have in the model operations catalog operations to save and load IDataLoader
itself specifically.
But at the same time this duality of the term "model" is, as I understand @eerhardt, confusing. We have two things we're calling model. In an ideal world, I feel like if we can get away with just one story of what the model is, we should take it, and if there must be only one it must be ITransformer
. We even have the situation where if you have mlContext.Model.Save(
, the first thing that pops up is the IDataLoader
thing, which is kind of strange.
I am not quite sure what I think, since in this case I agree with whoever talks to me last with an even a vaguely convincing argument. But I think in this case I will see about getting rid of the IDataLoader
-only APIs -- people can, if it is important, continue to save and load such things by using empty ITransformer
chains (again, any chain of ITransformer
is itself an ITransformer
, including the empty chain).
Since we are approaching v1, I view it as a bit more important to be conservative w.r.t. potentially confusing additions to the API, especially around something as central as the saving and loading of models. We might be able to add it back later if there's some really compelling scenario for them, that we somehow did not anticipate.
We will of course retain the saving and loading of transformers with loaders, since that is really important to be able to capture, but I think being consistent around the story that "models are transformers" as we are most places is kind of important.
Activity
TomFinley commentedon Mar 20, 2019
A side effect of this work is that we still want to support a loader only pipeline -- we just don't want to encourage that as being somehow the primary use case. (Which, against our desires, has arguably happened with how intellisense is presenting things.) So I will support as a convenience
null
as an input forITransformer
, with the documentation that this is intended as a shorthand for an empty chain (which is to say, the trivial transformer that returns its input as output).There's also an annoying inconsistency in the API: we describe the inputs in different positions. Here we see the input coming first.
machinelearning/src/Microsoft.ML.Data/Model/ModelOperationsCatalog.cs
Line 65 in c38f81b
Here it comes second.
machinelearning/src/Microsoft.ML.Data/Model/ModelOperationsCatalog.cs
Line 106 in c38f81b
I feel like we ought to be consistent. While the order in which things are done suggests putting the input first (since input feeds into transformer), I wonder if emphasizing the common thing they all have in common (they take
ITransformer
) as the more fundamental object is not more correct. I do not have a strong opinion here, but will provisionally perhaps putITransformer
first.yaeldekel commentedon Mar 20, 2019
I agree that too many save/load APIs can be confusing (especially if the first one that shows up in intellisense is the one least commonly used). I added the
IDataLoader
-only API for convenience since sometimes the whole pipeline can be contained in anIDataLoader
of typeCompositeDataLoader
, so if it makes things inconvenient, it should definitely not be there :-).I would perhaps suggest that we add a property
TransformerChain.Empty
. It may be a bit more convenient in the case of saving a loader when there are no transformers in the pipeline. It would be this:instead of this:
Although if there is a way to put the
Empty
property somewhere more discoverable, that would be nice. I can't think of anywhere...Sorry @TomFinley, I guess I should have read your comment before writing mine... passing null and instantiating the empty transform chain internally is even easier than what I suggested :-).
TomFinley commentedon Mar 20, 2019
Cool thanks @yaeldekel.
Separate note: Another thing I notice is that when we sometimes save a transformer, when we deserialize we wind up with a chain of that transformer, which is not quite what I expected, so I'm going to make it so that the loader and saver save the same type of thing.
TomFinley commentedon Mar 20, 2019
Oh boy... and while doing this work I see right smack dab in the middle of this, this:
machinelearning/src/Microsoft.ML.Data/DataLoadSave/TransformerChain.cs
Line 64 in e00d19d
A mutable array, right in the middle of a transformer. Given the tight time constraints I'm somewhat disinclined to file a separate issue and PR for that specifically.
Edit: Whoops, explicit interface. It's fine. 😄
TomFinley commentedon Mar 20, 2019
This badness actually worked both ways: when you saved an
IDataLoader
that happened to be a composite loader, then used the load, it would actually decompose the loader for you, even though you had explicitly told it to save a single loader! Anyway, that is also now fixed.glebuk commentedon Mar 21, 2019
Note that with the current code, if you load model without IDataLoader using the wrong .Load() method,
you would get a very "helpful" exception:
System.InvalidOperationException Message=Model does not contain an IDataLoader FormatException: **Corrupt model file**
This is not expected and completely confusing.
TomFinley commentedon Mar 21, 2019
Well, anyway, I'm getting rid of that method as described above, and
LoadWithDataLoader
will be the only way to service this scenario going forward, and the load/save methods will treatITransformer
as the first class citizen from now on, with description of the inputs (whether loaders or schemas) being considered secondary information. So hopefully the situation will be less inherently confusing, which I agree it was. (I generally find overloads that return different types are often confusing.)On top of that, I also made the advice in the exception message more descriptive, to indicate why this could be so, and to suggest an alternate way.
Regarding whether it was confusing or not, I'm not sure. Certainly the message was completely accurate, but as we see I added a lot more description and details. Anyway I'll post the PR soon...