-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Chains of Chains #2820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@TomFinley @eerhardt Your input would be greatly appreciated! |
I believe this is intentional and by design. Reading #581:
|
Confirmed with other parties offline that this is intended, even though it's easy to be led astray with an unintended |
Re-opening to discuss a better way to consciously define nested |
As @eerhardt explains, this design is intentional. I think flattening would lead to more confusion. If I form an estimator chain with Nonetheless, the issue with the stray var x = a.Append(b.Append(c).Append(d.Append(e)).Append(f));
var y = a.Append((b.Append(c).Append(d.Append(e).Append(f));
var z = a.Append(b).Append(c).Append(d).Append(e).Append(f); The difference in semantics here are very subtle. (I'd expect the ultimate transformation to behave the same, which is in a sense all we really care about, but this structure is unclear. I'm having trouble teasing apart what is actually happening without careful study. (Though of course we could argue that this is true of parentheses generally.) var x = a.Append(b).Append(c);
IEstimator y = x;
var xx = x.Append(d);
var yy = y.Append(d); What do you think the second one does? This potential confusion is all the natural consequences of having the method var z = a.CreateChain(b).Append(c).Append(d).Append(e).Append(f); This would solve @rogancarr 's problem I think, at the cost of having one more name to describe a very "close" (but nonetheless distinguishable) operation. What do we think? @eerhardt , @rogancarr , @sfilipi ? |
It's fine if the answer is, let's keep it as is, I just want to be explicit that we understand that this is a potential problem, and we're making that choice deliberately as opposed to the current situation (which I suspect is accidental). |
@TomFinley This is a great write-up. Under the current definitions, I have no idea what Personally, I like having the instance and the extension methods having different names because they do different things, and I would like to specifically ask to have something nested. Also, what functionality do we get from having nesting? With the current way we propagate information through |
OK. Does anyone object to having a different name then? Maybe we can make it something.
I wouldn't put it that way. I'd instead say that nesting is merely a natural consequence of having estimator chains that are themselves estimators. I'm not arguing that nesting is super useful, I'm arguing that the behavior of the API if we take positive steps make it impossible becomes incomprehensible. This is why I led off by arguing that if I chain As for the question though of whether nesting is itself valuable, that's more than I know. The question I'd rather ask is, am I willing to make the behavior of our API more polymorphic and complex merely to make nesting impossible? Definitely not. |
Oh, this is why it's valuable. Say that I am building a learning pipeline. I have various actors build different stages. I don't actually know the contents of those steps. However, I know that a handful of them define a trainer, so I can extract them and do something with the models. Here is an example that I wrote to do this: [Fact]
public void InspectNestedPipeline()
{
var mlContext = new MLContext(seed: 1, conc: 1);
var data = mlContext.Data.LoadFromTextFile<Iris>(GetDataPath(TestDatasets.iris.trainFilename),
hasHeader: TestDatasets.iris.fileHasHeader,
separatorChar: TestDatasets.iris.fileSeparator);
// Create a training pipeline.
var pipeline = mlContext.Transforms.Concatenate("Features", Iris.Features)
.Append(StepOne(mlContext))
.Append(StepTwo(mlContext));
// Train the model.
var model = pipeline.Fit(data);
// Extract the trained models.
var modelEnumerator = model.GetEnumerator();
modelEnumerator.MoveNext(); // The Concat Transform
modelEnumerator.MoveNext();
var kMeansModel = (modelEnumerator.Current as TransformerChain<ClusteringPredictionTransformer<KMeansModelParameters>>).LastTransformer;
modelEnumerator.MoveNext();
var mcLrModel = (modelEnumerator.Current as TransformerChain<MulticlassPredictionTransformer<MulticlassLogisticRegressionModelParameters>>).LastTransformer;
// Validate the k-means model.
VBuffer<float>[] centroids = default;
kMeansModel.Model.GetClusterCentroids(ref centroids, out int nCentroids);
Assert.Equal(4, centroids.Length);
// Validate the MulticlassLogisticRegressionModel.
VBuffer<float>[] weights = default;
mcLrModel.Model.GetWeights(ref weights, out int classes);
Assert.Equal(3, weights.Length);
}
private IEstimator<TransformerChain<ClusteringPredictionTransformer<KMeansModelParameters>>> StepOne(MLContext mlContext)
{
return mlContext.Transforms.Concatenate("LabelAndFeatures", "Label", "Features")
.Append(mlContext.Clustering.Trainers.KMeans(
new KMeansPlusPlusTrainer.Options
{
InitAlgorithm = KMeansPlusPlusTrainer.InitAlgorithm.Random,
ClustersCount = 4,
MaxIterations = 10,
NumThreads = 1
}));
}
private IEstimator<TransformerChain<MulticlassPredictionTransformer<MulticlassLogisticRegressionModelParameters>>> StepTwo(MLContext mlContext)
{
return mlContext.Transforms.Conversion.MapValueToKey("Label")
.Append(mlContext.MulticlassClassification.Trainers.StochasticDualCoordinateAscent(
new SdcaMultiClassTrainer.Options {
MaxIterations = 10,
NumThreads = 1 }));
} Now, this does happen to be a bit awkward.
|
Right, but your ability to know that it had the transforms in that order is precisely because we didn't rewrite it at will. We just did what you told us to do, so you were able to know. The only thing you have to do in order to enable your scenario is be explicit about what you want. If we wanted to write an enhancement in the future if flattening actually becomes a thing, that's fine, but the default behavior should be to allow nesting. You can't have an |
Yes, we should keep nesting in. The example I gave above was to show that it's useful. My worry here is that we have a capability, but we don't have a lot of good examples or scenarios on what it is for, so we don't know if it operates the way we want it to. The typing/casting issues are hard; I'm not sure what to do about it. Maybe it's not necessary? It be nice, however, to allow a bit of encapsulation in nested pipelines (No 3. above), so that the nested Example: Imagine that I expect a k-means model out from you. I give you a schema in. I expect the k-means model back, and then I want to send this new IDV to a second pipeline.
My initial thought was that this is what nested pipelines are for. However, it's already possible to do this in C# code manually using separate (non-nested) pipelines. If this is the preferred case, then what use does nesting provide? As an aside, we can ignore No 2. above, as this is actually currently possible. You can use Linq to do |
It is not necessary in the sense that our advice is, if you care about this, you should write your code in a bit of a different way. The pipe as I said earlier is for those situations where you want to form a chain of estimators as a logical unit -- that is, situations where you don't care about what the individual parts are doing per se. As soon as you start to care, I'd argue that you shouldn't have put them in a pipe. So, if you want something as a discrete unit, you accomplish that by just keeping it as a discrete unit. So if I had three estimators A, B, C, out of which I wanted to keep and maintain the resulting transformers X, Y, and Z, I would just fit and transform using the estimators and transformers directly. These chains are little more than helper methods to do that chained This is just kind of "how things are" in statically typed languages. E.g., you have a We did "cheat" a little bit insofar as we maintain in a strongly typed fashion the type of whatever the last estimator or transformer is, but as @Zruty0 has argued on multiple occasions perhaps that is not so useful. If we really wanted to solve this problem of having pipelines and strong typing of the individual items, the only way I can figure how we'd do that is by adopting an approach similar to what is done with tuples and value tuples... then if we had a chain of |
Based on the conversion above, the consensus is to make no additional changes. Therefore I am closing this sisue. |
It is possible to nest
EstimatorChain
s inside one another, fit them, and use them to transform data. The result is an object that is a nestedTransformerChain
.Question: Is this intended behavior? Do we want to allow this sort of nesting in the V1 API?
I think that the proper way to handle nesting is to first flatten the structure before the fit and return a single
EstimatorChain
. I believe that since there is no forking and joining, that nested and non-nested pipelines are identical, except for the returned object. Data transformed by these objects should be the same whether the pipeline is nested or not (and is in my limited testing).Take a look at the following example where we featurize the UCI Adult dataset.
Here,
pipeline
is anEstimatorChain<BinaryPredictionTransformer<...>>
andmodel
is aTransformerChain<BinaryPredictionTransformer<...>>
.It's also possible to nest the pipeline. Perhaps you accidentally put an errant
)
here and there, and then you have this:Now,
pipeline
is anEstimatorChain<EstimatorChain<BinaryPredictionTransformer<...>>>
andmodel
is aTransformerChain<TransformerChain<BinaryPredictionTransformer<...>>>
.Now, if I compare the two (where
var predictor = model.LastTransformer
andvar nestedPredictor = nestedModel.LastTransformer.LastTransformer
), it's clear that the models and the transformed data are identical:The text was updated successfully, but these errors were encountered: