Skip to content

Chains of Chains #2820

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rogancarr opened this issue Mar 1, 2019 · 13 comments
Closed

Chains of Chains #2820

rogancarr opened this issue Mar 1, 2019 · 13 comments
Labels
API Issues pertaining the friendly API question Further information is requested

Comments

@rogancarr
Copy link
Contributor

It is possible to nest EstimatorChains inside one another, fit them, and use them to transform data. The result is an object that is a nested TransformerChain.

Question: Is this intended behavior? Do we want to allow this sort of nesting in the V1 API?

I think that the proper way to handle nesting is to first flatten the structure before the fit and return a single EstimatorChain. I believe that since there is no forking and joining, that nested and non-nested pipelines are identical, except for the returned object. Data transformed by these objects should be the same whether the pipeline is nested or not (and is in my limited testing).

Take a look at the following example where we featurize the UCI Adult dataset.

var mlContext = new MLContext(seed: 1, conc: 1);

// Load the Adult (tiny) dataset
var data = mlContext.Data.LoadFromTextFile<Adult>(GetDataPath(TestDatasets.adult.trainFilename),
    hasHeader: TestDatasets.adult.fileHasHeader,
    separatorChar: TestDatasets.adult.fileSeparator);

// Create the learning pipeline
var pipeline = mlContext.Transforms.Concatenate("NumericalFeatures", Adult.NumericalFeatures)
    .Append(mlContext.Transforms.Concatenate("CategoricalFeatures", Adult.CategoricalFeatures))
    .Append(mlContext.Transforms.Categorical.OneHotHashEncoding("CategoricalFeatures",
        invertHash: 2, outputKind: OneHotEncodingTransformer.OutputKind.Bag))
    .Append(mlContext.Transforms.Concatenate("Features", "NumericalFeatures", "CategoricalFeatures"))
    .Append(mlContext.BinaryClassification.Trainers.LogisticRegression());

// Train the model.
var model = pipeline.Fit(data);

Here, pipeline is an EstimatorChain<BinaryPredictionTransformer<...>> and model is a TransformerChain<BinaryPredictionTransformer<...>>.

It's also possible to nest the pipeline. Perhaps you accidentally put an errant ) here and there, and then you have this:

// Create the learning pipeline
var pipeline = mlContext.Transforms.Concatenate("NumericalFeatures", Adult.NumericalFeatures)
    .Append(mlContext.Transforms.Concatenate("CategoricalFeatures", Adult.CategoricalFeatures))
    .Append(mlContext.Transforms.Categorical.OneHotHashEncoding("CategoricalFeatures",
        invertHash: 2, outputKind: OneHotEncodingTransformer.OutputKind.Bag) // <-- missing a )
    .Append(mlContext.Transforms.Concatenate("Features", "NumericalFeatures", "CategoricalFeatures"))
    .Append(mlContext.BinaryClassification.Trainers.LogisticRegression())); // <-- extra )

Now, pipeline is an EstimatorChain<EstimatorChain<BinaryPredictionTransformer<...>>> and model is a TransformerChain<TransformerChain<BinaryPredictionTransformer<...>>>.

Now, if I compare the two (where var predictor = model.LastTransformer and var nestedPredictor = nestedModel.LastTransformer.LastTransformer), it's clear that the models and the transformed data are identical:

//True!
Assert.Equal(predictor.Model.SubModel.Bias, nestedPredictor.Model.SubModel.Bias);
int nFeatures = predictor.Model.SubModel.Weights.Count;
for (int i = 0; i < nFeatures; i++ )
    //True!
    Assert.Equal(predictor.Model.SubModel.Weights[i], nestedPredictor.Model.SubModel.Weights[i]); 

var transformedRows = mlContext.Data.CreateEnumerable<BinaryPrediction>(transformedData, false).ToArray();
var nestedTransformedRows = mlContext.Data.CreateEnumerable<BinaryPrediction>(nestedTransformedData, false).ToArray();
for (int i = 0; i < transformedRows.Length; i++)
    //True!
    Assert.Equal(transformedRows[i].Score, nestedTransformedRows[i].Score); 
@rogancarr rogancarr added question Further information is requested API Issues pertaining the friendly API labels Mar 1, 2019
@rogancarr
Copy link
Contributor Author

@TomFinley @eerhardt Your input would be greatly appreciated!

@eerhardt
Copy link
Member

eerhardt commented Mar 4, 2019

I believe this is intentional and by design. Reading #581:

Obviously, a chain of transformers can itself behave as a transformer, and a chain of estimators can behave like estimators.

@rogancarr
Copy link
Contributor Author

Confirmed with other parties offline that this is intended, even though it's easy to be led astray with an unintended ) or two.

@rogancarr
Copy link
Contributor Author

rogancarr commented Mar 4, 2019

Re-opening to discuss a better way to consciously define nested EstimatorChains on request from @TomFinley .

@rogancarr rogancarr reopened this Mar 4, 2019
@TomFinley
Copy link
Contributor

TomFinley commented Mar 4, 2019

As @eerhardt explains, this design is intentional. I think flattening would lead to more confusion. If I form an estimator chain with n items, I expect the chain itself to have n estimators in it upon creation, and n transformers in the resulting chain upon fitting -- if it does not, I am going to be confused. The effect of flattening would also be that if the last estimator happened to be an estimator chain itself, the type of what is the LastTransform upon creation is also unclear. (Indeed, even if we could somehow manage to get that to work via magic of overload resolution, the effect would be far more confusing than helpful.)

Nonetheless, the issue with the stray ) is easy to commit. And there are other problems...

var x = a.Append(b.Append(c).Append(d.Append(e)).Append(f));
var y = a.Append((b.Append(c).Append(d.Append(e).Append(f));
var z = a.Append(b).Append(c).Append(d).Append(e).Append(f);

The difference in semantics here are very subtle. (I'd expect the ultimate transformation to behave the same, which is in a sense all we really care about, but this structure is unclear. I'm having trouble teasing apart what is actually happening without careful study. (Though of course we could argue that this is true of parentheses generally.)

var x = a.Append(b).Append(c);
IEstimator y = x;
var xx = x.Append(d);
var yy = y.Append(d);

What do you think the second one does? xx is a chain of four estimators, whereas yy is a chain of two, due to the convenience extension method being defined over IEsimator, but the preference is for the instance method *when it is typed as that chain instance (as it is with var x).

This potential confusion is all the natural consequences of having the method .Append extension being named the same as the instance method. We have previously expressed, I think, contentment with this state, but I would like to be explicit about it. We could if we wanted name the extension method something else. Maybe, CreateChained or something. (I don't insist on the name.) This would make it absolutely clear what is going on.

var z = a.CreateChain(b).Append(c).Append(d).Append(e).Append(f);

This would solve @rogancarr 's problem I think, at the cost of having one more name to describe a very "close" (but nonetheless distinguishable) operation. What do we think? @eerhardt , @rogancarr , @sfilipi ?

@TomFinley
Copy link
Contributor

TomFinley commented Mar 4, 2019

It's fine if the answer is, let's keep it as is, I just want to be explicit that we understand that this is a potential problem, and we're making that choice deliberately as opposed to the current situation (which I suspect is accidental).

@rogancarr
Copy link
Contributor Author

@TomFinley This is a great write-up. Under the current definitions, I have no idea what xx and yy will be in your example. I'd have to code it up and use intellisense at this point.

Personally, I like having the instance and the extension methods having different names because they do different things, and I would like to specifically ask to have something nested.

Also, what functionality do we get from having nesting? With the current way we propagate information through IDataView, I don't think that we get local scoping of transformations. And because of the sequential information of the Append method, we don't get parallelization either. Does anybody have an example of when I'd want to use nesting?

@TomFinley
Copy link
Contributor

TomFinley commented Mar 4, 2019

OK. Does anyone object to having a different name then? Maybe we can make it something.

Also, what functionality do we get from having nesting?

I wouldn't put it that way. I'd instead say that nesting is merely a natural consequence of having estimator chains that are themselves estimators. I'm not arguing that nesting is super useful, I'm arguing that the behavior of the API if we take positive steps make it impossible becomes incomprehensible. This is why I led off by arguing that if I chain n items together, it is natural to suppose I wind up with a chain of n items. Because I can understand that. But, in this "flatten" world, the answer to "how many estimators are in the chain if you chain n estimators together" is, you have absolutely no idea. It could be anywhere from 0 (assuming empty chains are possible) to infinity, and I don't find that lack of predictability appealing in an API.

As for the question though of whether nesting is itself valuable, that's more than I know. The question I'd rather ask is, am I willing to make the behavior of our API more polymorphic and complex merely to make nesting impossible? Definitely not.

@rogancarr
Copy link
Contributor Author

This is why I led off by arguing that if I chain n items together, it is natural to suppose I wind up with a chain of n items.

Oh, this is why it's valuable.

Say that I am building a learning pipeline. I have various actors build different stages. I don't actually know the contents of those steps. However, I know that a handful of them define a trainer, so I can extract them and do something with the models.

Here is an example that I wrote to do this:

[Fact]
public void InspectNestedPipeline()
{
    var mlContext = new MLContext(seed: 1, conc: 1);

    var data = mlContext.Data.LoadFromTextFile<Iris>(GetDataPath(TestDatasets.iris.trainFilename),
        hasHeader: TestDatasets.iris.fileHasHeader,
        separatorChar: TestDatasets.iris.fileSeparator);

    // Create a training pipeline.
    var pipeline = mlContext.Transforms.Concatenate("Features", Iris.Features)
        .Append(StepOne(mlContext))
        .Append(StepTwo(mlContext));

    // Train the model.
    var model = pipeline.Fit(data);

    // Extract the trained models.
    var modelEnumerator = model.GetEnumerator();
    modelEnumerator.MoveNext(); // The Concat Transform
    modelEnumerator.MoveNext();
    var kMeansModel = (modelEnumerator.Current as TransformerChain<ClusteringPredictionTransformer<KMeansModelParameters>>).LastTransformer;
    modelEnumerator.MoveNext();
    var mcLrModel = (modelEnumerator.Current as TransformerChain<MulticlassPredictionTransformer<MulticlassLogisticRegressionModelParameters>>).LastTransformer;

    // Validate the k-means model.
    VBuffer<float>[] centroids = default;
    kMeansModel.Model.GetClusterCentroids(ref centroids, out int nCentroids);
    Assert.Equal(4, centroids.Length);

    // Validate the MulticlassLogisticRegressionModel.
    VBuffer<float>[] weights = default;
    mcLrModel.Model.GetWeights(ref weights, out int classes);
    Assert.Equal(3, weights.Length);
}

private IEstimator<TransformerChain<ClusteringPredictionTransformer<KMeansModelParameters>>> StepOne(MLContext mlContext)
{
    return mlContext.Transforms.Concatenate("LabelAndFeatures", "Label", "Features")
        .Append(mlContext.Clustering.Trainers.KMeans(
            new KMeansPlusPlusTrainer.Options
            {
                InitAlgorithm = KMeansPlusPlusTrainer.InitAlgorithm.Random,
                ClustersCount = 4,
                MaxIterations = 10,
                NumThreads = 1
            }));
}

private IEstimator<TransformerChain<MulticlassPredictionTransformer<MulticlassLogisticRegressionModelParameters>>> StepTwo(MLContext mlContext)
{
    return mlContext.Transforms.Conversion.MapValueToKey("Label")
        .Append(mlContext.MulticlassClassification.Trainers.StochasticDualCoordinateAscent(
        new SdcaMultiClassTrainer.Options {
            MaxIterations = 10,
            NumThreads = 1 }));
}

Now, this does happen to be a bit awkward.

  1. I can't make functions that return an IEstimator<TransformerChain<ITransformer>> for some reason, so I don't have a lot of flexibility on defining helper functions. Step one will always need to define a k-means trainer, and step two will always have to define a multi-class logistic regression model. I can't say, ask for any clustering predictor or any multiclass model.
  2. I have to hand-enumerate through the EstimatorChain. I would prefer to write var kMeansModel = model[1] as ...; than to have to manually iterate.
  3. If I want to do build temporary structures in my methods, I can't remove them and still return the same contract. That is, I create a column LabelAndFeatures in StepOne, but if I add a DropColumn at the end, I will no longer return an IEstimator<TransformerChain<ClusteringPredictionTransformer<KMeansModelParameters>>>. That means I either dirty up my schema or the calling code has to introspect the schema and DropColumns or SelectColumns before serialization.

@TomFinley
Copy link
Contributor

TomFinley commented Mar 6, 2019

Right, but your ability to know that it had the transforms in that order is precisely because we didn't rewrite it at will. We just did what you told us to do, so you were able to know. The only thing you have to do in order to enable your scenario is be explicit about what you want. If we wanted to write an enhancement in the future if flattening actually becomes a thing, that's fine, but the default behavior should be to allow nesting.

You can't have an TransformerChain<ITransformer> be implicitly assignable from a more specific transformer chain, since covariacne is not a thing on abstract classes. (It is a thing on interfaces.) If we wanted to add explicit casts to make that possible in the future I wouldn't object. Or we could say that TransformerChain descends from a different non-abstract class. Either is fine. Again I don't think that has anything whatsoever to do with nesting or not. I'm not saying you don't have a problem, I just think your solution is not necessarily what I would choose.

@rogancarr
Copy link
Contributor Author

Yes, we should keep nesting in. The example I gave above was to show that it's useful. My worry here is that we have a capability, but we don't have a lot of good examples or scenarios on what it is for, so we don't know if it operates the way we want it to.

The typing/casting issues are hard; I'm not sure what to do about it. Maybe it's not necessary?

It be nice, however, to allow a bit of encapsulation in nested pipelines (No 3. above), so that the nested transformerChains can be independent. I'm not sure how that happens, though, but I know what I want.

Example: Imagine that I expect a k-means model out from you. I give you a schema in. I expect the k-means model back, and then I want to send this new IDV to a second pipeline.
Questions:

  1. Can the k-means pipeline step make temporary columns, train the model, and then remove them, and still fulfill the contract to make a IEstimator<TransformerChain<ClusteringPredictionTransformer<KMeansModelParameters>>>.
  2. Can I guarantee that my schema didn't change? Or do I need to do that manually?
  3. Can I send the output of the trained k-means model to the next pipeline step?

My initial thought was that this is what nested pipelines are for. However, it's already possible to do this in C# code manually using separate (non-nested) pipelines. If this is the preferred case, then what use does nesting provide?

As an aside, we can ignore No 2. above, as this is actually currently possible. You can use Linq to do var modelComponents = model.ToList() and then you get a list of TransformerChain<ITransformer>. You can't use it for transformations at the top-level anymore, but you can access elements easier.

@TomFinley
Copy link
Contributor

The typing/casting issues are hard; I'm not sure what to do about it. Maybe it's not necessary?

It is not necessary in the sense that our advice is, if you care about this, you should write your code in a bit of a different way.

The pipe as I said earlier is for those situations where you want to form a chain of estimators as a logical unit -- that is, situations where you don't care about what the individual parts are doing per se. As soon as you start to care, I'd argue that you shouldn't have put them in a pipe.

So, if you want something as a discrete unit, you accomplish that by just keeping it as a discrete unit. So if I had three estimators A, B, C, out of which I wanted to keep and maintain the resulting transformers X, Y, and Z, I would just fit and transform using the estimators and transformers directly. These chains are little more than helper methods to do that chained Fit and Transform for you, in situations where, again, you do not care.

This is just kind of "how things are" in statically typed languages. E.g., you have a List<A>, where A is some abstract base class, you can put specific instances of subclasses in, but if you want to operate on the specific subclasses, you either have to (1) cast out of indexing into the list or (2) not insert into the list in the first place, if you cared about keeping it as a logical discrete unit more specific than being of type A. I prefer the latter.

We did "cheat" a little bit insofar as we maintain in a strongly typed fashion the type of whatever the last estimator or transformer is, but as @Zruty0 has argued on multiple occasions perhaps that is not so useful.

If we really wanted to solve this problem of having pipelines and strong typing of the individual items, the only way I can figure how we'd do that is by adopting an approach similar to what is done with tuples and value tuples... then if we had a chain of Chain<TEst1, TEst2, TEst3>, the append would result in Chain<TEst1, TEst2, TEst3, TEst4>, and so on, and so on. But this would be going a bit nuts in my opinion.

@mstfbl
Copy link
Contributor

mstfbl commented Jan 9, 2020

Based on the conversion above, the consensus is to make no additional changes. Therefore I am closing this sisue.

@mstfbl mstfbl closed this as completed Jan 9, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Mar 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants