Skip to content

Chains of Chains #2820

Closed
Closed
@rogancarr

Description

@rogancarr

It is possible to nest EstimatorChains inside one another, fit them, and use them to transform data. The result is an object that is a nested TransformerChain.

Question: Is this intended behavior? Do we want to allow this sort of nesting in the V1 API?

I think that the proper way to handle nesting is to first flatten the structure before the fit and return a single EstimatorChain. I believe that since there is no forking and joining, that nested and non-nested pipelines are identical, except for the returned object. Data transformed by these objects should be the same whether the pipeline is nested or not (and is in my limited testing).

Take a look at the following example where we featurize the UCI Adult dataset.

var mlContext = new MLContext(seed: 1, conc: 1);

// Load the Adult (tiny) dataset
var data = mlContext.Data.LoadFromTextFile<Adult>(GetDataPath(TestDatasets.adult.trainFilename),
    hasHeader: TestDatasets.adult.fileHasHeader,
    separatorChar: TestDatasets.adult.fileSeparator);

// Create the learning pipeline
var pipeline = mlContext.Transforms.Concatenate("NumericalFeatures", Adult.NumericalFeatures)
    .Append(mlContext.Transforms.Concatenate("CategoricalFeatures", Adult.CategoricalFeatures))
    .Append(mlContext.Transforms.Categorical.OneHotHashEncoding("CategoricalFeatures",
        invertHash: 2, outputKind: OneHotEncodingTransformer.OutputKind.Bag))
    .Append(mlContext.Transforms.Concatenate("Features", "NumericalFeatures", "CategoricalFeatures"))
    .Append(mlContext.BinaryClassification.Trainers.LogisticRegression());

// Train the model.
var model = pipeline.Fit(data);

Here, pipeline is an EstimatorChain<BinaryPredictionTransformer<...>> and model is a TransformerChain<BinaryPredictionTransformer<...>>.

It's also possible to nest the pipeline. Perhaps you accidentally put an errant ) here and there, and then you have this:

// Create the learning pipeline
var pipeline = mlContext.Transforms.Concatenate("NumericalFeatures", Adult.NumericalFeatures)
    .Append(mlContext.Transforms.Concatenate("CategoricalFeatures", Adult.CategoricalFeatures))
    .Append(mlContext.Transforms.Categorical.OneHotHashEncoding("CategoricalFeatures",
        invertHash: 2, outputKind: OneHotEncodingTransformer.OutputKind.Bag) // <-- missing a )
    .Append(mlContext.Transforms.Concatenate("Features", "NumericalFeatures", "CategoricalFeatures"))
    .Append(mlContext.BinaryClassification.Trainers.LogisticRegression())); // <-- extra )

Now, pipeline is an EstimatorChain<EstimatorChain<BinaryPredictionTransformer<...>>> and model is a TransformerChain<TransformerChain<BinaryPredictionTransformer<...>>>.

Now, if I compare the two (where var predictor = model.LastTransformer and var nestedPredictor = nestedModel.LastTransformer.LastTransformer), it's clear that the models and the transformed data are identical:

//True!
Assert.Equal(predictor.Model.SubModel.Bias, nestedPredictor.Model.SubModel.Bias);
int nFeatures = predictor.Model.SubModel.Weights.Count;
for (int i = 0; i < nFeatures; i++ )
    //True!
    Assert.Equal(predictor.Model.SubModel.Weights[i], nestedPredictor.Model.SubModel.Weights[i]); 

var transformedRows = mlContext.Data.CreateEnumerable<BinaryPrediction>(transformedData, false).ToArray();
var nestedTransformedRows = mlContext.Data.CreateEnumerable<BinaryPrediction>(nestedTransformedData, false).ToArray();
for (int i = 0; i < transformedRows.Length; i++)
    //True!
    Assert.Equal(transformedRows[i].Score, nestedTransformedRows[i].Score); 

Metadata

Metadata

Assignees

No one assigned

    Labels

    APIIssues pertaining the friendly APIquestionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions