Description
It is possible to nest EstimatorChain
s inside one another, fit them, and use them to transform data. The result is an object that is a nested TransformerChain
.
Question: Is this intended behavior? Do we want to allow this sort of nesting in the V1 API?
I think that the proper way to handle nesting is to first flatten the structure before the fit and return a single EstimatorChain
. I believe that since there is no forking and joining, that nested and non-nested pipelines are identical, except for the returned object. Data transformed by these objects should be the same whether the pipeline is nested or not (and is in my limited testing).
Take a look at the following example where we featurize the UCI Adult dataset.
var mlContext = new MLContext(seed: 1, conc: 1);
// Load the Adult (tiny) dataset
var data = mlContext.Data.LoadFromTextFile<Adult>(GetDataPath(TestDatasets.adult.trainFilename),
hasHeader: TestDatasets.adult.fileHasHeader,
separatorChar: TestDatasets.adult.fileSeparator);
// Create the learning pipeline
var pipeline = mlContext.Transforms.Concatenate("NumericalFeatures", Adult.NumericalFeatures)
.Append(mlContext.Transforms.Concatenate("CategoricalFeatures", Adult.CategoricalFeatures))
.Append(mlContext.Transforms.Categorical.OneHotHashEncoding("CategoricalFeatures",
invertHash: 2, outputKind: OneHotEncodingTransformer.OutputKind.Bag))
.Append(mlContext.Transforms.Concatenate("Features", "NumericalFeatures", "CategoricalFeatures"))
.Append(mlContext.BinaryClassification.Trainers.LogisticRegression());
// Train the model.
var model = pipeline.Fit(data);
Here, pipeline
is an EstimatorChain<BinaryPredictionTransformer<...>>
and model
is a TransformerChain<BinaryPredictionTransformer<...>>
.
It's also possible to nest the pipeline. Perhaps you accidentally put an errant )
here and there, and then you have this:
// Create the learning pipeline
var pipeline = mlContext.Transforms.Concatenate("NumericalFeatures", Adult.NumericalFeatures)
.Append(mlContext.Transforms.Concatenate("CategoricalFeatures", Adult.CategoricalFeatures))
.Append(mlContext.Transforms.Categorical.OneHotHashEncoding("CategoricalFeatures",
invertHash: 2, outputKind: OneHotEncodingTransformer.OutputKind.Bag) // <-- missing a )
.Append(mlContext.Transforms.Concatenate("Features", "NumericalFeatures", "CategoricalFeatures"))
.Append(mlContext.BinaryClassification.Trainers.LogisticRegression())); // <-- extra )
Now, pipeline
is an EstimatorChain<EstimatorChain<BinaryPredictionTransformer<...>>>
and model
is a TransformerChain<TransformerChain<BinaryPredictionTransformer<...>>>
.
Now, if I compare the two (where var predictor = model.LastTransformer
and var nestedPredictor = nestedModel.LastTransformer.LastTransformer
), it's clear that the models and the transformed data are identical:
//True!
Assert.Equal(predictor.Model.SubModel.Bias, nestedPredictor.Model.SubModel.Bias);
int nFeatures = predictor.Model.SubModel.Weights.Count;
for (int i = 0; i < nFeatures; i++ )
//True!
Assert.Equal(predictor.Model.SubModel.Weights[i], nestedPredictor.Model.SubModel.Weights[i]);
var transformedRows = mlContext.Data.CreateEnumerable<BinaryPrediction>(transformedData, false).ToArray();
var nestedTransformedRows = mlContext.Data.CreateEnumerable<BinaryPrediction>(nestedTransformedData, false).ToArray();
for (int i = 0; i < transformedRows.Length; i++)
//True!
Assert.Equal(transformedRows[i].Score, nestedTransformedRows[i].Score);