-
Notifications
You must be signed in to change notification settings - Fork 1.9k
SdcaMaximumEntropy trainer goes into an infinite loop if it takes already transformed data view as an input #4926
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@artemiusgreat I couldn't reproduce this error with a simple enumerable data with only a label and a numeric feature vector. I used a simple That said, one thing I see potentially wrong with your code is that in the last line you have Admittedly, this should throw a schema mismatch error at the first step, saying column |
@najeeb-kazmi yes, you're right regarding throwing an exception, but the line selecting only columns Label and Features is irrelevant to the issue and can be commented for now. Created a demo project that doesn't reproduce significant resources consumption, but demonstrates how drastically execution time can increase by simply separating data preparation pipeline from trainer. At least, this is the only difference I can see between 2 stop-watches. |
@artemiusgreat I tried to make the comparison as apples-to-apples as possible. The difference is primarily due to
|
The thing is, in the real project, method creating pipeline is the same for all trainers, and only SdcaMaximumEntropy trainer goes wild. Slow code public static void CreateSlowModel(IDataView baseView)
{
var dataPipeline = GetPipeline(); // if I move data pipeline creation to a separate method, it becomes slow, replace this line with the content of the method GetPipeline and slow model will become fast
var trainer = GetEstimator();
dataPipeline.Append(trainer).Fit(baseView);
}
public static IEstimator<ITransformer> GetPipeline()
{
return mlContext
.Transforms
.Conversion
.MapValueToKey("Label", "Strategy")
.Append(mlContext.Transforms.Concatenate("Combination", Selection.ToArray()))
.Append(mlContext.Transforms.NormalizeMinMax(new[] { new InputOutputColumnPair("Features", "Combination") }))
.AppendCacheCheckpoint(mlContext);
} Results
Now merge two methods into one public static void CreateSlowModel(IDataView baseView)
{
var dataPipeline = mlContext
.Transforms
.Conversion
.MapValueToKey("Label", "Strategy")
.Append(mlContext.Transforms.Concatenate("Combination", Selection.ToArray()))
.Append(mlContext.Transforms.NormalizeMinMax(new[] { new InputOutputColumnPair("Features", "Combination") }))
.AppendCacheCheckpoint(mlContext);
var trainer = GetEstimator();
dataPipeline.Append(trainer).Fit(baseView);
} Results
|
@artemiusgreat This is happening because
Here, start is dataPipeline , which is an EstimatorChain<ITransformer> , and estimator is trainer , which is also an EstimatorChain<ITransformer> . This method then calls .Append twice on an empty EstimatorChain<ITransformer> , using this method:machinelearning/src/Microsoft.ML.Data/DataLoadSave/EstimatorChain.cs Lines 87 to 88 in 290da82
This returns a new EstimatorChain<ITransformer> whose non-public property _estimators is an IEstimator<ITransformer>[] of length 2, with two elements of type EstimatorChain<ITransformer> , the first one being dataPipeline and the second one being trainer . At the same time, the corresponding non-public property _needCacheAfter is a bool[] of length 2 with both elements being false . This is where you are losing the caching.
On the other hand, when you create the machinelearning/src/Microsoft.ML.Data/DataLoadSave/EstimatorChain.cs Lines 87 to 88 in 290da82
Here, trainer , which is an EstimatorChain<ITransformer> , gets appended to the _estimators property of dataPipeline . What you get then is an EstimatorChain<ITransformer> whose _estimators is an IEstimator<ITransformer>[] of length 4, with the first three being the three estimators in dataPipeline , and the last being the trainer , which is an EstimatorChain<ITransformer> . The corresponding _needCacheAfter is a bool[] of length 4 with the third element (corresponding to the normalizer, i.e. the step right before the trainer) being true and the rest false . This is why you get caching when you do it like this.
So, coming to your problem, you can do one of the following:
|
Thank you for the investigation of this case. Final code Added cache in the method that combines data pipeline with a trainer. public void GetPredictor(IEnumerable<string> columns, IDataView inputs)
{
var estimator = GetEstimator();
var pipeline = GetPipeline(columns);
var estimatorModel = pipeline.AppendCacheCheckpoint(Context).Append(estimator).Fit(inputs);
} |
System information
Issue
What I did
What happened
If I execute pipeline once, e.g. load from enumerables into data view and then execute entire transformation chain that includes transformations and trainer, everything works fine.
If I execute pipeline twice, first time - separately, then - as a part of entire transformation chain, it consumes 3GB of RAM memory out of 16GB available, then training hangs indefinitely and never ends.
Fixed this temporarily by changing this
MaximumNumberOfIterations
option, but not sure if it's a good idea...What I expect
I expect training to stop eventually, no matter how many times I execute pipeline.
Check the comment on the last line in the core below.
Source code
Source code is taken from this issue #4903
The text was updated successfully, but these errors were encountered: