Skip to content

Debugging hanging AutoFitImageClassificationTrainTest #4893

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 28 commits into from

Conversation

mstfbl
Copy link
Contributor

@mstfbl mstfbl commented Feb 26, 2020

Will be using this draft PR for general debugging purposes on CI

Notes:
Windows builds have 7,168 MBs of RAM

@mstfbl
Copy link
Contributor Author

mstfbl commented Feb 26, 2020

Testing of AutoFitImageClassificationTrainTest is taking too long per test, so tesitng it with 1000 iterations isn't feasible.

@mstfbl
Copy link
Contributor Author

mstfbl commented Feb 26, 2020

The tests AutoFitRecommendationTest and AutoFitRegressionTest are passing. AutoFitImageClassificationTrainTest is displaying errors every now and then.

@mstfbl
Copy link
Contributor Author

mstfbl commented Feb 26, 2020

The reason why AutoFitImageClassificationTrainTest is crashing is after running var result = context.Auto().CreateMulticlassClassificationExperiment(0).Execute(trainDataset, testDataset, columnInference.ColumnInformation), result.Best run might not always be updated. I saw this as I caught result.BestRun having a null value right when it is being called. Exact location where error is thrown.

@mstfbl
Copy link
Contributor Author

mstfbl commented Feb 27, 2020

The original bug with AutoFitImageClassificationTrainTest is occuring due to the null returned value here:

if (!results.Any()) { return null; }

For some reason, sometimes the validationMetrics of an IEnumerable<(RunDetail) is null.

@mstfbl
Copy link
Contributor Author

mstfbl commented Feb 28, 2020

There's an issue with this Evaluation function:

public MulticlassClassificationMetrics Evaluate(IDataView data, string label, string score, string predictedLabel)
{
Host.CheckValue(data, nameof(data));
Host.CheckNonEmpty(label, nameof(label));
Host.CheckNonEmpty(score, nameof(score));
Host.CheckNonEmpty(predictedLabel, nameof(predictedLabel));
var roles = new RoleMappedData(data, opt: false,
RoleMappedSchema.ColumnRole.Label.Bind(label),
RoleMappedSchema.CreatePair(AnnotationUtils.Const.ScoreValueKind.Score, score),
RoleMappedSchema.CreatePair(AnnotationUtils.Const.ScoreValueKind.PredictedLabel, predictedLabel));
var resultDict = ((IEvaluator)this).Evaluate(roles);
Host.Assert(resultDict.ContainsKey(MetricKinds.OverallMetrics));
var overall = resultDict[MetricKinds.OverallMetrics];
var confusionMatrix = resultDict[MetricKinds.ConfusionMatrix];
MulticlassClassificationMetrics result;
using (var cursor = overall.GetRowCursorForAllColumns())
{
var moved = cursor.MoveNext();
Host.Assert(moved);
result = new MulticlassClassificationMetrics(Host, cursor, _outputTopKAcc ?? 0, confusionMatrix);
moved = cursor.MoveNext();
Host.Assert(!moved);
}
return result;
}
}

The returned result value can sometimes be null, which is what is causing AutoFitImageClassificationTrainTest to sometimes fail.

@mstfbl
Copy link
Contributor Author

mstfbl commented Feb 28, 2020

I figured out the cause of the occasional crash of AutoFitImageClassificationTrainTest. When any exception occurs in RunnerUtil.TrainAndScorePipeline, instead of throwing the error, it is instead caught and ignored while a null metrics value (in line 49) is sent up through the call stack instead.

try
{
var estimator = pipeline.ToEstimator(trainData, validData);
var model = estimator.Fit(trainData);
var scoredData = model.Transform(validData);
var metrics = metricsAgent.EvaluateMetrics(scoredData, labelColumn);
var score = metricsAgent.GetScore(metrics);
if (preprocessorTransform != null)
{
model = preprocessorTransform.Append(model);
}
// Build container for model
var modelContainer = modelFileInfo == null ?
new ModelContainer(context, model) :
new ModelContainer(context, modelFileInfo, model, modelInputSchema);
return (modelContainer, metrics, null, score);
}
catch (Exception ex)
{
logger.Error($"Pipeline crashed: {pipeline.ToString()} . Exception: {ex}");
return (null, null, ex, double.NaN);

This is the exception being caught:

System.ArgumentException : PIPELINE CRASHES - Line 55 - RunnerUtil.cs - Pipeline crash string: xf=ValueToKeyMapping{ col=Label:Label} xf=RawByteImageLoading{ col=ImagePath_featurized:ImagePath imageFolder=} xf=ColumnCopying{ col=Features:ImagePath_featurized} tr=ImageClassification{} xf=KeyToValueMapping{ col=PredictedLabel:PredictedLabel} cache=- - Exception String: System.FormatException: Tensorflow exception triggered while loading model. ---> System.Runtime.InteropServices.SEHException: External component has thrown an exception.

When reproduced locally, the exception string is:

"Could not find a part of the path 'C:\Users\mubal\AppData\Local\Temp\Microsoft.ML.AutoML\experiment_y1gbdyum.xdu\Model1.zip'."

@mstfbl
Copy link
Contributor Author

mstfbl commented Mar 2, 2020

Update: AutoFitImageClassificationTrainTest with 100 iterations fail on Windows x64 builds with:

System.Runtime.InteropServices.SEHException (0x80004005): External component has thrown an exception

but also I get:
System.FormatException: Tensorflow exception triggered while loading model. ---> System.OutOfMemoryException
and:
System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.

in Tensorflow.c_api.TF_SessionRun, which is the C++ implementation of TensorFlow's training code. This seems related to Issue SciSharp/TensorFlow.NET#485

As mentioned in this PR #4755, we still cannot see details about the crash in Tensorflow.c_api.TF_SessionRun.

@mstfbl mstfbl force-pushed the AutoFitTests-Debugging branch from 4520530 to 65a72ef Compare March 19, 2020 05:17
@mstfbl mstfbl closed this Mar 20, 2020
@mstfbl mstfbl changed the title Auto fit tests debugging Debugging PR Mar 22, 2020
@mstfbl mstfbl reopened this Mar 22, 2020
@mstfbl mstfbl closed this Mar 22, 2020
@mstfbl mstfbl force-pushed the AutoFitTests-Debugging branch from c1f8231 to c1e422d Compare March 22, 2020 08:39
@mstfbl mstfbl reopened this Mar 22, 2020
@mstfbl mstfbl changed the title Debugging PR Debugging hanging AutoFitImageClassificationTrainTest Mar 26, 2020
@mstfbl mstfbl force-pushed the AutoFitTests-Debugging branch from df5f642 to 5a7ad17 Compare March 26, 2020 03:32
@mstfbl
Copy link
Contributor Author

mstfbl commented Mar 26, 2020

Will be using this PR to debug AutoFitImageClassificationTrainTest hanging occasionally on Windows builds.

@mstfbl
Copy link
Contributor Author

mstfbl commented Mar 27, 2020

AutoFitImageClassificationTrainTest is still occasionally hanging, mostly due to indisposed Tensorflow objects after the test is complete. I found this comment in Microsoft.ML.Vision/ImageClassificationTrainer.TrainModelCore to be of interest:

// Leave the ownership of _session so that it is not disposed/closed when this object goes out of scope
// since it will be used by ImageClassificationModelParameters class (new owner that will take care of
// disposing).
var session = _session;
_session = null;
return new ImageClassificationModelParameters(Host, session, _classCount, _jpegDataTensorName,
_resizedImageTensorName, _inputTensorName, _softmaxTensorName);

@mstfbl
Copy link
Contributor Author

mstfbl commented Mar 27, 2020

Adding the fix (model as IDisposable)?.Dispose(); for freeing Tensor objects worked! This fix is necessary, as these Tensor objects made in the C TensorFlow libraries are not automatically cleaned up by C#'s Garbage Collector.

Edit: While this fix works, it is not safe to assume that this model can be disposed in RunnerUtil.cs. The user might be accessing this model during disposal, which would result in use-after-free and/or null reference errors.

@mstfbl mstfbl force-pushed the AutoFitTests-Debugging branch from 78a65c2 to 087c0d5 Compare April 16, 2020 05:13
@dotnet dotnet deleted a comment from azure-pipelines bot Apr 16, 2020
@mstfbl
Copy link
Contributor Author

mstfbl commented Apr 16, 2020

Freeing Tensor objects in model in a finally statement in TrainAndScorePipeline works in fixing memory bug, and is safe to do when model is never saved in memory and written to disk always.

@mstfbl mstfbl closed this Apr 26, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Mar 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant