Debugging hanging AutoFitImageClassificationTrainTest #4893

mstfbl · 2020-02-26T18:59:11Z

Will be using this draft PR for general debugging purposes on CI

Notes:
Windows builds have 7,168 MBs of RAM

mstfbl · 2020-02-26T19:22:24Z

Testing of AutoFitImageClassificationTrainTest is taking too long per test, so tesitng it with 1000 iterations isn't feasible.

mstfbl · 2020-02-26T23:18:05Z

The tests AutoFitRecommendationTest and AutoFitRegressionTest are passing. AutoFitImageClassificationTrainTest is displaying errors every now and then.

mstfbl · 2020-02-26T23:38:05Z

The reason why AutoFitImageClassificationTrainTest is crashing is after running var result = context.Auto().CreateMulticlassClassificationExperiment(0).Execute(trainDataset, testDataset, columnInference.ColumnInformation), result.Best run might not always be updated. I saw this as I caught result.BestRun having a null value right when it is being called. Exact location where error is thrown.

mstfbl · 2020-02-27T23:13:42Z

The original bug with AutoFitImageClassificationTrainTest is occuring due to the null returned value here:

machinelearning/src/Microsoft.ML.AutoML/Utils/BestResultUtil.cs

Line 61 in f0a8a76

if (!results.Any()) { return null; }

For some reason, sometimes the validationMetrics of an IEnumerable<(RunDetail) is null.

mstfbl · 2020-02-28T00:38:44Z

There's an issue with this Evaluation function:

machinelearning/src/Microsoft.ML.Data/Evaluators/MulticlassClassificationEvaluator.cs

Lines 506 to 535 in f0a8a76

    
               public MulticlassClassificationMetrics Evaluate(IDataView data, string label, string score, string predictedLabel) 
        
               { 
        
                   Host.CheckValue(data, nameof(data)); 
        
                   Host.CheckNonEmpty(label, nameof(label)); 
        
                   Host.CheckNonEmpty(score, nameof(score)); 
        
                   Host.CheckNonEmpty(predictedLabel, nameof(predictedLabel)); 
        
                   var roles = new RoleMappedData(data, opt: false, 
        
                       RoleMappedSchema.ColumnRole.Label.Bind(label), 
        
                       RoleMappedSchema.CreatePair(AnnotationUtils.Const.ScoreValueKind.Score, score), 
        
                       RoleMappedSchema.CreatePair(AnnotationUtils.Const.ScoreValueKind.PredictedLabel, predictedLabel)); 
        
                   var resultDict = ((IEvaluator)this).Evaluate(roles); 
        
                   Host.Assert(resultDict.ContainsKey(MetricKinds.OverallMetrics)); 
        
                   var overall = resultDict[MetricKinds.OverallMetrics]; 
        
                   var confusionMatrix = resultDict[MetricKinds.ConfusionMatrix]; 
        
                   MulticlassClassificationMetrics result; 
        
                   using (var cursor = overall.GetRowCursorForAllColumns()) 
        
                   { 
        
                       var moved = cursor.MoveNext(); 
        
                       Host.Assert(moved); 
        
                       result = new MulticlassClassificationMetrics(Host, cursor, _outputTopKAcc ?? 0, confusionMatrix); 
        
                       moved = cursor.MoveNext(); 
        
                       Host.Assert(!moved); 
        
                   } 
        
                   return result; 
        
               } 
        
           }

The returned result value can sometimes be null, which is what is causing AutoFitImageClassificationTrainTest to sometimes fail.

mstfbl · 2020-02-28T06:24:14Z

I figured out the cause of the occasional crash of AutoFitImageClassificationTrainTest. When any exception occurs in RunnerUtil.TrainAndScorePipeline, instead of throwing the error, it is instead caught and ignored while a null metrics value (in line 49) is sent up through the call stack instead.

machinelearning/src/Microsoft.ML.AutoML/Experiment/Runners/RunnerUtil.cs

Lines 25 to 49 in f0a8a76

    
           try 
        
           { 
        
               var estimator = pipeline.ToEstimator(trainData, validData); 
        
               var model = estimator.Fit(trainData); 
        
               var scoredData = model.Transform(validData); 
        
               var metrics = metricsAgent.EvaluateMetrics(scoredData, labelColumn); 
        
               var score = metricsAgent.GetScore(metrics); 
        
               if (preprocessorTransform != null) 
        
               { 
        
                   model = preprocessorTransform.Append(model); 
        
               } 
        
               // Build container for model 
        
               var modelContainer = modelFileInfo == null ? 
        
                   new ModelContainer(context, model) : 
        
                   new ModelContainer(context, modelFileInfo, model, modelInputSchema); 
        
               return (modelContainer, metrics, null, score); 
        
           } 
        
           catch (Exception ex) 
        
           { 
        
               logger.Error($"Pipeline crashed: {pipeline.ToString()} . Exception: {ex}"); 
        
               return (null, null, ex, double.NaN);

This is the exception being caught:

System.ArgumentException : PIPELINE CRASHES - Line 55 - RunnerUtil.cs - Pipeline crash string: xf=ValueToKeyMapping{ col=Label:Label} xf=RawByteImageLoading{ col=ImagePath_featurized:ImagePath imageFolder=} xf=ColumnCopying{ col=Features:ImagePath_featurized} tr=ImageClassification{} xf=KeyToValueMapping{ col=PredictedLabel:PredictedLabel} cache=- - Exception String: System.FormatException: Tensorflow exception triggered while loading model. ---> System.Runtime.InteropServices.SEHException: External component has thrown an exception.

When reproduced locally, the exception string is:

"Could not find a part of the path 'C:\Users\mubal\AppData\Local\Temp\Microsoft.ML.AutoML\experiment_y1gbdyum.xdu\Model1.zip'."

mstfbl · 2020-03-02T08:27:32Z

Update: AutoFitImageClassificationTrainTest with 100 iterations fail on Windows x64 builds with:

System.Runtime.InteropServices.SEHException (0x80004005): External component has thrown an exception

but also I get:
System.FormatException: Tensorflow exception triggered while loading model. ---> System.OutOfMemoryException
and:
System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.

in Tensorflow.c_api.TF_SessionRun, which is the C++ implementation of TensorFlow's training code. This seems related to Issue SciSharp/TensorFlow.NET#485

As mentioned in this PR #4755, we still cannot see details about the crash in Tensorflow.c_api.TF_SessionRun.

mstfbl · 2020-03-26T03:32:54Z

Will be using this PR to debug AutoFitImageClassificationTrainTest hanging occasionally on Windows builds.

mstfbl · 2020-03-27T04:04:34Z

AutoFitImageClassificationTrainTest is still occasionally hanging, mostly due to indisposed Tensorflow objects after the test is complete. I found this comment in Microsoft.ML.Vision/ImageClassificationTrainer.TrainModelCore to be of interest:

machinelearning/src/Microsoft.ML.Vision/ImageClassificationTrainer.cs

Lines 714 to 721 in 5d531d3

    
           // Leave the ownership of _session so that it is not disposed/closed when this object goes out of scope 
        
           // since it will be used by ImageClassificationModelParameters class (new owner that will take care of 
        
           // disposing). 
        
           var session = _session; 
        
           _session = null; 
        
           return new ImageClassificationModelParameters(Host, session, _classCount, _jpegDataTensorName, 
        
               _resizedImageTensorName, _inputTensorName, _softmaxTensorName);

mstfbl · 2020-03-27T18:21:32Z

Adding the fix (model as IDisposable)?.Dispose(); for freeing Tensor objects worked! This fix is necessary, as these Tensor objects made in the C TensorFlow libraries are not automatically cleaned up by C#'s Garbage Collector.

Edit: While this fix works, it is not safe to assume that this model can be disposed in RunnerUtil.cs. The user might be accessing this model during disposal, which would result in use-after-free and/or null reference errors.

…Container in used cases

…cationModelParameters

…ificationTrainTest-memoryFix

…del disposal

…://github.com/mstfbl/machinelearning into AutoFitImageClassificationTrainTest-memoryFix

…chinelearning into AutoFitTests-Debugging

mstfbl · 2020-04-16T09:34:35Z

Freeing Tensor objects in model in a finally statement in TrainAndScorePipeline works in fixing memory bug, and is safe to do when model is never saved in memory and written to disk always.

mstfbl force-pushed the AutoFitTests-Debugging branch from 4520530 to 65a72ef Compare March 19, 2020 05:17

mstfbl mentioned this pull request Mar 19, 2020

Added working version of checking whether file is available for access #4938

Merged

mstfbl closed this Mar 20, 2020

mstfbl changed the title ~~Auto fit tests debugging~~ Debugging PR Mar 22, 2020

mstfbl reopened this Mar 22, 2020

mstfbl closed this Mar 22, 2020

mstfbl force-pushed the AutoFitTests-Debugging branch from c1f8231 to c1e422d Compare March 22, 2020 08:39

mstfbl reopened this Mar 22, 2020

mstfbl changed the title ~~Debugging PR~~ Debugging hanging AutoFitImageClassificationTrainTest Mar 26, 2020

mstfbl force-pushed the AutoFitTests-Debugging branch from df5f642 to 5a7ad17 Compare March 26, 2020 03:32

mstfbl and others added 10 commits March 27, 2020 13:28

Free Tensor objects in finally statement

f4a910e

Update RunnerUtil.cs

4f8475c

Re-enable AutoFitImageClassificationTrainTest after fix

f7e1337

Added IDisposable support to ModelContainer & corrected name of model…

83ad312

…Container in used cases

Corrected name of modelContainer in used cases

b366fbe

Clean up Tensor objects through finalizer/destructor of ImageClassifi…

cb22b21

…cationModelParameters

Dispose ExperimentResult objects at the end

eefa76f

Dispose only Tensor objects in models

45681b4

Don't free BestModel models

fbd3fd9

Merge remote-tracking branch 'upstream/master' into AutoFitImageClass…

2816ced

…ificationTrainTest-memoryFix

mstfbl added 6 commits April 13, 2020 17:29

Throw Exception if model is trying to be accessed after disposal

7dad242

Initialize IsModelDisposed inside constructors

1488d0c

Model always written to disk, no longer stored in memory, simplify mo…

78bba9c

…del disposal

Model always written to disk, no longer stored in memory, simplify mo…

bf84823

…del disposal

Merge branch 'AutoFitImageClassificationTrainTest-memoryFix' of https…

15b6135

…://github.com/mstfbl/machinelearning into AutoFitImageClassificationTrainTest-memoryFix

Update ModelContainer.cs

087c0d5

mstfbl force-pushed the AutoFitTests-Debugging branch from 78a65c2 to 087c0d5 Compare April 16, 2020 05:13

mstfbl added 7 commits April 15, 2020 22:23

Run AutoFitImageClassificationTrainTest 100 times with latest update

d633c5c

Restart build

bb4a8b2

Test latest changes

43f5f2a

Merge branch 'AutoFitTests-Debugging' of https://github.com/mstfbl/ma…

0350890

…chinelearning into AutoFitTests-Debugging

Update .vsts-dotnet-ci.yml

f9adbef

Dispose of models using "using", and free model after saving to disk

de8a567

Dispose model in RunnerUtil.cs

eec4c42

dotnet deleted a comment from azure-pipelines bot Apr 16, 2020

mstfbl added 5 commits April 16, 2020 02:53

Test directly disposing models when models can still be in memory

dadc173

Test directly disposing models when models can still be in memory - 2

e0864a3

Add test case for .DisposeRunDetails and get memory info

675ee13

get memory info

146bc40

Get memory info in Windows and UNIX builds

1e3e985

mstfbl closed this Apr 26, 2020

ghost locked as resolved and limited conversation to collaborators Mar 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Debugging hanging AutoFitImageClassificationTrainTest #4893

Debugging hanging AutoFitImageClassificationTrainTest #4893

Uh oh!

mstfbl commented Feb 26, 2020 •

edited

Loading

Uh oh!

mstfbl commented Feb 26, 2020

Uh oh!

mstfbl commented Feb 26, 2020

Uh oh!

mstfbl commented Feb 26, 2020 •

edited

Loading

Uh oh!

mstfbl commented Feb 27, 2020 •

edited

Loading

Uh oh!

mstfbl commented Feb 28, 2020

Uh oh!

mstfbl commented Feb 28, 2020 •

edited

Loading

Uh oh!

mstfbl commented Mar 2, 2020 •

edited

Loading

Uh oh!

mstfbl commented Mar 26, 2020

Uh oh!

mstfbl commented Mar 27, 2020

Uh oh!

mstfbl commented Mar 27, 2020 •

edited

Loading

Uh oh!

mstfbl commented Apr 16, 2020

Uh oh!

Uh oh!

Debugging hanging AutoFitImageClassificationTrainTest #4893

Debugging hanging AutoFitImageClassificationTrainTest #4893

Uh oh!

Conversation

mstfbl commented Feb 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mstfbl commented Feb 26, 2020

Uh oh!

mstfbl commented Feb 26, 2020

Uh oh!

mstfbl commented Feb 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mstfbl commented Feb 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mstfbl commented Feb 28, 2020

Uh oh!

mstfbl commented Feb 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mstfbl commented Mar 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mstfbl commented Mar 26, 2020

Uh oh!

mstfbl commented Mar 27, 2020

Uh oh!

mstfbl commented Mar 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mstfbl commented Apr 16, 2020

Uh oh!

Uh oh!

mstfbl commented Feb 26, 2020 •

edited

Loading

mstfbl commented Feb 26, 2020 •

edited

Loading

mstfbl commented Feb 27, 2020 •

edited

Loading

mstfbl commented Feb 28, 2020 •

edited

Loading

mstfbl commented Mar 2, 2020 •

edited

Loading

mstfbl commented Mar 27, 2020 •

edited

Loading