-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Debugging hanging AutoFitImageClassificationTrainTest #4893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Testing of |
The tests AutoFitRecommendationTest and AutoFitRegressionTest are passing. AutoFitImageClassificationTrainTest is displaying errors every now and then. |
The reason why AutoFitImageClassificationTrainTest is crashing is after running |
The original bug with
For some reason, sometimes the validationMetrics of an IEnumerable<(RunDetail) is null. |
There's an issue with this machinelearning/src/Microsoft.ML.Data/Evaluators/MulticlassClassificationEvaluator.cs Lines 506 to 535 in f0a8a76
The returned |
I figured out the cause of the occasional crash of AutoFitImageClassificationTrainTest. When any exception occurs in RunnerUtil.TrainAndScorePipeline, instead of throwing the error, it is instead caught and ignored while a null metrics value (in line 49) is sent up through the call stack instead.
This is the exception being caught:
When reproduced locally, the exception string is:
|
Update: AutoFitImageClassificationTrainTest with 100 iterations fail on Windows x64 builds with:
but also I get: in As mentioned in this PR #4755, we still cannot see details about the crash in |
4520530
to
65a72ef
Compare
c1f8231
to
c1e422d
Compare
df5f642
to
5a7ad17
Compare
Will be using this PR to debug AutoFitImageClassificationTrainTest hanging occasionally on Windows builds. |
machinelearning/src/Microsoft.ML.Vision/ImageClassificationTrainer.cs Lines 714 to 721 in 5d531d3
|
Adding the fix Edit: While this fix works, it is not safe to assume that this model can be disposed in RunnerUtil.cs. The user might be accessing this model during disposal, which would result in use-after-free and/or null reference errors. |
…Container in used cases
…cationModelParameters
…ificationTrainTest-memoryFix
…://github.com/mstfbl/machinelearning into AutoFitImageClassificationTrainTest-memoryFix
78a65c2
to
087c0d5
Compare
…chinelearning into AutoFitTests-Debugging
Freeing Tensor objects in |
Will be using this draft PR for general debugging purposes on CI
Notes:
Windows builds have 7,168 MBs of RAM