Training time finished without any models trained. #596

infiniteloopltd · 2020-03-20T09:20:22Z

System Information (please complete the following information):

Model Builder Version: 1.4 - Not sure
Visual Studio Version: 2017 Professional

Describe the bug
Creating a classification model on 6 million rows, suggested time 1800 seconds.

To Reproduce
Can't share data, sorry.

Expected behavior
Model to be created

Screenshots
Training time finished without any models trained.

at Microsoft.ML.ModelBuilder.AutoMLService.Experiments.AutoMLExperiment`3.d__23.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.ML.ModelBuilder.AutoMLEngine.d__28.MoveNext()

infiniteloopltd · 2020-03-20T10:30:06Z

Same error after 4000 seconds (>1hr)

Output pane doesn't give much more detail

| Trainer MicroAccuracy MacroAccuracy Duration #Iteration |
| Trainer MicroAccuracy MacroAccuracy Duration #Iteration |
| Trainer MicroAccuracy MacroAccuracy Duration #Iteration |

infiniteloopltd · 2020-03-20T10:36:15Z

But worked on 100 seconds on 10000 rows, so extrapolating, it would take 17 hours to train on 6 million rows.

The issue here is just the fact that the suggestted time is completely underestimated in the Model Builder.

infiniteloopltd · 2020-03-23T15:25:58Z

If anyone in Microsoft is reading these issues, I'd love to put forward a suggestion here -

If someone has enterered a time that is insufficient to train a model, it would be better if it could have a "Resume" option, instead of crashing out.

On large data sets, it could be hours before a model could be trained.

LittleLittleCloud · 2020-03-25T00:04:46Z

Sorry for the late reply.
how many columns do you have in your dataset, and does that also include some text field (perhaps also hardware configuration for your PC)? usually it shouldn't take that much long time to train a 10000 row file.

infiniteloopltd · 2020-03-26T12:52:04Z

Hi @LittleLittleCloud - Nice cat :)

The training data was text based, and the PC is a few years old.

My only suggestion, is that instead of this error appearing, is that if it could perhaps suggest to continue training the model for x more minutes?, rather than crashing, and having the user re-start the training process.

I would feel that people would prefer the model to complete training, rather than see and error message, even if the training is going to take longer than expected.

I've also seen cases where the training has cancelled, without me pressing the "cancel" button, but I'm not sure how that happens, but it is equally annoying.

justinormont · 2020-03-26T19:20:00Z

@infiniteloopltd : What's the size of your dataset in MB? What is your task? (regression/classification)

If classification, how many classes do you have? The number of classes has a large effect on runtime, as for most of our trainers, it multiplies the amount of work needed.

Text datasets are expected to take longer than categorical or numeric.
Generally runtimes are: (slow) Images >> Text >> Categorical >> Numeric (fast)

As an example runtime, to create the first model on a text dataset of 5.5GB and 19M rows and 200 classes, it took me 34 hours to get the first model. Most of the runtime is due to the high number of classes (200). This was run on an old but large machine (circa 2013 12-core/24-threads, 256GB RAM).

On that run, 77GB of RAM was used (as dataset caching was enabled, otherwise ~0GB). If you go beyond physical RAM into virtual memory, it will be quite slow due to thrashing as the featurized dataset streams line-by-line from the cache which is now swapped out of RAM onto disk.

Suggestions for Model Builder:

We should increase the recommended runtime by ~10x. It's way too small. Users should be aiming to create ~120-150 models, which is when the Bayesian hyperparameter optimization had enough time to run.
Another route is doing tiny test runs, and extrapolating to the full runtime. Run on 10 lines of data, 100, 1000, and from that extrapolate to the time for ~120-150 models. In the meantime on the GUI, display a non-blocking spinner saying "analyzing dataset", where the "recommended time" currently resides.
Implement the resume training ability in AutoML.NET -- related issues:
- https://github.com/dotnet/machinelearning-automl/issues/405 ([CLI] Resume option before stopping plus save state, edit, cotinue)
- https://github.com/dotnet/machinelearning-automl/issues/399 ([CLI] [AutoML] Manual early stopping (Key pressed and best current model generated))
  
  Sorry, the issues are filed in the old AutoML repo, which is set to private; we should move the issues to the main ML․NET repo or into this repo.

LittleLittleCloud · 2020-03-30T17:39:13Z

@infiniteloopltd I'm going to close this issue for clean up portal, should you have any questions, feel free to re-open it.

LittleLittleCloud added this to the April 2020 milestone Mar 25, 2020

LittleLittleCloud closed this as completed Mar 30, 2020

LittleLittleCloud mentioned this issue Apr 21, 2020

Issue Classification Scenario Fails After Multiple Refinements #661

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training time finished without any models trained. #596

Training time finished without any models trained. #596

infiniteloopltd commented Mar 20, 2020

infiniteloopltd commented Mar 20, 2020

infiniteloopltd commented Mar 20, 2020

infiniteloopltd commented Mar 23, 2020

LittleLittleCloud commented Mar 25, 2020 •

edited

Loading

infiniteloopltd commented Mar 26, 2020

justinormont commented Mar 26, 2020 •

edited

Loading

LittleLittleCloud commented Mar 30, 2020

Training time finished without any models trained. #596

Training time finished without any models trained. #596

Comments

infiniteloopltd commented Mar 20, 2020

infiniteloopltd commented Mar 20, 2020

infiniteloopltd commented Mar 20, 2020

infiniteloopltd commented Mar 23, 2020

LittleLittleCloud commented Mar 25, 2020 • edited Loading

infiniteloopltd commented Mar 26, 2020

justinormont commented Mar 26, 2020 • edited Loading

LittleLittleCloud commented Mar 30, 2020

LittleLittleCloud commented Mar 25, 2020 •

edited

Loading

justinormont commented Mar 26, 2020 •

edited

Loading