Skip to content

Training time finished without any models trained. #596

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
infiniteloopltd opened this issue Mar 20, 2020 · 7 comments
Closed

Training time finished without any models trained. #596

infiniteloopltd opened this issue Mar 20, 2020 · 7 comments
Milestone

Comments

@infiniteloopltd
Copy link

System Information (please complete the following information):

  • Model Builder Version: 1.4 - Not sure
  • Visual Studio Version: 2017 Professional

Describe the bug
Creating a classification model on 6 million rows, suggested time 1800 seconds.

To Reproduce
Can't share data, sorry.

Expected behavior
Model to be created

Screenshots
Training time finished without any models trained.

at Microsoft.ML.ModelBuilder.AutoMLService.Experiments.AutoMLExperiment`3.d__23.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.ML.ModelBuilder.AutoMLEngine.d__28.MoveNext()

@infiniteloopltd
Copy link
Author

Same error after 4000 seconds (>1hr)

Output pane doesn't give much more detail

| Trainer MicroAccuracy MacroAccuracy Duration #Iteration |
| Trainer MicroAccuracy MacroAccuracy Duration #Iteration |
| Trainer MicroAccuracy MacroAccuracy Duration #Iteration |

@infiniteloopltd
Copy link
Author

But worked on 100 seconds on 10000 rows, so extrapolating, it would take 17 hours to train on 6 million rows.

The issue here is just the fact that the suggestted time is completely underestimated in the Model Builder.

@infiniteloopltd
Copy link
Author

If anyone in Microsoft is reading these issues, I'd love to put forward a suggestion here -

If someone has enterered a time that is insufficient to train a model, it would be better if it could have a "Resume" option, instead of crashing out.

On large data sets, it could be hours before a model could be trained.

@LittleLittleCloud
Copy link
Contributor

LittleLittleCloud commented Mar 25, 2020

Sorry for the late reply.
how many columns do you have in your dataset, and does that also include some text field (perhaps also hardware configuration for your PC)? usually it shouldn't take that much long time to train a 10000 row file.

@LittleLittleCloud LittleLittleCloud added this to the April 2020 milestone Mar 25, 2020
@infiniteloopltd
Copy link
Author

Hi @LittleLittleCloud - Nice cat :)

The training data was text based, and the PC is a few years old.

My only suggestion, is that instead of this error appearing, is that if it could perhaps suggest to continue training the model for x more minutes?, rather than crashing, and having the user re-start the training process.

I would feel that people would prefer the model to complete training, rather than see and error message, even if the training is going to take longer than expected.

I've also seen cases where the training has cancelled, without me pressing the "cancel" button, but I'm not sure how that happens, but it is equally annoying.

@justinormont
Copy link

justinormont commented Mar 26, 2020

@infiniteloopltd : What's the size of your dataset in MB? What is your task? (regression/classification)

If classification, how many classes do you have? The number of classes has a large effect on runtime, as for most of our trainers, it multiplies the amount of work needed.

Text datasets are expected to take longer than categorical or numeric.
Generally runtimes are: (slow) Images >> Text >> Categorical >> Numeric (fast)

As an example runtime, to create the first model on a text dataset of 5.5GB and 19M rows and 200 classes, it took me 34 hours to get the first model. Most of the runtime is due to the high number of classes (200). This was run on an old but large machine (circa 2013 12-core/24-threads, 256GB RAM).

On that run, 77GB of RAM was used (as dataset caching was enabled, otherwise ~0GB). If you go beyond physical RAM into virtual memory, it will be quite slow due to thrashing as the featurized dataset streams line-by-line from the cache which is now swapped out of RAM onto disk.

Suggestions for Model Builder:

  • We should increase the recommended runtime by ~10x. It's way too small. Users should be aiming to create ~120-150 models, which is when the Bayesian hyperparameter optimization had enough time to run.
  • Another route is doing tiny test runs, and extrapolating to the full runtime. Run on 10 lines of data, 100, 1000, and from that extrapolate to the time for ~120-150 models. In the meantime on the GUI, display a non-blocking spinner saying "analyzing dataset", where the "recommended time" currently resides.
  • Implement the resume training ability in AutoML.NET -- related issues:

@LittleLittleCloud
Copy link
Contributor

@infiniteloopltd I'm going to close this issue for clean up portal, should you have any questions, feel free to re-open it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants