Skip to content

Use custom BlockingQueue to significantly improve F5 perf with SDCA without caching #2595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 19, 2019

Conversation

stephentoub
Copy link
Member

When the StochasticDualCoordinateAscent trainer is used without AppendCacheCheckpoint, the algorithm makes many, many passes over the input data. Each of these passes currently results in a LineReader getting created, which in turns spins up a variety of constructs that incur overhead. That’s something to be addressed in the TextLoader design in general, but the matter is made significantly worse when a debugger is attached because several of the constructs that get spun up are orders of magnitude more expensive when a debugger is involved, namely threads and exceptions.

I previously put up several PRs to help defray these costs:

These changes, in particular the first two, helped to take the Iris multi-class classification demo (without using AppendCacheCheckpoint(mlContext)) from appx “forever” when the debugger was attached to ~230 seconds on my machine.

However, even after these changes there are still many exceptions getting thrown in this situation. Whereas those mentioned PRs helped to remove overheads directly incurred by the ML libraries, there’s a significant source of exceptions that’s coming indirectly, via use of BlockingCollection, due to a mismatch between how it’s implemented and how it’s being used in TextLoader in ML.NET. BlockingCollection was implemented under the assumption that you don’t frequently create and destroy them; as such, its CompleteAdding method was implemented on top of cancellation, such that if CompleteAdding is called while one or more threads are blocked waiting for data to arrive, internally there will be an OperationCanceledException thrown and subsequently eaten by those threads (https://github.com/dotnet/corefx/issues/34602). When there’s just a few of these, it’s not ideal but also not problematic. However, ML.NET’s usage in this situation is tipping the scales. In the Iris multi-class classification demo, for example, when caching isn’t used (e.g. no call to .AppendCacheCheckpoint(mlContext)), tens of thousands of TextLoaders get created, each of which creates a BlockingCollection, and each of which uses CompleteAdding to tear down communication between the single producer reading and publishing lines from the file, and the one or more consumers taking those lines. The net result is thousands upon thousands of 1st-chance exceptions. The Iris demo app has a training file with only ~120 lines of data, and yet even after all of the aforementioned PRs, this demo app is incurring on average on my machine on the order of ~60,000 1st-chance exceptions!

This results in numbers like the following for the Iris demo:

  • With debugger attached: “Training time: 228 seconds”
  • Without debugger attached: “Training time: 8 seconds”

This PR addresses almost all of the remaining exceptions by replacing the usage of BlockingCollection with an alternative implementation that’s not nearly as feature rich but that doesn't rely on cancellation to implement CompleteAdding. This implementation can be used until either TextLoader is rearchitected to avoid all of these overheads in the first place, or until this implementation issue in BlockingCollection itself is addressed.

With this PR in place, the numbers on my machine drop to the following:

  • With debugger attached: “Training time: 30 seconds”
  • Without debugger attached: “Training time: 5 seconds”

That’s an ~33% improvement when the debugger isn’t attached, but an ~7.5x improvement when the debugger is attached.

cc: @CESARDELATORRE, @TomFinley, @eerhardt, @asthana86
Contributes to #2099

@codecov
Copy link

codecov bot commented Feb 18, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@e3b5f76). Click here to learn what that means.
The diff coverage is 95.74%.

@@            Coverage Diff            @@
##             master    #2595   +/-   ##
=========================================
  Coverage          ?   71.52%           
=========================================
  Files             ?      802           
  Lines             ?   142125           
  Branches          ?    16159           
=========================================
  Hits              ?   101651           
  Misses            ?    36000           
  Partials          ?     4474
Flag Coverage Δ
#Debug 71.52% <95.74%> (?)
#production 67.81% <95.74%> (?)
#test 85.57% <ø> (?)

@TomFinley
Copy link
Contributor

Hmmm. I wasn't aware BlockingCollection was so problematic.

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @stephentoub for looking at this! I appreciate your expertise. We may want to adapt it in other situations, we use BC as a mechanism to pass work among threads in multiple places.

@TomFinley TomFinley added the perf Performance and Benchmarking related label Feb 19, 2019
@stephentoub
Copy link
Member Author

I wasn't aware BlockingCollection was so problematic.

It shouldn't be; this is the first time I've seen this particular issue, mainly because it's the first time I've seen an app create 15,000 BlockingCollections :-)

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@TomFinley TomFinley merged commit f9d5ec2 into dotnet:master Feb 19, 2019
@stephentoub stephentoub deleted the blockingcollection branch February 19, 2019 17:06
@ghost ghost locked as resolved and limited conversation to collaborators Mar 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
perf Performance and Benchmarking related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants