Use custom BlockingQueue to significantly improve F5 perf with SDCA without caching #2595

stephentoub · 2019-02-18T05:15:43Z

When the StochasticDualCoordinateAscent trainer is used without AppendCacheCheckpoint, the algorithm makes many, many passes over the input data. Each of these passes currently results in a LineReader getting created, which in turns spins up a variety of constructs that incur overhead. That’s something to be addressed in the TextLoader design in general, but the matter is made significantly worse when a debugger is attached because several of the constructs that get spun up are orders of magnitude more expensive when a debugger is involved, namely threads and exceptions.

I previously put up several PRs to help defray these costs:

Reuse threads in ThreadUtils #2152 has been merged and removed a bunch of threads getting created in this situation
Avoid using exceptions for control flow in GetBatch #2138 has been merged and removed a bunch of exceptions getting thrown
Remove exception due to supportsSparse:true in ReadFromTextLoader #2579 is still pending and further removes additional exceptions

These changes, in particular the first two, helped to take the Iris multi-class classification demo (without using AppendCacheCheckpoint(mlContext)) from appx “forever” when the debugger was attached to ~230 seconds on my machine.

However, even after these changes there are still many exceptions getting thrown in this situation. Whereas those mentioned PRs helped to remove overheads directly incurred by the ML libraries, there’s a significant source of exceptions that’s coming indirectly, via use of BlockingCollection, due to a mismatch between how it’s implemented and how it’s being used in TextLoader in ML.NET. BlockingCollection was implemented under the assumption that you don’t frequently create and destroy them; as such, its CompleteAdding method was implemented on top of cancellation, such that if CompleteAdding is called while one or more threads are blocked waiting for data to arrive, internally there will be an OperationCanceledException thrown and subsequently eaten by those threads (https://github.com/dotnet/corefx/issues/34602). When there’s just a few of these, it’s not ideal but also not problematic. However, ML.NET’s usage in this situation is tipping the scales. In the Iris multi-class classification demo, for example, when caching isn’t used (e.g. no call to .AppendCacheCheckpoint(mlContext)), tens of thousands of TextLoaders get created, each of which creates a BlockingCollection, and each of which uses CompleteAdding to tear down communication between the single producer reading and publishing lines from the file, and the one or more consumers taking those lines. The net result is thousands upon thousands of 1st-chance exceptions. The Iris demo app has a training file with only ~120 lines of data, and yet even after all of the aforementioned PRs, this demo app is incurring on average on my machine on the order of ~60,000 1st-chance exceptions!

This results in numbers like the following for the Iris demo:

With debugger attached: “Training time: 228 seconds”
Without debugger attached: “Training time: 8 seconds”

This PR addresses almost all of the remaining exceptions by replacing the usage of BlockingCollection with an alternative implementation that’s not nearly as feature rich but that doesn't rely on cancellation to implement CompleteAdding. This implementation can be used until either TextLoader is rearchitected to avoid all of these overheads in the first place, or until this implementation issue in BlockingCollection itself is addressed.

With this PR in place, the numbers on my machine drop to the following:

With debugger attached: “Training time: 30 seconds”
Without debugger attached: “Training time: 5 seconds”

That’s an ~33% improvement when the debugger isn’t attached, but an ~7.5x improvement when the debugger is attached.

cc: @CESARDELATORRE, @TomFinley, @eerhardt, @asthana86
Contributes to #2099

…ithout caching

codecov · 2019-02-18T05:56:32Z

Codecov Report

❗ No coverage uploaded for pull request base (master@e3b5f76). Click here to learn what that means.
The diff coverage is 95.74%.

@@            Coverage Diff            @@
##             master    #2595   +/-   ##
=========================================
  Coverage          ?   71.52%           
=========================================
  Files             ?      802           
  Lines             ?   142125           
  Branches          ?    16159           
=========================================
  Hits              ?   101651           
  Misses            ?    36000           
  Partials          ?     4474

Flag	Coverage Δ
#Debug	`71.52% <95.74%> (?)`
#production	`67.81% <95.74%> (?)`
#test	`85.57% <ø> (?)`

TomFinley · 2019-02-19T05:19:32Z

Hmmm. I wasn't aware BlockingCollection was so problematic.

TomFinley

Thanks @stephentoub for looking at this! I appreciate your expertise. We may want to adapt it in other situations, we use BC as a mechanism to pass work among threads in multiple places.

stephentoub · 2019-02-19T14:03:46Z

I wasn't aware BlockingCollection was so problematic.

It shouldn't be; this is the first time I've seen this particular issue, mainly because it's the first time I've seen an app create 15,000 BlockingCollections :-)

eerhardt

Use custom BlockingQueue to significantly improve F5 perf with SDCA w…

53857bc

…ithout caching

TomFinley approved these changes Feb 19, 2019

View reviewed changes

TomFinley mentioned this pull request Feb 19, 2019

Investigate lightweight blocking collection replacement #2610

Open

TomFinley added the perf Performance and Benchmarking related label Feb 19, 2019

eerhardt approved these changes Feb 19, 2019

View reviewed changes

TomFinley merged commit f9d5ec2 into dotnet:master Feb 19, 2019

stephentoub deleted the blockingcollection branch February 19, 2019 17:06

ghost locked as resolved and limited conversation to collaborators Mar 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use custom BlockingQueue to significantly improve F5 perf with SDCA without caching #2595

Use custom BlockingQueue to significantly improve F5 perf with SDCA without caching #2595

Uh oh!

stephentoub commented Feb 18, 2019

Uh oh!

codecov bot commented Feb 18, 2019 •

edited

Loading

Uh oh!

TomFinley commented Feb 19, 2019

Uh oh!

TomFinley left a comment

Uh oh!

stephentoub commented Feb 19, 2019

Uh oh!

eerhardt left a comment

Uh oh!

Uh oh!

Use custom BlockingQueue to significantly improve F5 perf with SDCA without caching #2595

Use custom BlockingQueue to significantly improve F5 perf with SDCA without caching #2595

Uh oh!

Conversation

stephentoub commented Feb 18, 2019

Uh oh!

codecov bot commented Feb 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

TomFinley commented Feb 19, 2019

Uh oh!

TomFinley left a comment

Choose a reason for hiding this comment

Uh oh!

stephentoub commented Feb 19, 2019

Uh oh!

eerhardt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Feb 18, 2019 •

edited

Loading