Use custom BlockingQueue to significantly improve F5 perf with SDCA without caching #2595
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When the StochasticDualCoordinateAscent trainer is used without AppendCacheCheckpoint, the algorithm makes many, many passes over the input data. Each of these passes currently results in a LineReader getting created, which in turns spins up a variety of constructs that incur overhead. That’s something to be addressed in the TextLoader design in general, but the matter is made significantly worse when a debugger is attached because several of the constructs that get spun up are orders of magnitude more expensive when a debugger is involved, namely threads and exceptions.
I previously put up several PRs to help defray these costs:
These changes, in particular the first two, helped to take the Iris multi-class classification demo (without using AppendCacheCheckpoint(mlContext)) from appx “forever” when the debugger was attached to ~230 seconds on my machine.
However, even after these changes there are still many exceptions getting thrown in this situation. Whereas those mentioned PRs helped to remove overheads directly incurred by the ML libraries, there’s a significant source of exceptions that’s coming indirectly, via use of BlockingCollection, due to a mismatch between how it’s implemented and how it’s being used in TextLoader in ML.NET. BlockingCollection was implemented under the assumption that you don’t frequently create and destroy them; as such, its CompleteAdding method was implemented on top of cancellation, such that if CompleteAdding is called while one or more threads are blocked waiting for data to arrive, internally there will be an OperationCanceledException thrown and subsequently eaten by those threads (https://github.com/dotnet/corefx/issues/34602). When there’s just a few of these, it’s not ideal but also not problematic. However, ML.NET’s usage in this situation is tipping the scales. In the Iris multi-class classification demo, for example, when caching isn’t used (e.g. no call to .AppendCacheCheckpoint(mlContext)), tens of thousands of TextLoaders get created, each of which creates a BlockingCollection, and each of which uses CompleteAdding to tear down communication between the single producer reading and publishing lines from the file, and the one or more consumers taking those lines. The net result is thousands upon thousands of 1st-chance exceptions. The Iris demo app has a training file with only ~120 lines of data, and yet even after all of the aforementioned PRs, this demo app is incurring on average on my machine on the order of ~60,000 1st-chance exceptions!
This results in numbers like the following for the Iris demo:
This PR addresses almost all of the remaining exceptions by replacing the usage of BlockingCollection with an alternative implementation that’s not nearly as feature rich but that doesn't rely on cancellation to implement CompleteAdding. This implementation can be used until either TextLoader is rearchitected to avoid all of these overheads in the first place, or until this implementation issue in BlockingCollection itself is addressed.
With this PR in place, the numbers on my machine drop to the following:
That’s an ~33% improvement when the debugger isn’t attached, but an ~7.5x improvement when the debugger is attached.
cc: @CESARDELATORRE, @TomFinley, @eerhardt, @asthana86
Contributes to #2099