[OpenCL] Optimize BatchedReduceAdd implementation #3190

SplitInfinity · 2019-06-27T21:16:54Z

Description
This commit reduces the time taken to execute a batchedreduceadd
instruction in the OpenCL backend by moving the transfer of the input
and output slice size data from execution time to addNetwork time.

The slice sizes of the input and output can be computed using static
shape information at compile time, so they don't have to be computed and
transferred once per operator invocation at runtime. This commit introduces
a post-lowering OCLBackend transformation that replaces BatchedReduceAddNode
with a semantically identical OCLBatchedReduceAddNode that has two additional
inputs for the slice sizes of the input and output nodes. The code to
compute these slice sizes has been moved from OpenCLFunction::execute
to OCLBackend::transformPostLowering and modified to write into the
payload Tensors of Constants that are used as the inputs to the
previously mentioned OCLBatchedReduceAddNode. These slice sizes are then
copied to the device with the rest of the constants needed by the
function.

Test Plan
All unit tests pass.

DLRM trace before this optimization:

DLRM trace after this optimization:

Time taken to process one minibatch has decreased by 20%.

facebook-github-bot

@SplitInfinity has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

ksaurabh-cadence · 2019-06-27T21:41:50Z

@SplitInfinity
I am planning to upstream non-OpenCL reduceAdd support in a day or two. Should I expect a conflict?

SplitInfinity · 2019-06-27T21:53:00Z

@ksaurabh-cadence

No, I am not going to modify non-OpenCL BatchedReduceAdd as part of this or any PRs in the immediate future.

SplitInfinity · 2019-06-27T23:26:52Z

I think this PR has the right idea, but I'm not so sure about the approach. OCLBatchedReduceAdd's two Constant inputs contain what is essentially shape information for its inputs and outputs. If some optimization pass changes the input and output shapes without updating these Constants, the resultant graph will be incorrect. One way to defend against this is to add checks in OCLBatchedReduceAdd::verify, but that seems like a bandaid. The shape information shouldn't be duplicated in the first place.

jfix71 · 2019-06-28T17:53:17Z

If some optimization pass changes the input and output shapes without updating these Constants, the resultant graph will be incorrect.

At least for now, I wouldn't be too worried about that -- currently backend-specific Nodes are complete black boxes*. They are never touched except for DCE/CSE. So I think only the backend itself would be able to make valid changes here that impact shapes, as otherwise they would also need to update the backend-specific Nodes which they don't understand.

*If we end up adding functionality mentioned in #1830 then we may need to revisit this.

opti-mix

Nice! Overall, it looks very clean. How do you test it? Do we need to add any new tests?

lib/Backends/OpenCL/Transforms.cpp

opti-mix · 2019-07-02T00:49:26Z

I think this PR has the right idea, but I'm not so sure about the approach. OCLBatchedReduceAdd's two Constant inputs contain what is essentially shape information for its inputs and outputs. If some optimization pass changes the input and output shapes without updating these Constants, the resultant graph will be incorrect. One way to defend against this is to add checks in OCLBatchedReduceAdd::verify, but that seems like a bandaid. The shape information shouldn't be duplicated in the first place.

I think OCLBatchedReduceAdd::verify approach makes sense for the time being.

A totally different approach could be to run a backend specific pass (not post-lowering) that would perform the rewriting at the very end of the graph processing pipeline or even as an IR pass after the usual IR generation. In this case nobody can overwrite its results and make them inconsistent.

SplitInfinity · 2019-07-02T18:47:56Z

Addressed comments
Added OCLBatchedReduceAddNode::verify() implementation

SplitInfinity · 2019-07-02T18:49:31Z

How do you test it? Do we need to add any new tests?

When I implemented this operator for OpenCL, we aready had tests for this operator that test reducing on axis 0 and axis 1 as well as producing a zero-dimensional result. I added another test in #2958 for axis 2 (which in that specific test case is also the last dimension), so I think we have enough tests.

tools/ClassGen/Backends/OpenCL/OpenCLSpecificNodesVerification.h

SplitInfinity · 2019-07-02T21:48:45Z

Updated verify() to use expectCompareTrue.

opti-mix

LGTM!

facebook-github-bot

@SplitInfinity has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Description: This commit reduces the time taken to execute a batchedreduceadd instruction in the OpenCL backend by moving the transfer of the input and output slice size data from execution time to addNetwork time. The slice sizes of the input and output can be computed using static shape information at compile time, so they don't have to be computed and transferred once per operation invocation at runtime. This commit introduces a post-lowering OCLBackend transformation that replaces BatchedReduceAddNode with a semantically identical OCLBatchedReduceAddNode that has two additional inputs for the slice sizes of the input and output nodes. The code to compute these slice sizes has been moved from OpenCLFunction::execute() to OCLBackend::transformPostLowering and modified to write into the payload Tensors of Constants that are used as the inputs to the previously mentioned OCLBatchedReduceAddNode. These slice sizes are then copied to the device with the rest of the constants needed by the function. Test Plan: All unit tests pass.

facebook-github-bot

@SplitInfinity has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2019-07-03T01:06:07Z

@SplitInfinity merged this pull request in f06ef01.

facebook-github-bot added the CLA Signed label Jun 27, 2019

facebook-github-bot reviewed Jun 27, 2019

View reviewed changes

SplitInfinity marked this pull request as ready for review June 27, 2019 22:01

SplitInfinity requested a review from opti-mix July 2, 2019 00:24

opti-mix reviewed Jul 2, 2019

View reviewed changes

lib/Backends/OpenCL/Transforms.cpp Outdated Show resolved Hide resolved

lib/Backends/OpenCL/Transforms.cpp Show resolved Hide resolved

SplitInfinity force-pushed the optimize-ocl-batchedreduceadd branch from afe992b to dea8c99 Compare July 2, 2019 18:47

SplitInfinity force-pushed the optimize-ocl-batchedreduceadd branch from dea8c99 to c145cc7 Compare July 2, 2019 18:58

opti-mix reviewed Jul 2, 2019

View reviewed changes

tools/ClassGen/Backends/OpenCL/OpenCLSpecificNodesVerification.h Show resolved Hide resolved

SplitInfinity force-pushed the optimize-ocl-batchedreduceadd branch from c145cc7 to 84541cf Compare July 2, 2019 21:47

opti-mix approved these changes Jul 2, 2019

View reviewed changes

facebook-github-bot reviewed Jul 2, 2019

View reviewed changes

SplitInfinity force-pushed the optimize-ocl-batchedreduceadd branch from 84541cf to 28e048d Compare July 2, 2019 22:52

facebook-github-bot reviewed Jul 2, 2019

View reviewed changes

facebook-github-bot closed this in f06ef01 Jul 3, 2019

facebook-github-bot added the Merged label Jul 3, 2019

SplitInfinity deleted the optimize-ocl-batchedreduceadd branch July 11, 2019 18:13

[OpenCL] Optimize BatchedReduceAdd implementation #3190

[OpenCL] Optimize BatchedReduceAdd implementation #3190

Uh oh!

Conversation

SplitInfinity commented Jun 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

ksaurabh-cadence commented Jun 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SplitInfinity commented Jun 27, 2019

Uh oh!

SplitInfinity commented Jun 27, 2019

Uh oh!

jfix71 commented Jun 28, 2019

Uh oh!

opti-mix left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

opti-mix commented Jul 2, 2019

Uh oh!

SplitInfinity commented Jul 2, 2019

Uh oh!

SplitInfinity commented Jul 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

SplitInfinity commented Jul 2, 2019

Uh oh!

opti-mix left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jul 3, 2019

Uh oh!

Uh oh!

SplitInfinity commented Jun 27, 2019 •

edited

Loading

ksaurabh-cadence commented Jun 27, 2019 •

edited

Loading

SplitInfinity commented Jul 2, 2019 •

edited

Loading