Skip to content

[OpenCL] Optimize BatchedReduceAdd implementation #3190

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

SplitInfinity
Copy link
Contributor

@SplitInfinity SplitInfinity commented Jun 27, 2019

Description
This commit reduces the time taken to execute a batchedreduceadd
instruction in the OpenCL backend by moving the transfer of the input
and output slice size data from execution time to addNetwork time.

The slice sizes of the input and output can be computed using static
shape information at compile time, so they don't have to be computed and
transferred once per operator invocation at runtime. This commit introduces
a post-lowering OCLBackend transformation that replaces BatchedReduceAddNode
with a semantically identical OCLBatchedReduceAddNode that has two additional
inputs for the slice sizes of the input and output nodes. The code to
compute these slice sizes has been moved from OpenCLFunction::execute
to OCLBackend::transformPostLowering and modified to write into the
payload Tensors of Constants that are used as the inputs to the
previously mentioned OCLBatchedReduceAddNode. These slice sizes are then
copied to the device with the rest of the constants needed by the
function.

Test Plan
All unit tests pass.

DLRM trace before this optimization:
Screen Shot 2019-06-27 at 2 57 36 PM

DLRM trace after this optimization:
Screen Shot 2019-06-27 at 2 57 18 PM

Time taken to process one minibatch has decreased by 20%.

Copy link

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SplitInfinity has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ksaurabh-cadence
Copy link
Contributor

ksaurabh-cadence commented Jun 27, 2019

@SplitInfinity
I am planning to upstream non-OpenCL reduceAdd support in a day or two. Should I expect a conflict?

@SplitInfinity
Copy link
Contributor Author

@ksaurabh-cadence

No, I am not going to modify non-OpenCL BatchedReduceAdd as part of this or any PRs in the immediate future.

@SplitInfinity SplitInfinity marked this pull request as ready for review June 27, 2019 22:01
@SplitInfinity
Copy link
Contributor Author

I think this PR has the right idea, but I'm not so sure about the approach. OCLBatchedReduceAdd's two Constant inputs contain what is essentially shape information for its inputs and outputs. If some optimization pass changes the input and output shapes without updating these Constants, the resultant graph will be incorrect. One way to defend against this is to add checks in OCLBatchedReduceAdd::verify, but that seems like a bandaid. The shape information shouldn't be duplicated in the first place.

@jfix71
Copy link
Contributor

jfix71 commented Jun 28, 2019

If some optimization pass changes the input and output shapes without updating these Constants, the resultant graph will be incorrect.

At least for now, I wouldn't be too worried about that -- currently backend-specific Nodes are complete black boxes*. They are never touched except for DCE/CSE. So I think only the backend itself would be able to make valid changes here that impact shapes, as otherwise they would also need to update the backend-specific Nodes which they don't understand.

*If we end up adding functionality mentioned in #1830 then we may need to revisit this.

@SplitInfinity SplitInfinity requested a review from opti-mix July 2, 2019 00:24
Copy link
Contributor

@opti-mix opti-mix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Overall, it looks very clean. How do you test it? Do we need to add any new tests?

@opti-mix
Copy link
Contributor

opti-mix commented Jul 2, 2019

I think this PR has the right idea, but I'm not so sure about the approach. OCLBatchedReduceAdd's two Constant inputs contain what is essentially shape information for its inputs and outputs. If some optimization pass changes the input and output shapes without updating these Constants, the resultant graph will be incorrect. One way to defend against this is to add checks in OCLBatchedReduceAdd::verify, but that seems like a bandaid. The shape information shouldn't be duplicated in the first place.

I think OCLBatchedReduceAdd::verify approach makes sense for the time being.

A totally different approach could be to run a backend specific pass (not post-lowering) that would perform the rewriting at the very end of the graph processing pipeline or even as an IR pass after the usual IR generation. In this case nobody can overwrite its results and make them inconsistent.

@SplitInfinity SplitInfinity force-pushed the optimize-ocl-batchedreduceadd branch from afe992b to dea8c99 Compare July 2, 2019 18:47
@SplitInfinity
Copy link
Contributor Author

  • Addressed comments
  • Added OCLBatchedReduceAddNode::verify() implementation

@SplitInfinity
Copy link
Contributor Author

SplitInfinity commented Jul 2, 2019

How do you test it? Do we need to add any new tests?

When I implemented this operator for OpenCL, we aready had tests for this operator that test reducing on axis 0 and axis 1 as well as producing a zero-dimensional result. I added another test in #2958 for axis 2 (which in that specific test case is also the last dimension), so I think we have enough tests.

@SplitInfinity SplitInfinity force-pushed the optimize-ocl-batchedreduceadd branch from dea8c99 to c145cc7 Compare July 2, 2019 18:58
@SplitInfinity SplitInfinity force-pushed the optimize-ocl-batchedreduceadd branch from c145cc7 to 84541cf Compare July 2, 2019 21:47
@SplitInfinity
Copy link
Contributor Author

  • Updated verify() to use expectCompareTrue.

Copy link
Contributor

@opti-mix opti-mix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SplitInfinity has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Description:
This commit reduces the time taken to execute a batchedreduceadd
instruction in the OpenCL backend by moving the transfer of the input
and output slice size data from execution time to addNetwork time.

The slice sizes of the input and output can be computed using static
shape information at compile time, so they don't have to be computed and
transferred once per operation invocation at runtime. This commit introduces
a post-lowering OCLBackend transformation that replaces BatchedReduceAddNode
with a semantically identical OCLBatchedReduceAddNode that has two additional
inputs for the slice sizes of the input and output nodes. The code to
compute these slice sizes has been moved from OpenCLFunction::execute()
to OCLBackend::transformPostLowering and modified to write into the
payload Tensors of Constants that are used as the inputs to the
previously mentioned OCLBatchedReduceAddNode. These slice sizes are then
copied to the device with the rest of the constants needed by the
function.

Test Plan:
All unit tests pass.
@SplitInfinity SplitInfinity force-pushed the optimize-ocl-batchedreduceadd branch from 84541cf to 28e048d Compare July 2, 2019 22:52
Copy link

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SplitInfinity has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link

@SplitInfinity merged this pull request in f06ef01.

@SplitInfinity SplitInfinity deleted the optimize-ocl-batchedreduceadd branch July 11, 2019 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants