-
Notifications
You must be signed in to change notification settings - Fork 699
Add dynamic buffer support to OCL Backend #3765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Perf should be neutral, but I'll run some tests with image-classifier and attach the results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nickgg Overall, I like this change a lot! It really provides a uniform way of working with OpenCL buffers. BTW, we've discussed this approach with @mortzur a couple of weeks ago.
My two major comments are:
- I'd really like to see if is has any performance implications or if creating sub-buffers is essentially done for free. In particular, it should not slow down the copying of constants/weights at the beginning/end of each run.
- Glow's OpenCL backend currently uses a very explicit way of passing the arguments to a kernel by using argument indices, e.g.
setKernelArg(kernel, 1, ...)
. This is very fragile and if we change the scheme (e.g. we do not passmem
as the first argument), we need to touch all places where we pass arguments and change their indices. It seems like it would be more robust to introduce something which does not use explicit indices, e.g. something like this:
Kernel kernel("kernel_name");
kernel.pushArg(arg1);
kernel.pushArgs(arg2, arg3, arg4);
enqueueKernel(kernel, ...);
Of course, this second comment is not directly related to the scope of this PR and probably should be handled in a separate PR/issue.
lib/Backends/OpenCL/OpenCL.cpp
Outdated
setKernelArg(kernel, 0, deviceBuffer); | ||
auto numArgs = setKernelArgsForBuffers(kernel, I, 1, runtimeBundle_); | ||
unsigned numArgs = 0; | ||
setKernelArg(kernel, numArgs++, deviceBuffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to expand/inline setKernelArgsForBuffers here, but not in certain cases below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case it looks like it's because I made the deviceBuffer the first argument again. I actually had a lot of trouble making this kernel work correctly and vaguely remember this being the only thing that worked. Will look into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a comment about this but basically if you remove the first void*
arg from this kernel it doesn't compile. Why? No idea it should be fine, and all the other kernels were. This is a compromise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, at least file an issue about it, so that we do not forget. |
@nickgg Thanks for checking the performance. Looks like the change is neutral, which is very good. |
Both test suite fails here look spurious, but it seems like I can't rerun. |
6878ff8
to
d72e0a8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
lib/Backends/OpenCL/OpenCL.cpp
Outdated
cl_int err = | ||
clEnqueueCopyBuffer(commands, srcBuf, destBuf, 0, 0, sizeInBytes, 0, | ||
nullptr, kernelProfiling_ ? &event : nullptr); | ||
llvm::outs() << "COPY\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove debug prints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just temporary I'm trying to printf debug the POCL build. will fix before landing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you notice POCL_DEBUG=1 which might be useful?
tests/unittests/MLTest.cpp
Outdated
@@ -1376,6 +1376,7 @@ TEST_P(MLTest, testFindPixelRegression) { | |||
auto dx = LH.at({i, 0}) - RH.at({i, 0}); | |||
auto dy = LH.at({i, 1}) - RH.at({i, 1}); | |||
auto distance = std::sqrt(std::pow(dx, 2) + std::pow(dy, 2)); | |||
llvm::outs() << distance << "\n"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this a debugging print statement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup will remove as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
22ec1c5
to
a0a407f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think I got the POCL issue, it's due to alignment which isn't enforced on nvidia/cpu but is in pocl (potentially amd devices as well).
OK! Lint problems are from fc64547 not this diff, OpenCL build is just the normal POCL issues, Pytorch broken in master. I'm going to land this if it kills me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nickgg has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nickgg has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nickgg has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
This reverts commit bd69664.
Summary: This reverts commit bd69664. I had thought that I had gotten the last POCL issue in #3765, but I had not. Reverting to fix the OCL build. Honestly this last issue (AMD/POCL requires sub buffers to aligned) seems to torpedo the whole idea, I can't think of any way to handle Glow TensorViews on the host - which means passing buffer + offset everywhere we pass a buffer below. Essentially this would mean rewriting the whole thing. Very frustrating since that alignment restriction on subBuffers makes no sense, and no other OCL implementation has it. Pull Request resolved: #3784 Differential Revision: D18480248 Pulled By: nickgg fbshipit-source-id: 9b05009ea901a0f477805e6c946faac34d9bc303
Just curious: does https://github.com/pocl/pocl/issues know about "the normal pocl issues" you refer to here? For me Glow now works quite well with pocl, but I've one single liner patch I need to upstream to pocl due to the way Glow checks for platform existence with 0 device query which currently fails. |
Summary: The OpenCL Backend uses a static memory allocation strategy of allocating a single large buffer and then using offsets into it, which is good for the general case, but doesn't allow us to get the most benefit out of Device Resident Tensors (when we'd like to leave an output on the device to be used as the input to another network). This PR adds a more dynamic mapping of device buffers to the OCL backend via OpenCL SubBuffers, which are similar to Glow TensorViews in that they provide access to a region without additional allocations. There is no behavioural change in this PR, but it provides infrastructure to reference buffers outside of the range of the DeviceBuffer in the future, which we need to get DRT perf wins. The immediate benefit is that I was able to simplify the OCL kernel code, deleting about 25% of kernels.cl. Documentation: NFC Pull Request resolved: pytorch#3765 Test Plan: tests in release and ASAN Differential Revision: D18465407 Pulled By: nickgg fbshipit-source-id: 1b5416c4f389885bae4d5e1533a65bef8ab60122
…torch#3784) Summary: This reverts commit bd69664. I had thought that I had gotten the last POCL issue in pytorch#3765, but I had not. Reverting to fix the OCL build. Honestly this last issue (AMD/POCL requires sub buffers to aligned) seems to torpedo the whole idea, I can't think of any way to handle Glow TensorViews on the host - which means passing buffer + offset everywhere we pass a buffer below. Essentially this would mean rewriting the whole thing. Very frustrating since that alignment restriction on subBuffers makes no sense, and no other OCL implementation has it. Pull Request resolved: pytorch#3784 Differential Revision: D18480248 Pulled By: nickgg fbshipit-source-id: 9b05009ea901a0f477805e6c946faac34d9bc303
Summary: The OpenCL Backend uses a static memory allocation strategy of allocating a single large buffer and then using offsets into it, which is good for the general case, but doesn't allow us to get the most benefit out of Device Resident Tensors (when we'd like to leave an output on the device to be used as the input to another network). This PR adds a more dynamic mapping of device buffers to the OCL backend via OpenCL SubBuffers, which are similar to Glow TensorViews in that they provide access to a region without additional allocations.
There is no behavioural change in this PR, but it provides infrastructure to reference buffers outside of the range of the DeviceBuffer in the future, which we need to get DRT perf wins.
The immediate benefit is that I was able to simplify the OCL kernel code, deleting about 25% of kernels.cl.
Documentation: NFC
Test Plan: tests in release and ASAN