Add dynamic buffer support to OCL Backend #3765

nickgg · 2019-11-09T19:25:04Z

Summary: The OpenCL Backend uses a static memory allocation strategy of allocating a single large buffer and then using offsets into it, which is good for the general case, but doesn't allow us to get the most benefit out of Device Resident Tensors (when we'd like to leave an output on the device to be used as the input to another network). This PR adds a more dynamic mapping of device buffers to the OCL backend via OpenCL SubBuffers, which are similar to Glow TensorViews in that they provide access to a region without additional allocations.

There is no behavioural change in this PR, but it provides infrastructure to reference buffers outside of the range of the DeviceBuffer in the future, which we need to get DRT perf wins.

The immediate benefit is that I was able to simplify the OCL kernel code, deleting about 25% of kernels.cl.

Documentation: NFC

Test Plan: tests in release and ASAN

nickgg · 2019-11-09T19:25:41Z

Perf should be neutral, but I'll run some tests with image-classifier and attach the results.

opti-mix

@nickgg Overall, I like this change a lot! It really provides a uniform way of working with OpenCL buffers. BTW, we've discussed this approach with @mortzur a couple of weeks ago.

My two major comments are:

I'd really like to see if is has any performance implications or if creating sub-buffers is essentially done for free. In particular, it should not slow down the copying of constants/weights at the beginning/end of each run.
Glow's OpenCL backend currently uses a very explicit way of passing the arguments to a kernel by using argument indices, e.g. setKernelArg(kernel, 1, ...). This is very fragile and if we change the scheme (e.g. we do not pass mem as the first argument), we need to touch all places where we pass arguments and change their indices. It seems like it would be more robust to introduce something which does not use explicit indices, e.g. something like this:

Kernel kernel("kernel_name");
kernel.pushArg(arg1);
kernel.pushArgs(arg2, arg3, arg4);
enqueueKernel(kernel, ...);

Of course, this second comment is not directly related to the scope of this PR and probably should be handled in a separate PR/issue.

lib/Backends/OpenCL/OpenCL.cpp

opti-mix · 2019-11-10T05:43:43Z

lib/Backends/OpenCL/OpenCL.cpp

-      setKernelArg(kernel, 0, deviceBuffer);
-      auto numArgs = setKernelArgsForBuffers(kernel, I, 1, runtimeBundle_);
+      unsigned numArgs = 0;
+      setKernelArg(kernel, numArgs++, deviceBuffer);


Why do you need to expand/inline setKernelArgsForBuffers here, but not in certain cases below?

In this case it looks like it's because I made the deviceBuffer the first argument again. I actually had a lot of trouble making this kernel work correctly and vaguely remember this being the only thing that worked. Will look into it.

I've added a comment about this but basically if you remove the first void* arg from this kernel it doesn't compile. Why? No idea it should be fine, and all the other kernels were. This is a compromise.

lib/Backends/OpenCL/OpenCL.cpp

nickgg

Thanks @opti-mix. I'm very curious about #1 as well, will verify today.

For #2 I agree, was thinking the same thing while doing the work. Figured this diff was big enough as it is, follow up?

nickgg · 2019-11-11T20:06:10Z

Perf comparison:

before:

after:

Looks neutral to me.

opti-mix · 2019-11-11T21:02:15Z

For #2 I agree, was thinking the same thing while doing the work. Figured this diff was big enough as it is, follow up?

Yes, at least file an issue about it, so that we do not forget.

opti-mix · 2019-11-11T21:03:05Z

@nickgg Thanks for checking the performance. Looks like the change is neutral, which is very good.

nickgg · 2019-11-12T18:26:08Z

Both test suite fails here look spurious, but it seems like I can't rerun.

opti-mix

LGTM

opti-mix · 2019-11-12T22:13:13Z

lib/Backends/OpenCL/OpenCL.cpp

+      cl_int err =
+          clEnqueueCopyBuffer(commands, srcBuf, destBuf, 0, 0, sizeInBytes, 0,
+                              nullptr, kernelProfiling_ ? &event : nullptr);
+      llvm::outs() << "COPY\n";


Please remove debug prints.

just temporary I'm trying to printf debug the POCL build. will fix before landing

Did you notice POCL_DEBUG=1 which might be useful?

gcatron · 2019-11-12T23:28:12Z

tests/unittests/MLTest.cpp

@@ -1376,6 +1376,7 @@ TEST_P(MLTest, testFindPixelRegression) {
    auto dx = LH.at({i, 0}) - RH.at({i, 0});
    auto dy = LH.at({i, 1}) - RH.at({i, 1});
    auto distance = std::sqrt(std::pow(dx, 2) + std::pow(dy, 2));
+    llvm::outs() << distance << "\n";


Was this a debugging print statement?

yup will remove as well

gcatron

Looks good!

nickgg

Think I got the POCL issue, it's due to alignment which isn't enforced on nvidia/cpu but is in pocl (potentially amd devices as well).

nickgg · 2019-11-13T00:02:52Z

OK! Lint problems are from fc64547 not this diff, OpenCL build is just the normal POCL issues, Pytorch broken in master. I'm going to land this if it kills me.

facebook-github-bot

@nickgg has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@nickgg has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot

@nickgg has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2019-11-13T03:32:18Z

@nickgg merged this pull request in bd69664.

This reverts commit bd69664.

Summary: This reverts commit bd69664. I had thought that I had gotten the last POCL issue in #3765, but I had not. Reverting to fix the OCL build. Honestly this last issue (AMD/POCL requires sub buffers to aligned) seems to torpedo the whole idea, I can't think of any way to handle Glow TensorViews on the host - which means passing buffer + offset everywhere we pass a buffer below. Essentially this would mean rewriting the whole thing. Very frustrating since that alignment restriction on subBuffers makes no sense, and no other OCL implementation has it. Pull Request resolved: #3784 Differential Revision: D18480248 Pulled By: nickgg fbshipit-source-id: 9b05009ea901a0f477805e6c946faac34d9bc303

pjaaskel · 2019-11-16T10:27:03Z

... OpenCL build is just the normal POCL issues, Pytorch broken in master. I'm going to land this if it kills me.

Just curious: does https://github.com/pocl/pocl/issues know about "the normal pocl issues" you refer to here?

For me Glow now works quite well with pocl, but I've one single liner patch I need to upstream to pocl due to the way Glow checks for platform existence with 0 device query which currently fails.
Are there other remaining issues?

Summary: The OpenCL Backend uses a static memory allocation strategy of allocating a single large buffer and then using offsets into it, which is good for the general case, but doesn't allow us to get the most benefit out of Device Resident Tensors (when we'd like to leave an output on the device to be used as the input to another network). This PR adds a more dynamic mapping of device buffers to the OCL backend via OpenCL SubBuffers, which are similar to Glow TensorViews in that they provide access to a region without additional allocations. There is no behavioural change in this PR, but it provides infrastructure to reference buffers outside of the range of the DeviceBuffer in the future, which we need to get DRT perf wins. The immediate benefit is that I was able to simplify the OCL kernel code, deleting about 25% of kernels.cl. Documentation: NFC Pull Request resolved: pytorch#3765 Test Plan: tests in release and ASAN Differential Revision: D18465407 Pulled By: nickgg fbshipit-source-id: 1b5416c4f389885bae4d5e1533a65bef8ab60122

…torch#3784) Summary: This reverts commit bd69664. I had thought that I had gotten the last POCL issue in pytorch#3765, but I had not. Reverting to fix the OCL build. Honestly this last issue (AMD/POCL requires sub buffers to aligned) seems to torpedo the whole idea, I can't think of any way to handle Glow TensorViews on the host - which means passing buffer + offset everywhere we pass a buffer below. Essentially this would mean rewriting the whole thing. Very frustrating since that alignment restriction on subBuffers makes no sense, and no other OCL implementation has it. Pull Request resolved: pytorch#3784 Differential Revision: D18480248 Pulled By: nickgg fbshipit-source-id: 9b05009ea901a0f477805e6c946faac34d9bc303

facebook-github-bot added the CLA Signed label Nov 9, 2019

nickgg requested review from gcatron and opti-mix November 9, 2019 19:25

opti-mix reviewed Nov 10, 2019

View reviewed changes

nickgg commented Nov 11, 2019

View reviewed changes

nickgg mentioned this pull request Nov 11, 2019

Fix OpenCL Kernel argument addressing #3776

Open

nickgg force-pushed the oclBuffers branch from f4da2b4 to 23f9fa1 Compare November 11, 2019 23:54

nickgg force-pushed the oclBuffers branch 3 times, most recently from 6878ff8 to d72e0a8 Compare November 12, 2019 21:53

opti-mix approved these changes Nov 12, 2019

View reviewed changes

gcatron reviewed Nov 12, 2019

View reviewed changes

gcatron approved these changes Nov 12, 2019

View reviewed changes

nickgg force-pushed the oclBuffers branch 2 times, most recently from 22ec1c5 to a0a407f Compare November 12, 2019 23:46

nickgg commented Nov 12, 2019

View reviewed changes

facebook-github-bot reviewed Nov 13, 2019

View reviewed changes

nickgg force-pushed the oclBuffers branch from a0a407f to 203e6a8 Compare November 13, 2019 01:01

facebook-github-bot reviewed Nov 13, 2019

View reviewed changes

Add dynamic buffer support to OCL Backend

50cb744

nickgg force-pushed the oclBuffers branch from 203e6a8 to 50cb744 Compare November 13, 2019 01:22

facebook-github-bot reviewed Nov 13, 2019

View reviewed changes

facebook-github-bot closed this in bd69664 Nov 13, 2019

facebook-github-bot added the Merged label Nov 13, 2019

nickgg added a commit to nickgg/glow that referenced this pull request Nov 13, 2019

Revert "Add dynamic buffer support to OCL Backend (pytorch#3765)"

3a9e462

This reverts commit bd69664.

nickgg mentioned this pull request Nov 13, 2019

Revert "Add dynamic buffer support to OCL Backend (#3765)" #3784

Closed

Add dynamic buffer support to OCL Backend #3765

Add dynamic buffer support to OCL Backend #3765

Uh oh!

Conversation

nickgg commented Nov 9, 2019

Uh oh!

nickgg commented Nov 9, 2019

Uh oh!

opti-mix left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nickgg left a comment

Choose a reason for hiding this comment

Uh oh!

nickgg commented Nov 11, 2019

Uh oh!

opti-mix commented Nov 11, 2019

Uh oh!

opti-mix commented Nov 11, 2019

Uh oh!

nickgg commented Nov 12, 2019

Uh oh!

opti-mix left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gcatron left a comment

Choose a reason for hiding this comment

Uh oh!

nickgg left a comment

Choose a reason for hiding this comment

Uh oh!

nickgg commented Nov 13, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Nov 13, 2019

Uh oh!

pjaaskel commented Nov 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

pjaaskel commented Nov 16, 2019 •

edited

Loading