Skip to content

Add dynamic buffer support to OCL Backend #3765

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

nickgg
Copy link
Contributor

@nickgg nickgg commented Nov 9, 2019

Summary: The OpenCL Backend uses a static memory allocation strategy of allocating a single large buffer and then using offsets into it, which is good for the general case, but doesn't allow us to get the most benefit out of Device Resident Tensors (when we'd like to leave an output on the device to be used as the input to another network). This PR adds a more dynamic mapping of device buffers to the OCL backend via OpenCL SubBuffers, which are similar to Glow TensorViews in that they provide access to a region without additional allocations.

There is no behavioural change in this PR, but it provides infrastructure to reference buffers outside of the range of the DeviceBuffer in the future, which we need to get DRT perf wins.

The immediate benefit is that I was able to simplify the OCL kernel code, deleting about 25% of kernels.cl.

Documentation: NFC

Test Plan: tests in release and ASAN

@nickgg
Copy link
Contributor Author

nickgg commented Nov 9, 2019

Perf should be neutral, but I'll run some tests with image-classifier and attach the results.

@nickgg nickgg requested review from gcatron and opti-mix November 9, 2019 19:25
Copy link
Contributor

@opti-mix opti-mix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nickgg Overall, I like this change a lot! It really provides a uniform way of working with OpenCL buffers. BTW, we've discussed this approach with @mortzur a couple of weeks ago.

My two major comments are:

  1. I'd really like to see if is has any performance implications or if creating sub-buffers is essentially done for free. In particular, it should not slow down the copying of constants/weights at the beginning/end of each run.
  2. Glow's OpenCL backend currently uses a very explicit way of passing the arguments to a kernel by using argument indices, e.g. setKernelArg(kernel, 1, ...). This is very fragile and if we change the scheme (e.g. we do not pass mem as the first argument), we need to touch all places where we pass arguments and change their indices. It seems like it would be more robust to introduce something which does not use explicit indices, e.g. something like this:
Kernel kernel("kernel_name");
kernel.pushArg(arg1);
kernel.pushArgs(arg2, arg3, arg4);
enqueueKernel(kernel, ...);

Of course, this second comment is not directly related to the scope of this PR and probably should be handled in a separate PR/issue.

setKernelArg(kernel, 0, deviceBuffer);
auto numArgs = setKernelArgsForBuffers(kernel, I, 1, runtimeBundle_);
unsigned numArgs = 0;
setKernelArg(kernel, numArgs++, deviceBuffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to expand/inline setKernelArgsForBuffers here, but not in certain cases below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it looks like it's because I made the deviceBuffer the first argument again. I actually had a lot of trouble making this kernel work correctly and vaguely remember this being the only thing that worked. Will look into it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment about this but basically if you remove the first void* arg from this kernel it doesn't compile. Why? No idea it should be fine, and all the other kernels were. This is a compromise.

Copy link
Contributor Author

@nickgg nickgg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @opti-mix. I'm very curious about #1 as well, will verify today.

For #2 I agree, was thinking the same thing while doing the work. Figured this diff was big enough as it is, follow up?

@nickgg
Copy link
Contributor Author

nickgg commented Nov 11, 2019

Perf comparison:

before:
image

after:
image

Looks neutral to me.

@opti-mix
Copy link
Contributor

For #2 I agree, was thinking the same thing while doing the work. Figured this diff was big enough as it is, follow up?

Yes, at least file an issue about it, so that we do not forget.

@opti-mix
Copy link
Contributor

@nickgg Thanks for checking the performance. Looks like the change is neutral, which is very good.

@nickgg
Copy link
Contributor Author

nickgg commented Nov 12, 2019

Both test suite fails here look spurious, but it seems like I can't rerun.

@nickgg nickgg force-pushed the oclBuffers branch 3 times, most recently from 6878ff8 to d72e0a8 Compare November 12, 2019 21:53
Copy link
Contributor

@opti-mix opti-mix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

cl_int err =
clEnqueueCopyBuffer(commands, srcBuf, destBuf, 0, 0, sizeInBytes, 0,
nullptr, kernelProfiling_ ? &event : nullptr);
llvm::outs() << "COPY\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove debug prints.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just temporary I'm trying to printf debug the POCL build. will fix before landing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you notice POCL_DEBUG=1 which might be useful?

@@ -1376,6 +1376,7 @@ TEST_P(MLTest, testFindPixelRegression) {
auto dx = LH.at({i, 0}) - RH.at({i, 0});
auto dy = LH.at({i, 1}) - RH.at({i, 1});
auto distance = std::sqrt(std::pow(dx, 2) + std::pow(dy, 2));
llvm::outs() << distance << "\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this a debugging print statement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup will remove as well

Copy link
Contributor

@gcatron gcatron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@nickgg nickgg force-pushed the oclBuffers branch 2 times, most recently from 22ec1c5 to a0a407f Compare November 12, 2019 23:46
Copy link
Contributor Author

@nickgg nickgg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think I got the POCL issue, it's due to alignment which isn't enforced on nvidia/cpu but is in pocl (potentially amd devices as well).

@nickgg
Copy link
Contributor Author

nickgg commented Nov 13, 2019

OK! Lint problems are from fc64547 not this diff, OpenCL build is just the normal POCL issues, Pytorch broken in master. I'm going to land this if it kills me.

Copy link

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nickgg has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nickgg has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nickgg has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link

@nickgg merged this pull request in bd69664.

nickgg added a commit to nickgg/glow that referenced this pull request Nov 13, 2019
facebook-github-bot pushed a commit that referenced this pull request Nov 13, 2019
Summary:
This reverts commit bd69664.

I had thought that I had gotten the last POCL issue in #3765, but I had not. Reverting to fix the OCL build.

Honestly this last issue (AMD/POCL requires sub buffers to aligned) seems to torpedo the whole idea, I can't think of any way to handle Glow TensorViews on the host - which means passing buffer + offset everywhere we pass a buffer below. Essentially this would mean rewriting the whole thing.

Very frustrating since that alignment restriction on subBuffers makes no sense, and no other OCL implementation has it.
Pull Request resolved: #3784

Differential Revision: D18480248

Pulled By: nickgg

fbshipit-source-id: 9b05009ea901a0f477805e6c946faac34d9bc303
@pjaaskel
Copy link
Contributor

pjaaskel commented Nov 16, 2019

... OpenCL build is just the normal POCL issues, Pytorch broken in master. I'm going to land this if it kills me.

Just curious: does https://github.com/pocl/pocl/issues know about "the normal pocl issues" you refer to here?

For me Glow now works quite well with pocl, but I've one single liner patch I need to upstream to pocl due to the way Glow checks for platform existence with 0 device query which currently fails.
Are there other remaining issues?

vdantu pushed a commit to vdantu/glow that referenced this pull request Jul 12, 2020
Summary:
The OpenCL Backend uses a static memory allocation strategy of allocating a single large buffer and then using offsets into it, which is good for the general case, but doesn't allow us to get the most benefit out of Device Resident Tensors (when we'd like to leave an output on the device to be used as the input to another network). This PR adds a more dynamic mapping of device buffers to the OCL backend via OpenCL SubBuffers, which are similar to Glow TensorViews in that they provide access to a region without additional allocations.

There is no behavioural change in this PR, but it provides infrastructure to reference buffers outside of the range of the DeviceBuffer in the future, which we need to get DRT perf wins.

The immediate benefit is that I was able to simplify the OCL kernel code, deleting about 25% of kernels.cl.

Documentation:  NFC
Pull Request resolved: pytorch#3765

Test Plan: tests in release and ASAN

Differential Revision: D18465407

Pulled By: nickgg

fbshipit-source-id: 1b5416c4f389885bae4d5e1533a65bef8ab60122
vdantu pushed a commit to vdantu/glow that referenced this pull request Jul 12, 2020
…torch#3784)

Summary:
This reverts commit bd69664.

I had thought that I had gotten the last POCL issue in pytorch#3765, but I had not. Reverting to fix the OCL build.

Honestly this last issue (AMD/POCL requires sub buffers to aligned) seems to torpedo the whole idea, I can't think of any way to handle Glow TensorViews on the host - which means passing buffer + offset everywhere we pass a buffer below. Essentially this would mean rewriting the whole thing.

Very frustrating since that alignment restriction on subBuffers makes no sense, and no other OCL implementation has it.
Pull Request resolved: pytorch#3784

Differential Revision: D18480248

Pulled By: nickgg

fbshipit-source-id: 9b05009ea901a0f477805e6c946faac34d9bc303
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants