Skip to content

Teach Glow how to support custom alginment requirements for tensor dimensions #2686

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
opti-mix opened this issue Apr 9, 2019 · 6 comments

Comments

@opti-mix
Copy link
Contributor

opti-mix commented Apr 9, 2019

Some backends may have specific requirements on the alignment of data inside tensors. E.g. each row of the innermost dimension to be cache line size aligned. In principle, any of the dimensions may have an alignment requirement.

Glow's Tensor class as of today is not able to represent this requirements, because they underlying Type cannot express it. Type only stores the actual dimensions, but not the aligned dimensions.

So, how could one use Glow Tensors to enforce any requirements on alignments?

Here are some possible alternatives:

  1. Glow Tensors could be extended to explicitly support custom strides. Type would have a dedicated strides_ member in addition to sizes_.

    • This change could be useful for different backends.
    • Any code that accesses tensor elements only via logical indices and never through raw pointers should continue to work correctly.
    • But it may require a lot of changes all over the place, especially in libjit and some kernels and operations that iterate over the content of a tensor using raw pointers.
      • Everything that uses row pointers would start iterating over "padding" elements. These elementds may contain junk data and thus all such places should be updated to skip such elements.
      • libjit kernels should get additional parameters for strides, so that they can correctly compute linear indices into tensors.
  2. It is also possible to try tweaking the representation of tensors only in the specific backends that need it, without changing the Tensor and Type classes. E.g. one could introduce custom backend-specific Tensor classes. This is certainly doable, but it is very likely that this logic will be re-implemented in a number of different backends.

  • The disadvantages of this approach are:
    • it may require more memory to keep both the original and backend specific tensors.
    • it may involve a lot of copying between padded and non-padded tensors
    • none of the existing kernels can be re-used, because they work only with Glow's standard tensors.
  1. It is also possible to use just the usual Tensor and Type classes. Every time when a custom-aligned tensor is needed, one would represent it e.g. as a Tensor with bigger dimensions that accommodate for the required alignments and contains some junk padding elements. Such a tensor would have a desired memory layout. But it is important to note that some of the existing operations on tensors would not produce correct results when applied to this tensor, because all those operations assume that all elements contain real data and need to be processed, which is not the case due to the added junk padding elements. The content of the original tensor would be copied into the "padded" tensor before the operation that makes use of custom alignments and would be copied back into the original tensor after that operation.
    • The disadvantages of this approach are:
      • it may require more memory to keep both the original and the padded tensor.
      • it may involve a lot of copying between padded and non-padded tensors
      • Some of the existing kernels/operations cannot be re-used with the "padded" tensors, because they would try to process "padding" elements. One would need to develop new kernels for "padded kernels"

I'm probably missing some further alternatives here.

It would be interesting to hear from those who faced a similar issue what was their experience and how they solved this problem. May be we can learn from it.

Personally, I tend to think that (1) is the cleanest solution, but I'm afraid it may involve a lot of re-factoring in existing kernels.

Note: This issue is related to #1214.

@jfix71
Copy link
Contributor

jfix71 commented Apr 10, 2019

For 1, do we want to use libjit for some backends other than the CPU Backend that require/prefer strides? RE: raw(), I think most of its uses are just to avoid needing to worry about shapes, and so they would still work fine if we updated it to skip padding, like at(ArrayRef<size_t>) would. And maybe we could rename raw() to something more representative, e.g. at(size_t).

For 2, we could always create a new StridedTensor class that all backends could share. I agree it might not be great if we need to keep all unstrided Tensors around too, or copy everything over to the new format. Though I'm not sure how we avoid copying in option 1 either -- we would load Constants as non-strided and need to copy over all the data into the new layout once the backend specifies what its requirements are. And I think we could ditch the unstrided Tensors in either case once we create the Strided versions, no?

For 3, this is similar to the way I implemented fused rowwise-quantized Tensors. I added a new ElemKind::UInt8FusedQTy. When accessing elements of Tensors of this type you have to know that the last 8 columns of every row contain scales/offsets and proceed accordingly. Perhaps whatever design we land on would more elegantly handle these sorts of fused Tensors.

@opti-mix
Copy link
Contributor Author

@jfix71

For 1, do we want to use libjit for some backends other than the CPU Backend that require/prefer strides?
RE: raw(), I think most of its uses are just to avoid needing to worry about shapes, and so they would still work fine if we updated it to skip padding, like at(ArrayRef<size_t>) would.

Yes, they'd work. But we'd need to update a bunch of Interpreter kernels and virtually all libjit and OpenCL kernels would need to get a strides parameter for each of their argument tensors.

And maybe we could rename raw() to something more representative, e.g. at(size_t).

Sounds reasonable.

For 2, we could always create a new StridedTensor class that all backends could share.

+1.

I agree it might not be great if we need to keep all unstrided Tensors around too, or copy everything over to the new format.

If you are talking about types, then StridedTensor would become the new Tensor as it is strictly a superset of the current Tensor class, because current Tensor class is just a StridedTensor where all alignment requirements are 1, i.e. there are no alignment requirements.

But I was talking more about executing NN models.

Though I'm not sure how we avoid copying in option 1 either -- we would load Constants as non-strided and need to copy over all the data into the new layout once the backend specifies what its requirements are.

Right, we could e.g. transform all constants at compile-time into whatever memory layout is specified by the backend.

And I think we could ditch the unstrided Tensors in either case once we create the Strided versions, no?

This is the question. I guess if a backend with custom alignment requirements supports all the operations we can convert everything into the strided versions and then ditch the original unstrided tensors.

But if we need to use e.g. our current libjit or Interpreter implementations as a fallback for some operations, e.g. during the development of a new backend, we run into a problem that our current existing implementations expect unstrided tensors. And thus we'd need to perform some copies between strided and unstrided tensors at runtime so that we can provide these tensors in the proper expected format to those operations.

Of course, if we'd update all those operations to support alignments/striding as in (1), we wouldn't have this problem.

For 3, this is similar to the way I implemented fused rowwise-quantized Tensors. I added a new ElemKind::UInt8FusedQTy. When accessing elements of Tensors of this type you have to know that the last 8 columns of every row contain scales/offsets and proceed accordingly. Perhaps whatever design we land on would more elegantly handle these sorts of fused Tensors.

Yeah, (3) is pretty similar to (2), but it is much more dangerous IMHO, because we use the same Tensor unstrided type for everything, but have to remember if any given instance of this Tensor type is logically strided or not. This sounds like a recipe for disaster :-) And we cannot reuse our existing kernels on the strided versions. Thus, I think (3) is the least attractive alternative.

@nadavrot
Copy link
Contributor

@opti-mix One way to view this is to consider packed tensors as the canonical representation, and aligned tensors as the lowered tensor. The packed version is close to the pure mathematical form. This is the form that you want to have when you are performing optimizations, such as fusing operations. Later in the compilation pipeline we can complicate things by adding things like padding, alignment, or even perhaps distribution across multiple banks.

There are two ways to represent the aligned tensors in the type system. 1) add first-class support to everything, like you mentioned. 2) implement the semantics with specialized operators that know what to do with the pads. LLVM, btw is closer to #1, where the operations (loads/stores) contain the alignment information.

Another example for the typesystem question is what @jfix71 mentioned. We could have implemented the rowwise operators as a change to the type-system, but instead decided to add custom operators and keep the tensors opaque (using the pointer terminology here).

Let's say that we want to support padded convolutions. One easy way to implement this is to manually pad some tensor by making it larger, and create some special convolution that knows that only a certain window inside of it contains the "real" data. All of the low-level optimization passes that deal with copying memory don't need to know anything about the padding/alignment because for them this is just memory that is copied around. Only the operations that access the data need to know what's padding and what's data. We can transform the pure graph into this form during lowering.

@opti-mix
Copy link
Contributor Author

I've spent more time thinking about these issues.

Here is what I came up with:

Different operations (e.g. convolutions, gemms, etc) may require each of their inputs and outputs to use a specific memory layout, both in terms of alignments and logical ordering of dimensions (e.g. NHWC or NCHW). More over, these requirements may depend on the properties of the operation, e.g. dimensions of its operands, etc. E.g. a convolution with a small filter may need the input operands in a format different from a convolution with a big filter.

Today, Glow core and custom backends implicitly hard-code this knowledge about the operations into (backend-specific) nodes and code that works with them, i.e. in the code that transforms Glow nodes, which are in the canonical NHWC format, into backend-specific nodes. This is pretty fragile and involves a lot of boiler plate code.

What if Glow would be able to ask a backend for each node what are the requirements for each of node's inputs and results in terms of alignments and ordering of dimensions?

  • The backend would return this information and Glow core could insert all the required layout transformations using a new node called LayoutTransform (may be we could extend/subsume Transpose for these purposes).

  • This new LayoutTransform node could be applied to Constant weights at compile-time. It can also be optimized in almost the same ways as Glow optimizes Transpose nodes, e.g. multiple LayoutTransforms can be combined into one, opposite LayoutTransforms can eliminate each other, etc.

  • This way all of the functionality to insert the required layout transforms is handled by the Glow core, which removes a lot of code duplication from backends. It also allows for better optimizations as the optimizer would understand the semantics of LayoutTransform. And it also allows for better verification, as Glow can now check that layouts of all inputs/outputs of each node satisfy the requirements of this operation as reported by backends.

IMHO, this approach seems to solve this common issue in a pretty clean and principled way.

@nadavrot
Copy link
Contributor

Is LayoutTransform similar to Transpose?

@opti-mix
Copy link
Contributor Author

Is LayoutTransform similar to Transpose?

Yes. The overall idea and purpose is very similar, but LayoutTransform is a bit more general. That's why I've written in my previous comment about LayoutTransform "may be we could extend/subsume Transpose for these purposes."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants