Skip to content

Teach Glow how to support layout requirements and fix uncovered bugs. #3503

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

shajrawi
Copy link
Contributor

Summary:

Note: I did not want to break the enum ConvolutionLayout introduced in 5074a72, As such, I used it in the verifier / did not change the creation of said nodes.
HOWEVER: We should use the more-generic string-based layout, which I introduce to Transpose node in this commit: it is basically an extendable enum that can be used in the backends without touching the generic code base. as a bonus, it makes differentiation easier: see how it is done for transpose now in Function *glow::differentiate.

Getting rid of said enum is a proposed TODO / follow-up.

Also note that some nodes need layout requirements, which have been added, namely we need to know the layout for placeholders and constants (obviously) and for reshapes (in case we optimized a transpose into a reshape).

An additional nice-to-have feature of the string-based layout is the wildcard / any-layout option. Some operations, such as data parallel nodes, might accept any layout.

A potential follow-up is to get create a "Solver" that automatically inserts transposes if the layouts do not match, this might greatly simplify the loader: we no longer need to insert transposes based on if we are importing NHWC or NCHW (for example). We just need to annotate the placeholder with the layout information we've get at load-time, and which we "forget" afterwards.

The verifier is useful even without creating said solver, it exposed a couple of bugs which are mentioned in this commit, as such any proposed solvers are not a must-have to demonstrate the usefulness of this commit.

Fixes #3452

Also Fixes #3493 and Fixes #3500 GraphOptimizer bugs which were found after adding the layout verifier.

Provides a workaround for the #3499 issue which was also found via the verifier.

Test Plan:
ninja test

Copy link
Contributor

@opti-mix opti-mix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shajrawi Very nice! I'm glad to see this feature being implemented!

I've done the first quick round of reviewing. Please find my comments below.

@@ -1788,6 +1789,7 @@ TEST(Graph, testDumpStructure) {
std::string mesN = K->toString();
std::string expectMes = R"(Placeholder
name : "input"
layout : *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* is used to express a layout with number of dimensions with any names? Have you considered to use e.g. **** for 4D layouts? Would it make anything easier or more complex compared to your approach?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More complex. Currently we have * as the default value when constructing the placeholder. We would need to change that and/or modify the creation method to count the number of dimensions. Same for constants. We can usually figure the the layout out from * so.. But I am open to making this change if we decide it is best.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion here. My main concern was that e.g. if two tensors have different number of dimensions and one of them has the "*" layout and another one may be a more concrete layout like NCHW we should be able to correctly handle this in our layout compatibility checking logic. As long as this is the case, I'm fine with this short way of designating the any layout.

@shajrawi shajrawi force-pushed the tesnor_layout branch 3 times, most recently from 21b1504 to 18ca0e3 Compare September 12, 2019 01:12
@shajrawi
Copy link
Contributor Author

Thanks for the initial review @opti-mix ! I addressed your comments and fixed the cmake / shared library issue.

Copy link
Contributor

@opti-mix opti-mix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick iterations. Below some further comments.

In particular, the GLOW_DEFAULT_4D_LAYOUT change is a bit controversial as it turns out. What do you think about it? Should we undo it completely? Or should we just use GLOW_DEFAULT_4D_LAYOUT in fewer places? Maybe you have a better idea?

@nadavrot
Copy link
Contributor

@shajrawi Could you please add a section to /docs that describes the tensor layout proposal?

@shajrawi
Copy link
Contributor Author

I've amended the PR to take to address the review, including adding a small description in docs/Backends, but I can work on either creating a separate document with the proposal and/or expand the section in said MD file.

@@ -299,6 +299,50 @@ llvm::raw_ostream &operator<<(llvm::raw_ostream &os, const Node *node) {
node->dump(os);
return os;
}

bool isDataParallel(const Node *node) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long term we probably want to support the isDataParallel predicate in NodeGen just like we do for InstrGen, because we may need to support it for backend-specific nodes and this approach would not work properly in such cases. If you don't want to submit a small PR for it right now, please file an issue about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@nadavrot
Copy link
Contributor

nadavrot commented Sep 13, 2019

@shajrawi Joe, I don't yet understand the proposal. At the moment we have docs that describe the IR, the Graph and how they work in /docs. This PR looks like a major change to the type-system but it does not explain why it exists and how it works. Questions: Is this an optional addition to the IR? Is this metadata or does it have a functional use? Is this required for correctness? Is the graph always in a legal state? All of these things are not explained in the PR or in the docs section.

Could you please start with a short document that explains the motivation?

@shajrawi shajrawi force-pushed the tesnor_layout branch 2 times, most recently from 113400b to 2e802e0 Compare September 13, 2019 18:36
@shajrawi
Copy link
Contributor Author

@nadavrot I amended the PR with a document that explains the motivation, interface, related work and future directions. See docs/TensorLayout.md .

@chadrosier, @pfierek, @tlepley-cadence, @mciprian13, and @vuzelac-cadence: I would appreciate any feedback you might have on this document and the changes it proposes to make the life of custom backends easier.

@shajrawi shajrawi force-pushed the tesnor_layout branch 3 times, most recently from efb06fa to 9bd5dca Compare September 13, 2019 22:21
@shajrawi shajrawi force-pushed the tesnor_layout branch 4 times, most recently from 05a6ff6 to edd80ad Compare September 16, 2019 22:14
@shajrawi shajrawi force-pushed the tesnor_layout branch 2 times, most recently from 12f2bf9 to cd0c215 Compare November 8, 2019 20:35
Copy link

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shajrawi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shajrawi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shajrawi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@jfix71 jfix71 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shajrawi Seems like it's pretty flexible and so should be useful/useable by other backends. Added some comments, nothing major.


1. A mandatory one character representing the current dimension. Either an alphabetic letter or `*` (any layout).
2. An optional token for the start of the current dimension's information: `[`.
3. An optional namespace identifier for non-standard information, such as tiling, followed by `:`. Must have `[` from 2. in place. following said identifier, all subsequent data is considered as a "black box" until `]` is encountered.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: capitalize "following"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -1376,6 +1397,8 @@ llvm::raw_ostream &operator<<(llvm::raw_ostream &os, const Function &F);

llvm::raw_ostream &operator<<(llvm::raw_ostream &os, const Function *F);

#define NHWC2HWNC \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why is this one not colocated with the others?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
// Dynamically form the layout description for transposes.
auto input = TN->getInput();
auto inputLayout = getNthInputLayoutRequirements(node, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without thinking a ton about this, why is it safe to always use the 0th input here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because there's only one input for transpose?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah misread it -- so then yeah I'd use TransposeNode::InputIdx here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no worries - good suggestion: changed that for clarity :)

CN->getInputs().front().getNode()->getName(), newCN,
CN->getResult().dims(),
CanonicalTensorLayout::getInstance().getNthResultLayoutRequirements(CN,
0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ConcatNode::ResultIdx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


auto *input =
mod_.createPlaceholder(ElemKind::FloatTy, {1, 3, 3, 1}, "input", false);
auto IH = bindings_.allocate(input)->getHandle();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: don't think we need the bindings_ / need to allocate or set any values in this whole file, since we're not running anything right?

@@ -684,12 +684,14 @@ int main(int argc, char **argv) {
BB.newNode("Reshape")
.addInput("Input")
.addMember(MemberType::VectorSizeT, "Dims")
.addMember(MemberType::String, "Layout")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no big deal, but we could special case this to .addLayout() or something in NodeBuilder as it will be the same across all Nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our plan is to move the layout information from nodes to Glow types. this can be done in a follow-up commit, even as a first task for someone motivated, during an operator's construction it will propagate the layout string (a pointer to global hash of strings to save on size). so we will not add a member called layout.
said follow-up is straightforward / just a lot of manual labor if we are fine with the design - no point in something fancy in here :)


TransposeNode *createTranspose(llvm::StringRef name, NodeValue input,
llvm::ArrayRef<unsigned_t> shuffle);
llvm::ArrayRef<unsigned_t> shuffle,
const std::string &layout = ANY_LAYOUT);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: seems like it's inconsistent whether you're using llvm::StringRef layout (above) vs. const std::string &layout (here)? No big deal but I'd say pick one and stick with it.

lastWordIdx =
F_->createReshape("decoder.reshape", topK->getIndices(), {batchSize_});
lastWordIdx = F_->createReshape("decoder.reshape", topK->getIndices(),
{batchSize_}, "N");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own understanding: "N" is just a descriptor that you chose to use here because of batchSize_, right? We aren't really going to use this anywhere in here. And it doesn't seem like it will make much of a difference for verification here, right? I'm wondering why you are using it here in general.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we don't make extensive use of named tensor, so "N" is probably not mandatory for the current backends and current verifications, but, lets say we add a different 1-D layout, for example a private backend adds it during lowering, and it is NOT "N", then the verifier should fail.

outputs.push_back(lastWordIdx);
}

Node *concat = F_->createConcat("decoder.output.concat", outputs, 0);
Node *reshape = F_->createReshape("decoder.output.reshape", concat,
{MAX_LENGTH, batchSize_});
{MAX_LENGTH, batchSize_}, "*");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why you've added "*" to many places, since it's already the default? Also wondering the benefit of ANY_LAYOUT if we're only going to use it sometimes. Unless there's a reason I'm missing, I'd suggest either always or never using it, otherwise it creates confusion for why they're both being used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch - fixed

Copy link

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shajrawi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shajrawi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@shajrawi
Copy link
Contributor Author

shajrawi commented Nov 9, 2019

Thanks @jfix71 ! I added a new commit to address the nits

Copy link

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shajrawi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

```
- This function checks if `ty` satisfies `destLayout` layout requirements, if `srcLayout` is provided for `ty`, take that into account.

- `virtual std::array<TensorLayoutDescription, max_tensor_dimensions + 1> &getLayoutsForDims() const`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type needs to be updated to return ArrayRef, after you changed the signature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch - done! (I already changed the signature and used make array ref).

Copy link

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shajrawi has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@opti-mix
Copy link
Contributor

@shajrawi Looking good! I think it is ready to be merged.

@opti-mix
Copy link
Contributor

@jfix71 Do you have any further comments or are you OK with merging it?

Copy link
Contributor

@jfix71 jfix71 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good by me!

@facebook-github-bot
Copy link

@shajrawi merged this pull request in ec46f24.

pjaaskel added a commit to parmance/glow that referenced this pull request Dec 7, 2019
vdantu pushed a commit to vdantu/glow that referenced this pull request Jul 12, 2020
Summary:
cherry-pick of the isDataParallel change from pytorch#3503 : it is an independent feature that we can merge per the offline discussion with opti-mix
Pull Request resolved: pytorch#3715

Test Plan: ninja test

Differential Revision: D18263215

Pulled By: shajrawi

fbshipit-source-id: 52947ba5419c55eaf76048411d09a40a862fda1f
vdantu pushed a commit to vdantu/glow that referenced this pull request Jul 12, 2020
…pytorch#3503)

Summary:
Note: I did not want to break the `enum ConvolutionLayout` introduced in 5074a72, As such, I used it in the verifier / did not change the creation of said nodes.
HOWEVER: We should use the more-generic string-based layout, which I introduce to Transpose node in this commit: it is basically an extendable enum that can be used in the backends without touching the generic code base. as a bonus, it makes differentiation easier: see how it is done for transpose now in `Function *glow::differentiate`.

Getting rid of said enum is a proposed TODO / follow-up.

Also note that some nodes *need* layout requirements, which have been added, namely we need to know the layout for placeholders and constants (obviously) and for reshapes (in case we optimized a transpose into a reshape).

An additional nice-to-have feature of the string-based layout is the wildcard / any-layout option. Some operations, such as data parallel nodes, might accept any layout.

A potential follow-up is to get create a "Solver" that automatically inserts transposes if the layouts do not match, this might greatly simplify the loader: we no longer need to insert transposes based on if we are importing NHWC or NCHW (for example). We just need to annotate the placeholder with the layout information we've get at load-time, and which we "forget" afterwards.

The verifier is useful even without creating said solver, it exposed a couple of bugs which are mentioned in this commit, as such any proposed solvers are not a must-have to demonstrate the usefulness of this commit.

Fixes pytorch#3452

Also Fixes pytorch#3493 and Fixes pytorch#3500 GraphOptimizer bugs which were found after adding the layout verifier.

Provides a workaround for the pytorch#3499 issue which was also found via the verifier.
Pull Request resolved: pytorch#3503

Test Plan: `ninja test`

Differential Revision: D18357369

Pulled By: shajrawi

fbshipit-source-id: 45f91fbe120b234c2a85879cee9ee0de6c100b50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants