Skip to content

Need to redesign the Variable interface. #1334

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nadavrot opened this issue Jul 26, 2018 · 17 comments
Closed

Need to redesign the Variable interface. #1334

nadavrot opened this issue Jul 26, 2018 · 17 comments

Comments

@nadavrot
Copy link
Contributor

nadavrot commented Jul 26, 2018

Motivation

Currently Glow variables are backed by Tensors, and are tied to the graph. This gives us the ability to perform optimizations such as constant propagation that mutate the content of Tensors. However, it also ties the input/output tensors to the module. This design prevents us from running multithreaded programs simply because the input/output variables (which need to be different for each thread) are a single entity.

Proposed design

The proposed solution is to separate the concept of symbolic variables from graph variables. For example, an input variable X will not be backed by a Tensor until the execution-engine executes the compiled program. To execute the network the caller would need to bind Tensors to variables. The execution engine actually already does this, so we won't need to change the APIs that are used by our unit tests.

To preserve the ability to perform constant propagation (and other tensor-content optimizations) we would need to be able to make some variables concrete during the compilation phase. We would need to have two kinds of variables:

Constant - this is a tensor-backed variable. Similar to LLVM's ConstantArray it's a data structure that lives at the module-level and can be manipulated by transformations.

Variable - this is a typed variable that isn't backed by a real tensor until runtime. The runtime allocates the tensor (either in device memory or on the host) and manages the input/output to and from the variable.

@stoklund
Copy link
Contributor

This sounds great! A couple comments:

The name ConstantVariable is a bit of an oxymoron. A constant is something that doesn't change, and a variable is something that does change. I would suggest Constant and Variable instead.

Second, I think we should also look at separating constants from the module. The module represents the compiler's intermediate representation, but right now it isn't intermediate at all — the constants in the module are needed at runtime, so you have to keep it around.

More generally, we should make sure we are economical with memory management in a production environment. I imagine something like:

  • The Module only exists during compilation, then it is deleted. We don't need the IR any more.
  • After compilation, constants that are needed at runtime are moved into a constant pool. Any constants that are not needed at runtime get deleted along with the module.
  • A constant pool can be shared by multiple runtime variable bindings, and also by multiple compiled functions. (e.g., compiled functions representing different batch sizes of the same graph).
  • For constants we didn't need to modify during compilation, this should all happen without any copies of tensor data.

@bertmaher
Copy link
Contributor

This proposal sounds great to me. A closely-related issue for multithreaded programs is where we allocate activations. We should hopefully be able to use the same mechanism for RuntimeVariables and activations.

@nadavrot
Copy link
Contributor Author

@stoklund Thank you for the feedback.

I like your naming suggestion for Constant and Variable. I will update the proposal.

How do you envision separating the Constant tensors from the module? If we use the LLVM terminology Constants are like LLVM's ConstantArray, which are large chunks of data that live inside the module. We need to be able to mutate some tensor during constant propagation. Our AOT compiler can already serialize variables/constants and create standalone bundles. Our JIT compiler references the module tensors to save a copy and reduce the memory use.

I think that the economical memory management that you envision (and I agree with this vision, btw) is tied to the concept of compiled function, that we are currently missing.

@stoklund
Copy link
Contributor

@nadavrot We need the constants at runtime, and actually variables are needed too as names when binding. I imaging splitting out a Context from module:

  • Context owns constants and variables. It outlives both modules and compiled functions.
  • Module owns functions and references the context where variables and constants live.
  • CompiledFunction references the context, not the module.

But actually, this is something we can do after separating variables and constants. I don't think we have to implement it at the same time as this issue.

@bertmaher The memory for variables can be provided from the outside. See for example onnxSetGraphIO. I think we should allocate activation memory when binding variables. Some accelerators will want to statically allocate activation memory, though.

@nadavrot
Copy link
Contributor Author

nadavrot commented Aug 3, 2018

Update: we had an offline discussion about this proposal. At the moment Glow Variables are modeled after TensorFlow's variables, that own the underlying storage. Some variables must be bound (for example, constants) while others must be unbound (for example, input images to the network). We are missing TensorFlow's concept of "placeholder". There is consensus that we should implement something similar to "TF.placeholder". We did not reach to a decision about the mutability of bound variables. One option is to keep the TF model that allows mutable bound variables. The other option is to disallow mutable bound variables. Using mutable variables does not make sense in the context of multithreading, but it's very useful for testing/debugging/profiling the code. @artemrakhov pointed out that it's important for us to be able to write tests easily without a complicated setup and mentioned the difficulty of listing out the LSTM variables. @stoklund prefers to abolish the mutable bound variables. @opti-mix mentioned that bound and unbound variables need to inherit from Storage.

@qcolombet mentioned that XLA does not have a concept of Variables and that they have a high-level compiler that lowers the high-level Variable concept to the low-level dataflow XLA graph (see tensorflow/tensorflow/compiler/tf2xla/). This transformation may fail.

Another topic: @stoklund pointed out that it's a good idea to separate the IR from the context/variables. I added a unit test that ensures that after compilation we can erase all of the functions and the code still runs correctly (on all tested backends). One possible next step would be to move the IRFunction from the execution engine and into the backend. @opti-mix raised the point that we don't want every backend to invent their own low-level IR, so having a single formal low-level IR is a good idea. Moving IRFunction into the backend (and possibly into the CompiledFunction class) will help with the clear ownership and help us implement better cacheing of functions.

We also talked about extracting the IR-builder parts of the graph from the Graph into a new namespace/class. I am sure that I missed a few topics.

@opti-mix
Copy link
Contributor

opti-mix commented Aug 3, 2018

One possible next step would be to move the IRFunction from the execution engine and into the backend. @opti-mix raised the point that we don't want every backend to invent their own low-level IR, so having a single formal low-level IR is a good idea. Moving IRFunction into the backend (and possibly into the CompiledFunction class) will help with the clear ownership and help us implement better cacheing of functions.

@nadavrot In fact, after recent refactorings of the Backend interface, ExecutionEngine does not own the IRFunction. Instead it provides a following function:
std::unique_ptr<IRFunction> generateIR(CompilationMode mode, Function *F);, which is used to create an IRFunction.

But ExecutionEngine owns the CompiledFunction that it needs to execute.

So, it seems like your comment regarding the IRFunction being moved from ExecutionEngine into a Backend/CompiledFunction is mostly implemented.

Or do you mean something besides what we have now?

@nadavrot
Copy link
Contributor Author

nadavrot commented Aug 3, 2018

@opti-mix Oh, you are right. What do you think about changing the API so that it does not expose IRFunction as part of the API?

Backend.h:
compile(std::unique_ptr IR) const = 0; // change this to: compile(Function *F);

@opti-mix
Copy link
Contributor

opti-mix commented Aug 3, 2018

@nadavrot ExecutionEngine currently uses the IRFunction type in the signature of a private function:
std::unique_ptr<IRFunction> generateIR(CompilationMode mode, Function *F);, thus it is an implementation detail. I'm pretty sure we can remove a reference to IRFunction from the ExecutionEngine.h.

We could make generateIR a helper function that could be used by backends if they need to produce the IR.

And backends should have the following compile method:
std::unique_ptr<CompiledFunction> compile(CompilationMode mode, Function *F) const = 0; instead of
std::unique_ptr<CompiledFunction> compile(std::unique_ptr<IRFunction> IR) const = 0;

If a backend wants to produce the IR, it can use generateIR. If not, it can work directly on the graph representation.

@nadavrot
Copy link
Contributor Author

nadavrot commented Aug 8, 2018

Plan update

This is a quick summary for a plan for implementing the Variable refactoring. The posts above explain the motivation for eliminating the graph-owned bound mutable variables. We'll eliminate these variables in steps. Some parts of the plan are clear, but we still have a few open design questions.

Step 1

PR #1365 separated the initialization of tensors from the Variable. This PR extracted the code that performs xavier and other kinds of initialization and moved them to the caller. This allowed us to start working on the refactoring of variables.

Step 2

The next step would be to implement Placeholders, which are unbound variables. The name Placeholder is borrowed from TensorFlow, and just like in TensorFlow, the content of inputs and outputs is copied into and from the tensors before and after the execution of the program. The execution engine is responsible for binding the inputs/outputs. The executing engine interfaces already support this kind of binding and the run method accepts a list of variables and a list of tensors to bind. PR #1409 is the first step to implement placeholders. Future PRs will implement the missing functionality (as explained in the body of the PR). After this change lands we will be able to migrate many of the simple tests (such as learn-xor) and all of the ONNX-loader tests into a pure Placeholder mode.

Step 3

Step 2 got rid of the input/output tensors, and remove most of the mutable bound variables from the test suite. However, we will still have many mutable variables that are defined by the graph builder. We can't easily get rid of these variables and replace them with placeholders (and backing tensors) because the graph builder also initializes the variables. We will need to design a system that will allow the graph builder to create new variables/placeholders and register their initialization kind and allocate backing tensors for them. The challenge here is to make sure that our unit tests are not too verbose and that it's easy to add new unit tests to the compiler.

@qcolombet
Copy link
Contributor

Not necessarily a good thing for non-verbosity in the tests, but XLA requires the runtime to supply the storage/tensor for all inputs, outputs and most temporaries. In other words, the address of the storage for the "variables" is always supplied at compiled time.

@qcolombet
Copy link
Contributor

is always supplied at compiled time.

  • I meant runtime!!

nadavrot added a commit to nadavrot/glow that referenced this issue Sep 5, 2018
…s that we can construct the graph.

This commit is the first few steps in the implementation of the
Placeholder node. The plan is discussed in pytorch#1334. Placeholder are
unbound variables where the content of the tensor is provided by the
execution engine or the AOT compiler at runtime.

This commit introduces Storage as a superclass for Variable (bound) and
Placeholder (unbound). This PR also adds a simple unit test that checks
that we can create new Placeholder nodes. This PR does not implement any
of the infrastructure that's required to do anything useful with
Placeholder variables, such as loading content into them, deleting them,
pretty dotty-printer, etc.

Next steps (not necessarily in any specific order):

1. Teach the execution engine to bind tensors to the Placeholder nodes.

2. Verify that dotty printing, dump() and debugging work well.

3. Cleanup the APIs that are related to Variable and Placeholder and
make them consistent.

4. Change (some of) the unit tests to use the new Placeholder API.

5. Make sure that our optimizations are correct when placeholder are
used.
nadavrot added a commit that referenced this issue Sep 6, 2018
…s that we can construct the graph.

This commit is the first few steps in the implementation of the
Placeholder node. The plan is discussed in #1334. Placeholder are
unbound variables where the content of the tensor is provided by the
execution engine or the AOT compiler at runtime.

This commit introduces Storage as a superclass for Variable (bound) and
Placeholder (unbound). This PR also adds a simple unit test that checks
that we can create new Placeholder nodes. This PR does not implement any
of the infrastructure that's required to do anything useful with
Placeholder variables, such as loading content into them, deleting them,
pretty dotty-printer, etc.

Next steps (not necessarily in any specific order):

1. Teach the execution engine to bind tensors to the Placeholder nodes.

2. Verify that dotty printing, dump() and debugging work well.

3. Cleanup the APIs that are related to Variable and Placeholder and
make them consistent.

4. Change (some of) the unit tests to use the new Placeholder API.

5. Make sure that our optimizations are correct when placeholder are
used.
@nadavrot
Copy link
Contributor Author

nadavrot commented Sep 6, 2018

I landed #1409 which introduced the Placeholder class. I also added a comment in the docs [1] that describes the upcoming changes. I intend to submit a patch to update the execution engine soon.

[1] - https://github.com/pytorch/glow/blob/master/docs/IR.md

@nadavrot
Copy link
Contributor Author

After landing #1612 we can now use Placeholder variables for executing meaningful programs. This is a demo that shows how MNIST looks like without any mutable variables, only placeholders:

#1622

The next steps would be to fix a few local bugs, then update all of the graph builder APIs to include the context. After that we'll need to update all of the unit tests. Finally, we'll need to get rid of Storage and turn Variable into a Constant.

@nadavrot
Copy link
Contributor Author

Status Update

The compiler now supports Placeholder variables. I wrote a few basic unit tests and ported the MNIST program #1622, as an experiment. I am sure that there are some bugs and missed optimizations, but these should be relatively easy to fix as we migrate and delete the code that handles the mutable variables.

The next step in the migration would be to rewrite all of our unit tests to use Placeholders instead of Variables. At the moment we have around ~600 locations that call into createVariable(). I plan to migrate the tests to Placeholder one by one. Porting everything in one big commit is not a good idea because it will discourage code review and increase the chance of merge conflicts. I suggest adding new builder functions that duplicate some of the builder methods that create variables. Once we finish the migration we'll delete the old builder methods. PR #1622 demonstrates the duplicated builder API. I intend to implement something similar, just cleaner and better documented.

Once we migrate all of the tests to using Placeholder we can simply limit SaveNode to saving into Placeholder, and rename Variable into Constant. This will finish the migration.

Other open items:

  • The differentiation method is tied to the grad-check unit test so we'll need to port both together.
  • By default weights will be stored as unoptimizable Placeholders (in the context). We'll need to write a method that converts the non-IO Placeholders into Constants to allow the optimizer to optimize them.
  • We'll need to clean up some of the APIs. During the migration process I've focused on reducing code changes, but the APIs for creating and managing the context should be modernized.

@nadavrot
Copy link
Contributor Author

nadavrot commented Oct 6, 2018

Status update

The recent PRs #1808 and #1807 removed the old Variable. Variable was renamed to Constant, and the SaveNode can't write into it. The next step in this refactoring would be to fix the APIs that were made strange during the rewrite and update the docs.

@nadavrot
Copy link
Contributor Author

This PR is now done. We've removed all mutable variables and replaced them with Placeholder and Constant.

@jackm321
Copy link
Contributor

jackm321 commented Dec 5, 2018

It looks like there are still some graph api methods that use context and probably shouldn't https://github.com/pytorch/glow/blob/master/include/glow/Graph/Graph.h#L664-L670. Is there a task to clean those up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants