Need to redesign the Variable interface. #1334

nadavrot · 2018-07-26T05:06:43Z

Motivation

Currently Glow variables are backed by Tensors, and are tied to the graph. This gives us the ability to perform optimizations such as constant propagation that mutate the content of Tensors. However, it also ties the input/output tensors to the module. This design prevents us from running multithreaded programs simply because the input/output variables (which need to be different for each thread) are a single entity.

Proposed design

The proposed solution is to separate the concept of symbolic variables from graph variables. For example, an input variable X will not be backed by a Tensor until the execution-engine executes the compiled program. To execute the network the caller would need to bind Tensors to variables. The execution engine actually already does this, so we won't need to change the APIs that are used by our unit tests.

To preserve the ability to perform constant propagation (and other tensor-content optimizations) we would need to be able to make some variables concrete during the compilation phase. We would need to have two kinds of variables:

Constant - this is a tensor-backed variable. Similar to LLVM's ConstantArray it's a data structure that lives at the module-level and can be manipulated by transformations.

Variable - this is a typed variable that isn't backed by a real tensor until runtime. The runtime allocates the tensor (either in device memory or on the host) and manages the input/output to and from the variable.

stoklund · 2018-07-26T17:00:34Z

This sounds great! A couple comments:

The name ConstantVariable is a bit of an oxymoron. A constant is something that doesn't change, and a variable is something that does change. I would suggest Constant and Variable instead.

Second, I think we should also look at separating constants from the module. The module represents the compiler's intermediate representation, but right now it isn't intermediate at all — the constants in the module are needed at runtime, so you have to keep it around.

More generally, we should make sure we are economical with memory management in a production environment. I imagine something like:

The Module only exists during compilation, then it is deleted. We don't need the IR any more.
After compilation, constants that are needed at runtime are moved into a constant pool. Any constants that are not needed at runtime get deleted along with the module.
A constant pool can be shared by multiple runtime variable bindings, and also by multiple compiled functions. (e.g., compiled functions representing different batch sizes of the same graph).
For constants we didn't need to modify during compilation, this should all happen without any copies of tensor data.

bertmaher · 2018-07-26T17:15:06Z

This proposal sounds great to me. A closely-related issue for multithreaded programs is where we allocate activations. We should hopefully be able to use the same mechanism for RuntimeVariables and activations.

nadavrot · 2018-07-26T17:20:29Z

@stoklund Thank you for the feedback.

I like your naming suggestion for Constant and Variable. I will update the proposal.

How do you envision separating the Constant tensors from the module? If we use the LLVM terminology Constants are like LLVM's ConstantArray, which are large chunks of data that live inside the module. We need to be able to mutate some tensor during constant propagation. Our AOT compiler can already serialize variables/constants and create standalone bundles. Our JIT compiler references the module tensors to save a copy and reduce the memory use.

I think that the economical memory management that you envision (and I agree with this vision, btw) is tied to the concept of compiled function, that we are currently missing.

stoklund · 2018-07-26T18:22:58Z

@nadavrot We need the constants at runtime, and actually variables are needed too as names when binding. I imaging splitting out a Context from module:

Context owns constants and variables. It outlives both modules and compiled functions.
Module owns functions and references the context where variables and constants live.
CompiledFunction references the context, not the module.

But actually, this is something we can do after separating variables and constants. I don't think we have to implement it at the same time as this issue.

@bertmaher The memory for variables can be provided from the outside. See for example onnxSetGraphIO. I think we should allocate activation memory when binding variables. Some accelerators will want to statically allocate activation memory, though.

nadavrot · 2018-08-03T17:50:56Z

Update: we had an offline discussion about this proposal. At the moment Glow Variables are modeled after TensorFlow's variables, that own the underlying storage. Some variables must be bound (for example, constants) while others must be unbound (for example, input images to the network). We are missing TensorFlow's concept of "placeholder". There is consensus that we should implement something similar to "TF.placeholder". We did not reach to a decision about the mutability of bound variables. One option is to keep the TF model that allows mutable bound variables. The other option is to disallow mutable bound variables. Using mutable variables does not make sense in the context of multithreading, but it's very useful for testing/debugging/profiling the code. @artemrakhov pointed out that it's important for us to be able to write tests easily without a complicated setup and mentioned the difficulty of listing out the LSTM variables. @stoklund prefers to abolish the mutable bound variables. @opti-mix mentioned that bound and unbound variables need to inherit from Storage.

@qcolombet mentioned that XLA does not have a concept of Variables and that they have a high-level compiler that lowers the high-level Variable concept to the low-level dataflow XLA graph (see tensorflow/tensorflow/compiler/tf2xla/). This transformation may fail.

Another topic: @stoklund pointed out that it's a good idea to separate the IR from the context/variables. I added a unit test that ensures that after compilation we can erase all of the functions and the code still runs correctly (on all tested backends). One possible next step would be to move the IRFunction from the execution engine and into the backend. @opti-mix raised the point that we don't want every backend to invent their own low-level IR, so having a single formal low-level IR is a good idea. Moving IRFunction into the backend (and possibly into the CompiledFunction class) will help with the clear ownership and help us implement better cacheing of functions.

We also talked about extracting the IR-builder parts of the graph from the Graph into a new namespace/class. I am sure that I missed a few topics.

opti-mix · 2018-08-03T18:05:47Z

One possible next step would be to move the IRFunction from the execution engine and into the backend. @opti-mix raised the point that we don't want every backend to invent their own low-level IR, so having a single formal low-level IR is a good idea. Moving IRFunction into the backend (and possibly into the CompiledFunction class) will help with the clear ownership and help us implement better cacheing of functions.

@nadavrot In fact, after recent refactorings of the Backend interface, ExecutionEngine does not own the IRFunction. Instead it provides a following function:
std::unique_ptr<IRFunction> generateIR(CompilationMode mode, Function *F);, which is used to create an IRFunction.

But ExecutionEngine owns the CompiledFunction that it needs to execute.

So, it seems like your comment regarding the IRFunction being moved from ExecutionEngine into a Backend/CompiledFunction is mostly implemented.

Or do you mean something besides what we have now?

nadavrot · 2018-08-03T18:09:32Z

@opti-mix Oh, you are right. What do you think about changing the API so that it does not expose IRFunction as part of the API?

Backend.h:
compile(std::unique_ptr IR) const = 0; // change this to: compile(Function *F);

opti-mix · 2018-08-03T22:12:01Z

@nadavrot ExecutionEngine currently uses the IRFunction type in the signature of a private function:
std::unique_ptr<IRFunction> generateIR(CompilationMode mode, Function *F);, thus it is an implementation detail. I'm pretty sure we can remove a reference to IRFunction from the ExecutionEngine.h.

We could make generateIR a helper function that could be used by backends if they need to produce the IR.

And backends should have the following compile method:
std::unique_ptr<CompiledFunction> compile(CompilationMode mode, Function *F) const = 0; instead of
std::unique_ptr<CompiledFunction> compile(std::unique_ptr<IRFunction> IR) const = 0;

If a backend wants to produce the IR, it can use generateIR. If not, it can work directly on the graph representation.

nadavrot · 2018-08-08T22:30:25Z

Plan update

This is a quick summary for a plan for implementing the Variable refactoring. The posts above explain the motivation for eliminating the graph-owned bound mutable variables. We'll eliminate these variables in steps. Some parts of the plan are clear, but we still have a few open design questions.

Step 1

PR #1365 separated the initialization of tensors from the Variable. This PR extracted the code that performs xavier and other kinds of initialization and moved them to the caller. This allowed us to start working on the refactoring of variables.

Step 2

The next step would be to implement Placeholders, which are unbound variables. The name Placeholder is borrowed from TensorFlow, and just like in TensorFlow, the content of inputs and outputs is copied into and from the tensors before and after the execution of the program. The execution engine is responsible for binding the inputs/outputs. The executing engine interfaces already support this kind of binding and the run method accepts a list of variables and a list of tensors to bind. PR #1409 is the first step to implement placeholders. Future PRs will implement the missing functionality (as explained in the body of the PR). After this change lands we will be able to migrate many of the simple tests (such as learn-xor) and all of the ONNX-loader tests into a pure Placeholder mode.

Step 3

Step 2 got rid of the input/output tensors, and remove most of the mutable bound variables from the test suite. However, we will still have many mutable variables that are defined by the graph builder. We can't easily get rid of these variables and replace them with placeholders (and backing tensors) because the graph builder also initializes the variables. We will need to design a system that will allow the graph builder to create new variables/placeholders and register their initialization kind and allocate backing tensors for them. The challenge here is to make sure that our unit tests are not too verbose and that it's easy to add new unit tests to the compiler.

qcolombet · 2018-08-08T23:02:13Z

Not necessarily a good thing for non-verbosity in the tests, but XLA requires the runtime to supply the storage/tensor for all inputs, outputs and most temporaries. In other words, the address of the storage for the "variables" is always supplied at compiled time.

qcolombet · 2018-08-08T23:08:44Z

is always supplied at compiled time.

I meant runtime!!

…s that we can construct the graph. This commit is the first few steps in the implementation of the Placeholder node. The plan is discussed in pytorch#1334. Placeholder are unbound variables where the content of the tensor is provided by the execution engine or the AOT compiler at runtime. This commit introduces Storage as a superclass for Variable (bound) and Placeholder (unbound). This PR also adds a simple unit test that checks that we can create new Placeholder nodes. This PR does not implement any of the infrastructure that's required to do anything useful with Placeholder variables, such as loading content into them, deleting them, pretty dotty-printer, etc. Next steps (not necessarily in any specific order): 1. Teach the execution engine to bind tensors to the Placeholder nodes. 2. Verify that dotty printing, dump() and debugging work well. 3. Cleanup the APIs that are related to Variable and Placeholder and make them consistent. 4. Change (some of) the unit tests to use the new Placeholder API. 5. Make sure that our optimizations are correct when placeholder are used.

…s that we can construct the graph. This commit is the first few steps in the implementation of the Placeholder node. The plan is discussed in #1334. Placeholder are unbound variables where the content of the tensor is provided by the execution engine or the AOT compiler at runtime. This commit introduces Storage as a superclass for Variable (bound) and Placeholder (unbound). This PR also adds a simple unit test that checks that we can create new Placeholder nodes. This PR does not implement any of the infrastructure that's required to do anything useful with Placeholder variables, such as loading content into them, deleting them, pretty dotty-printer, etc. Next steps (not necessarily in any specific order): 1. Teach the execution engine to bind tensors to the Placeholder nodes. 2. Verify that dotty printing, dump() and debugging work well. 3. Cleanup the APIs that are related to Variable and Placeholder and make them consistent. 4. Change (some of) the unit tests to use the new Placeholder API. 5. Make sure that our optimizations are correct when placeholder are used.

nadavrot · 2018-09-06T03:31:38Z

I landed #1409 which introduced the Placeholder class. I also added a comment in the docs [1] that describes the upcoming changes. I intend to submit a patch to update the execution engine soon.

[1] - https://github.com/pytorch/glow/blob/master/docs/IR.md

nadavrot · 2018-09-11T21:21:32Z

After landing #1612 we can now use Placeholder variables for executing meaningful programs. This is a demo that shows how MNIST looks like without any mutable variables, only placeholders:

#1622

The next steps would be to fix a few local bugs, then update all of the graph builder APIs to include the context. After that we'll need to update all of the unit tests. Finally, we'll need to get rid of Storage and turn Variable into a Constant.

nadavrot · 2018-09-12T22:34:09Z

Status Update

The compiler now supports Placeholder variables. I wrote a few basic unit tests and ported the MNIST program #1622, as an experiment. I am sure that there are some bugs and missed optimizations, but these should be relatively easy to fix as we migrate and delete the code that handles the mutable variables.

The next step in the migration would be to rewrite all of our unit tests to use Placeholders instead of Variables. At the moment we have around ~600 locations that call into createVariable(). I plan to migrate the tests to Placeholder one by one. Porting everything in one big commit is not a good idea because it will discourage code review and increase the chance of merge conflicts. I suggest adding new builder functions that duplicate some of the builder methods that create variables. Once we finish the migration we'll delete the old builder methods. PR #1622 demonstrates the duplicated builder API. I intend to implement something similar, just cleaner and better documented.

Once we migrate all of the tests to using Placeholder we can simply limit SaveNode to saving into Placeholder, and rename Variable into Constant. This will finish the migration.

Other open items:

The differentiation method is tied to the grad-check unit test so we'll need to port both together.
By default weights will be stored as unoptimizable Placeholders (in the context). We'll need to write a method that converts the non-IO Placeholders into Constants to allow the optimizer to optimize them.
We'll need to clean up some of the APIs. During the migration process I've focused on reducing code changes, but the APIs for creating and managing the context should be modernized.

nadavrot · 2018-10-06T23:24:34Z

Status update

The recent PRs #1808 and #1807 removed the old Variable. Variable was renamed to Constant, and the SaveNode can't write into it. The next step in this refactoring would be to fix the APIs that were made strange during the rewrite and update the docs.

nadavrot · 2018-10-16T03:50:16Z

This PR is now done. We've removed all mutable variables and replaced them with Placeholder and Constant.

jackm321 · 2018-12-05T02:34:34Z

It looks like there are still some graph api methods that use context and probably shouldn't https://github.com/pytorch/glow/blob/master/include/glow/Graph/Graph.h#L664-L670. Is there a task to clean those up?

This was referenced Jul 26, 2018

[Feature request] Backend-specific tensors #1214

Open

[Feature request] Tensor attributes #1213

Open

qcolombet mentioned this issue Jul 27, 2018

Final push to clean our overwritten arguments #1353

Closed

nadavrot mentioned this issue Jul 30, 2018

Start to untangle Variable: move init into Tensor #1362

Merged

nadavrot mentioned this issue Aug 6, 2018

Placeholder: add Placeholder as a new kind of input/output to the graph. #1409

Merged

qcolombet mentioned this issue Sep 5, 2018

[RFC] Perform deeper function cloning. Some variables needs to be cloned. #1582

Closed

nadavrot mentioned this issue Sep 14, 2018

[Placeholder] Migrate more tests to the new API. #1647

Merged

This was referenced Sep 24, 2018

[Placeholder] Migrate the last use of variables in one test. #1709

Merged

[Placeholder] Update tests that were creating variables that were saved into. #1712

Merged

This was referenced Sep 27, 2018

[Placeholder] Fix the differentiation of batchNormalization nodes. #1730

Merged

[Placeholder] Remove the ability to differentiate variables. #1735

Merged

[PH] Port the C2/ONNX loader to using Placeholders. #1783

Merged

This was referenced Oct 4, 2018

[Placeholder] Remove the ability to save into variables. #1797

Merged

Rename Variable to Constant. NFC. #1808

Merged

nadavrot mentioned this issue Oct 8, 2018

[doc] Update the docs and mention constants and placeholders. #1809

Merged

nadavrot closed this as completed Oct 16, 2018

rdzhabarov mentioned this issue Oct 16, 2018

[NFC/cleanup] Remove more entries of variables. #1875

Merged

Need to redesign the Variable interface. #1334

Need to redesign the Variable interface. #1334

Comments

nadavrot commented Jul 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Proposed design

stoklund commented Jul 26, 2018

Uh oh!

bertmaher commented Jul 26, 2018

Uh oh!

nadavrot commented Jul 26, 2018

Uh oh!

stoklund commented Jul 26, 2018

Uh oh!

nadavrot commented Aug 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

opti-mix commented Aug 3, 2018

Uh oh!

nadavrot commented Aug 3, 2018

Uh oh!

opti-mix commented Aug 3, 2018

Uh oh!

nadavrot commented Aug 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Plan update

Step 1

Step 2

Step 3

Uh oh!

qcolombet commented Aug 8, 2018

Uh oh!

qcolombet commented Aug 8, 2018

Uh oh!

nadavrot commented Sep 6, 2018

Uh oh!

nadavrot commented Sep 11, 2018

Uh oh!

nadavrot commented Sep 12, 2018

Status Update

Other open items:

Uh oh!

nadavrot commented Oct 6, 2018

Status update

Uh oh!

nadavrot commented Oct 16, 2018

Uh oh!

jackm321 commented Dec 5, 2018

Uh oh!

nadavrot commented Jul 26, 2018 •

edited

Loading

nadavrot commented Aug 3, 2018 •

edited

Loading

nadavrot commented Aug 8, 2018 •

edited

Loading