-
Notifications
You must be signed in to change notification settings - Fork 699
Need to redesign the Variable interface. #1334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This sounds great! A couple comments: The name ConstantVariable is a bit of an oxymoron. A constant is something that doesn't change, and a variable is something that does change. I would suggest Constant and Variable instead. Second, I think we should also look at separating constants from the module. The module represents the compiler's intermediate representation, but right now it isn't intermediate at all — the constants in the module are needed at runtime, so you have to keep it around. More generally, we should make sure we are economical with memory management in a production environment. I imagine something like:
|
This proposal sounds great to me. A closely-related issue for multithreaded programs is where we allocate activations. We should hopefully be able to use the same mechanism for |
@stoklund Thank you for the feedback. I like your naming suggestion for Constant and Variable. I will update the proposal. How do you envision separating the Constant tensors from the module? If we use the LLVM terminology Constants are like LLVM's ConstantArray, which are large chunks of data that live inside the module. We need to be able to mutate some tensor during constant propagation. Our AOT compiler can already serialize variables/constants and create standalone bundles. Our JIT compiler references the module tensors to save a copy and reduce the memory use. I think that the economical memory management that you envision (and I agree with this vision, btw) is tied to the concept of compiled function, that we are currently missing. |
@nadavrot We need the constants at runtime, and actually variables are needed too as names when binding. I imaging splitting out a Context from module:
But actually, this is something we can do after separating variables and constants. I don't think we have to implement it at the same time as this issue. @bertmaher The memory for variables can be provided from the outside. See for example onnxSetGraphIO. I think we should allocate activation memory when binding variables. Some accelerators will want to statically allocate activation memory, though. |
Update: we had an offline discussion about this proposal. At the moment Glow Variables are modeled after TensorFlow's variables, that own the underlying storage. Some variables must be bound (for example, constants) while others must be unbound (for example, input images to the network). We are missing TensorFlow's concept of "placeholder". There is consensus that we should implement something similar to "TF.placeholder". We did not reach to a decision about the mutability of bound variables. One option is to keep the TF model that allows mutable bound variables. The other option is to disallow mutable bound variables. Using mutable variables does not make sense in the context of multithreading, but it's very useful for testing/debugging/profiling the code. @artemrakhov pointed out that it's important for us to be able to write tests easily without a complicated setup and mentioned the difficulty of listing out the LSTM variables. @stoklund prefers to abolish the mutable bound variables. @opti-mix mentioned that bound and unbound variables need to inherit from Storage. @qcolombet mentioned that XLA does not have a concept of Variables and that they have a high-level compiler that lowers the high-level Variable concept to the low-level dataflow XLA graph (see tensorflow/tensorflow/compiler/tf2xla/). This transformation may fail. Another topic: @stoklund pointed out that it's a good idea to separate the IR from the context/variables. I added a unit test that ensures that after compilation we can erase all of the functions and the code still runs correctly (on all tested backends). One possible next step would be to move the IRFunction from the execution engine and into the backend. @opti-mix raised the point that we don't want every backend to invent their own low-level IR, so having a single formal low-level IR is a good idea. Moving IRFunction into the backend (and possibly into the CompiledFunction class) will help with the clear ownership and help us implement better cacheing of functions. We also talked about extracting the IR-builder parts of the graph from the Graph into a new namespace/class. I am sure that I missed a few topics. |
@nadavrot In fact, after recent refactorings of the Backend interface, ExecutionEngine does not own the IRFunction. Instead it provides a following function: But ExecutionEngine owns the CompiledFunction that it needs to execute. So, it seems like your comment regarding the IRFunction being moved from ExecutionEngine into a Backend/CompiledFunction is mostly implemented. Or do you mean something besides what we have now? |
@opti-mix Oh, you are right. What do you think about changing the API so that it does not expose IRFunction as part of the API? Backend.h: |
@nadavrot ExecutionEngine currently uses the IRFunction type in the signature of a private function: We could make And backends should have the following compile method: If a backend wants to produce the IR, it can use |
Plan updateThis is a quick summary for a plan for implementing the Variable refactoring. The posts above explain the motivation for eliminating the graph-owned bound mutable variables. We'll eliminate these variables in steps. Some parts of the plan are clear, but we still have a few open design questions. Step 1PR #1365 separated the initialization of tensors from the Variable. This PR extracted the code that performs xavier and other kinds of initialization and moved them to the caller. This allowed us to start working on the refactoring of variables. Step 2The next step would be to implement Placeholders, which are unbound variables. The name Placeholder is borrowed from TensorFlow, and just like in TensorFlow, the content of inputs and outputs is copied into and from the tensors before and after the execution of the program. The execution engine is responsible for binding the inputs/outputs. The executing engine interfaces already support this kind of binding and the run method accepts a list of variables and a list of tensors to bind. PR #1409 is the first step to implement placeholders. Future PRs will implement the missing functionality (as explained in the body of the PR). After this change lands we will be able to migrate many of the simple tests (such as learn-xor) and all of the ONNX-loader tests into a pure Placeholder mode. Step 3Step 2 got rid of the input/output tensors, and remove most of the mutable bound variables from the test suite. However, we will still have many mutable variables that are defined by the graph builder. We can't easily get rid of these variables and replace them with placeholders (and backing tensors) because the graph builder also initializes the variables. We will need to design a system that will allow the graph builder to create new variables/placeholders and register their initialization kind and allocate backing tensors for them. The challenge here is to make sure that our unit tests are not too verbose and that it's easy to add new unit tests to the compiler. |
Not necessarily a good thing for non-verbosity in the tests, but XLA requires the runtime to supply the storage/tensor for all inputs, outputs and most temporaries. In other words, the address of the storage for the "variables" is always supplied at compiled time. |
|
…s that we can construct the graph. This commit is the first few steps in the implementation of the Placeholder node. The plan is discussed in pytorch#1334. Placeholder are unbound variables where the content of the tensor is provided by the execution engine or the AOT compiler at runtime. This commit introduces Storage as a superclass for Variable (bound) and Placeholder (unbound). This PR also adds a simple unit test that checks that we can create new Placeholder nodes. This PR does not implement any of the infrastructure that's required to do anything useful with Placeholder variables, such as loading content into them, deleting them, pretty dotty-printer, etc. Next steps (not necessarily in any specific order): 1. Teach the execution engine to bind tensors to the Placeholder nodes. 2. Verify that dotty printing, dump() and debugging work well. 3. Cleanup the APIs that are related to Variable and Placeholder and make them consistent. 4. Change (some of) the unit tests to use the new Placeholder API. 5. Make sure that our optimizations are correct when placeholder are used.
…s that we can construct the graph. This commit is the first few steps in the implementation of the Placeholder node. The plan is discussed in #1334. Placeholder are unbound variables where the content of the tensor is provided by the execution engine or the AOT compiler at runtime. This commit introduces Storage as a superclass for Variable (bound) and Placeholder (unbound). This PR also adds a simple unit test that checks that we can create new Placeholder nodes. This PR does not implement any of the infrastructure that's required to do anything useful with Placeholder variables, such as loading content into them, deleting them, pretty dotty-printer, etc. Next steps (not necessarily in any specific order): 1. Teach the execution engine to bind tensors to the Placeholder nodes. 2. Verify that dotty printing, dump() and debugging work well. 3. Cleanup the APIs that are related to Variable and Placeholder and make them consistent. 4. Change (some of) the unit tests to use the new Placeholder API. 5. Make sure that our optimizations are correct when placeholder are used.
I landed #1409 which introduced the Placeholder class. I also added a comment in the docs [1] that describes the upcoming changes. I intend to submit a patch to update the execution engine soon. [1] - https://github.com/pytorch/glow/blob/master/docs/IR.md |
After landing #1612 we can now use Placeholder variables for executing meaningful programs. This is a demo that shows how MNIST looks like without any mutable variables, only placeholders: The next steps would be to fix a few local bugs, then update all of the graph builder APIs to include the context. After that we'll need to update all of the unit tests. Finally, we'll need to get rid of Storage and turn Variable into a Constant. |
Status UpdateThe compiler now supports Placeholder variables. I wrote a few basic unit tests and ported the MNIST program #1622, as an experiment. I am sure that there are some bugs and missed optimizations, but these should be relatively easy to fix as we migrate and delete the code that handles the mutable variables. The next step in the migration would be to rewrite all of our unit tests to use Placeholders instead of Variables. At the moment we have around ~600 locations that call into createVariable(). I plan to migrate the tests to Placeholder one by one. Porting everything in one big commit is not a good idea because it will discourage code review and increase the chance of merge conflicts. I suggest adding new builder functions that duplicate some of the builder methods that create variables. Once we finish the migration we'll delete the old builder methods. PR #1622 demonstrates the duplicated builder API. I intend to implement something similar, just cleaner and better documented. Once we migrate all of the tests to using Placeholder we can simply limit SaveNode to saving into Placeholder, and rename Variable into Constant. This will finish the migration. Other open items:
|
This PR is now done. We've removed all mutable variables and replaced them with Placeholder and Constant. |
It looks like there are still some graph api methods that use context and probably shouldn't https://github.com/pytorch/glow/blob/master/include/glow/Graph/Graph.h#L664-L670. Is there a task to clean those up? |
Uh oh!
There was an error while loading. Please reload this page.
Motivation
Currently Glow variables are backed by Tensors, and are tied to the graph. This gives us the ability to perform optimizations such as constant propagation that mutate the content of Tensors. However, it also ties the input/output tensors to the module. This design prevents us from running multithreaded programs simply because the input/output variables (which need to be different for each thread) are a single entity.
Proposed design
The proposed solution is to separate the concept of symbolic variables from graph variables. For example, an input variable X will not be backed by a Tensor until the execution-engine executes the compiled program. To execute the network the caller would need to bind Tensors to variables. The execution engine actually already does this, so we won't need to change the APIs that are used by our unit tests.
To preserve the ability to perform constant propagation (and other tensor-content optimizations) we would need to be able to make some variables concrete during the compilation phase. We would need to have two kinds of variables:
Constant - this is a tensor-backed variable. Similar to LLVM's ConstantArray it's a data structure that lives at the module-level and can be manipulated by transformations.
Variable - this is a typed variable that isn't backed by a real tensor until runtime. The runtime allocates the tensor (either in device memory or on the host) and manages the input/output to and from the variable.
The text was updated successfully, but these errors were encountered: