-
Notifications
You must be signed in to change notification settings - Fork 24.3k
[jit] Remove symbolic autodiff's reliance on shape information #8410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This sounds like it should work. It's also interesting to note that existing formulas with multiple inputs, e.g., as in convolution double backwards (in That being said:
You shouldn't assume this if you want double backwards to work :)
I would rather we encode linearity as an intrinsic property of operations, so that we can apply this optimization more widely (not just to backward graphs). It seems dangerous to assume that any computation reachable from |
I updated this with a tweak to the formulation that encodes linearity of a block of operators by putting a UndefToPoison nodes before the op and PoisonToUndef node after it. These nodes capture the fact that the sub-graph in between the two nodes is linear. I prefer this solution to one that uses explicit subgraphs marking the Linear regions because it doesn't require us to teach other passes like |
I reread the new proposal in preparation of reviewing the code, and I am still nervous about the poisoning semantics because I don't understand, semantically, what the poison operations mean. As in, imagine this were an actual programming language, and you had to explain what each op means, in terms of runtime semantics. What does EDIT: Oh, this proposal is no longer up to date. Never mind this :) |
…ined This commit implements the solution proposed in pytorch#8410 to workaround the need to create zero tensors with the same shape as inputs. It introduces the concept of a LinearBlock which marks places in the code where we know if all the inputs to the node are zero, then the outputs to the node are also zero. Autodiff introduces LinearBlocks around backwards functions, which have this property. specializeUndef then propagates Undef nodes using this information. Notes: * Since we do not always specialize, we have a pass LowerLinearBlocks that replaces the block with an if statement that dynamically guards the Undef case. * We introduce AutogradAdd which is addition that still works when its inputs might be undefined. In cases where we specialize this will get removed in favor of a normal add, but there are cases where gradient graphs do not specialize (e.g. when they are not differentiable, but a derivative is required) so it is important for this op to be executable.
…ined (#8641) This commit implements the solution proposed in #8410 to workaround the need to create zero tensors with the same shape as inputs. It introduces the concept of a LinearBlock which marks places in the code where we know if all the inputs to the node are zero, then the outputs to the node are also zero. Autodiff introduces LinearBlocks around backwards functions, which have this property. specializeUndef then propagates Undef nodes using this information. Notes: * Since we do not always specialize, we have a pass LowerLinearBlocks that replaces the block with an if statement that dynamically guards the Undef case. * We introduce AutogradAdd which is addition that still works when its inputs might be undefined. In cases where we specialize this will get removed in favor of a normal add, but there are cases where gradient graphs do not specialize (e.g. when they are not differentiable, but a derivative is required) so it is important for this op to be executable.
This has been fixed in #8641. |
…ined (pytorch#8641) This commit implements the solution proposed in pytorch#8410 to workaround the need to create zero tensors with the same shape as inputs. It introduces the concept of a LinearBlock which marks places in the code where we know if all the inputs to the node are zero, then the outputs to the node are also zero. Autodiff introduces LinearBlocks around backwards functions, which have this property. specializeUndef then propagates Undef nodes using this information. Notes: * Since we do not always specialize, we have a pass LowerLinearBlocks that replaces the block with an if statement that dynamically guards the Undef case. * We introduce AutogradAdd which is addition that still works when its inputs might be undefined. In cases where we specialize this will get removed in favor of a normal add, but there are cases where gradient graphs do not specialize (e.g. when they are not differentiable, but a derivative is required) so it is important for this op to be executable.
…ined (pytorch#8641) This commit implements the solution proposed in pytorch#8410 to workaround the need to create zero tensors with the same shape as inputs. It introduces the concept of a LinearBlock which marks places in the code where we know if all the inputs to the node are zero, then the outputs to the node are also zero. Autodiff introduces LinearBlocks around backwards functions, which have this property. specializeUndef then propagates Undef nodes using this information. Notes: * Since we do not always specialize, we have a pass LowerLinearBlocks that replaces the block with an if statement that dynamically guards the Undef case. * We introduce AutogradAdd which is addition that still works when its inputs might be undefined. In cases where we specialize this will get removed in favor of a normal add, but there are cases where gradient graphs do not specialize (e.g. when they are not differentiable, but a derivative is required) so it is important for this op to be executable.
Uh oh!
There was an error while loading. Please reload this page.
Problem
A value is output from a function. That value is never used in the program. When computing the gradient of that function, we get an undefined tensor as the gradient of that output. However, that value should actually be zero (it isn't used in the loss so dL/dV should be all zeros). Currently we have a node 'ReplaceIfUndef' that substitutes the zero tensor in. However, we need to know the shape of the output to generate the right shape in the zero tensor. This is not always possible to statically derive. We do not want to rely on the shape being available, nor do we want to add the bookkeeping to remember the shape between forward and backward.
Additional Background
We always specialize a different graph when a tensor is undefined, so
we can perform graph optimizations knowing that certain inputs to the graph are undefined, including the optimization of the autograd graph when certain outputs are zeros.
Proposed Solution
If we have a function
y = f(x)
with jacobianJ
, the backwards of f is dx = J^t dy. Note that because the backwards always implements this matrix multiply, we know that it maps an input vector of zeros to an output vector of zero regardless of what operations it choses to do inside to actually implement the matrix multiply (most use some optimized form and never generate J^t). Hence it is ok to replace any backward function with known-zero inputs with something that produces known-zero outputs.The entire backwards pass is composed of operators that either (1) compute a backwards function, or (2) sum the results of backward functions when something is used multiple times.
We can introduce nodes before and after each backward function that gate the inputs and outputs to the op. If we can prove that all of the inputs to the op are Undef nodes, then we can remove the entire graph representing the op, replacing the outputs with Undef nodes. To do this systematically, we introduce two ops:
y0, y1, y2, ... = UndefToPoison(x0, x1, x2...)
if all of the inputs are undef, the all of the outputs become poison values. Otherwise this op does nothing. For all ops, if a single input can be proven to be a poison value, then all of its outputs become poison values.y = PoisonToUndef(x)
if x is a poison value then y turns into an Undef. Otherwise this op does nothing. This is the one exception to the poison propagation described above.A backwards op graph for a single backward function
y0, y1... = f'(x0, x1...)
then will get transformed into:In the event that all the inputs are Undef, then poison propagation will propagate through the entire body of
f'
, and the outputs will become Undefs. If some inputs are not Undef, then it will not propagate and the body will be retained as is. When generating this graph, we can use the property thatUndefToPoison(PoisonToUndef(x)) == x
to keep the graph simple. With this simplification, ops that are simply composed together will not have these nodes inserted in between.A pass to propagate Undefs would perform the following actions, given an set of input values known to be Undefs:
add(Undef, x) -> x
,add(x, Undef) -> x
, andadd(Undef, Undef) -> Undef
Given the fact that we always specialize inputs to be defined or undefined, functions with single output values can generate gradient graph that do not have to handle Undef inputs. Functions with multiple output will need to handle the gradient case where some but not all inputs may be Undef. This is already true of many of the backwards ops with multiple outputs.
Previous Proposed Solution
(this is similar to the above, but does not handle ops with multiple outputs well, and makes it hard to distinguish autograd addition from addition inside a single operator).
We can introduce a 'AutogradZero', which represents a place where we know the value is zero but we do not know its shape, and define rules to propagate it through the graph. We have to handle propagating AutogradZero through both groups of statements that might appear:
(1) a group of operators representing a backwards function (typically each backward has more than a single op to compute it. Only the composition of these operators is guaranteed to map 0 to 0).
(2) an addition of the outputs of (1)
For (1), assuming that the backwards function has a single input and multiple outputs (which is the case for all of the ops we currently handle, and 90%+ of autograd formulas in general), we can treat AutogradZero as a poison value -- a single input that is AutogradZero makes an op's output become AutogradZero. This will propagate through the entire formula of a group of operators representing a backward. If we ever have to write a formula with multiple inputs, we can insert ops that change autogradzero back to undef and allow the op to handle the behavior on its own. To handle (2), we need to distinguish between additions that appear inside of a backward function (which should propagate the AutogradeZero) and those which sum the derivatives from multiple uses of a value (which should drop any AutogradZeros). We can do this by introducing a special AutogradAdd op to indicate case (2), which is removed when we do the propagation pass of AutogradZeros.
Finally, it is important that the AutogradZero token is only propagated when the graph actually represents a gradient. The poison-based propagation only works because we know that the the backwards functions are implementing a matrix multiply, which maps 0 to 0. To ensure we only generate AutogradZero in gradient graphs, the autodiff can introduce autogradZeroIfUndef nodes for any undefined input, which will be replaced during the autogradZero propagation pass with autogradZero.
Alternative considered: An alternative formulation would guard each output of every backward function with
propagateUndef(v, input)
which would replace the computed outputv
with undef if input is undef. This would replace the "poison" value AutogradZero in the above formulation. However this has the nasty property that it inserts a huge number ofpropagateUndef
nodes, which makes reading and debugging the backward output much harder. In contrast, the poison-value formulation only inserts autogradZeroIfUndef nodes on the inputs to the graph and the body of the graph remains the same.The text was updated successfully, but these errors were encountered: