[jit] Remove symbolic autodiff's reliance on shape information #8410

zdevito · 2018-06-13T02:22:24Z

Problem

A value is output from a function. That value is never used in the program. When computing the gradient of that function, we get an undefined tensor as the gradient of that output. However, that value should actually be zero (it isn't used in the loss so dL/dV should be all zeros). Currently we have a node 'ReplaceIfUndef' that substitutes the zero tensor in. However, we need to know the shape of the output to generate the right shape in the zero tensor. This is not always possible to statically derive. We do not want to rely on the shape being available, nor do we want to add the bookkeeping to remember the shape between forward and backward.

Additional Background

We always specialize a different graph when a tensor is undefined, so
we can perform graph optimizations knowing that certain inputs to the graph are undefined, including the optimization of the autograd graph when certain outputs are zeros.

Proposed Solution

If we have a function y = f(x) with jacobian J, the backwards of f is dx = J^t dy. Note that because the backwards always implements this matrix multiply, we know that it maps an input vector of zeros to an output vector of zero regardless of what operations it choses to do inside to actually implement the matrix multiply (most use some optimized form and never generate J^t). Hence it is ok to replace any backward function with known-zero inputs with something that produces known-zero outputs.

The entire backwards pass is composed of operators that either (1) compute a backwards function, or (2) sum the results of backward functions when something is used multiple times.

We can introduce nodes before and after each backward function that gate the inputs and outputs to the op. If we can prove that all of the inputs to the op are Undef nodes, then we can remove the entire graph representing the op, replacing the outputs with Undef nodes. To do this systematically, we introduce two ops:

y0, y1, y2, ... = UndefToPoison(x0, x1, x2...) if all of the inputs are undef, the all of the outputs become poison values. Otherwise this op does nothing. For all ops, if a single input can be proven to be a poison value, then all of its outputs become poison values.
y = PoisonToUndef(x) if x is a poison value then y turns into an Undef. Otherwise this op does nothing. This is the one exception to the poison propagation described above.

A backwards op graph for a single backward function y0, y1... = f'(x0, x1...) then will get transformed into:

x0', x1', ... = UndefToPoison(x0, x1, ...)
y0', y1' = f'(x0', x1') # note f' represents a whole graph not necessarily a single op
y0 = PoisonToUndef(y0')
y1 = PoisonToUndef(y1')

In the event that all the inputs are Undef, then poison propagation will propagate through the entire body of f', and the outputs will become Undefs. If some inputs are not Undef, then it will not propagate and the body will be retained as is. When generating this graph, we can use the property that UndefToPoison(PoisonToUndef(x)) == x to keep the graph simple. With this simplification, ops that are simply composed together will not have these nodes inserted in between.

A pass to propagate Undefs would perform the following actions, given an set of input values known to be Undefs:

If all values into UndefToPoison are proven to be Undefs, it replaces the outputs of the op with Poison values.
If a node that is not PoisonToUndef has one poison input value, it replaces the outputs of the with Poison values
If the input PoisonToUndef is a poison value, it replaces the output with Undef
add(Undef, x) -> x, add(x, Undef) -> x, and add(Undef, Undef) -> Undef

Given the fact that we always specialize inputs to be defined or undefined, functions with single output values can generate gradient graph that do not have to handle Undef inputs. Functions with multiple output will need to handle the gradient case where some but not all inputs may be Undef. This is already true of many of the backwards ops with multiple outputs.

Previous Proposed Solution

(this is similar to the above, but does not handle ops with multiple outputs well, and makes it hard to distinguish autograd addition from addition inside a single operator).

We can introduce a 'AutogradZero', which represents a place where we know the value is zero but we do not know its shape, and define rules to propagate it through the graph. We have to handle propagating AutogradZero through both groups of statements that might appear:

(1) a group of operators representing a backwards function (typically each backward has more than a single op to compute it. Only the composition of these operators is guaranteed to map 0 to 0).
(2) an addition of the outputs of (1)

For (1), assuming that the backwards function has a single input and multiple outputs (which is the case for all of the ops we currently handle, and 90%+ of autograd formulas in general), we can treat AutogradZero as a poison value -- a single input that is AutogradZero makes an op's output become AutogradZero. This will propagate through the entire formula of a group of operators representing a backward. If we ever have to write a formula with multiple inputs, we can insert ops that change autogradzero back to undef and allow the op to handle the behavior on its own. To handle (2), we need to distinguish between additions that appear inside of a backward function (which should propagate the AutogradeZero) and those which sum the derivatives from multiple uses of a value (which should drop any AutogradZeros). We can do this by introducing a special AutogradAdd op to indicate case (2), which is removed when we do the propagation pass of AutogradZeros.

Finally, it is important that the AutogradZero token is only propagated when the graph actually represents a gradient. The poison-based propagation only works because we know that the the backwards functions are implementing a matrix multiply, which maps 0 to 0. To ensure we only generate AutogradZero in gradient graphs, the autodiff can introduce autogradZeroIfUndef nodes for any undefined input, which will be replaced during the autogradZero propagation pass with autogradZero.

Alternative considered: An alternative formulation would guard each output of every backward function with propagateUndef(v, input) which would replace the computed output v with undef if input is undef. This would replace the "poison" value AutogradZero in the above formulation. However this has the nasty property that it inserts a huge number of propagateUndef nodes, which makes reading and debugging the backward output much harder. In contrast, the poison-value formulation only inserts autogradZeroIfUndef nodes on the inputs to the graph and the body of the graph remains the same.

The text was updated successfully, but these errors were encountered:

ezyang · 2018-06-13T14:31:06Z

This sounds like it should work. It's also interesting to note that existing formulas with multiple inputs, e.g., as in convolution double backwards (in aten/src/ATen/native/Convolution.cpp) already handle undefined inputs for ggI, ggO, etc. correctly.

That being said:

assuming that the backwards function has a single input and multiple outputs (which is the case for all of the ops we currently handle, and 90%+ of autograd formulas in general), we can treat AutogradZero as a poison value -- a single input that is AutogradZero makes an op's output become AutogradZero.

You shouldn't assume this if you want double backwards to work :)

To ensure we only generate AutogradZero in gradient graphs, the autodiff can introduce autogradZeroIfUndef nodes for any undefined input, which will be replaced during the autogradZero propagation pass with autogradZero.

I would rather we encode linearity as an intrinsic property of operations, so that we can apply this optimization more widely (not just to backward graphs). It seems dangerous to assume that any computation reachable from AutogradZero is eligible for zeroing.

zdevito · 2018-06-15T06:13:38Z

I updated this with a tweak to the formulation that encodes linearity of a block of operators by putting a UndefToPoison nodes before the op and PoisonToUndef node after it. These nodes capture the fact that the sub-graph in between the two nodes is linear. I prefer this solution to one that uses explicit subgraphs marking the Linear regions because it doesn't require us to teach other passes like isDifferentiable about the subgraphs, and should keep the derivative graphs fairly readable.

ezyang · 2018-06-20T21:31:54Z

I reread the new proposal in preparation of reviewing the code, and I am still nervous about the poisoning semantics because I don't understand, semantically, what the poison operations mean. As in, imagine this were an actual programming language, and you had to explain what each op means, in terms of runtime semantics. What does UndefToPoison mean, at runtime?

EDIT: Oh, this proposal is no longer up to date. Never mind this :)

…ined This commit implements the solution proposed in pytorch#8410 to workaround the need to create zero tensors with the same shape as inputs. It introduces the concept of a LinearBlock which marks places in the code where we know if all the inputs to the node are zero, then the outputs to the node are also zero. Autodiff introduces LinearBlocks around backwards functions, which have this property. specializeUndef then propagates Undef nodes using this information. Notes: * Since we do not always specialize, we have a pass LowerLinearBlocks that replaces the block with an if statement that dynamically guards the Undef case. * We introduce AutogradAdd which is addition that still works when its inputs might be undefined. In cases where we specialize this will get removed in favor of a normal add, but there are cases where gradient graphs do not specialize (e.g. when they are not differentiable, but a derivative is required) so it is important for this op to be executable.

…ined (#8641) This commit implements the solution proposed in #8410 to workaround the need to create zero tensors with the same shape as inputs. It introduces the concept of a LinearBlock which marks places in the code where we know if all the inputs to the node are zero, then the outputs to the node are also zero. Autodiff introduces LinearBlocks around backwards functions, which have this property. specializeUndef then propagates Undef nodes using this information. Notes: * Since we do not always specialize, we have a pass LowerLinearBlocks that replaces the block with an if statement that dynamically guards the Undef case. * We introduce AutogradAdd which is addition that still works when its inputs might be undefined. In cases where we specialize this will get removed in favor of a normal add, but there are cases where gradient graphs do not specialize (e.g. when they are not differentiable, but a derivative is required) so it is important for this op to be executable.

apaszke · 2018-07-10T16:23:39Z

This has been fixed in #8641.

…ined (pytorch#8641) This commit implements the solution proposed in pytorch#8410 to workaround the need to create zero tensors with the same shape as inputs. It introduces the concept of a LinearBlock which marks places in the code where we know if all the inputs to the node are zero, then the outputs to the node are also zero. Autodiff introduces LinearBlocks around backwards functions, which have this property. specializeUndef then propagates Undef nodes using this information. Notes: * Since we do not always specialize, we have a pass LowerLinearBlocks that replaces the block with an if statement that dynamically guards the Undef case. * We introduce AutogradAdd which is addition that still works when its inputs might be undefined. In cases where we specialize this will get removed in favor of a normal add, but there are cases where gradient graphs do not specialize (e.g. when they are not differentiable, but a derivative is required) so it is important for this op to be executable.

jamesr66a mentioned this issue Jun 14, 2018

[jit][script] Can't allocate zero gradients for a value without a type #8393

Closed

zdevito mentioned this issue Jun 19, 2018

Allow autograd to work even when the shape of values cannot be determined #8641

Merged

apaszke closed this as completed Jul 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[jit] Remove symbolic autodiff's reliance on shape information #8410

[jit] Remove symbolic autodiff's reliance on shape information #8410

zdevito commented Jun 13, 2018 •

edited

Loading

ezyang commented Jun 13, 2018

Uh oh!

zdevito commented Jun 15, 2018

Uh oh!

ezyang commented Jun 20, 2018 •

edited

Loading

Uh oh!

apaszke commented Jul 10, 2018

Uh oh!

[jit] Remove symbolic autodiff's reliance on shape information #8410

[jit] Remove symbolic autodiff's reliance on shape information #8410

Comments

zdevito commented Jun 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Additional Background

Proposed Solution

Previous Proposed Solution

ezyang commented Jun 13, 2018

Uh oh!

zdevito commented Jun 15, 2018

Uh oh!

ezyang commented Jun 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apaszke commented Jul 10, 2018

Uh oh!

zdevito commented Jun 13, 2018 •

edited

Loading

ezyang commented Jun 20, 2018 •

edited

Loading