-
Notifications
You must be signed in to change notification settings - Fork 128
Debug issue with AOTAutograd for speech_transformer/hf_GPT2/hf_T5 #85
Description
The three models - speech_transformer, hf_GPT2 and hf_T5 fail with similar type of error signature.
TorchDynamo finds static subgraphs and sends them to AOT Autograd. AOT Autograd generates the forward and backward graphs. The output of AOT Autograd is a autograd.Function (code). AOT Autograd saves some tensors for the backward pass gradient computation in the forward pass.
The issue arises in the backward pass. When we read the saved_tensors, one of the item in the saved_tensors
is not of Tensor type anymore. This causes cryptic error messages like the one below. And this type changes from run to run. I have seen immutable_dict, tuple and even weakref and builtin.
ERROR:root:unhandled error
Traceback (most recent call last):
File "torchbench.py", line 1006, in run_one_model
new_result = model_iter_fn(model, example_inputs)
File "torchbench.py", line 482, in forward_and_backward_pass
def forward_and_backward_pass(mod, inputs, collect_outputs=True):
File "torchbench.py", line 482, in forward_and_backward_pass
def forward_and_backward_pass(mod, inputs, collect_outputs=True):
File "torchbench.py", line 482, in forward_and_backward_pass
def forward_and_backward_pass(mod, inputs, collect_outputs=True):
[Previous line repeated 2 more times]
File "/fsx/users/anijain/functorch/functorch/_src/monkey_patching.py", line 97, in _backward
return _old_backward(*args, **kwargs)
File "/data/home/anijain/miniconda/envs/pytorch_dev/lib/python3.8/site-packages/torch/_tensor.py", line 395, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/data/home/anijain/miniconda/envs/pytorch_dev/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/data/home/anijain/miniconda/envs/pytorch_dev/lib/python3.8/site-packages/torch/autograd/function.py", line 253, in apply
return user_fn(self, *args)
File "/fsx/users/anijain/torchdynamo/torchdynamo/eval_frame.py", line 58, in _fn
return fn(*args, **kwargs)
File "/fsx/users/anijain/functorch/functorch/_src/aot_autograd.py", line 188, in backward
out = normalize_as_list(compiled_bw(*ctx.saved_tensors, *contiguous_args))
File "/fsx/users/anijain/torchdynamo/torchdynamo/eval_frame.py", line 58, in _fn
return fn(*args, **kwargs)
File "/data/home/anijain/miniconda/envs/pytorch_dev/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
RuntimeError: forward() Expected a value of type 'Tensor (inferred)' for argument 'primals_14' but instead found type 'tuple'.
Inferred 'primals_14' to be of type 'Tensor' because it was not annotated with an explicit type.
Position: 19
Value: ('___check_obj_id', '___check_tensors', '___check_type_id', '___guarded_code')
I further looked into C++ and starting printing the type of objects while saving the tensors at the end of forward pass, and reading them back in backward pass. I observed the weird behavior in this line -(https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/python_function.cpp#L834). This is called in the backward pass, when we call ctx.saved_tensors.
When I print the unpacked_var, it is a tensor. It has its dim, I can print its shape and everything.
But Py_TYPE(value)→tp_name equals immutable_dict here.
The unpack_fn is basically THPVariable_Wrap - (https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/python_function.cpp#L849).
For completeness, adding images for the failure
Repro - python torchbench.py --training --devices=cuda --accuracy-aot-nop --only=hf_GPT2
Repro - python torchbench.py --training --devices=cuda --accuracy-aot-nop --only=speech_transformer
Repro - python torchbench.py --training --devices=cuda --accuracy-aot-nop --only=hf_T5