Debug issue with AOTAutograd for speech_transformer/hf_GPT2/hf_T5

The three models - speech_transformer, hf_GPT2 and hf_T5 fail with similar type of error signature.




TorchDynamo finds static subgraphs and sends them to AOT Autograd. AOT Autograd generates the forward and backward graphs. The output of AOT Autograd is a autograd.Function ([code](https://github.com/pytorch/functorch/blob/main/functorch/_src/aot_autograd.py#L136-L178)). AOT Autograd saves some tensors for the backward pass gradient computation in the forward pass.

The issue arises in the backward pass. When we read the saved_tensors, one of the item in the `saved_tensors` is not of Tensor type anymore. This causes cryptic error messages like the one below. And this type changes from run to run. I have seen immutable_dict, tuple and even weakref and builtin. 

~~~
ERROR:root:unhandled error
Traceback (most recent call last):
  File "torchbench.py", line 1006, in run_one_model
    new_result = model_iter_fn(model, example_inputs)
  File "torchbench.py", line 482, in forward_and_backward_pass
    def forward_and_backward_pass(mod, inputs, collect_outputs=True):
  File "torchbench.py", line 482, in forward_and_backward_pass
    def forward_and_backward_pass(mod, inputs, collect_outputs=True):
  File "torchbench.py", line 482, in forward_and_backward_pass
    def forward_and_backward_pass(mod, inputs, collect_outputs=True):
  [Previous line repeated 2 more times]
  File "/fsx/users/anijain/functorch/functorch/_src/monkey_patching.py", line 97, in _backward
    return _old_backward(*args, **kwargs)
  File "/data/home/anijain/miniconda/envs/pytorch_dev/lib/python3.8/site-packages/torch/_tensor.py", line 395, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/data/home/anijain/miniconda/envs/pytorch_dev/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/data/home/anijain/miniconda/envs/pytorch_dev/lib/python3.8/site-packages/torch/autograd/function.py", line 253, in apply
    return user_fn(self, *args)
  File "/fsx/users/anijain/torchdynamo/torchdynamo/eval_frame.py", line 58, in _fn
    return fn(*args, **kwargs)
  File "/fsx/users/anijain/functorch/functorch/_src/aot_autograd.py", line 188, in backward
    out = normalize_as_list(compiled_bw(*ctx.saved_tensors, *contiguous_args))
  File "/fsx/users/anijain/torchdynamo/torchdynamo/eval_frame.py", line 58, in _fn
    return fn(*args, **kwargs)
  File "/data/home/anijain/miniconda/envs/pytorch_dev/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
RuntimeError: forward() Expected a value of type 'Tensor (inferred)' for argument 'primals_14' but instead found type 'tuple'.
Inferred 'primals_14' to be of type 'Tensor' because it was not annotated with an explicit type.
Position: 19
Value: ('___check_obj_id', '___check_tensors', '___check_type_id', '___guarded_code')

~~~

I further looked into C++ and starting printing the type of objects while saving the tensors at the end of forward pass, and reading them back in backward pass. I observed the weird behavior in this line -(https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/python_function.cpp#L834).  This is called in the backward pass, when we call ctx.saved_tensors.

When I print the unpacked_var, it is a *tensor*. It has its dim, I can print its shape and everything. 
But  Py_TYPE(value)→tp_name equals immutable_dict here. 
The unpack_fn is basically THPVariable_Wrap - (https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/python_function.cpp#L849).


For completeness, adding images for the failure

Repro - `python torchbench.py --training --devices=cuda --accuracy-aot-nop --only=hf_GPT2`
![image](https://user-images.githubusercontent.com/13822661/159565917-30b1735c-8a2d-4aec-8298-9d5df900e523.png)


Repro - `python torchbench.py --training --devices=cuda --accuracy-aot-nop --only=speech_transformer`
![image](https://user-images.githubusercontent.com/13822661/159565622-70e1680d-d934-470e-a1fd-bd4c2ff63568.png)



Repro - `python torchbench.py --training --devices=cuda --accuracy-aot-nop --only=hf_T5`
![image](https://user-images.githubusercontent.com/13822661/159566100-50dea8e0-1a80-4d21-a2c6-b3aed8ecddc3.png)




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Debug issue with AOTAutograd for speech_transformer/hf_GPT2/hf_T5 #85

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Debug issue with AOTAutograd for speech_transformer/hf_GPT2/hf_T5 #85

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions