-
Notifications
You must be signed in to change notification settings - Fork 127
Debug issue with AOTAutograd for speech_transformer/hf_GPT2/hf_T5 #85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Milestone
Comments
18 tasks
Repro for the bug
|
@Chillee has been looking at this and isolated the problem with even smaller repro than the above. Assigning this to him. |
Resolved in pytorch/pytorch#75933 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
The three models - speech_transformer, hf_GPT2 and hf_T5 fail with similar type of error signature.
TorchDynamo finds static subgraphs and sends them to AOT Autograd. AOT Autograd generates the forward and backward graphs. The output of AOT Autograd is a autograd.Function (code). AOT Autograd saves some tensors for the backward pass gradient computation in the forward pass.
The issue arises in the backward pass. When we read the saved_tensors, one of the item in the
saved_tensors
is not of Tensor type anymore. This causes cryptic error messages like the one below. And this type changes from run to run. I have seen immutable_dict, tuple and even weakref and builtin.I further looked into C++ and starting printing the type of objects while saving the tensors at the end of forward pass, and reading them back in backward pass. I observed the weird behavior in this line -(https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/python_function.cpp#L834). This is called in the backward pass, when we call ctx.saved_tensors.
When I print the unpacked_var, it is a tensor. It has its dim, I can print its shape and everything.
But Py_TYPE(value)→tp_name equals immutable_dict here.
The unpack_fn is basically THPVariable_Wrap - (https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/python_function.cpp#L849).
For completeness, adding images for the failure
Repro -

python torchbench.py --training --devices=cuda --accuracy-aot-nop --only=hf_GPT2
Repro -

python torchbench.py --training --devices=cuda --accuracy-aot-nop --only=speech_transformer
Repro -

python torchbench.py --training --devices=cuda --accuracy-aot-nop --only=hf_T5
The text was updated successfully, but these errors were encountered: