Skip to content

test_detection_model[cuda-fasterrcnn_mobilenet_v3_large_320_fpn] failing due to NVFuser #6015

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
davidberard98 opened this issue May 13, 2022 · 7 comments

Comments

@davidberard98
Copy link
Contributor

🐛 Describe the bug

convolution decomposition has a bug in nvfuser - currently looking into it with @jjsjann123

Versions

CI, windows gpu + linux 3.10 gpu

@davidberard98 davidberard98 self-assigned this May 13, 2022
@jjsjann123
Copy link

Just for my own record. Failing log:
https://circleci.com/api/v1.1/project/github/pytorch/vision/1423677/output/109/0?file=true&allocation-id=627e7d34dec82e3918d717ca-0-build%2F1OQ0W9PF

=================================== FAILURES ===================================

_______ test_detection_model[cuda-fasterrcnn_mobilenet_v3_large_320_fpn] _______

Traceback (most recent call last):

  File "/home/circleci/project/test/test_models.py", line 769, in test_detection_model

    _check_jit_scriptable(model, ([x],), unwrapper=script_model_unwrapper.get(model_name, None), eager_out=out)

  File "/home/circleci/project/test/test_models.py", line 140, in _check_jit_scriptable

    script_out = sm(*args)

  File "/home/circleci/project/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1144, in _call_impl

    return forward_call(*input, **kwargs)

RuntimeError: bias_size_opt.has_value() INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1652425840433/work/torch/csrc/jit/codegen/cuda/graph_fuser.cpp":2220, please report a bug to PyTorch. concrete shape for bias input to conv2d are required

https://circleci.com/api/v1.1/project/github/pytorch/vision/1422234/output/110/0?file=true&allocation-id=627e3bd4750ee26ac7a677b5-0-build%2F550F08F9


================================== FAILURES ===================================
______ test_detection_model[cuda-fasterrcnn_mobilenet_v3_large_320_fpn] _______
Traceback (most recent call last):
  File "C:\Users\circleci\project\test\test_models.py", line 769, in test_detection_model
    _check_jit_scriptable(model, ([x],), unwrapper=script_model_unwrapper.get(model_name, None), eager_out=out)
  File "C:\Users\circleci\project\test\test_models.py", line 144, in _check_jit_scriptable
    torch.testing.assert_close(eager_out, script_out, atol=1e-4, rtol=1e-4)
  File "C:\Users\circleci\project\env\lib\site-packages\torch\testing\_comparison.py", line 1304, in assert_close
    assert_equal(
  File "C:\Users\circleci\project\env\lib\site-packages\torch\testing\_comparison.py", line 1074, in assert_equal
    raise error_metas[0].to_error()
AssertionError: Tensor-likes are not close!

Mismatched elements: 358 / 400 (89.5%)
Greatest absolute difference: 255.36756038665771 at index (14, 2) (up to 0.0001 allowed)
Greatest relative difference: 58729.45237970024 at index (52, 1) (up to 0.0001 allowed)

@jjsjann123
Copy link

Looking at the output value (comparing jit & eager), I'm suspecting there's some indexing issue...
Kevin pointed out a while ago that something was mysteriously fixed in devel but not in upstream: pytorch/pytorch#76790 (comment)

I'm waiting on my build of devel merged into upstream to try my luck there.

FYI, @eellison suggested dropout. I went through the TS graph and don't spot any dropout/rand-like ops there.

@jjsjann123
Copy link

Hmmmm. I think there's a combination of indexing + permutation support.

devel branch does get rid of the mismatched output. So the mysterious indexing fixes are real. (though we never really pin down which PR actually fixed that).
I'm trying to figure out where exactly we are seeing a permutation from profiling information.

@jjsjann123
Copy link

jjsjann123 commented May 13, 2022

errrr.... the index issue is not related to permutation... I patched that anyway in pytorch/pytorch#77460
But the test is still failing with wrong result on upstream/master.

@datumbox
Copy link
Contributor

datumbox commented May 14, 2022

I confirm that TorchVision's latest main branch is failing still. See #6017

@davidberard98 @jjsjann123 Could you please revert the offending PR to resolve the failure? We have been facing breakages from upstream commits on core for several weeks now and this has been very disruptive for the project. Please help us restore the CI on green and run the necessary tests prior relanding to ensure we won't break TorchVision's tests.

@jjsjann123
Copy link

I confirm that TorchVision's latest main branch is failing still. See #6017

@davidberard98 @jjsjann123 Could you please revert the offending PR to resolve the failure? We have been facing breakages from upstream commits on core for several weeks now and this has been very disruptive for the project. Please help us restore the CI on green and run the necessary tests prior relanding to ensure we won't break TorchVision's tests.

Sorry for the inconvenience and confusion.
main branch is still failing because our fix would need us to push our devel branch in: pytorch/pytorch#77471.
That PR is taking a bit longer to wrap up, we are still fighting some build issue on ROCm system.

I verified the fix on my local machine last Friday.

@datumbox
Copy link
Contributor

@davidberard98 @jjsjann123 Thanks a lot for helping us fix the breakage. I confirm that the issue now is resolved on TorchVision's latest main branch d9a6950.

seemethere pushed a commit to pytorch/pytorch that referenced this issue May 18, 2022
Updating nvfuser code base.

This should fix the indexing issue observed in pytorch/vision#6015.

Running tests locally as well. Will update the description here at a later point

@bypass-github-export-checks
Pull Request resolved: #77471
Approved by: https://github.com/seemethere, https://github.com/eellison
facebook-github-bot pushed a commit to pytorch/pytorch that referenced this issue May 18, 2022
Summary:
Updating nvfuser code base.

This should fix the indexing issue observed in pytorch/vision#6015.

Running tests locally as well. Will update the description here at a later point

bypass-github-export-checks

Pull Request resolved: #77471

Reviewed By: malfet, seemethere

Differential Revision: D36393120

Pulled By: eellison

fbshipit-source-id: 876f2d066e8e54b5d076de66ad1811f6970be1c8
jjsjann123 added a commit to jjsjann123/nvfuser that referenced this issue Oct 29, 2022
Updating nvfuser code base.

This should fix the indexing issue observed in pytorch/vision#6015.

Running tests locally as well. Will update the description here at a later point

@bypass-github-export-checks
Pull Request resolved: pytorch/pytorch#77471
Approved by: https://github.com/seemethere, https://github.com/eellison
jjsjann123 added a commit to jjsjann123/nvfuser that referenced this issue Nov 10, 2022
Updating nvfuser code base.

This should fix the indexing issue observed in pytorch/vision#6015.

Running tests locally as well. Will update the description here at a later point

@bypass-github-export-checks
Pull Request resolved: pytorch/pytorch#77471
Approved by: https://github.com/seemethere, https://github.com/eellison
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants