test_detection_model[cuda-fasterrcnn_mobilenet_v3_large_320_fpn] failing due to NVFuser #6015

davidberard98 · 2022-05-13T17:07:48Z

🐛 Describe the bug

convolution decomposition has a bug in nvfuser - currently looking into it with @jjsjann123

Versions

CI, windows gpu + linux 3.10 gpu

jjsjann123 · 2022-05-13T17:20:07Z

Just for my own record. Failing log:
https://circleci.com/api/v1.1/project/github/pytorch/vision/1423677/output/109/0?file=true&allocation-id=627e7d34dec82e3918d717ca-0-build%2F1OQ0W9PF

=================================== FAILURES ===================================

_______ test_detection_model[cuda-fasterrcnn_mobilenet_v3_large_320_fpn] _______

Traceback (most recent call last):

  File "/home/circleci/project/test/test_models.py", line 769, in test_detection_model

    _check_jit_scriptable(model, ([x],), unwrapper=script_model_unwrapper.get(model_name, None), eager_out=out)

  File "/home/circleci/project/test/test_models.py", line 140, in _check_jit_scriptable

    script_out = sm(*args)

  File "/home/circleci/project/env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1144, in _call_impl

    return forward_call(*input, **kwargs)

RuntimeError: bias_size_opt.has_value() INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1652425840433/work/torch/csrc/jit/codegen/cuda/graph_fuser.cpp":2220, please report a bug to PyTorch. concrete shape for bias input to conv2d are required

https://circleci.com/api/v1.1/project/github/pytorch/vision/1422234/output/110/0?file=true&allocation-id=627e3bd4750ee26ac7a677b5-0-build%2F550F08F9


================================== FAILURES ===================================
______ test_detection_model[cuda-fasterrcnn_mobilenet_v3_large_320_fpn] _______
Traceback (most recent call last):
  File "C:\Users\circleci\project\test\test_models.py", line 769, in test_detection_model
    _check_jit_scriptable(model, ([x],), unwrapper=script_model_unwrapper.get(model_name, None), eager_out=out)
  File "C:\Users\circleci\project\test\test_models.py", line 144, in _check_jit_scriptable
    torch.testing.assert_close(eager_out, script_out, atol=1e-4, rtol=1e-4)
  File "C:\Users\circleci\project\env\lib\site-packages\torch\testing\_comparison.py", line 1304, in assert_close
    assert_equal(
  File "C:\Users\circleci\project\env\lib\site-packages\torch\testing\_comparison.py", line 1074, in assert_equal
    raise error_metas[0].to_error()
AssertionError: Tensor-likes are not close!

Mismatched elements: 358 / 400 (89.5%)
Greatest absolute difference: 255.36756038665771 at index (14, 2) (up to 0.0001 allowed)
Greatest relative difference: 58729.45237970024 at index (52, 1) (up to 0.0001 allowed)

jjsjann123 · 2022-05-13T19:16:36Z

Looking at the output value (comparing jit & eager), I'm suspecting there's some indexing issue...
Kevin pointed out a while ago that something was mysteriously fixed in devel but not in upstream: pytorch/pytorch#76790 (comment)

I'm waiting on my build of devel merged into upstream to try my luck there.

FYI, @eellison suggested dropout. I went through the TS graph and don't spot any dropout/rand-like ops there.

jjsjann123 · 2022-05-13T19:43:18Z

Hmmmm. I think there's a combination of indexing + permutation support.

devel branch does get rid of the mismatched output. So the mysterious indexing fixes are real. (though we never really pin down which PR actually fixed that).
I'm trying to figure out where exactly we are seeing a permutation from profiling information.

jjsjann123 · 2022-05-13T22:24:11Z

errrr.... the index issue is not related to permutation... I patched that anyway in pytorch/pytorch#77460
But the test is still failing with wrong result on upstream/master.

datumbox · 2022-05-14T20:33:16Z

I confirm that TorchVision's latest main branch is failing still. See #6017

@davidberard98 @jjsjann123 Could you please revert the offending PR to resolve the failure? We have been facing breakages from upstream commits on core for several weeks now and this has been very disruptive for the project. Please help us restore the CI on green and run the necessary tests prior relanding to ensure we won't break TorchVision's tests.

jjsjann123 · 2022-05-16T16:55:27Z

I confirm that TorchVision's latest main branch is failing still. See #6017

@davidberard98 @jjsjann123 Could you please revert the offending PR to resolve the failure? We have been facing breakages from upstream commits on core for several weeks now and this has been very disruptive for the project. Please help us restore the CI on green and run the necessary tests prior relanding to ensure we won't break TorchVision's tests.

Sorry for the inconvenience and confusion.
main branch is still failing because our fix would need us to push our devel branch in: pytorch/pytorch#77471.
That PR is taking a bit longer to wrap up, we are still fighting some build issue on ROCm system.

I verified the fix on my local machine last Friday.

datumbox · 2022-05-17T13:22:18Z

@davidberard98 @jjsjann123 Thanks a lot for helping us fix the breakage. I confirm that the issue now is resolved on TorchVision's latest main branch d9a6950.

Updating nvfuser code base. This should fix the indexing issue observed in pytorch/vision#6015. Running tests locally as well. Will update the description here at a later point @bypass-github-export-checks Pull Request resolved: #77471 Approved by: https://github.com/seemethere, https://github.com/eellison

Summary: Updating nvfuser code base. This should fix the indexing issue observed in pytorch/vision#6015. Running tests locally as well. Will update the description here at a later point bypass-github-export-checks Pull Request resolved: #77471 Reviewed By: malfet, seemethere Differential Revision: D36393120 Pulled By: eellison fbshipit-source-id: 876f2d066e8e54b5d076de66ad1811f6970be1c8

Updating nvfuser code base. This should fix the indexing issue observed in pytorch/vision#6015. Running tests locally as well. Will update the description here at a later point @bypass-github-export-checks Pull Request resolved: pytorch/pytorch#77471 Approved by: https://github.com/seemethere, https://github.com/eellison

davidberard98 self-assigned this May 13, 2022

jjsjann123 mentioned this issue May 14, 2022

Upstream master bump 0513 pytorch/pytorch#77471

Closed

datumbox added bug dependency issue core issue labels May 14, 2022

davidberard98 mentioned this issue May 16, 2022

enable NVFuser by default pytorch/pytorch#77213

Closed

datumbox closed this as completed May 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test_detection_model[cuda-fasterrcnn_mobilenet_v3_large_320_fpn] failing due to NVFuser #6015

test_detection_model[cuda-fasterrcnn_mobilenet_v3_large_320_fpn] failing due to NVFuser #6015

davidberard98 commented May 13, 2022

jjsjann123 commented May 13, 2022

Uh oh!

jjsjann123 commented May 13, 2022

Uh oh!

jjsjann123 commented May 13, 2022

Uh oh!

jjsjann123 commented May 13, 2022 •

edited

Loading

Uh oh!

datumbox commented May 14, 2022 •

edited

Loading

Uh oh!

jjsjann123 commented May 16, 2022

Uh oh!

datumbox commented May 17, 2022

Uh oh!

test_detection_model[cuda-fasterrcnn_mobilenet_v3_large_320_fpn] failing due to NVFuser #6015

test_detection_model[cuda-fasterrcnn_mobilenet_v3_large_320_fpn] failing due to NVFuser #6015

Comments

davidberard98 commented May 13, 2022

🐛 Describe the bug

Versions

jjsjann123 commented May 13, 2022

Uh oh!

jjsjann123 commented May 13, 2022

Uh oh!

jjsjann123 commented May 13, 2022

Uh oh!

jjsjann123 commented May 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

datumbox commented May 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjsjann123 commented May 16, 2022

Uh oh!

datumbox commented May 17, 2022

Uh oh!

jjsjann123 commented May 13, 2022 •

edited

Loading

datumbox commented May 14, 2022 •

edited

Loading