Reduce unnecessary cuda sync in `anchor_utils.py` #5515

xuzhao9 · 2022-03-02T18:10:47Z

This PR attempts to reduce unnecessary CUDA sync in maskrcnn model code, trying to improve its performance.
We evaluate the performance impact of this PR with the vision_maskrcnn model in TorchBench.

Before the patch:

$ python run.py vision_maskrcnn -d cuda -t eval
Running eval method from vision_maskrcnn on cuda in eager mode.
GPU Time:            163.383 milliseconds
CPU Dispatch Time:   163.380 milliseconds
CPU Total Wall Time: 163.425 milliseconds
$ python run.py vision_maskrcnn -d cuda -t train
Running train method from vision_maskrcnn on cuda in eager mode.
GPU Time:            236.189 milliseconds
CPU Dispatch Time:   165.670 milliseconds
CPU Total Wall Time: 236.255 milliseconds

We observe 3 CUDA sync events in anchor_utils.py:

/data/home/xzhao9/cluster/miniconda3/envs/py38/lib/python3.8/site-packages/torchvision-0.13.0a0+71d2bb0-py3.8-linux-x86_64.egg/torchvision/models/detection/anchor_utils.py:124: UserWarning: called a synchronizing CUDA operation (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:147.)
  torch.tensor(image_size[0] // g[0], dtype=torch.int64, device=device),
/data/home/xzhao9/cluster/miniconda3/envs/py38/lib/python3.8/site-packages/torchvision-0.13.0a0+71d2bb0-py3.8-linux-x86_64.egg/torchvision/models/detection/anchor_utils.py:125: UserWarning: called a synchronizing CUDA operation (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:147.)
  torch.tensor(image_size[1] // g[1], dtype=torch.int64, device=device),
/data/home/xzhao9/cluster/miniconda3/envs/py38/lib/python3.8/site-packages/torchvision-0.13.0a0+71d2bb0-py3.8-linux-x86_64.egg/torchvision/models/detection/anchor_utils.py:79: UserWarning: called a synchronizing CUDA operation (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:147.)
  self.cell_anchors = [cell_anchor.to(dtype=dtype, device=device) for cell_anchor in self.cell_anchors]

After the patch:

$ python run.py vision_maskrcnn -d cuda -t eval
Running eval method from vision_maskrcnn on cuda in eager mode.
GPU Time:            164.010 milliseconds
CPU Dispatch Time:   164.037 milliseconds
CPU Total Wall Time: 164.083 milliseconds
$ python run.py vision_maskrcnn -d cuda -t train
Running train method from vision_maskrcnn on cuda in eager mode.
GPU Time:            235.582 milliseconds
CPU Dispatch Time:   165.316 milliseconds
CPU Total Wall Time: 235.683 milliseconds

Although there is no obvious change in the runtime, we observe only 1 CUDA sync event from anchor_utils.py:

/data/home/xzhao9/cluster/miniconda3/envs/py38/lib/python3.8/site-packages/torchvision-0.13.0a0+7f0faf0-py3.8-linux-x86_64.egg/torchvision/models/detection/anchor_utils.py:79: UserWarning: called a synchronizing CUDA operation (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:147.)
  self.cell_anchors = [cell_anchor.to(dtype=dtype, device=device) for cell_anchor in self.cell_anchors]

So we still believe it is a good improvement.

facebook-github-bot · 2022-03-02T18:10:54Z

💊 CI failures summary and remediations

As of commit 582eba6 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

pmeier · 2022-03-02T21:22:17Z

I'm only listed in blame due to #4384.

datumbox

I'm OK switching to the empty().fill_() idiom given the analysis. It's something we use in other places on TorchVision anyway:

vision/torchvision/ops/drop_block.py

Lines 41 to 42 in 5568744

    
           noise = torch.empty((N, C, H - block_size + 1, W - block_size + 1), dtype=input.dtype, device=input.device) 
        
           noise.bernoulli_(gamma)

github-actions · 2022-03-04T11:01:41Z

Hey @datumbox!

You merged this PR, but no labels were added. The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py

Summary: Co-authored-by: Vasilis Vryniotis <[email protected]> Reviewed By: vmoens Differential Revision: D34879001 fbshipit-source-id: 5830dec79b5f80fa20b55862c84906a04898aa80

Atttempt to reduce unnecessary cuda sync

7f0faf0

pytorch-bot bot added the ciflow/default label Mar 2, 2022

facebook-github-bot added the cla signed label Mar 2, 2022

xuzhao9 changed the title ~~[WIP] Atttempt to reduce unnecessary cuda sync~~ Atttempt to reduce unnecessary cuda sync Mar 2, 2022

xuzhao9 changed the title ~~Atttempt to reduce unnecessary cuda sync~~ Reduce unnecessary cuda sync in anchor_utils.py Mar 2, 2022

xuzhao9 requested a review from pmeier March 2, 2022 21:14

pmeier requested review from datumbox and removed request for pmeier March 2, 2022 21:22

datumbox approved these changes Mar 4, 2022

View reviewed changes

Merge branch 'main' into xz9/improve-stride

582eba6

datumbox merged commit a784db4 into pytorch:main Mar 4, 2022

datumbox added enhancement module: models Perf For performance improvements labels Mar 4, 2022

xuzhao9 deleted the xz9/improve-stride branch March 4, 2022 20:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce unnecessary cuda sync in `anchor_utils.py` #5515

Reduce unnecessary cuda sync in `anchor_utils.py` #5515

xuzhao9 commented Mar 2, 2022 •

edited

Loading

facebook-github-bot commented Mar 2, 2022 •

edited

Loading

pmeier commented Mar 2, 2022

datumbox left a comment

github-actions bot commented Mar 4, 2022

	noise = torch.empty((N, C, H - block_size + 1, W - block_size + 1), dtype=input.dtype, device=input.device)
	noise.bernoulli_(gamma)

Reduce unnecessary cuda sync in anchor_utils.py #5515

Reduce unnecessary cuda sync in anchor_utils.py #5515

Conversation

xuzhao9 commented Mar 2, 2022 • edited Loading

facebook-github-bot commented Mar 2, 2022 • edited Loading

💊 CI failures summary and remediations

pmeier commented Mar 2, 2022

datumbox left a comment

Choose a reason for hiding this comment

github-actions bot commented Mar 4, 2022

Reduce unnecessary cuda sync in `anchor_utils.py` #5515

Reduce unnecessary cuda sync in `anchor_utils.py` #5515

xuzhao9 commented Mar 2, 2022 •

edited

Loading

facebook-github-bot commented Mar 2, 2022 •

edited

Loading