Skip to content

Reduce unnecessary cuda sync in anchor_utils.py #5515

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 4, 2022

Conversation

xuzhao9
Copy link
Contributor

@xuzhao9 xuzhao9 commented Mar 2, 2022

This PR attempts to reduce unnecessary CUDA sync in maskrcnn model code, trying to improve its performance.
We evaluate the performance impact of this PR with the vision_maskrcnn model in TorchBench.

Before the patch:

$ python run.py vision_maskrcnn -d cuda -t eval
Running eval method from vision_maskrcnn on cuda in eager mode.
GPU Time:            163.383 milliseconds
CPU Dispatch Time:   163.380 milliseconds
CPU Total Wall Time: 163.425 milliseconds
$ python run.py vision_maskrcnn -d cuda -t train
Running train method from vision_maskrcnn on cuda in eager mode.
GPU Time:            236.189 milliseconds
CPU Dispatch Time:   165.670 milliseconds
CPU Total Wall Time: 236.255 milliseconds

We observe 3 CUDA sync events in anchor_utils.py:

/data/home/xzhao9/cluster/miniconda3/envs/py38/lib/python3.8/site-packages/torchvision-0.13.0a0+71d2bb0-py3.8-linux-x86_64.egg/torchvision/models/detection/anchor_utils.py:124: UserWarning: called a synchronizing CUDA operation (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:147.)
  torch.tensor(image_size[0] // g[0], dtype=torch.int64, device=device),
/data/home/xzhao9/cluster/miniconda3/envs/py38/lib/python3.8/site-packages/torchvision-0.13.0a0+71d2bb0-py3.8-linux-x86_64.egg/torchvision/models/detection/anchor_utils.py:125: UserWarning: called a synchronizing CUDA operation (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:147.)
  torch.tensor(image_size[1] // g[1], dtype=torch.int64, device=device),
/data/home/xzhao9/cluster/miniconda3/envs/py38/lib/python3.8/site-packages/torchvision-0.13.0a0+71d2bb0-py3.8-linux-x86_64.egg/torchvision/models/detection/anchor_utils.py:79: UserWarning: called a synchronizing CUDA operation (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:147.)
  self.cell_anchors = [cell_anchor.to(dtype=dtype, device=device) for cell_anchor in self.cell_anchors]

After the patch:

$ python run.py vision_maskrcnn -d cuda -t eval
Running eval method from vision_maskrcnn on cuda in eager mode.
GPU Time:            164.010 milliseconds
CPU Dispatch Time:   164.037 milliseconds
CPU Total Wall Time: 164.083 milliseconds
$ python run.py vision_maskrcnn -d cuda -t train
Running train method from vision_maskrcnn on cuda in eager mode.
GPU Time:            235.582 milliseconds
CPU Dispatch Time:   165.316 milliseconds
CPU Total Wall Time: 235.683 milliseconds

Although there is no obvious change in the runtime, we observe only 1 CUDA sync event from anchor_utils.py:

/data/home/xzhao9/cluster/miniconda3/envs/py38/lib/python3.8/site-packages/torchvision-0.13.0a0+7f0faf0-py3.8-linux-x86_64.egg/torchvision/models/detection/anchor_utils.py:79: UserWarning: called a synchronizing CUDA operation (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:147.)
  self.cell_anchors = [cell_anchor.to(dtype=dtype, device=device) for cell_anchor in self.cell_anchors]

So we still believe it is a good improvement.

@facebook-github-bot
Copy link

facebook-github-bot commented Mar 2, 2022

💊 CI failures summary and remediations

As of commit 582eba6 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@xuzhao9 xuzhao9 changed the title [WIP] Atttempt to reduce unnecessary cuda sync Atttempt to reduce unnecessary cuda sync Mar 2, 2022
@xuzhao9 xuzhao9 changed the title Atttempt to reduce unnecessary cuda sync Reduce unnecessary cuda sync in anchor_utils.py Mar 2, 2022
@xuzhao9 xuzhao9 requested a review from pmeier March 2, 2022 21:14
@pmeier
Copy link
Collaborator

pmeier commented Mar 2, 2022

I'm only listed in blame due to #4384.

@pmeier pmeier requested review from datumbox and removed request for pmeier March 2, 2022 21:22
Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK switching to the empty().fill_() idiom given the analysis. It's something we use in other places on TorchVision anyway:

noise = torch.empty((N, C, H - block_size + 1, W - block_size + 1), dtype=input.dtype, device=input.device)
noise.bernoulli_(gamma)

@datumbox datumbox merged commit a784db4 into pytorch:main Mar 4, 2022
@github-actions
Copy link

github-actions bot commented Mar 4, 2022

Hey @datumbox!

You merged this PR, but no labels were added. The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py

@datumbox datumbox added enhancement module: models Perf For performance improvements labels Mar 4, 2022
@xuzhao9 xuzhao9 deleted the xz9/improve-stride branch March 4, 2022 20:34
facebook-github-bot pushed a commit that referenced this pull request Mar 15, 2022
Summary: Co-authored-by: Vasilis Vryniotis <[email protected]>

Reviewed By: vmoens

Differential Revision: D34879001

fbshipit-source-id: 5830dec79b5f80fa20b55862c84906a04898aa80
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants