Skip to content

[prototype] Optimize and clean up all affine methods #6945

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Nov 14, 2022

Conversation

datumbox
Copy link
Contributor

@datumbox datumbox commented Nov 11, 2022

Related to #6818

Clean ups:

  • _get_inverse_affine_matrix
  • _assert_grid_transform_inputs
  • _perspective_grid
  • _affine_bounding_box_xyxy
  • _affine_grid

Perfs:

  • _compute_affine_output_size
  • _apply_grid_transform

Affected methods (to benchmark):

  • rotate_image_tensor
  • perspective_image_tensor
  • elastic_image_tensor
  • affine_image_tensor

Benchmarks:

The PR improves all the kernels across the board. There is only one reported slowdown on elastic_image_tensor cpu torch.float32 but these 2 runs have high std. I did a few consequent runs and it looks good. I think the changes look like they give us a 10-20% speed boost. We will measure it more carefully on the placeholder ticket.

Before the PR (main branch):

[----- rotate_image_tensor cpu torch.float32 -----]
                         |  rotate_image_tensor old
1 threads: ----------------------------------------
      (16, 3, 400, 400)  |       129 (+- 20) ms    
      (3, 400, 400)      |         5 (+-  0) ms    

Times are in milliseconds (ms).

[----- perspective_image_tensor cpu torch.float32 -----]
                         |  perspective_image_tensor old
1 threads: ---------------------------------------------
      (16, 3, 400, 400)  |         123 (+- 30) ms       
      (3, 400, 400)      |           5 (+-  0) ms        

Times are in milliseconds (ms).

[----- affine_image_tensor cpu torch.float32 -----]
                         |  affine_image_tensor old
1 threads: ----------------------------------------
      (16, 3, 400, 400)  |       130 (+- 20) ms    
      (3, 400, 400)      |         5 (+-  0) ms    

Times are in milliseconds (ms).

[----- elastic_image_tensor cpu torch.float32 -----]
                         |  elastic_image_tensor old
1 threads: -----------------------------------------
      (16, 3, 400, 400)  |        84 (+- 22) ms     
      (3, 400, 400)      |         5 (+-  0) ms     

Times are in milliseconds (ms).

[----- rotate_image_tensor cuda torch.float32 ----]
                         |  rotate_image_tensor old
1 threads: ----------------------------------------
      (16, 3, 400, 400)  |       820 (+- 45) us    
      (3, 400, 400)      |       464 (+-  1) us    

Times are in microseconds (us).

[---- perspective_image_tensor cuda torch.float32 -----]
                         |  perspective_image_tensor old
1 threads: ---------------------------------------------
      (16, 3, 400, 400)  |         711 (+-  1) us       
      (3, 400, 400)      |         387 (+- 80) us       

Times are in microseconds (us).

[----- affine_image_tensor cuda torch.float32 ----]
                         |  affine_image_tensor old
1 threads: ----------------------------------------
      (16, 3, 400, 400)  |       615 (+-  1) us    
      (3, 400, 400)      |       576 (+- 40) us    

Times are in microseconds (us).

[---- elastic_image_tensor cuda torch.float32 -----]
                         |  elastic_image_tensor old
1 threads: -----------------------------------------
      (16, 3, 400, 400)  |       808 (+-  6) us     
      (3, 400, 400)      |       750 (+- 25) us      

Times are in microseconds (us).

[------ rotate_image_tensor cpu torch.uint8 ------]
                         |  rotate_image_tensor old
1 threads: ----------------------------------------
      (16, 3, 400, 400)  |       270 (+- 14) ms    
      (3, 400, 400)      |        13 (+-  1) ms    

Times are in milliseconds (ms).

[------ perspective_image_tensor cpu torch.uint8 ------]
                         |  perspective_image_tensor old
1 threads: ---------------------------------------------
      (16, 3, 400, 400)  |         176 (+-  8) ms       
      (3, 400, 400)      |          11 (+-  0) ms       

Times are in milliseconds (ms).

[------ affine_image_tensor cpu torch.uint8 ------]
                         |  affine_image_tensor old
1 threads: ----------------------------------------
      (16, 3, 400, 400)  |       184 (+-  4) ms    
      (3, 400, 400)      |        11 (+-  0) ms    

Times are in milliseconds (ms).

[------ elastic_image_tensor cpu torch.uint8 ------]
                         |  elastic_image_tensor old
1 threads: -----------------------------------------
      (16, 3, 400, 400)  |       182 (+- 18) ms     
      (3, 400, 400)      |        11 (+-  1) ms     

Times are in milliseconds (ms).

[------ rotate_image_tensor cuda torch.uint8 -----]
                         |  rotate_image_tensor old
1 threads: ----------------------------------------
      (16, 3, 400, 400)  |      1039 (+- 31) us    
      (3, 400, 400)      |       782 (+- 60) us    

Times are in microseconds (us).

[----- perspective_image_tensor cuda torch.uint8 ------]
                         |  perspective_image_tensor old
1 threads: ---------------------------------------------
      (16, 3, 400, 400)  |         988 (+- 60) us       
      (3, 400, 400)      |         655 (+- 50) us       

Times are in microseconds (us).

[------ affine_image_tensor cuda torch.uint8 -----]
                         |  affine_image_tensor old
1 threads: ----------------------------------------
      (16, 3, 400, 400)  |       867 (+- 40) us    
      (3, 400, 400)      |       606 (+- 21) us    

Times are in microseconds (us).

[----- elastic_image_tensor cuda torch.uint8 ------]
                         |  elastic_image_tensor old
1 threads: -----------------------------------------
      (16, 3, 400, 400)  |      1018 (+- 20) us     
      (3, 400, 400)      |       824 (+- 31) us     

Times are in microseconds (us).

This PR:

[----- rotate_image_tensor cpu torch.float32 -----]
                         |  rotate_image_tensor new
1 threads: ----------------------------------------
      (16, 3, 400, 400)  |       128 (+-  4) ms    
      (3, 400, 400)      |         5 (+-  0) ms    

Times are in milliseconds (ms).

[----- perspective_image_tensor cpu torch.float32 -----]
                         |  perspective_image_tensor new
1 threads: ---------------------------------------------
      (16, 3, 400, 400)  |         106 (+-  2) ms       
      (3, 400, 400)      |           5 (+-  0) ms       

Times are in milliseconds (ms).

[----- affine_image_tensor cpu torch.float32 -----]
                         |  affine_image_tensor new
1 threads: ----------------------------------------
      (16, 3, 400, 400)  |       113 (+-  2) ms    
      (3, 400, 400)      |         4 (+-  0) ms    

Times are in milliseconds (ms).

[----- elastic_image_tensor cpu torch.float32 -----]
                         |  elastic_image_tensor new
1 threads: -----------------------------------------
      (16, 3, 400, 400)  |       107 (+-  1) ms     
      (3, 400, 400)      |         4 (+-  0) ms    

Times are in milliseconds (ms).

[----- rotate_image_tensor cuda torch.float32 ----]
                         |  rotate_image_tensor new
1 threads: ----------------------------------------
      (16, 3, 400, 400)  |       804 (+- 45) us    
      (3, 400, 400)      |       432 (+-  1) us     

Times are in microseconds (us).

[---- perspective_image_tensor cuda torch.float32 -----]
                         |  perspective_image_tensor new
1 threads: ---------------------------------------------
      (16, 3, 400, 400)  |         631 (+-  1) us       
      (3, 400, 400)      |         334 (+-  2) us       

Times are in microseconds (us).

[----- affine_image_tensor cuda torch.float32 ----]
                         |  affine_image_tensor new
1 threads: ----------------------------------------
      (16, 3, 400, 400)  |       531 (+-  1) us    
      (3, 400, 400)      |       295 (+-  1) us     

Times are in microseconds (us).

[---- elastic_image_tensor cuda torch.float32 -----]
                         |  elastic_image_tensor new
1 threads: -----------------------------------------
      (16, 3, 400, 400)  |       785 (+-  6) us     
      (3, 400, 400)      |       555 (+-  1) us    

Times are in microseconds (us).

[------ rotate_image_tensor cpu torch.uint8 ------]
                         |  rotate_image_tensor new
1 threads: ----------------------------------------
      (16, 3, 400, 400)  |       140 (+-  1) ms    
      (3, 400, 400)      |         6 (+-  0) ms    

Times are in milliseconds (ms).

[------ perspective_image_tensor cpu torch.uint8 ------]
                         |  perspective_image_tensor new
1 threads: ---------------------------------------------
      (16, 3, 400, 400)  |         118 (+-  1) ms       
      (3, 400, 400)      |           5 (+-  0) ms       

Times are in milliseconds (ms).

[------ affine_image_tensor cpu torch.uint8 ------]
                         |  affine_image_tensor new
1 threads: ----------------------------------------
      (16, 3, 400, 400)  |       124 (+-  1) ms    
      (3, 400, 400)      |         5 (+-  0) ms    

Times are in milliseconds (ms).

[------ elastic_image_tensor cpu torch.uint8 ------]
                         |  elastic_image_tensor new
1 threads: -----------------------------------------
      (16, 3, 400, 400)  |       118 (+-  1) ms     
      (3, 400, 400)      |         5 (+-  0) ms     

Times are in milliseconds (ms).

[------ rotate_image_tensor cuda torch.uint8 -----]
                         |  rotate_image_tensor new
1 threads: ----------------------------------------
      (16, 3, 400, 400)  |       877 (+- 13) us    
      (3, 400, 400)      |       467 (+-  1) us    

Times are in microseconds (us).

[----- perspective_image_tensor cuda torch.uint8 ------]
                         |  perspective_image_tensor new
1 threads: ---------------------------------------------
      (16, 3, 400, 400)  |         751 (+-  5) us       
      (3, 400, 400)      |         365 (+-  2) us       

Times are in microseconds (us).

[------ affine_image_tensor cuda torch.uint8 -----]
                         |  affine_image_tensor new
1 threads: ----------------------------------------
      (16, 3, 400, 400)  |       654 (+-  1) us    
      (3, 400, 400)      |       325 (+-  1) us    

Times are in microseconds (us).

[----- elastic_image_tensor cuda torch.uint8 ------]
                         |  elastic_image_tensor new
1 threads: -----------------------------------------
      (16, 3, 400, 400)  |       821 (+-  4) us     
      (3, 400, 400)      |       584 (+-  1) us     

Times are in microseconds (us).

cc @vfdev-5 @bjuncek @pmeier

@datumbox
Copy link
Contributor Author

There are some related failures. I'll check them on Monday.

@@ -915,7 +915,7 @@ def sample_inputs_rotate_video():
reference_inputs_fn=reference_inputs_rotate_image_tensor,
float32_vs_uint8=True,
# TODO: investigate
closeness_kwargs=pil_reference_pixel_difference(100, agg_method="mean"),
closeness_kwargs=pil_reference_pixel_difference(110, agg_method="mean"),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flaky test unrelated to this PR that popped up previously on another PR:

Unrelated flakiness:
FAILED test/test_prototype_transforms_functional.py::TestKernels::test_against_reference[rotate_image_tensor-38] - AssertionError: The 'mean' of the absolute difference is 104.21571906354515, but only 100.0 is allowed.

Copy link
Contributor Author

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some high-level notes on the changes. After this PR we would very minimally use _FT in our code-base. Some changes can be reverted or moved on FT stable if we want to minimize copy-pasted code.

I will follow up with benchmarks for the affected methods.

raise ValueError(f"Interpolation mode '{interpolation}' is unsupported with Tensor input")


def _affine_grid(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is modified minimally. Only a rename (to match the naming of other methods such as _perspective_grid) and 1 tiny in-place that won't matter. Can be reverted.

Comment on lines +275 to +277
def _get_inverse_affine_matrix(
center: List[float], angle: float, translate: List[float], scale: float, shear: List[float], inverted: bool = True
) -> List[float]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we do some caching of intermediate results and minor refactoring (especially with the negative values) to make the code a bit more readable IMO. Can be reverted.

return matrix


def _compute_affine_output_size(matrix: List[float], w: int, h: int) -> Tuple[int, int]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aminmax call + a bunch of in-place ops to speed things up.

return int(size[0]), int(size[1]) # w, h


def _apply_grid_transform(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do in-place ops where possible, plus a bit of refactoring. Important note is that the input image must be float. This is because the handling/casting can be done more efficiently outside of the method.

return float_img


def _assert_grid_transform_inputs(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor clean ups to make the if statements clearer.

Comment on lines +482 to +489
if ndim > 4:
image = image.reshape((-1,) + shape[-3:])
needs_unsquash = True
elif ndim == 3:
image = image.unsqueeze(0)
needs_unsquash = True
else:
needs_unsquash = False
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adopting the idiom from other places to handle squashing. I do the same in all methods where possible below.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the special casing for ndim == 3 here? Is there actually a significant perf difference between reshape and unsqueeze in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a perf. This is because _apply_grid_transform assumes batched input.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, this is copied from something so not a new idiom that will be introduced here. I'll look into it in a follow-up.

Comment on lines +504 to +507
dtype = image.dtype if fp else torch.float32
theta = torch.tensor(matrix, dtype=dtype, device=image.device).reshape(1, 2, 3)
grid = _affine_grid(theta, w=width, h=height, ow=width, oh=height)
output = _apply_grid_transform(image if fp else image.to(dtype), grid, interpolation.value, fill=fill)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Porting the code over from _FT.affine to be able to use the optimized private methods defined above. I do the same to all other kernels where possible below.

Comment on lines +773 to +774
output = image
new_width, new_height = _compute_affine_output_size(matrix, width, height) if expand else (width, height)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I don't adopt the squash idiom from other places because it would lead to more complex logic. I have concerns about whether the implementation followed here (maintained from main) actually handles properly all corner-cases (ill formed images with 0 elements), similar to the issue observed with pad.

@vfdev-5 Might be worth talking a look on your side to see if a mitigation is necessary similar to #6949 (aka sending the image through the kernel normally).

@datumbox datumbox marked this pull request as ready for review November 14, 2022 14:48
@datumbox datumbox changed the title [WIP] [prototype] Optimize and clean up all affine methods [prototype] Optimize and clean up all affine methods Nov 14, 2022
@datumbox datumbox requested review from vfdev-5 and pmeier November 14, 2022 14:48
Comment on lines +418 to +419
if coeffs is not None and len(coeffs) != 8:
raise ValueError("Argument coeffs should have 8 float values")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is coeffs in an affine transformation? That seems out of place here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the perspective coeff. I'm not writing new code, I'm porting the existing methods here. I don't think changes of changing the validation should be in scope on this PR because it will get really fast really quickly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me send a follow-up PR then.

Comment on lines +482 to +489
if ndim > 4:
image = image.reshape((-1,) + shape[-3:])
needs_unsquash = True
elif ndim == 3:
image = image.unsqueeze(0)
needs_unsquash = True
else:
needs_unsquash = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the special casing for ndim == 3 here? Is there actually a significant perf difference between reshape and unsqueeze in this case?

Copy link
Collaborator

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless benchmarks show a perf regression somewhere, LGTM if CI is green!

@datumbox
Copy link
Contributor Author

@pmeier We observe a speed improvement. The flaky test at prototype is a known issue on the bboxes and Victor is looking at it. Merging.

@datumbox datumbox merged commit b1f6c9e into pytorch:main Nov 14, 2022
@datumbox datumbox deleted the perf/affine branch November 14, 2022 19:36
facebook-github-bot pushed a commit that referenced this pull request Nov 19, 2022
Summary:
* Clean up `_get_inverse_affine_matrix` and `_compute_affine_output_size`

* Optimize `_apply_grid_transform`

* Cleanup `_assert_grid_transform_inputs`

* Fix bugs on `_pad_with_scalar_fill` & `crop_mask` and port `crop_image_tensor`

* Call directly `_pad_with_scalar_fill`

* Fix linter

* Clean up `center_crop_image_tensor`

* Fix comments.

* Fixing rounding issues.

* Bumping tolerance for rotate which is unrelated to this PR.

* Fix tolerance threshold for RandomPerspective.

* Clean up `_affine_grid` and `affine_image_tensor`

* Clean up `rotate_image_tensor`

* Fixing linter

* Address code-review comments.

Reviewed By: YosuaMichael

Differential Revision: D41376279

fbshipit-source-id: bc21d68d6374b39d78ae74ba01e49633b2d8d1e2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants