[prototype] Optimize and clean up all affine methods #6945

datumbox · 2022-11-11T18:22:12Z

Related to #6818

Clean ups:

Perfs:

_compute_affine_output_size
_apply_grid_transform

Affected methods (to benchmark):

rotate_image_tensor
perspective_image_tensor
elastic_image_tensor
affine_image_tensor

Benchmarks:

The PR improves all the kernels across the board. There is only one reported slowdown on elastic_image_tensor cpu torch.float32 but these 2 runs have high std. I did a few consequent runs and it looks good. I think the changes look like they give us a 10-20% speed boost. We will measure it more carefully on the placeholder ticket.

Before the PR (main branch):

[----- rotate_image_tensor cpu torch.float32 -----]
                         |  rotate_image_tensor old
1 threads: ----------------------------------------
      (16, 3, 400, 400)  |       129 (+- 20) ms    
      (3, 400, 400)      |         5 (+-  0) ms    

Times are in milliseconds (ms).

[----- perspective_image_tensor cpu torch.float32 -----]
                         |  perspective_image_tensor old
1 threads: ---------------------------------------------
      (16, 3, 400, 400)  |         123 (+- 30) ms       
      (3, 400, 400)      |           5 (+-  0) ms        

Times are in milliseconds (ms).

[----- affine_image_tensor cpu torch.float32 -----]
                         |  affine_image_tensor old
1 threads: ----------------------------------------
      (16, 3, 400, 400)  |       130 (+- 20) ms    
      (3, 400, 400)      |         5 (+-  0) ms    

Times are in milliseconds (ms).

[----- elastic_image_tensor cpu torch.float32 -----]
                         |  elastic_image_tensor old
1 threads: -----------------------------------------
      (16, 3, 400, 400)  |        84 (+- 22) ms     
      (3, 400, 400)      |         5 (+-  0) ms     

Times are in milliseconds (ms).

[----- rotate_image_tensor cuda torch.float32 ----]
                         |  rotate_image_tensor old
1 threads: ----------------------------------------
      (16, 3, 400, 400)  |       820 (+- 45) us    
      (3, 400, 400)      |       464 (+-  1) us    

Times are in microseconds (us).

[---- perspective_image_tensor cuda torch.float32 -----]
                         |  perspective_image_tensor old
1 threads: ---------------------------------------------
      (16, 3, 400, 400)  |         711 (+-  1) us       
      (3, 400, 400)      |         387 (+- 80) us       

Times are in microseconds (us).

[----- affine_image_tensor cuda torch.float32 ----]
                         |  affine_image_tensor old
1 threads: ----------------------------------------
      (16, 3, 400, 400)  |       615 (+-  1) us    
      (3, 400, 400)      |       576 (+- 40) us    

Times are in microseconds (us).

[---- elastic_image_tensor cuda torch.float32 -----]
                         |  elastic_image_tensor old
1 threads: -----------------------------------------
      (16, 3, 400, 400)  |       808 (+-  6) us     
      (3, 400, 400)      |       750 (+- 25) us      

Times are in microseconds (us).

[------ rotate_image_tensor cpu torch.uint8 ------]
                         |  rotate_image_tensor old
1 threads: ----------------------------------------
      (16, 3, 400, 400)  |       270 (+- 14) ms    
      (3, 400, 400)      |        13 (+-  1) ms    

Times are in milliseconds (ms).

[------ perspective_image_tensor cpu torch.uint8 ------]
                         |  perspective_image_tensor old
1 threads: ---------------------------------------------
      (16, 3, 400, 400)  |         176 (+-  8) ms       
      (3, 400, 400)      |          11 (+-  0) ms       

Times are in milliseconds (ms).

[------ affine_image_tensor cpu torch.uint8 ------]
                         |  affine_image_tensor old
1 threads: ----------------------------------------
      (16, 3, 400, 400)  |       184 (+-  4) ms    
      (3, 400, 400)      |        11 (+-  0) ms    

Times are in milliseconds (ms).

[------ elastic_image_tensor cpu torch.uint8 ------]
                         |  elastic_image_tensor old
1 threads: -----------------------------------------
      (16, 3, 400, 400)  |       182 (+- 18) ms     
      (3, 400, 400)      |        11 (+-  1) ms     

Times are in milliseconds (ms).

[------ rotate_image_tensor cuda torch.uint8 -----]
                         |  rotate_image_tensor old
1 threads: ----------------------------------------
      (16, 3, 400, 400)  |      1039 (+- 31) us    
      (3, 400, 400)      |       782 (+- 60) us    

Times are in microseconds (us).

[----- perspective_image_tensor cuda torch.uint8 ------]
                         |  perspective_image_tensor old
1 threads: ---------------------------------------------
      (16, 3, 400, 400)  |         988 (+- 60) us       
      (3, 400, 400)      |         655 (+- 50) us       

Times are in microseconds (us).

[------ affine_image_tensor cuda torch.uint8 -----]
                         |  affine_image_tensor old
1 threads: ----------------------------------------
      (16, 3, 400, 400)  |       867 (+- 40) us    
      (3, 400, 400)      |       606 (+- 21) us    

Times are in microseconds (us).

[----- elastic_image_tensor cuda torch.uint8 ------]
                         |  elastic_image_tensor old
1 threads: -----------------------------------------
      (16, 3, 400, 400)  |      1018 (+- 20) us     
      (3, 400, 400)      |       824 (+- 31) us     

Times are in microseconds (us).

This PR:

[----- rotate_image_tensor cpu torch.float32 -----]
                         |  rotate_image_tensor new
1 threads: ----------------------------------------
      (16, 3, 400, 400)  |       128 (+-  4) ms    
      (3, 400, 400)      |         5 (+-  0) ms    

Times are in milliseconds (ms).

[----- perspective_image_tensor cpu torch.float32 -----]
                         |  perspective_image_tensor new
1 threads: ---------------------------------------------
      (16, 3, 400, 400)  |         106 (+-  2) ms       
      (3, 400, 400)      |           5 (+-  0) ms       

Times are in milliseconds (ms).

[----- affine_image_tensor cpu torch.float32 -----]
                         |  affine_image_tensor new
1 threads: ----------------------------------------
      (16, 3, 400, 400)  |       113 (+-  2) ms    
      (3, 400, 400)      |         4 (+-  0) ms    

Times are in milliseconds (ms).

[----- elastic_image_tensor cpu torch.float32 -----]
                         |  elastic_image_tensor new
1 threads: -----------------------------------------
      (16, 3, 400, 400)  |       107 (+-  1) ms     
      (3, 400, 400)      |         4 (+-  0) ms    

Times are in milliseconds (ms).

[----- rotate_image_tensor cuda torch.float32 ----]
                         |  rotate_image_tensor new
1 threads: ----------------------------------------
      (16, 3, 400, 400)  |       804 (+- 45) us    
      (3, 400, 400)      |       432 (+-  1) us     

Times are in microseconds (us).

[---- perspective_image_tensor cuda torch.float32 -----]
                         |  perspective_image_tensor new
1 threads: ---------------------------------------------
      (16, 3, 400, 400)  |         631 (+-  1) us       
      (3, 400, 400)      |         334 (+-  2) us       

Times are in microseconds (us).

[----- affine_image_tensor cuda torch.float32 ----]
                         |  affine_image_tensor new
1 threads: ----------------------------------------
      (16, 3, 400, 400)  |       531 (+-  1) us    
      (3, 400, 400)      |       295 (+-  1) us     

Times are in microseconds (us).

[---- elastic_image_tensor cuda torch.float32 -----]
                         |  elastic_image_tensor new
1 threads: -----------------------------------------
      (16, 3, 400, 400)  |       785 (+-  6) us     
      (3, 400, 400)      |       555 (+-  1) us    

Times are in microseconds (us).

[------ rotate_image_tensor cpu torch.uint8 ------]
                         |  rotate_image_tensor new
1 threads: ----------------------------------------
      (16, 3, 400, 400)  |       140 (+-  1) ms    
      (3, 400, 400)      |         6 (+-  0) ms    

Times are in milliseconds (ms).

[------ perspective_image_tensor cpu torch.uint8 ------]
                         |  perspective_image_tensor new
1 threads: ---------------------------------------------
      (16, 3, 400, 400)  |         118 (+-  1) ms       
      (3, 400, 400)      |           5 (+-  0) ms       

Times are in milliseconds (ms).

[------ affine_image_tensor cpu torch.uint8 ------]
                         |  affine_image_tensor new
1 threads: ----------------------------------------
      (16, 3, 400, 400)  |       124 (+-  1) ms    
      (3, 400, 400)      |         5 (+-  0) ms    

Times are in milliseconds (ms).

[------ elastic_image_tensor cpu torch.uint8 ------]
                         |  elastic_image_tensor new
1 threads: -----------------------------------------
      (16, 3, 400, 400)  |       118 (+-  1) ms     
      (3, 400, 400)      |         5 (+-  0) ms     

Times are in milliseconds (ms).

[------ rotate_image_tensor cuda torch.uint8 -----]
                         |  rotate_image_tensor new
1 threads: ----------------------------------------
      (16, 3, 400, 400)  |       877 (+- 13) us    
      (3, 400, 400)      |       467 (+-  1) us    

Times are in microseconds (us).

[----- perspective_image_tensor cuda torch.uint8 ------]
                         |  perspective_image_tensor new
1 threads: ---------------------------------------------
      (16, 3, 400, 400)  |         751 (+-  5) us       
      (3, 400, 400)      |         365 (+-  2) us       

Times are in microseconds (us).

[------ affine_image_tensor cuda torch.uint8 -----]
                         |  affine_image_tensor new
1 threads: ----------------------------------------
      (16, 3, 400, 400)  |       654 (+-  1) us    
      (3, 400, 400)      |       325 (+-  1) us    

Times are in microseconds (us).

[----- elastic_image_tensor cuda torch.uint8 ------]
                         |  elastic_image_tensor new
1 threads: -----------------------------------------
      (16, 3, 400, 400)  |       821 (+-  4) us     
      (3, 400, 400)      |       584 (+-  1) us     

Times are in microseconds (us).

cc @vfdev-5 @bjuncek @pmeier

…e_tensor`

torchvision/prototype/transforms/functional/_geometry.py

datumbox · 2022-11-11T19:16:55Z

There are some related failures. I'll check them on Monday.

datumbox · 2022-11-14T12:57:31Z

test/prototype_transforms_kernel_infos.py

@@ -915,7 +915,7 @@ def sample_inputs_rotate_video():
            reference_inputs_fn=reference_inputs_rotate_image_tensor,
            float32_vs_uint8=True,
            # TODO: investigate
-            closeness_kwargs=pil_reference_pixel_difference(100, agg_method="mean"),
+            closeness_kwargs=pil_reference_pixel_difference(110, agg_method="mean"),


Flaky test unrelated to this PR that popped up previously on another PR:

Unrelated flakiness: FAILED test/test_prototype_transforms_functional.py::TestKernels::test_against_reference[rotate_image_tensor-38] - AssertionError: The 'mean' of the absolute difference is 104.21571906354515, but only 100.0 is allowed.

datumbox

Here are some high-level notes on the changes. After this PR we would very minimally use _FT in our code-base. Some changes can be reverted or moved on FT stable if we want to minimize copy-pasted code.

I will follow up with benchmarks for the affected methods.

datumbox · 2022-11-14T14:29:57Z

torchvision/prototype/transforms/functional/_geometry.py

+        raise ValueError(f"Interpolation mode '{interpolation}' is unsupported with Tensor input")
+
+
+def _affine_grid(


This one is modified minimally. Only a rename (to match the naming of other methods such as _perspective_grid) and 1 tiny in-place that won't matter. Can be reverted.

datumbox · 2022-11-14T14:30:52Z

torchvision/prototype/transforms/functional/_geometry.py

+def _get_inverse_affine_matrix(
+    center: List[float], angle: float, translate: List[float], scale: float, shear: List[float], inverted: bool = True
+) -> List[float]:


Here we do some caching of intermediate results and minor refactoring (especially with the negative values) to make the code a bit more readable IMO. Can be reverted.

datumbox · 2022-11-14T14:31:15Z

torchvision/prototype/transforms/functional/_geometry.py

+    return matrix
+
+
+def _compute_affine_output_size(matrix: List[float], w: int, h: int) -> Tuple[int, int]:


aminmax call + a bunch of in-place ops to speed things up.

datumbox · 2022-11-14T14:31:54Z

torchvision/prototype/transforms/functional/_geometry.py

+    return int(size[0]), int(size[1])  # w, h
+
+
+def _apply_grid_transform(


We do in-place ops where possible, plus a bit of refactoring. Important note is that the input image must be float. This is because the handling/casting can be done more efficiently outside of the method.

datumbox · 2022-11-14T14:32:19Z

torchvision/prototype/transforms/functional/_geometry.py

+    return float_img
+
+
+def _assert_grid_transform_inputs(


Minor clean ups to make the if statements clearer.

datumbox · 2022-11-14T14:41:01Z

torchvision/prototype/transforms/functional/_geometry.py

+    if ndim > 4:
+        image = image.reshape((-1,) + shape[-3:])
+        needs_unsquash = True
+    elif ndim == 3:
+        image = image.unsqueeze(0)
+        needs_unsquash = True
+    else:
+        needs_unsquash = False


Adopting the idiom from other places to handle squashing. I do the same in all methods where possible below.

Do we need the special casing for ndim == 3 here? Is there actually a significant perf difference between reshape and unsqueeze in this case?

This is not a perf. This is because _apply_grid_transform assumes batched input.

As discussed offline, this is copied from something so not a new idiom that will be introduced here. I'll look into it in a follow-up.

datumbox · 2022-11-14T14:41:25Z

torchvision/prototype/transforms/functional/_geometry.py

+    dtype = image.dtype if fp else torch.float32
+    theta = torch.tensor(matrix, dtype=dtype, device=image.device).reshape(1, 2, 3)
+    grid = _affine_grid(theta, w=width, h=height, ow=width, oh=height)
+    output = _apply_grid_transform(image if fp else image.to(dtype), grid, interpolation.value, fill=fill)


Porting the code over from _FT.affine to be able to use the optimized private methods defined above. I do the same to all other kernels where possible below.

datumbox · 2022-11-14T14:44:08Z

torchvision/prototype/transforms/functional/_geometry.py

+        output = image
+        new_width, new_height = _compute_affine_output_size(matrix, width, height) if expand else (width, height)


Here I don't adopt the squash idiom from other places because it would lead to more complex logic. I have concerns about whether the implementation followed here (maintained from main) actually handles properly all corner-cases (ill formed images with 0 elements), similar to the issue observed with pad.

@vfdev-5 Might be worth talking a look on your side to see if a mitigation is necessary similar to #6949 (aka sending the image through the kernel normally).

test/test_prototype_transforms_consistency.py

torchvision/prototype/transforms/functional/_geometry.py

pmeier · 2022-11-14T14:58:55Z

torchvision/prototype/transforms/functional/_geometry.py

+    if coeffs is not None and len(coeffs) != 8:
+        raise ValueError("Argument coeffs should have 8 float values")


What is coeffs in an affine transformation? That seems out of place here.

See #6945 (comment).

It's the perspective coeff. I'm not writing new code, I'm porting the existing methods here. I don't think changes of changing the validation should be in scope on this PR because it will get really fast really quickly.

Let me send a follow-up PR then.

torchvision/prototype/transforms/functional/_geometry.py

pmeier · 2022-11-14T15:07:08Z

torchvision/prototype/transforms/functional/_geometry.py

+    if ndim > 4:
+        image = image.reshape((-1,) + shape[-3:])
+        needs_unsquash = True
+    elif ndim == 3:
+        image = image.unsqueeze(0)
+        needs_unsquash = True
+    else:
+        needs_unsquash = False


Do we need the special casing for ndim == 3 here? Is there actually a significant perf difference between reshape and unsqueeze in this case?

pmeier

Unless benchmarks show a perf regression somewhere, LGTM if CI is green!

datumbox · 2022-11-14T19:30:27Z

@pmeier We observe a speed improvement. The flaky test at prototype is a known issue on the bboxes and Victor is looking at it. Merging.

Summary: * Clean up `_get_inverse_affine_matrix` and `_compute_affine_output_size` * Optimize `_apply_grid_transform` * Cleanup `_assert_grid_transform_inputs` * Fix bugs on `_pad_with_scalar_fill` & `crop_mask` and port `crop_image_tensor` * Call directly `_pad_with_scalar_fill` * Fix linter * Clean up `center_crop_image_tensor` * Fix comments. * Fixing rounding issues. * Bumping tolerance for rotate which is unrelated to this PR. * Fix tolerance threshold for RandomPerspective. * Clean up `_affine_grid` and `affine_image_tensor` * Clean up `rotate_image_tensor` * Fixing linter * Address code-review comments. Reviewed By: YosuaMichael Differential Revision: D41376279 fbshipit-source-id: bc21d68d6374b39d78ae74ba01e49633b2d8d1e2

datumbox added 4 commits November 11, 2022 13:11

Clean up _get_inverse_affine_matrix and _compute_affine_output_size

3ac06ad

Optimize _apply_grid_transform

62deb43

Cleanup _assert_grid_transform_inputs

2f0d763

Fix bugs on _pad_with_scalar_fill & crop_mask and port `crop_imag…

dca1923

…e_tensor`

datumbox added module: transforms Perf For performance improvements code quality prototype labels Nov 11, 2022

facebook-github-bot added the cla signed label Nov 11, 2022

datumbox marked this pull request as draft November 11, 2022 18:22

datumbox commented Nov 11, 2022

View reviewed changes

torchvision/prototype/transforms/functional/_geometry.py Outdated Show resolved Hide resolved

datumbox commented Nov 11, 2022

View reviewed changes

torchvision/prototype/transforms/functional/_geometry.py Outdated Show resolved Hide resolved

datumbox and others added 4 commits November 11, 2022 18:39

Call directly _pad_with_scalar_fill

b5548ec

Fix linter

709b34a

Merge branch 'main' into perf/affine

3c38b97

Clean up center_crop_image_tensor

b9a6e74

datumbox added 2 commits November 14, 2022 10:36

Fix comments.

62b9d47

Fixing rounding issues.

b3a0bb1

datumbox mentioned this pull request Nov 14, 2022

Fix bug on prototype pad #6949

Merged

datumbox and others added 3 commits November 14, 2022 12:49

Bumping tolerance for rotate which is unrelated to this PR.

8e110f6

Fix tolerance threshold for RandomPerspective.

555df2d

Merge branch 'main' into perf/affine

5c1f433

datumbox commented Nov 14, 2022

View reviewed changes

datumbox added 2 commits November 14, 2022 13:57

Clean up _affine_grid and affine_image_tensor

a32be72

Clean up rotate_image_tensor

311ff85

datumbox commented Nov 14, 2022

View reviewed changes

Fixing linter

6644006

datumbox marked this pull request as ready for review November 14, 2022 14:48

datumbox changed the title ~~[WIP] [prototype] Optimize and clean up all affine methods~~ [prototype] Optimize and clean up all affine methods Nov 14, 2022

datumbox requested review from vfdev-5 and pmeier November 14, 2022 14:48

pmeier reviewed Nov 14, 2022

View reviewed changes

Address code-review comments.

d3639e0

pmeier approved these changes Nov 14, 2022

View reviewed changes

Merge branch 'main' into perf/affine

548ef68

datumbox merged commit b1f6c9e into pytorch:main Nov 14, 2022

datumbox deleted the perf/affine branch November 14, 2022 19:36

pmeier mentioned this pull request Feb 8, 2023

[NOMERGE] drop in transforms v2 into the v1 tests #7159

Draft

		raise ValueError(f"Interpolation mode '{interpolation}' is unsupported with Tensor input")


		def _affine_grid(

		return matrix


		def _compute_affine_output_size(matrix: List[float], w: int, h: int) -> Tuple[int, int]:

		return int(size[0]), int(size[1]) # w, h


		def _apply_grid_transform(

		output = image
		new_width, new_height = _compute_affine_output_size(matrix, width, height) if expand else (width, height)

		if coeffs is not None and len(coeffs) != 8:
		raise ValueError("Argument coeffs should have 8 float values")

[prototype] Optimize and clean up all affine methods #6945

[prototype] Optimize and clean up all affine methods #6945

Uh oh!

Conversation

datumbox commented Nov 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

datumbox commented Nov 11, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

datumbox left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pmeier left a comment

Choose a reason for hiding this comment

Uh oh!

datumbox commented Nov 14, 2022

Uh oh!

Uh oh!

datumbox commented Nov 11, 2022 •

edited

Loading

datumbox left a comment •

edited

Loading