Skip to content

[prototype] Speed up adjust_hue_image_tensor #6938

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Nov 10, 2022
Merged

Conversation

datumbox
Copy link
Contributor

@datumbox datumbox commented Nov 9, 2022

Related to #6818

Small performance improvement by making use of inplace ops where possible:

[-------------- adjust_hue_image_tensor cpu torch.float32 ---------------]
                         |  adjust_hue_image_tensor old  |     fn2 new    
1 threads: ---------------------------------------------------------------
      (16, 3, 400, 400)  |         467 (+- 13) ms        |  450 (+- 12) ms
      (3, 400, 400)      |          14 (+-  0) ms        |   14 (+-  0) ms
6 threads: ---------------------------------------------------------------
      (16, 3, 400, 400)  |         457 (+- 13) ms        |  451 (+-  9) ms
      (3, 400, 400)      |          18 (+-  0) ms        |   17 (+-  0) ms

Times are in milliseconds (ms).

[--------------- adjust_hue_image_tensor cuda torch.float32 --------------]
                         |  adjust_hue_image_tensor old  |      fn2 new    
1 threads: ----------------------------------------------------------------
      (16, 3, 400, 400)  |        2260 (+-  0) us        |  2200 (+-  1) us
      (3, 400, 400)      |         571 (+-  2) us        |   526 (+-  1) us
6 threads: ----------------------------------------------------------------
      (16, 3, 400, 400)  |        2261 (+- 10) us        |  2200 (+- 15) us
      (3, 400, 400)      |         570 (+- 19) us        |   525 (+-  4) us

Times are in microseconds (us).

[--------------- adjust_hue_image_tensor cpu torch.uint8 ----------------]
                         |  adjust_hue_image_tensor old  |     fn2 new    
1 threads: ---------------------------------------------------------------
      (16, 3, 400, 400)  |         483 (+- 20) ms        |  461 (+- 13) ms
      (3, 400, 400)      |          15 (+-  0) ms        |   15 (+-  0) ms
6 threads: ---------------------------------------------------------------
      (16, 3, 400, 400)  |         487 (+- 15) ms        |  479 (+- 17) ms
      (3, 400, 400)      |          19 (+-  0) ms        |   18 (+-  1) ms

Times are in milliseconds (ms).

[---------------- adjust_hue_image_tensor cuda torch.uint8 ---------------]
                         |  adjust_hue_image_tensor old  |      fn2 new    
1 threads: ----------------------------------------------------------------
      (16, 3, 400, 400)  |        2433 (+-  1) us        |  2365 (+-  0) us
      (3, 400, 400)      |         623 (+-  1) us        |   580 (+-  2) us
6 threads: ----------------------------------------------------------------
      (16, 3, 400, 400)  |        2436 (+-  6) us        |  2366 (+-  0) us
      (3, 400, 400)      |         622 (+- 21) us        |   581 (+-  2) us

Times are in microseconds (us).

cc @vfdev-5 @bjuncek @pmeier

Copy link
Contributor Author

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some clarification comments. Frankly I was expecting a more significant speed improvement given the amount of inplace ops that we could do here. The reason we don't see this is because adjust_hue is very computation heavy and one of our slowest transforms. These optimizations clip a few ms but those account for a small fraction of the total time.

Comment on lines +208 to +210
hg = rc.add(2.0).sub_(bc).mul_(mask_maxc_eq_g & mask_maxc_neq_r)
hr = bc.sub_(gc).mul_(~mask_maxc_neq_r)
hb = gc.add_(4.0).sub_(rc).mul_(mask_maxc_neq_r.logical_and_(mask_maxc_eq_g.logical_not_()))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the order of operations allows us to do more in-place ops.

In particular once hg is estimated, we can do an in-place on bc during the hr estimation. Then in the estimation of hb we can do inplace on gc and the logical masks.

Comment on lines +224 to +228
sxf = s * f
one_minus_s = 1.0 - s
q = (1.0 - sxf).mul_(v).clamp_(0.0, 1.0)
t = sxf.add_(one_minus_s).mul_(v).clamp_(0.0, 1.0)
p = one_minus_s.mul_(v).clamp_(0.0, 1.0)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again we reorder to be able to do more in-place ops. We also expand the math ops to reuse components.

We precompute s*f which is used in q and t estimation. We also do that for 1-s. Then we estimate first q, so that we canl ater modify sxf inplace on the t estimation. Finally one_minus_s can be in-place modified in the estimation of p.

@@ -234,7 +235,7 @@ def _hsv_to_rgb(img: torch.Tensor) -> torch.Tensor:
a3 = torch.stack((p, p, t, v, v, q), dim=-3)
a4 = torch.stack((a1, a2, a3), dim=-4)

return (a4.mul_(mask.to(dtype=img.dtype).unsqueeze(dim=-4))).sum(dim=-3)
return (a4.mul_(mask.unsqueeze(dim=-4))).sum(dim=-3)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary casting of mask. It's possible to do a multiplication with bools as we did previously for mask_maxc_eq_g etc.

Comment on lines +219 to +221
h6 = h.mul(6)
i = torch.floor(h6)
f = h6 - i
f = h6.sub_(i)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor optimizations based on @pmeier's finding that mul is preferable to * for numbers. Also h6 can be modified in-place as it's not reused.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, we only get benefits if we can eliminate a tensor division, which is not the case here.

@@ -20,6 +19,8 @@ def decode_image_with_pil(encoded_image: torch.Tensor) -> features.Image:

@torch.jit.unused
def decode_video_with_av(encoded_video: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor, Dict[str, Any]]:
import unittest.mock
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a drive by change to avoid the hard dependency on unittest. @pmeier said offline that we can clean up many methods that are no longer used. He is going to do this on a separate PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will send a PR soon.

@@ -164,6 +164,7 @@ def convert_format_bounding_box(
if new_format == old_format:
return bounding_box

# TODO: Add _xywh_to_cxcywh and _cxcywh_to_xywh to improve performance
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the highest priority as we don't do such conversions internally but it might be good to offer those 2 and stop doing 2 conversions on the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with that as the number of formats is low. If that changes in the future, we maybe need to walk back or only partially implement 1-to-1 conversions for all formats.

Copy link
Collaborator

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines +219 to +221
h6 = h.mul(6)
i = torch.floor(h6)
f = h6 - i
f = h6.sub_(i)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, we only get benefits if we can eliminate a tensor division, which is not the case here.

@@ -20,6 +19,8 @@ def decode_image_with_pil(encoded_image: torch.Tensor) -> features.Image:

@torch.jit.unused
def decode_video_with_av(encoded_video: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor, Dict[str, Any]]:
import unittest.mock
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will send a PR soon.

@@ -164,6 +164,7 @@ def convert_format_bounding_box(
if new_format == old_format:
return bounding_box

# TODO: Add _xywh_to_cxcywh and _cxcywh_to_xywh to improve performance
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with that as the number of formats is low. If that changes in the future, we maybe need to walk back or only partially implement 1-to-1 conversions for all formats.

@datumbox datumbox merged commit d72e906 into pytorch:main Nov 10, 2022
@datumbox datumbox deleted the perf/hue branch November 10, 2022 17:20
facebook-github-bot pushed a commit that referenced this pull request Nov 14, 2022
Summary:
* Performance optimization on adjust_hue_image_tensor

* handle ints

* Inplace logical ops

* Remove unnecessary casting.

* Fix linter.

Reviewed By: NicolasHug

Differential Revision: D41265196

fbshipit-source-id: f761c1238f42eb1771de520dcea88b74d016f3d2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants