-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[prototype] Speed up adjust_hue_image_tensor
#6938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some clarification comments. Frankly I was expecting a more significant speed improvement given the amount of inplace ops that we could do here. The reason we don't see this is because adjust_hue
is very computation heavy and one of our slowest transforms. These optimizations clip a few ms
but those account for a small fraction of the total time.
hg = rc.add(2.0).sub_(bc).mul_(mask_maxc_eq_g & mask_maxc_neq_r) | ||
hr = bc.sub_(gc).mul_(~mask_maxc_neq_r) | ||
hb = gc.add_(4.0).sub_(rc).mul_(mask_maxc_neq_r.logical_and_(mask_maxc_eq_g.logical_not_())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the order of operations allows us to do more in-place ops.
In particular once hg
is estimated, we can do an in-place on bc
during the hr
estimation. Then in the estimation of hb
we can do inplace on gc
and the logical masks.
sxf = s * f | ||
one_minus_s = 1.0 - s | ||
q = (1.0 - sxf).mul_(v).clamp_(0.0, 1.0) | ||
t = sxf.add_(one_minus_s).mul_(v).clamp_(0.0, 1.0) | ||
p = one_minus_s.mul_(v).clamp_(0.0, 1.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again we reorder to be able to do more in-place ops. We also expand the math ops to reuse components.
We precompute s*f
which is used in q
and t
estimation. We also do that for 1-s
. Then we estimate first q
, so that we canl ater modify sxf
inplace on the t
estimation. Finally one_minus_s
can be in-place modified in the estimation of p
.
@@ -234,7 +235,7 @@ def _hsv_to_rgb(img: torch.Tensor) -> torch.Tensor: | |||
a3 = torch.stack((p, p, t, v, v, q), dim=-3) | |||
a4 = torch.stack((a1, a2, a3), dim=-4) | |||
|
|||
return (a4.mul_(mask.to(dtype=img.dtype).unsqueeze(dim=-4))).sum(dim=-3) | |||
return (a4.mul_(mask.unsqueeze(dim=-4))).sum(dim=-3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unnecessary casting of mask. It's possible to do a multiplication with bools as we did previously for mask_maxc_eq_g
etc.
h6 = h.mul(6) | ||
i = torch.floor(h6) | ||
f = h6 - i | ||
f = h6.sub_(i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor optimizations based on @pmeier's finding that mul is preferable to *
for numbers. Also h6
can be modified in-place as it's not reused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, we only get benefits if we can eliminate a tensor division, which is not the case here.
@@ -20,6 +19,8 @@ def decode_image_with_pil(encoded_image: torch.Tensor) -> features.Image: | |||
|
|||
@torch.jit.unused | |||
def decode_video_with_av(encoded_video: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor, Dict[str, Any]]: | |||
import unittest.mock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a drive by change to avoid the hard dependency on unittest
. @pmeier said offline that we can clean up many methods that are no longer used. He is going to do this on a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will send a PR soon.
@@ -164,6 +164,7 @@ def convert_format_bounding_box( | |||
if new_format == old_format: | |||
return bounding_box | |||
|
|||
# TODO: Add _xywh_to_cxcywh and _cxcywh_to_xywh to improve performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not the highest priority as we don't do such conversions internally but it might be good to offer those 2 and stop doing 2 conversions on the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with that as the number of formats is low. If that changes in the future, we maybe need to walk back or only partially implement 1-to-1 conversions for all formats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
h6 = h.mul(6) | ||
i = torch.floor(h6) | ||
f = h6 - i | ||
f = h6.sub_(i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, we only get benefits if we can eliminate a tensor division, which is not the case here.
@@ -20,6 +19,8 @@ def decode_image_with_pil(encoded_image: torch.Tensor) -> features.Image: | |||
|
|||
@torch.jit.unused | |||
def decode_video_with_av(encoded_video: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor, Dict[str, Any]]: | |||
import unittest.mock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will send a PR soon.
@@ -164,6 +164,7 @@ def convert_format_bounding_box( | |||
if new_format == old_format: | |||
return bounding_box | |||
|
|||
# TODO: Add _xywh_to_cxcywh and _cxcywh_to_xywh to improve performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with that as the number of formats is low. If that changes in the future, we maybe need to walk back or only partially implement 1-to-1 conversions for all formats.
Summary: * Performance optimization on adjust_hue_image_tensor * handle ints * Inplace logical ops * Remove unnecessary casting. * Fix linter. Reviewed By: NicolasHug Differential Revision: D41265196 fbshipit-source-id: f761c1238f42eb1771de520dcea88b74d016f3d2
Related to #6818
Small performance improvement by making use of inplace ops where possible:
cc @vfdev-5 @bjuncek @pmeier