Skip to content

[prototype] Speed up autocontrast_image_tensor #6935

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 9, 2022

Conversation

datumbox
Copy link
Contributor

@datumbox datumbox commented Nov 9, 2022

Related to #6818

A performance improvement for uint8 images:

[-------------------------- autocontrast_image_tensor cpu torch.float32 --------------------------]
                         |  autocontrast_image_tensor old  |      fn2 new       |      fn3 new     
1 threads: ----------------------------------------------------------------------------------------
      (16, 3, 400, 400)  |         14050 (+-359) us        |  14020 (+-225) us  |  14085 (+-297) us
      (3, 400, 400)      |          528 (+-  1) us         |   529 (+-  1) us   |   533 (+-  1) us 
6 threads: ----------------------------------------------------------------------------------------
      (16, 3, 400, 400)  |         14855 (+-242) us        |  14814 (+- 60) us  |  14752 (+-447) us
      (3, 400, 400)      |          745 (+-  5) us         |   747 (+- 20) us   |   752 (+- 13) us 

Times are in microseconds (us).

[------------------------ autocontrast_image_tensor cuda torch.float32 -----------------------]
                         |  autocontrast_image_tensor old  |     fn2 new      |     fn3 new    
1 threads: ------------------------------------------------------------------------------------
      (16, 3, 400, 400)  |          224 (+-  0) us         |  223 (+-  0) us  |  223 (+-  0) us
      (3, 400, 400)      |           97 (+-  0) us         |   82 (+-  0) us  |   87 (+-  1) us
6 threads: ------------------------------------------------------------------------------------
      (16, 3, 400, 400)  |          224 (+-  3) us         |  224 (+-  2) us  |  224 (+-  1) us
      (3, 400, 400)      |           97 (+-  2) us         |   82 (+-  1) us  |   87 (+-  2) us

Times are in microseconds (us).

[--------------------------- autocontrast_image_tensor cpu torch.uint8 ---------------------------]
                         |  autocontrast_image_tensor old  |      fn2 new       |      fn3 new     
1 threads: ----------------------------------------------------------------------------------------
      (16, 3, 400, 400)  |         20519 (+-200) us        |  14527 (+- 50) us  |  17883 (+- 85) us
      (3, 400, 400)      |         1029 (+-  7) us         |   828 (+-  6) us   |  1025 (+-  8) us 
6 threads: ----------------------------------------------------------------------------------------
      (16, 3, 400, 400)  |         21208 (+-394) us        |  15201 (+-328) us  |  18550 (+-484) us
      (3, 400, 400)      |         1336 (+- 29) us         |  1131 (+- 27) us   |  1313 (+- 50) us 

Times are in microseconds (us).

[------------------------- autocontrast_image_tensor cuda torch.uint8 ------------------------]
                         |  autocontrast_image_tensor old  |     fn2 new      |     fn3 new    
1 threads: ------------------------------------------------------------------------------------
      (16, 3, 400, 400)  |          236 (+-  0) us         |  275 (+-  0) us  |  231 (+-  0) us
      (3, 400, 400)      |          123 (+-  1) us         |   98 (+-  1) us  |  106 (+-  1) us
6 threads: ------------------------------------------------------------------------------------
      (16, 3, 400, 400)  |          235 (+-  2) us         |  273 (+-  1) us  |  231 (+-  3) us
      (3, 400, 400)      |          123 (+-  2) us         |   98 (+-  1) us  |  107 (+-  2) us

Times are in microseconds (us).

fn2 is the submitted variant. It is 30% faster on CPU but 15% slower on GPU.

fn3 is another candidate (not included on this PR). It's 13% faster on CPU and about the same on GPU. Here is the implementation:

def fn3(image: torch.Tensor) -> torch.Tensor:
    c = image.shape[-3]
    if c not in [1, 3]:
        raise TypeError(f"Input image tensor permitted channel values are {[1, 3]}, but found {c}")

    if image.numel() == 0:
        # exit earlier on empty images
        return image

    bound = _FT._max_value(image.dtype)
    fp = image.is_floating_point()
    dtype = image.dtype if fp else torch.float32

    minimum = image.amin(dim=(-2, -1), keepdim=True)
    maximum = image.amax(dim=(-2, -1), keepdim=True)
    eq_idxs = maximum == minimum
    if not fp:
        maximum = maximum.to(dtype)

    inv_scale = maximum.sub_(minimum).div_(bound)
    minimum[eq_idxs] = 0.0
    inv_scale[eq_idxs] = 1.0

    output = torch.empty_like(image, dtype=dtype)
    torch.sub(image, minimum, out=output)
    return output.div_(inv_scale).clamp_(0, bound).to(image.dtype)

In offline discussions with @pmeier and @vfdev-5 we decided to choose fn2 because it optimizes for the most common use-case which is doing the process on CPU. Moreover the fn3 variant uses the out= idiom which doesn't play nice with autograd.

cc @vfdev-5 @bjuncek @pmeier

Copy link
Collaborator

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if CI is green (with the caveat of #6934). Thanks Vasilis!

@datumbox
Copy link
Contributor Author

datumbox commented Nov 9, 2022

@pmeier I get failures with:

Mismatched elements: 987 / 2772 (35.6%)
Greatest absolute difference: 5.960464477539063e-08 at index (0, 0, 0, 2)
Greatest relative difference: 1.1880246168204803e-07 at index (3, 1, 6, 11)

I thought we were using higher diffs. Do you think this threshold is reasonable?

@pmeier
Copy link
Collaborator

pmeier commented Nov 9, 2022

The failure happens in the consistency tests which test for equality by default

self.closeness_kwargs = closeness_kwargs or dict(rtol=0, atol=0)

You can add this

# Use default tolerances of `torch.testing.assert_close`
closeness_kwargs=dict(rtol=None, atol=None),

to

ConsistencyConfig(
prototype_transforms.RandomAutocontrast,
legacy_transforms.RandomAutocontrast,
[
ArgsKwargs(p=0),
ArgsKwargs(p=1),
],
),

for reasonable default tolerances. We needed to this for a few ops where we changed algorithms and in turn got slight deviations from v1.

@datumbox
Copy link
Contributor Author

datumbox commented Nov 9, 2022

@pmeier Thanks for the advice. Worked locally. I'll rerun the tests to be sure.

@datumbox
Copy link
Contributor Author

datumbox commented Nov 9, 2022

The failing test is the false positive that @vfdev-5 is currently investigating (see #6933 (comment))

@datumbox datumbox merged commit ffd5a56 into pytorch:main Nov 9, 2022
@datumbox datumbox deleted the perf/autocontrast branch November 9, 2022 17:30
facebook-github-bot pushed a commit that referenced this pull request Nov 14, 2022
Summary:
* Performance optimization for autocontrast

* Fixing tests

Reviewed By: NicolasHug

Differential Revision: D41265202

fbshipit-source-id: cd1f9f777ecf56168def256a2ef04335a602684b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants