Skip to content

remove unneccesary checks from posterize_image_tensor #6823

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented Oct 24, 2022

The v1 kernel includes a lot of checks that aren't useful:

_assert_image_tensor(img)
if img.ndim < 3:
raise TypeError(f"Input image tensor should have at least 3 dimensions, but found {img.ndim}")
if img.dtype != torch.uint8:
raise TypeError(f"Only torch.uint8 image tensors are supported, but found {img.dtype}")
_assert_channels(img, [1, 3])

  • we are already inside the tensor kernel so there is no need to assert that again
  • this kernel works with arbitrary channels so there is no need to enforce {1, 3}

Other than that, there is unfortunately no further optimization possible. Similar to #6819 (comment) it seems that using scalars with out of place ops is faster than manually putting them into a tensor and using inplace ops:

import itertools

import torch
from torch.utils import benchmark


shapes = [
    (3, 256, 256),  # single image
    (5, 3, 256, 256),  # single video
]
devices = ["cpu", "cuda"]


def scalar_and(image, bits):
    mask = -int(2 ** (8 - bits))
    return mask & image


def full_like(image, bits):
    mask = torch.full_like(image, -int(2 ** (8 - bits)))
    return mask.bitwise_and_(image)


timers = [
    benchmark.Timer(
        stmt="fn(input, bits)",
        globals=dict(
            fn=fn,
            input=torch.testing.make_tensor(shape, dtype=torch.uint8, device=device, low=0, high=256),
            bits=4,
        ),
        label="posterize",
        sub_label=f"{shape!s:16} / {device:4}",
        description=fn.__name__,
    )
    for fn, shape, device in itertools.product([scalar_and, full_like], shapes, devices)
    for num_threads in ([1, 2, 4] if device == "cpu" else [1])
]

measurements = [timer.blocked_autorange(min_run_time=5) for timer in timers]

comparison = benchmark.Compare(measurements)
comparison.trim_significant_figures()
comparison.print()
[---------------------- posterize -----------------------]
                               |  scalar_and  |  full_like
1 threads: -----------------------------------------------
      (3, 256, 256)    / cpu   |      28      |      95   
      (3, 256, 256)    / cuda  |       5      |       7   
      (5, 3, 256, 256) / cpu   |     119      |     466   
      (5, 3, 256, 256) / cuda  |       5      |       7   

Times are in microseconds (us).

Nevertheless, I've run the benchmark from #6818 to see if the extra checks have an effect on the performance:

[------- posterize @ torchvision==0.15.0a0+62da7d4 --------]
                                            |   v1   |   v2 
1 threads: -------------------------------------------------
      (1, 512, 512)       / uint8   / cpu   |    37  |    35
      (1, 512, 512)       / uint8   / cuda  |     8  |     5
      (3, 512, 512)       / uint8   / cpu   |    97  |    93
      (3, 512, 512)       / uint8   / cuda  |     9  |     5
      (5, 3, 512, 512)    / uint8   / cpu   |   446  |   464
      (5, 3, 512, 512)    / uint8   / cuda  |    34  |    35
      (4, 5, 3, 512, 512) / uint8   / cpu   |  1900  |  1800
      (4, 5, 3, 512, 512) / uint8   / cuda  |   130  |   130
2 threads: -------------------------------------------------
      (1, 512, 512)       / uint8   / cpu   |    27  |    23
      (3, 512, 512)       / uint8   / cpu   |    59  |    54
      (5, 3, 512, 512)    / uint8   / cpu   |   243  |   240
      (4, 5, 3, 512, 512) / uint8   / cpu   |   940  |   938
4 threads: -------------------------------------------------
      (1, 512, 512)       / uint8   / cpu   |    18  |    15
      (3, 512, 512)       / uint8   / cpu   |    36  |    32
      (5, 3, 512, 512)    / uint8   / cpu   |   130  |   127
      (4, 5, 3, 512, 512) / uint8   / cpu   |   480  |   480

Times are in microseconds (us).

As expected, there is no performance difference. Any measured difference is just noise.

datumbox
datumbox previously approved these changes Oct 24, 2022
Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@datumbox
Copy link
Contributor

@pmeier Given there is no performance boost, we can consider deferring merging until later to reduce copy-pasted code. This is one potential way to handle kernels that can't be optimized further (continue aliasing). No strong opinion, I'll leave it to you whether you want to merge or not.

@datumbox datumbox dismissed their stale review October 24, 2022 18:05

Doesn't improve performance, we should discuss aliasing.

@pmeier
Copy link
Collaborator Author

pmeier commented Oct 27, 2022

Superseded by #6847.

@pmeier pmeier closed this Oct 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants