remove unneccesary checks from posterize_image_tensor #6823

pmeier · 2022-10-24T13:57:51Z

The v1 kernel includes a lot of checks that aren't useful:

vision/torchvision/transforms/functional_tensor.py

Lines 784 to 791 in 7f5513d

    
           _assert_image_tensor(img) 
        
           if img.ndim < 3: 
        
               raise TypeError(f"Input image tensor should have at least 3 dimensions, but found {img.ndim}") 
        
           if img.dtype != torch.uint8: 
        
               raise TypeError(f"Only torch.uint8 image tensors are supported, but found {img.dtype}") 
        
           _assert_channels(img, [1, 3])

we are already inside the tensor kernel so there is no need to assert that again
this kernel works with arbitrary channels so there is no need to enforce {1, 3}

Other than that, there is unfortunately no further optimization possible. Similar to #6819 (comment) it seems that using scalars with out of place ops is faster than manually putting them into a tensor and using inplace ops:

import itertools

import torch
from torch.utils import benchmark


shapes = [
    (3, 256, 256),  # single image
    (5, 3, 256, 256),  # single video
]
devices = ["cpu", "cuda"]


def scalar_and(image, bits):
    mask = -int(2 ** (8 - bits))
    return mask & image


def full_like(image, bits):
    mask = torch.full_like(image, -int(2 ** (8 - bits)))
    return mask.bitwise_and_(image)


timers = [
    benchmark.Timer(
        stmt="fn(input, bits)",
        globals=dict(
            fn=fn,
            input=torch.testing.make_tensor(shape, dtype=torch.uint8, device=device, low=0, high=256),
            bits=4,
        ),
        label="posterize",
        sub_label=f"{shape!s:16} / {device:4}",
        description=fn.__name__,
    )
    for fn, shape, device in itertools.product([scalar_and, full_like], shapes, devices)
    for num_threads in ([1, 2, 4] if device == "cpu" else [1])
]

measurements = [timer.blocked_autorange(min_run_time=5) for timer in timers]

comparison = benchmark.Compare(measurements)
comparison.trim_significant_figures()
comparison.print()

[---------------------- posterize -----------------------]
                               |  scalar_and  |  full_like
1 threads: -----------------------------------------------
      (3, 256, 256)    / cpu   |      28      |      95   
      (3, 256, 256)    / cuda  |       5      |       7   
      (5, 3, 256, 256) / cpu   |     119      |     466   
      (5, 3, 256, 256) / cuda  |       5      |       7   

Times are in microseconds (us).

Nevertheless, I've run the benchmark from #6818 to see if the extra checks have an effect on the performance:

[------- posterize @ torchvision==0.15.0a0+62da7d4 --------]
                                            |   v1   |   v2 
1 threads: -------------------------------------------------
      (1, 512, 512)       / uint8   / cpu   |    37  |    35
      (1, 512, 512)       / uint8   / cuda  |     8  |     5
      (3, 512, 512)       / uint8   / cpu   |    97  |    93
      (3, 512, 512)       / uint8   / cuda  |     9  |     5
      (5, 3, 512, 512)    / uint8   / cpu   |   446  |   464
      (5, 3, 512, 512)    / uint8   / cuda  |    34  |    35
      (4, 5, 3, 512, 512) / uint8   / cpu   |  1900  |  1800
      (4, 5, 3, 512, 512) / uint8   / cuda  |   130  |   130
2 threads: -------------------------------------------------
      (1, 512, 512)       / uint8   / cpu   |    27  |    23
      (3, 512, 512)       / uint8   / cpu   |    59  |    54
      (5, 3, 512, 512)    / uint8   / cpu   |   243  |   240
      (4, 5, 3, 512, 512) / uint8   / cpu   |   940  |   938
4 threads: -------------------------------------------------
      (1, 512, 512)       / uint8   / cpu   |    18  |    15
      (3, 512, 512)       / uint8   / cpu   |    36  |    32
      (5, 3, 512, 512)    / uint8   / cpu   |   130  |   127
      (4, 5, 3, 512, 512) / uint8   / cpu   |   480  |   480

Times are in microseconds (us).

As expected, there is no performance difference. Any measured difference is just noise.

datumbox

LGTM, thanks!

datumbox · 2022-10-24T16:03:31Z

@pmeier Given there is no performance boost, we can consider deferring merging until later to reduce copy-pasted code. This is one potential way to handle kernels that can't be optimized further (continue aliasing). No strong opinion, I'll leave it to you whether you want to merge or not.

Doesn't improve performance, we should discuss aliasing.

pmeier · 2022-10-27T11:13:54Z

Superseded by #6847.

remove unneccesary checks from posterize_image_tensor

9781b83

facebook-github-bot added the cla signed label Oct 24, 2022

pmeier mentioned this pull request Oct 24, 2022

Performance improvements for transforms v2 vs. v1 #6818

Closed

31 tasks

datumbox previously approved these changes Oct 24, 2022

View reviewed changes

pmeier added 2 commits October 24, 2022 19:14

fix JIT

292b8bb

Merge branch 'main' into perf/posterize

52d49d4

pmeier closed this Oct 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

remove unneccesary checks from posterize_image_tensor #6823

remove unneccesary checks from posterize_image_tensor #6823

pmeier commented Oct 24, 2022

datumbox left a comment

datumbox commented Oct 24, 2022

pmeier commented Oct 27, 2022

	_assert_image_tensor(img)

	if img.ndim < 3:
	raise TypeError(f"Input image tensor should have at least 3 dimensions, but found {img.ndim}")
	if img.dtype != torch.uint8:
	raise TypeError(f"Only torch.uint8 image tensors are supported, but found {img.dtype}")

	_assert_channels(img, [1, 3])

remove unneccesary checks from posterize_image_tensor #6823

remove unneccesary checks from posterize_image_tensor #6823

Conversation

pmeier commented Oct 24, 2022

datumbox left a comment

Choose a reason for hiding this comment

datumbox commented Oct 24, 2022

pmeier commented Oct 27, 2022