Skip to content

Fix hardcoded 255 #6830

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Nov 3, 2022
Merged

Fix hardcoded 255 #6830

merged 24 commits into from
Nov 3, 2022

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented Oct 24, 2022

@@ -226,19 +226,15 @@ def adjust_hue_image_tensor(image: torch.Tensor, hue_factor: float) -> torch.Ten
return image

orig_dtype = image.dtype
if image.dtype == torch.uint8:
image = image / 255.0
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing the conversion manually, I've opted to use our kernel for this. Note that this also implicitly converts to float32 since the divisor is a float.

@@ -15,12 +15,6 @@ def _assert_image_tensor(img: Tensor) -> None:
raise TypeError("Tensor is not a torch image.")


def _assert_threshold(img: Tensor, threshold: float) -> None:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was only used once so I inlined it.

F_t.solarize(img, threshold)


@pytest.mark.parametrize("device", cpu_and_gpu())
@pytest.mark.parametrize("threshold", [260])
def test_solarize_threshold2_upper_bound(threshold, device):
img = torch.randint(0, 256, (3, 12, 23)).to(device)
img = torch.randint(0, 256, (3, 12, 23), dtype=torch.uint8, device=device)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch.randint is a int64 by default which will no longer trigger an error for a threshold of 260.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of just fixing this here, I opted to make the tests more robust in e13613a.

@datumbox
Copy link
Contributor

@pmeier Thanks for the PR. Have you made any measurements on the V2 changes to check if there is a performance degradation? There are a few more 255 hardcoded values left but those are in places where we support only uint8. Is there a plan to update them?

Finally this change, though correct, has the potential of breaking existing code. Before merging we might need to cherrypick it and bring it on FBcode to see if there is any breakage.

@pmeier
Copy link
Collaborator Author

pmeier commented Oct 25, 2022

Have you made any measurements on the V2 changes to check if there is a performance degradation?

Not yet. For all ops except for adjust_hue we only replace one if-else with a function call that includes an if-elif-else. This change should be in the nano second range and thus well within our measuring tolerance. Do you also want me to benchmark all of them?

There are a few more 255 hardcoded values left but those are in places where we support only uint8. Is there a plan to update them?

Nope, I don't see a point. Or do you mean have another look at the kernel whether or not we are doing the wrong thing, i.e. nothing, for other integer dtypes? In that case, yes that would be a good idea.

@pmeier
Copy link
Collaborator Author

pmeier commented Oct 25, 2022

There are two places with hardcoded 255 left:

  1. if interpolation == "bicubic" and out_dtype == torch.uint8:
    img = img.clamp(min=0, max=255)

    Two things here:

    1. Although the 255 is fine here for torch.uint8, all other integer images will not be clamped and thus might fail the following conversion back to an integer dtype due to overflow. Thus, this should be something like
    if interpolation == "bicubic" and not out_dtype.is_floating_point:
        img = img.clamp(min=0, max=_max_value(out_dtype)) 
    1. That being said, I'm a little confused why we are only clamping only for integer dtypes in the first place. Bicubic interpolation can lead to overflowing values, so shouldn't we clamp regardless of the dtype? Otherwise, the value range [0.0, 1.0] is no longer guaranteed after this operation. Thus, I think this should be
    if interpolation == "bicubic":
        img = img.clamp_(0, _max_value(out_dtype))

    Maybe @vfdev-5 can shed some light here.

  2. if img_chan.is_cuda:
    hist = torch.histc(img_chan.to(torch.float32), bins=256, min=0, max=255)
    else:
    hist = torch.bincount(img_chan.reshape(-1), minlength=256)

    which is guarded by

    if img.dtype != torch.uint8:
    raise TypeError(f"Only torch.uint8 image tensors are supported, but found {img.dtype}")

    Meaning, equalize does not work with any dtype other than torch.uint8 and thus the hardcoded 255 is fine.

    That being said, I think we need to have discussion whether or not we want to have kernels that categorically only work with a subset of dtypes, in the extreme case like here, only for a single dtype. That means for example that the AA transforms that use equalize internally cannot be used with floating point images. Their docstring states that much

    class AutoAugment(torch.nn.Module):
    r"""AutoAugment data augmentation method based on
    `"AutoAugment: Learning Augmentation Strategies from Data" <https://arxiv.org/pdf/1805.09501.pdf>`_.
    If the image is torch Tensor, it should be of type torch.uint8, and it is expected

    but we should discuss whether or not we are ok with this behavior going forward. This should not happen on this PR and I will open an issue soon about this.

@datumbox
Copy link
Contributor

@pmeier As discussed offline, checking only the methods that change significantly will do. No need to benchmark those that just fetch the value from the dictionary. Let's wait for Victor's thoughts on this.

@fmassa I was wondering if you could chime in as well. I'm supportive of Philip's change, just wanted to make sure we don't miss something important. The TLDR is, we replace the hardcoded 255 values with the ones for each dtype. Floats and uint8 remain unaffected but other integer types would change. This would align the behaviour of the kernels with convert_image_dtype which currently use different max values. I think this can be considered a bug but not 100% if it was originally intentional.

@pmeier
Copy link
Collaborator Author

pmeier commented Oct 25, 2022

[--------------------------- adjust_hue ---------------------------]
                                       |  main  |  fix-hardcoded-255
1 threads: ---------------------------------------------------------
      (3, 512, 512), uint8, cpu        |   14   |          15       
      (3, 512, 512), uint8, cuda       |    1   |           1       
      (3, 512, 512), float32, cpu      |   14   |          14       
      (3, 512, 512), float32, cuda     |    1   |           1       
      (5, 3, 512, 512), uint8, cpu     |   94   |          90       
      (5, 3, 512, 512), uint8, cuda    |    7   |           8       
      (5, 3, 512, 512), float32, cpu   |   88   |          84       
      (5, 3, 512, 512), float32, cuda  |    7   |           7       

Times are in milliseconds (ms).

No changes apart from noise.

@datumbox
Copy link
Contributor

I've imported this PR on FBcode to check if there are any breakages at D40752944

@datumbox
Copy link
Contributor

I ran all the tests internally and it seems the change didn't break anything. There are a lot of pre-existing failures and skipped tests, so we can't be 100% sure. But it looks like it's mostly OK.

@pmeier Do we need to update the PR to cover your recently changes on the 2 kernels?

@fmassa
Copy link
Member

fmassa commented Nov 3, 2022

Hi @datumbox

I'm supportive of this change. We didn't have good support for other dtypes before, so assuming either uint8 or float was okaish. Happy to see this being improved!

@datumbox
Copy link
Contributor

datumbox commented Nov 3, 2022

@pmeier Looks like we should be good to go once you finish the remaining hardcoded values. Ping me when you are happy to do one final review and merge. I've already ported this internally and it looks there are no issues.

Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I highlighted 2 places were we should measure performance.

return (1 if image.is_floating_point() else 255) - image # type: ignore[no-any-return]
else: # signed integer dtypes
# We can't use `Tensor.bitwise_not` here, since we want to retain the leading zero bit that encodes the sign
return image.bitwise_xor((1 << _num_value_bits(image.dtype)) - 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide benchmarks for this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[--------------------------- invert_image_tensor ---------------------------]
                                      |       main       |  fix-hardcoded-255
1 threads: ------------------------------------------------------------------
      (3, 512, 512), float32, cpu     |   61 (+-  0) us  |     57 (+-  0) us 
      (3, 512, 512), uint8, cpu       |   17 (+-  0) us  |     17 (+-  0) us 
      (3, 512, 512), int32, cpu       |   78 (+-  0) us  |     63 (+-  0) us 
      (5, 3, 512, 512), float32, cpu  |  461 (+- 33) us  |    445 (+- 31) us 
      (5, 3, 512, 512), uint8, cpu    |   98 (+-  1) us  |     79 (+-  1) us 
      (5, 3, 512, 512), int32, cpu    |  538 (+- 67) us  |    514 (+-  8) us 

Times are in microseconds (us).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I like it when bug/code-quality fixing leads to speed improvements. What more can we ask? 😄

@pmeier
Copy link
Collaborator Author

pmeier commented Nov 3, 2022

There was some offline discussion whether or not we want to remove the implicit assumption that floating point images have the maximum value 1.0. Here are some benchmarks:

[-------------- convert_dtype_image_tensor float32 -> float64 ---------------]
                                      |        main       |  fix-hardcoded-255
1 threads: -------------------------------------------------------------------
      (3, 512, 512), float32, cpu     |    81 (+-  0) us  |    151 (+-  1) us 
      (5, 3, 512, 512), float32, cpu  |  1406 (+- 58) us  |   2183 (+-143) us 

Times are in microseconds (us).
  • float to float will be slower because we need to perform an additional (inplace) multiplication whereas before a dtype conversion was sufficient. However, if we assume that each float has the same value range, we don't need to touch this conversion at all.
[--------------- convert_dtype_image_tensor float32 -> uint8 ----------------]
                                      |        main       |  fix-hardcoded-255  
1 threads: -------------------------------------------------------------------
      (3, 512, 512), float32, cpu     |   414 (+- 37) us  |    408 (+-  3) us   
      (5, 3, 512, 512), float32, cpu  |  2326 (+-182) us  |   2208 (+- 34) us   

Times are in microseconds (us).
  • If anything, the new version should be slower since we need an additional (Python scalar) division. Not sure where the measured difference comes from.
[-------------- convert_dtype_image_tensor uint8 -> float32 --------------]
                                    |       main       |  fix-hardcoded-255
1 threads: ----------------------------------------------------------------
      (3, 512, 512), uint8, cpu     |  133 (+-  1) us  |     95 (+-  0) us 
      (5, 3, 512, 512), uint8, cpu  |  640 (+-  3) us  |    438 (+-  6) us 

Times are in microseconds (us).
  • If anything, the new version should be slower since we need an additional (Python scalar) division. The performance improvement here comes from a trick that I found while implementing this patch that is independent of this change. Instead of doing a tensor division, we can simply do a Python scalar division followed by a tensor multiplication. Effectively, this turns

    return image.to(dtype).div_(_FT._max_value(image.dtype))

    into

    image.to(dtype).mul_(1 / _FT._max_value(image.dtype)) 

    Although it seems unrelated here, I've included this improvement in the benchmark, because the only thing that changes is the 1 turns into a _FT._max_value(dtype).

@datumbox
Copy link
Contributor

datumbox commented Nov 3, 2022

There was some offline discussion whether or not we want to remove the implicit assumption that floating point images have the maximum value 1.0.

Let's not adopt anything that slows us down.

Instead of doing a tensor division, we can simply do a Python scalar division followed by a tensor multiplication.

What the heck! Well sounds good to me. Shall we try the trick in other places too?

Here are some places that we could replace it:

h = h.div_(6.0).add_(1.0).fmod_(1.0)

return image.mul(levels).floor_().clamp_(0, levels - 1).div_(levels)

return image.to(dtype).div_(_FT._max_value(image.dtype))

Finally from what I understand, none of the changes you make here are expected to cause speed regressions, can you confirm?

@pmeier
Copy link
Collaborator Author

pmeier commented Nov 3, 2022

Let's not adopt anything that slows us down.

As discussed offline, there are probably a lot more implicit assumptions on the floating point range than what I detailed above. We agreed to just put a comment on the value inside the _FT._max_value function to indicate that this can't be changed easily.

What the heck! Well sounds good to me. Shall we try the trick in other places too?

Will do so in a follow-up PR since this is unrelated to this PR.

Finally from what I understand, none of the changes you make here are expected to cause speed regressions, can you confirm?

Nope, perf should be the same. For some ops I posted benchmarks and they all show either no difference or even an improvement.

@pmeier pmeier merged commit cb4413a into pytorch:main Nov 3, 2022
@pmeier pmeier deleted the fix-hardcoded-255 branch November 3, 2022 17:11
facebook-github-bot pushed a commit that referenced this pull request Nov 4, 2022
Summary:
* fix prototype kernels

* fix stable kernels

* fix tests

* make test more robust

* improve invert for signed integers

* improve invert

* fix posterize

* Revert "assume that integer images are [0, 255] in equalize (#6859)"

This reverts commit 436ff9a.

* fix solarize in AA

* fix resize

* Revert "fix resize"

This reverts commit 5f33f4a.

* add comment to float max value

Reviewed By: datumbox

Differential Revision: D41020539

fbshipit-source-id: 1c618ead36a0ae4d93b4ebf07186fd39bd85d915

Co-authored-by: Vasilis Vryniotis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't hardcode 255 unless uint8 is enforced
4 participants