Skip to content

RFC: transforms #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 26 commits into
base: master
Choose a base branch
from
Open

RFC: transforms #1

wants to merge 26 commits into from

Conversation

pmeier
Copy link
Owner

@pmeier pmeier commented Jun 25, 2021

This is an RFC on how transforms will work with the datasets after the rework of the datasets API. Rendered version here

@pmeier
Copy link
Owner Author

pmeier commented Jun 28, 2021

After some more thoughts, I see two additional problems that we need to solve. Happy to hear your ideas / thoughts.


What happens if a transform changes some feature attributes for example cropping an image or converting a bounding box to a new format. In the first case the information is at least still encoded in the new shape of the image, but for bounding box conversion, the information is gone. Thus, we need to somehow reflect these changes back to the features.

To achieve that, we could either change the feature dictionary in-place, or we need to somehow return a changed version with each transform.


In its current state the FeatureTransform has the ability of transforming multiple features, but has no notion of coherence of the features. Thus, it is impossible to share data between individual transformation steps. For example to rotate a bounding box, we need the shape of the underlying image, but there is no way to share it. The same goes for joint random transforms that need to act on the same set of parameters drawn for each sample.

Copy link
Collaborator

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for putting this together @pmeier !

I'm trying to get a better sense of the scope and the challenges here, so I mostly have questions at this point :)

I'll start here with a naive one: instead of introducing the new SampleTransform and FeatureTransform classes, could we "simply" (<-- gotta love this word) extend our currents transforms from torchvsion.transforms to handle not just tensors, but also X where X is what the datapipe returns? This way users could basically do dataset.map(torchvision.transforms.Rotate(30.0)), which seems optimal in terms of UX.

(I'm not saying we should do this, but the answer to this question will help me better understand where we're going)


Also, some seemingly related discussed sparked in pytorch/vision#4029 (comment). If we start supporting new Tensor-like classes like class Image(Tensor) and class BoundingBox(Tensor), we might be able to avoid the introduction of the new Feature structure.

Copy link
Owner Author

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NicolasHug

Also, some seemingly related discussed sparked in pytorch/vision#4029 (comment). If we start supporting new Tensor-like classes like class Image(Tensor) and class BoundingBox(Tensor), we might be able to avoid the introduction of the new Feature structure.

If we can do that, it would solve a lot of the questions you have and issues I mentioned. PyTorch offers the __torch_function__ hook. Without going into details, with it it is possible to perform any PyTorch operation, for example transform an image, and return our custom class. So something like

>>> image = torchvision.Image(torch.rand(3, 256, 256))
>>> transformed_image = image + 1
>>> type(transformed_image)
<class 'torchvision.Image'>

is possible without any hacks from our side. Doing something like this of course implies that we convert all numerics to tensors.

The problem we had with this approach is that we were unable to come up with a strict enough rule set, that would let us determine if an arbitrary operation should return our custom tensor type. Take the example above: is a floating-point tensor with values outside of the range [0, 1] still an image? Just to name a second: "If you subtract a random real number from a class label, is the result still a class label?" You can come up with a number of these for all proposed custom tensor classes.

It should be fairly easy to control this for all builtin transformations, but we can't guarantuee that in general.

Thus, we decided to not encode this extra information in the tensor itself. The new datapipes will return a dictionary as sample. Encoding the information in the keys rather than the values is also possible, but this also has two problems:

  • Having non-string keys would make working with a sample dictionary a lot more cumbersome.
  • We can't guarantuee that a user transformation keeps the keys either, so this has the same problems as encoding the information in the values of the dictionary.

What is the rational for separating the Feature type and the value?

For example, what is the benefit of this over

transform(Image(torch.rand(3, 256, 256)))
transform(BoundingBox(type="XYHW", (0, 0, 256, 256)))

?

Thus, we came up with the plan to encode the extra information in a custom support structure.

IMO in the end this all boils down to two possible scenarios:

  1. We provide a more rigid framework that is based on the assumption that the transform ensures that it only returns valid tensor types. This makes the UX better for users that only rely on builtin transforms. At the same time it will be harder to implement custom transformations.
  2. We base all of our stuff on very minimal assumptions at the cost of a worse UX.

I vote for option 1., but lets discuss this idea a little more.

This would also make something like

could we "simply" (<-- gotta love this word) extend our currents transforms from torchvsion.transforms to handle not just tensors, but also X where X is what the datapipe returns? This way users could basically do dataset.map(torchvision.transforms.Rotate(30.0)), which seems optimal in terms of UX.

a lot simpler, because now each transform would know which feature transform to dispatch based on the input type.

@pmeier
Copy link
Owner Author

pmeier commented Jul 5, 2021

After some more thought, I think @NicolasHug's comments hit the nail on the head: my proposal was complicated and still wasn't able to solve that he and I myself pointed out. Thus, I rewrote the proposal with the assumption that we can use custom tensor classes. I've added an implications section to discuss this. Overall I like the new proposal much better. Happy to hear your thoughts.

@pmeier pmeier requested a review from NicolasHug July 5, 2021 14:43
Copy link
Collaborator

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking pretty good, thanks a lot @pmeier !

As we've discussed on VC, here are a few notes from our chat.

__all__ = ["Feature"]


class Feature(torch.Tensor):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great!

One meta-question that the team needs to figure out is if we will be supporting torchscript for those features or not. IIUC this might not be propagating the inherited type to torchscripted functions

Copy link
Owner Author

@pmeier pmeier Aug 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've split this into Feature and TensorFeature, since a Feature might not necessarily be a Tensor. For example, Text cannot be represented by a tensor, but could very well be regarded as Feature from torchtext.

def format(self) -> BoundingBoxFormat:
return self._format

def convert(self, format: Union[str, BoundingBoxFormat]) -> "BoundingBox":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! One other thing to think about is if there would be a minimum common API across all our data types that we might want to enforce?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, I don't see anything. Happy to hear ideas though. If something comes up later, that should be retro-fittable.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a TensorFeature.from_tensor method, that should make it easy to create a new feature from just a tensor.

return Image(input.flip((-1,)))

@staticmethod
def bounding_box(input: BoundingBox) -> BoundingBox:
Copy link
Collaborator

@datumbox datumbox Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This way, when possible, we share the same functional implementation for images and bboxes masks and when not possible we implement new. Also we leave it up to the user to make the call depending on the use-case rather than assuming the BBox is in some part of the target.

See also this corner case transform that requires joint transform of all image, label, bbox:
https://github.com/pytorch/vision/blob/96f6e0a117d5c56f7e0237851dbb96144ebb110b/references/detection/transforms.py#L54-L129

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand:

we share the same functional implementation for images and bboxes and when not possible we implement new

In most cases the implementation will be different for images, bounding boxes and possible other types. In any case, we will always have a separate method for each type. In case the transformation is the same, it is the developers responsibility to implement it with minimal duplications:

class FooTransform(Transform):
    @staticmethod
    def _foo(input):
        return input

    @staticmethod
    def image(input: Image) -> Image:
        return FooTransform._foo(input)

    @staticmethod
    def bounding_box(input: BoundingBox) -> BoundingBox:
        return FooTransform._foo(input)

Also we leave it up to the user to make the call depending on the use-case rather than assuming the BBox is in some part of the target.

That is actually what we are trying to avoid. If you want to, you can call HorizontalFlip.image(...), but during normal operation you would do

transform = HorizontalFlip()
transformed_image = transform(image)
transformed_bbox = transform(bbox)

This is needed to be able to transform a complete sample drawn from a dataset, which might include more than just an image.

For a better overview of what we are trying to achieve, please read the accompanying document which is updated as this RFC progresses.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmeier I didn't disagree with you. I just said I like the approach :)

What I meant is that there are a few transforms that can be implemented for images and bboxes in the same way. That can live in the functional API and as you said, we can call it for both image() and bounding_box(). But also the proposed API supports cases when this is not the case by providing separate implementations for the two.

That is actually what we are trying to avoid.

Yes I had a discussion with @fmassa who explained that you want to avoid this to make things composable. Sounds good. Just make sure you look the example I sent you because there are transforms where image, bbox and labels must be processed together. I think the API can cover for that but it's worth having this use-case in mind.

Comment on lines 55 to 57
# transform takes none or more than one than one positional arguments
if len(argspec.args) != 1:
continue
Copy link
Collaborator

@fmassa fmassa Aug 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed in VC, I believe this can lead to subtle bugs. If the user writes their transform as

class Rotate(Transform):
    @staticmethod
    def image(img, angle):
        pass

instead of

class Rotate(Transform):
    @staticmethod
    def image(img, *, angle):
        pass

then the transformation will not be automatically registered, and no errors / warnings will show up, only that the transform will silently not be applied because it wasn't properly auto-registered.

I think it might be better to be explicit here and instead use decorators to register a function to a transform. Something in the lines of

class Rotate(Transform):
    pass

@register_transform(transform=Rotate, input_type=Image)
def rotate_image(img, angle):
    pass

Copy link
Owner Author

@pmeier pmeier Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be better to be explicit here and instead use decorators to register a function to a transform.

  1. I dislike this approach, because we would have to manually namespace everything with the feature type. For example, I think it is more concise to have Rotate.image and Rotate.bounding_box rather than rotate_image and rotate_bounding_box. Note that in the latter case, the namespace would have roughly n times more entries where n is the number of features we support.

  2. The auto registering can be disabled and the registering can be performed manually in the constructor:

    class Rotate(Transform, auto_register=False):
        def __init__(self):
            super().__init__()
            self.register_feature_transform(Image, self.foo)
    
        @staticmethod
        def foo(input, angle):
            pass
  3. If we ultimately go with registering standalone functions with a decorator, I suggest we use a classmethod of the transform, i.e.

    class Rotate(Transform):
        pass
    
    
    @Rotate.register(Image)
    def rotate_image(input, angle):
        pass

As we discussed in VC, I believe this can lead to subtle bugs.

True, good catch! The rule can be made more robust though. We evoke the feature transforms with feature_transform(input, **params) where params can be an empty dict. That means all of these signatures need to be supported:

def foo(input):
    pass

def foo(input=None):
    pass

def foo(input, bar):  # your example
    pass

def foo(input, bar=None):
    pass

def foo(input=None, bar=None):
    pass

def foo(input, *, bar):
    pass

def foo(input, *, bar=None):
    pass

def foo(input=None, *, bar):
    pass

def foo(input=None, *, bar=None):
    pass

For Python >=3.8 we also need to keep positional-only arguments in mind.

For the debugging, the user can use transform.is_supported(EDIT: as of writing this comment, this method only existed locally. It is included in the most recent commit) and pass an feature type or instance to see if it would be transformed or not. But I agree that one would probably only do that after discovering that something is wrong. Another idea would be to add a verbose flag that prints extra information during the auto registration.


Since all of this is a trade-off between convenience with possible subtle bugs and explicitness with a higher verbosity, I think we should wait for more opinions before we decide.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a commit that makes the matching more robust and adds the option to print information about the process:

class TestTransform(Transform, verbose=True):
    @staticmethod
    def incompatible_signature1():
        pass

    @staticmethod
    def incompatible_signature2(*, input):
        pass

    # Python >=3.8 only
    # @staticmethod
    # def incompatible_signature3(input, foo, /):
    #     pass

    @staticmethod
    def _private(input):
        pass

    @staticmethod
    def unknown(input):
        pass

    @staticmethod
    def imaeg(input):
        pass

    @staticmethod
    def image(input, foo):
        pass

    @staticmethod
    def boundingbox(input):
        pass

    @staticmethod
    def bounding_box(input, *, foo):
        pass
TestTransform._private() was not registered as feature transform, because it is private.
TestTransform.bounding_box() was registered as feature transform for type 'BoundingBox'.
TestTransform.boundingbox() was not registered as feature transform, because its name doesn't match any known feature type. Did you mean to name it 'bounding_box' to be registered for type 'BoundingBox'?
TestTransform.imaeg() was not registered as feature transform, because its name doesn't match any known feature type. Did you mean to name it 'image' to be registered for type 'Image'?
TestTransform.image() was registered as feature transform for type 'Image'.
TestTransform.incompatible_signature1() was not registered as feature transform, because it cannot be invoked with incompatible_signature1(input, **params).
TestTransform.incompatible_signature2() was not registered as feature transform, because it cannot be invoked with incompatible_signature2(input, **params).
TestTransform.unknown() was not registered as feature transform, because its name doesn't match any known feature type.

self.formats.XYWH: self._xyxy_to_xywh,
self.formats.CXCYWH: self._xyxy_to_cxcywh,
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For when we will be discussing specifics of the data classes, it might be good to add a __torch_function__ that forbids applying operations of two bounding boxes if they have different image_size, while allowing BoundingBox + Tensor to work

@ppwwyyxx
Copy link

ppwwyyxx commented Sep 2, 2021

It seems a big issue of the current prototype is that it cannot implement Compose.get_params, therefore cannot implement nested Compose.

  • Why it cannot implement Compose.get_params (efficiently):
    Compose.get_params cannot be implemented simply by calling each sub-transform's get_params. For example let's try to obtain Compose.get_params of Compose([A, B, C]). To call B.get_params you need the input sample of B. Therefore you need to first run the transform of A to transform the sample, before you can call B.get_params.
    In other words, getting the params already requires running/applying the transform. When the params are returned they are used again (in Transform.forward()) to apply the transform - which is a 2x waste of computation.

It may seem nested compose is useless because it can be flattened, but that's not the case. Beyond Compose as a sequential container, there might be other types of containers, such as Compose(RandomOrder([T1, T2, T3]), T4), RandomChoice([Compose([T1, T2]), T3]). These pipelines that have nested structures will all have this issue.

@pmeier
Copy link
Owner Author

pmeier commented Sep 2, 2021

@ppwwyyxx

It seems a big issue of the current prototype is that it cannot implement Compose.get_params, therefore cannot implement nested Compose.

I agree with the argument, but disagree with the conclusion.

During "normal" operation the transform containers (Compose, RandomChoice, ...) do not need access to the get_params method. Every transformation simply receives the sample it is supposed to transform and thus there is no duplicated computation. Example: Assume we have Compose[Compose[T1, T2], T3]. T1 receives the sample, T2 receives T1(sample), and T3 receives Compose[T1, T2](sample).

Still, there might be scenarios where you conclusion is right.

If we have two equal but separate transforms that should operate with the same parameters:

transform1 = MyTransform()
transform2 = MyTransform()
params = transform1.get_params()
transformed_image1 = transform1(image, params=params)
transformed_image2 = transform(image, params=params)

I currently fail to see a use case for this, but maybe there is one.

@pmeier
Copy link
Owner Author

pmeier commented Sep 2, 2021

@ppwwyyxx I've added minimal implementations for RandomChoice and RandomApply.

@pmeier
Copy link
Owner Author

pmeier commented Sep 2, 2021

With the latest commits, the accompanying document is no longer in sync. I'll update it ASAP.

@ppwwyyxx
Copy link

Thanks @pmeier for following up! I agree that you can implement containers by giving up the ability to do get_params. However I do feel that having get_params available is pretty important. I'll list some scenarios where this is useful:

  1. With get_params users can save the params and later use them to apply an inverse of the transforms (if the transform supports it). This is useful during inference as well as test-time augmentation, where we transform the images, detect objects on the new image, and then need to invert the transform so we map the detected boxes/masks back to the original image space.

  2. Users can obtain a sequence of params, and apply the transforms one by one. This allows them to insert logic in between. With code this looks like:

# transforms: a composite transform container (e.g. with Composite, RandomOrder, etc)
tfms_and_prms: List[Tuple[Transform, dict]] = transforms.get_params(inputs)
for tfm, param in tfms_and_prms:
   inputs = tfm(inputs, params=param)
   # HERE: do some extra work with inputs

Insertion of this logic have a few use cases, e.g.:

  • For bounding boxes, it's up to users whether they want to clip the boxes to image sizes after every transform (might be needed after rotation/affine). They can make this decision at here
  • For keypoints, users need to modify the keypoints labels (e.g. left eye -> right eye) if transform is a flip. Users can implement this logic if they can obtain parameters of the transform, to check if a flip has happened.
  • Keypoints in COCO has a "visibility" field, users may (or may not) want to check after every transform whether some keypoints are out of range, and mark them as invisible.

These extra but very simple logic can be added in the above snippet without changing the input types transforms will see.
Without the ability to manually apply the params, the above custom logic seem still doable, but will probably need to be added as new data types or subclasses of existing data types, and then register new transform methods for the types. I'm not sure at the moment how easy it is to extend the types but maybe that's not bad.

However, actually the above snippet is fundamentally difficult to realize, because as I commented earlier doing get_params already needs to run the transforms so we waste some computation. The way we address this in detectron2 is to separately define "a minimal subset of inputs that are needed to get_params" which we called "AugInput". This also feels a bit awkward.. so I don't have a good suggestion on what exactly to do. But I hope the above example use cases can be useful.

@pmeier
Copy link
Owner Author

pmeier commented Sep 13, 2021

@ppwwyyxx Thanks a lot for your input!

  1. With get_params users can save the params and later use them to apply an inverse of the transforms (if the transform supports it). This is useful during inference as well as test-time augmentation, where we transform the images, detect objects on the new image, and then need to invert the transform so we map the detected boxes/masks back to the original image space.

I agree, that is an important use case that should be supported. Would it be sufficient to have a functional inverse

transform = MyComplexTransform()

sample = next(dataset)

transformed_sample = transform(sample)
transformed_prediction = model(sample["input"])

# magically materialize this here, more about that later
params = ...
prediction = transform.inverse(transformed_prediction, params=params)

or should we actually instantiate an inverse transform?

inverse of the transforms (if the transform supports it)

How do you decide if a transform has an inverse or not? For example, crop is the inverse of pad, but is pad also considered an inverse to crop? I would say no for images, since pad cannot retrieve the information lost by crop. For bounding boxes or keypoints on the other hand I would say yes, since no information was lost here.

  1. Users can obtain a sequence of params, and apply the transforms one by one. This allows them to insert logic in between.

Although I generally dislike polymorphism, it can maybe make things simpler here. We could have a return_params: bool = False flag in the forward such that

transform = MyComplexTransform()

sample = next(dataset)

transformed_sample1, params = transform(sample, return_params=True)
transformed_sample2 = transform(sample, params=params)

assert transformed_sample1 == transformed_sample2

This would transform your snippet into

# transforms: a composite transform container (e.g. with Composite, RandomOrder, etc)
for tfm in transforms:
    inputs, params = tfm(inputs)
    # HERE: do some extra work with inputs

and avoid extracting the params upfront either with doubled computation or dummy inputs.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants