-
Notifications
You must be signed in to change notification settings - Fork 0
RFC: transforms #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
After some more thoughts, I see two additional problems that we need to solve. Happy to hear your ideas / thoughts. What happens if a transform changes some feature attributes for example cropping an image or converting a bounding box to a new format. In the first case the information is at least still encoded in the new shape of the image, but for bounding box conversion, the information is gone. Thus, we need to somehow reflect these changes back to the features. To achieve that, we could either change the feature dictionary in-place, or we need to somehow return a changed version with each transform. In its current state the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for putting this together @pmeier !
I'm trying to get a better sense of the scope and the challenges here, so I mostly have questions at this point :)
I'll start here with a naive one: instead of introducing the new SampleTransform
and FeatureTransform
classes, could we "simply" (<-- gotta love this word) extend our currents transforms from torchvsion.transforms
to handle not just tensors, but also X where X is what the datapipe returns? This way users could basically do dataset.map(torchvision.transforms.Rotate(30.0))
, which seems optimal in terms of UX.
(I'm not saying we should do this, but the answer to this question will help me better understand where we're going)
Also, some seemingly related discussed sparked in pytorch/vision#4029 (comment). If we start supporting new Tensor-like classes like class Image(Tensor)
and class BoundingBox(Tensor)
, we might be able to avoid the introduction of the new Feature
structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, some seemingly related discussed sparked in pytorch/vision#4029 (comment). If we start supporting new Tensor-like classes like class Image(Tensor) and class BoundingBox(Tensor), we might be able to avoid the introduction of the new Feature structure.
If we can do that, it would solve a lot of the questions you have and issues I mentioned. PyTorch offers the __torch_function__
hook. Without going into details, with it it is possible to perform any PyTorch operation, for example transform an image, and return our custom class. So something like
>>> image = torchvision.Image(torch.rand(3, 256, 256))
>>> transformed_image = image + 1
>>> type(transformed_image)
<class 'torchvision.Image'>
is possible without any hacks from our side. Doing something like this of course implies that we convert all numerics to tensors.
The problem we had with this approach is that we were unable to come up with a strict enough rule set, that would let us determine if an arbitrary operation should return our custom tensor type. Take the example above: is a floating-point tensor with values outside of the range [0, 1]
still an image? Just to name a second: "If you subtract a random real number from a class label, is the result still a class label?" You can come up with a number of these for all proposed custom tensor classes.
It should be fairly easy to control this for all builtin transformations, but we can't guarantuee that in general.
Thus, we decided to not encode this extra information in the tensor itself. The new datapipes will return a dictionary as sample. Encoding the information in the keys rather than the values is also possible, but this also has two problems:
- Having non-string keys would make working with a sample dictionary a lot more cumbersome.
- We can't guarantuee that a user transformation keeps the keys either, so this has the same problems as encoding the information in the values of the dictionary.
What is the rational for separating the Feature type and the value?
For example, what is the benefit of this over
transform(Image(torch.rand(3, 256, 256))) transform(BoundingBox(type="XYHW", (0, 0, 256, 256)))?
Thus, we came up with the plan to encode the extra information in a custom support structure.
IMO in the end this all boils down to two possible scenarios:
- We provide a more rigid framework that is based on the assumption that the transform ensures that it only returns valid tensor types. This makes the UX better for users that only rely on builtin transforms. At the same time it will be harder to implement custom transformations.
- We base all of our stuff on very minimal assumptions at the cost of a worse UX.
I vote for option 1., but lets discuss this idea a little more.
This would also make something like
could we "simply" (<-- gotta love this word) extend our currents transforms from
torchvsion.transforms
to handle not just tensors, but also X where X is what the datapipe returns? This way users could basically dodataset.map(torchvision.transforms.Rotate(30.0))
, which seems optimal in terms of UX.
a lot simpler, because now each transform would know which feature transform to dispatch based on the input type.
After some more thought, I think @NicolasHug's comments hit the nail on the head: my proposal was complicated and still wasn't able to solve that he and I myself pointed out. Thus, I rewrote the proposal with the assumption that we can use custom tensor classes. I've added an implications section to discuss this. Overall I like the new proposal much better. Happy to hear your thoughts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking pretty good, thanks a lot @pmeier !
As we've discussed on VC, here are a few notes from our chat.
torchvision/features/_core.py
Outdated
__all__ = ["Feature"] | ||
|
||
|
||
class Feature(torch.Tensor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great!
One meta-question that the team needs to figure out is if we will be supporting torchscript for those features or not. IIUC this might not be propagating the inherited type to torchscripted functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've split this into Feature
and TensorFeature
, since a Feature
might not necessarily be a Tensor
. For example, Text
cannot be represented by a tensor, but could very well be regarded as Feature
from torchtext
.
def format(self) -> BoundingBoxFormat: | ||
return self._format | ||
|
||
def convert(self, format: Union[str, BoundingBoxFormat]) -> "BoundingBox": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! One other thing to think about is if there would be a minimum common API across all our data types that we might want to enforce?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, I don't see anything. Happy to hear ideas though. If something comes up later, that should be retro-fittable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a TensorFeature.from_tensor
method, that should make it easy to create a new feature from just a tensor.
return Image(input.flip((-1,))) | ||
|
||
@staticmethod | ||
def bounding_box(input: BoundingBox) -> BoundingBox: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. This way, when possible, we share the same functional implementation for images and bboxes masks and when not possible we implement new. Also we leave it up to the user to make the call depending on the use-case rather than assuming the BBox is in some part of the target.
See also this corner case transform that requires joint transform of all image, label, bbox:
https://github.com/pytorch/vision/blob/96f6e0a117d5c56f7e0237851dbb96144ebb110b/references/detection/transforms.py#L54-L129
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand:
we share the same functional implementation for images and bboxes and when not possible we implement new
In most cases the implementation will be different for images, bounding boxes and possible other types. In any case, we will always have a separate method for each type. In case the transformation is the same, it is the developers responsibility to implement it with minimal duplications:
class FooTransform(Transform):
@staticmethod
def _foo(input):
return input
@staticmethod
def image(input: Image) -> Image:
return FooTransform._foo(input)
@staticmethod
def bounding_box(input: BoundingBox) -> BoundingBox:
return FooTransform._foo(input)
Also we leave it up to the user to make the call depending on the use-case rather than assuming the BBox is in some part of the target.
That is actually what we are trying to avoid. If you want to, you can call HorizontalFlip.image(...)
, but during normal operation you would do
transform = HorizontalFlip()
transformed_image = transform(image)
transformed_bbox = transform(bbox)
This is needed to be able to transform a complete sample drawn from a dataset, which might include more than just an image.
For a better overview of what we are trying to achieve, please read the accompanying document which is updated as this RFC progresses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmeier I didn't disagree with you. I just said I like the approach :)
What I meant is that there are a few transforms that can be implemented for images and bboxes in the same way. That can live in the functional API and as you said, we can call it for both image()
and bounding_box()
. But also the proposed API supports cases when this is not the case by providing separate implementations for the two.
That is actually what we are trying to avoid.
Yes I had a discussion with @fmassa who explained that you want to avoid this to make things composable. Sounds good. Just make sure you look the example I sent you because there are transforms where image, bbox and labels must be processed together. I think the API can cover for that but it's worth having this use-case in mind.
torchvision/transforms/_core.py
Outdated
# transform takes none or more than one than one positional arguments | ||
if len(argspec.args) != 1: | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed in VC, I believe this can lead to subtle bugs. If the user writes their transform as
class Rotate(Transform):
@staticmethod
def image(img, angle):
pass
instead of
class Rotate(Transform):
@staticmethod
def image(img, *, angle):
pass
then the transformation will not be automatically registered, and no errors / warnings will show up, only that the transform will silently not be applied because it wasn't properly auto-registered.
I think it might be better to be explicit here and instead use decorators to register a function to a transform. Something in the lines of
class Rotate(Transform):
pass
@register_transform(transform=Rotate, input_type=Image)
def rotate_image(img, angle):
pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be better to be explicit here and instead use decorators to register a function to a transform.
-
I dislike this approach, because we would have to manually namespace everything with the feature type. For example, I think it is more concise to have
Rotate.image
andRotate.bounding_box
rather thanrotate_image
androtate_bounding_box
. Note that in the latter case, the namespace would have roughly n times more entries where n is the number of features we support. -
The auto registering can be disabled and the registering can be performed manually in the constructor:
class Rotate(Transform, auto_register=False): def __init__(self): super().__init__() self.register_feature_transform(Image, self.foo) @staticmethod def foo(input, angle): pass
-
If we ultimately go with registering standalone functions with a decorator, I suggest we use a classmethod of the transform, i.e.
class Rotate(Transform): pass @Rotate.register(Image) def rotate_image(input, angle): pass
As we discussed in VC, I believe this can lead to subtle bugs.
True, good catch! The rule can be made more robust though. We evoke the feature transforms with feature_transform(input, **params)
where params
can be an empty dict. That means all of these signatures need to be supported:
def foo(input):
pass
def foo(input=None):
pass
def foo(input, bar): # your example
pass
def foo(input, bar=None):
pass
def foo(input=None, bar=None):
pass
def foo(input, *, bar):
pass
def foo(input, *, bar=None):
pass
def foo(input=None, *, bar):
pass
def foo(input=None, *, bar=None):
pass
For Python >=3.8
we also need to keep positional-only arguments in mind.
For the debugging, the user can use transform.is_supported
(EDIT: as of writing this comment, this method only existed locally. It is included in the most recent commit) and pass an feature type or instance to see if it would be transformed or not. But I agree that one would probably only do that after discovering that something is wrong. Another idea would be to add a verbose
flag that prints extra information during the auto registration.
Since all of this is a trade-off between convenience with possible subtle bugs and explicitness with a higher verbosity, I think we should wait for more opinions before we decide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a commit that makes the matching more robust and adds the option to print information about the process:
class TestTransform(Transform, verbose=True):
@staticmethod
def incompatible_signature1():
pass
@staticmethod
def incompatible_signature2(*, input):
pass
# Python >=3.8 only
# @staticmethod
# def incompatible_signature3(input, foo, /):
# pass
@staticmethod
def _private(input):
pass
@staticmethod
def unknown(input):
pass
@staticmethod
def imaeg(input):
pass
@staticmethod
def image(input, foo):
pass
@staticmethod
def boundingbox(input):
pass
@staticmethod
def bounding_box(input, *, foo):
pass
TestTransform._private() was not registered as feature transform, because it is private.
TestTransform.bounding_box() was registered as feature transform for type 'BoundingBox'.
TestTransform.boundingbox() was not registered as feature transform, because its name doesn't match any known feature type. Did you mean to name it 'bounding_box' to be registered for type 'BoundingBox'?
TestTransform.imaeg() was not registered as feature transform, because its name doesn't match any known feature type. Did you mean to name it 'image' to be registered for type 'Image'?
TestTransform.image() was registered as feature transform for type 'Image'.
TestTransform.incompatible_signature1() was not registered as feature transform, because it cannot be invoked with incompatible_signature1(input, **params).
TestTransform.incompatible_signature2() was not registered as feature transform, because it cannot be invoked with incompatible_signature2(input, **params).
TestTransform.unknown() was not registered as feature transform, because its name doesn't match any known feature type.
self.formats.XYWH: self._xyxy_to_xywh, | ||
self.formats.CXCYWH: self._xyxy_to_cxcywh, | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For when we will be discussing specifics of the data classes, it might be good to add a __torch_function__
that forbids applying operations of two bounding boxes if they have different image_size
, while allowing BoundingBox
+ Tensor
to work
It seems a big issue of the current prototype is that it cannot implement
It may seem nested compose is useless because it can be flattened, but that's not the case. Beyond |
I agree with the argument, but disagree with the conclusion. During "normal" operation the transform containers ( Still, there might be scenarios where you conclusion is right. If we have two equal but separate transforms that should operate with the same parameters: transform1 = MyTransform()
transform2 = MyTransform()
params = transform1.get_params()
transformed_image1 = transform1(image, params=params)
transformed_image2 = transform(image, params=params) I currently fail to see a use case for this, but maybe there is one. |
@ppwwyyxx I've added minimal implementations for |
With the latest commits, the accompanying document is no longer in sync. I'll update it ASAP. |
Thanks @pmeier for following up! I agree that you can implement containers by giving up the ability to do
Insertion of this logic have a few use cases, e.g.:
These extra but very simple logic can be added in the above snippet without changing the input types transforms will see. However, actually the above snippet is fundamentally difficult to realize, because as I commented earlier doing |
@ppwwyyxx Thanks a lot for your input!
I agree, that is an important use case that should be supported. Would it be sufficient to have a functional inverse transform = MyComplexTransform()
sample = next(dataset)
transformed_sample = transform(sample)
transformed_prediction = model(sample["input"])
# magically materialize this here, more about that later
params = ...
prediction = transform.inverse(transformed_prediction, params=params) or should we actually instantiate an inverse transform?
How do you decide if a transform has an inverse or not? For example,
Although I generally dislike polymorphism, it can maybe make things simpler here. We could have a transform = MyComplexTransform()
sample = next(dataset)
transformed_sample1, params = transform(sample, return_params=True)
transformed_sample2 = transform(sample, params=params)
assert transformed_sample1 == transformed_sample2 This would transform your snippet into # transforms: a composite transform container (e.g. with Composite, RandomOrder, etc)
for tfm in transforms:
inputs, params = tfm(inputs)
# HERE: do some extra work with inputs and avoid extracting the params upfront either with doubled computation or dummy inputs. What do you think? |
This is an RFC on how transforms will work with the datasets after the rework of the datasets API. Rendered version here