[PoC] separate decoding from datasets #5105

pmeier · 2021-12-16T08:27:28Z

With this patch samples will be undecoded by default, but can easily decoded:

from torchvision.prototype import datasets

encoded_sample = next(iter(datasets.load("caltech101")))
for key, value in sorted(encoded_sample.items()):
    print(f"{key}: {type(value)}")

print("#" * 80)

decoded_sample = datasets.utils.decode_sample(encoded_sample)
for key, value in sorted(decoded_sample.items()):
    print(f"{key}: {type(value)}")

ann: <class 'torchvision.prototype.datasets.utils._decoder.DecodeableStreamWrapper'>
ann_path: <class 'str'>
image: <class 'torchvision.prototype.datasets.utils._decoder.DecodeableImageStreamWrapper'>
image_path: <class 'str'>
label: <class 'torchvision.prototype.features._label.Label'>
################################################################################
ann_path: <class 'str'>
bounding_box: <class 'torchvision.prototype.features._bounding_box.BoundingBox'>
contour: <class 'torchvision.prototype.features._feature.Feature'>
image: <class 'torchvision.prototype.features._image.Image'>
image_path: <class 'str'>
label: <class 'torchvision.prototype.features._label.Label'>

Of course, decode_sample can be applied through .map

from torchvision.prototype import datasets

dataset = datasets.load(...).map(datasets.utils.decode_sample)

For even more convenience, this also adds a SampleDecoder datapipe, that is a thin wrapper around Mapper applying decode_sample. Although, I generally favor using the class interface, I think this is a case where the functional interface comes in handy, because most users will want to use the default decoders:

from torchvision.prototype import datasets

dataset = datasets.load(...).decode_samples()

@ reviewers: Don't worry about the large diff. I already touched all datasets to see if I missed an edge case in my proposal. That was not the case, so it is sufficient to have a look at torchvision/prototype/datasets/utils/_decoder.py and one implementation for example torchvision/prototype/datasets/_builtin/caltech.py. I did not yet fix the tests, so it is expected that they are failing. I'll only do that if you are otherwise happy with the proposal.

cc @pmeier @bjuncek

facebook-github-bot · 2021-12-16T08:27:34Z

💊 CI failures summary and remediations

As of commit 1406bd3 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

torchvision/prototype/datasets/utils/_decoder.py

pmeier · 2021-12-22T15:41:42Z

After some more discussion, we want away from supplying custom file handles to the user, but rather will return the encoded data as uint8 tensors. See #5075 (comment) for details.

With the current implementation, loading data from a dataset now looks like this:

from torchvision.prototype import datasets

for sample in datasets.load("caltech101").map(datasets.utils.decode_images):
    ...

decode_images is only a thin wrapper around the workhorse decode_sample that sets a default decoder for images. There will be decode_videos in the future, but it will probably need additional parameters compared to decode_images and thus we cannot unify the two.

In the future we can also provide a transform that handles the decoding, so it can simply be used as first transform in an Compose

from torchvision.prototype import datasets, transforms

transform = transforms.Compose(
    transforms.DecodeImages(),
    transforms.Resize(...),
    ...
)

for sample in datasets.load(...).map(transform):
    ...

pmeier · 2022-01-26T17:01:48Z

Superseded by #5287

[PoC] separate decoding from datasets

3a1d886

pmeier added module: datasets prototype labels Dec 16, 2021

pmeier requested review from fmassa and ejguan December 16, 2021 08:27

pytorch-probot bot added the ciflow/default label Dec 16, 2021

facebook-github-bot added the cla signed label Dec 16, 2021

cleanup

8fca3a4

ejguan reviewed Dec 16, 2021

View reviewed changes

torchvision/prototype/datasets/utils/_decoder.py Outdated Show resolved Hide resolved

pmeier added 7 commits December 21, 2021 14:31

refactor to use tensors as base for undecoded data

d4507a8

Merge branch 'main' into datasets/decoding-poc

61f377d

cleanup

26a25a5

fix celeba

74f6a09

fix tests

375fefb

add todo

667ea7e

fix api tests

1406bd3

pmeier mentioned this pull request Dec 22, 2021

[RFC] How should datasets handle decoding of files? #5075

Open

pmeier mentioned this pull request Jan 26, 2022

remove decoding from prototype datasets #5287

Merged

pmeier closed this Jan 26, 2022

pmeier deleted the datasets/decoding-poc branch January 26, 2022 17:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PoC] separate decoding from datasets #5105

[PoC] separate decoding from datasets #5105

pmeier commented Dec 16, 2021 •

edited

Loading

facebook-github-bot commented Dec 16, 2021 •

edited

Loading

pmeier commented Dec 22, 2021

pmeier commented Jan 26, 2022

[PoC] separate decoding from datasets #5105

[PoC] separate decoding from datasets #5105

Conversation

pmeier commented Dec 16, 2021 • edited Loading

facebook-github-bot commented Dec 16, 2021 • edited Loading

💊 CI failures summary and remediations

pmeier commented Dec 22, 2021

pmeier commented Jan 26, 2022

pmeier commented Dec 16, 2021 •

edited

Loading

facebook-github-bot commented Dec 16, 2021 •

edited

Loading