add prototype image folder dataset #4441

pmeier · 2021-09-17T13:56:28Z

This is the precursor of the new style datasets. Example:

from torchvision.prototype import datasets

dataset, categories = datasets.from_image_folder(...)
dataset = dataset.shuffle().decode_sample(datasets.decoder.pil, "image")

for sample in dataset:
    ...

cc @pmeier

pmeier · 2021-09-17T13:58:36Z

@fmassa Should we have a prototype label for these PRs to easily tell them apart from other stuff for the release notes?

pmeier · 2021-09-17T14:05:38Z

One problem I see is that we cannot decode until after the .shuffle() call: it will read buffer_size samples (defaults to 10000) into memory before a single sample is returned to the user.

Given that in most of the cases the user will want to shuffle the dataset, I'm wondering if we even want to provide an option for decoding in the future API. If we don't we need to figure out a good way to pass on the information which keys of the sample need to be decoded from the dataset to the user.

fmassa

That's a great start, thanks!

I have a few questions for my understanding, let me know what you think

torchvision/prototype/datasets/decoder.py

fmassa · 2021-09-17T14:07:58Z

torchvision/prototype/datasets/_folder.py

+            fn_kwargs=dict(label=label, category=category),
+        )
+        category_dps.append(category_dp)
+    return Concater(*category_dps), categories


Doesn't this make shuffling much harder, because the elements will be sampled from the same category in order? Or does the shuffler have additional knowledge that knows how to handle those use-cases in a special way?

Doesn't this make shuffling much harder, because the elements will be sampled from the same category in order?

Yes, but I have no idea how to improve this. Maybe we can read one sample from each directory before we start at the top again. Or we implement a RandomChoser which picks one element at random from multiple datapipes.

Can't we just have a single FileLister which is recursive? This way it makes the job of a Shuffler much easier, as FileLister could implement a specific property that allows for perfect shuffling for those family of pipes.

@VitalyFedyunin @ejguan thoughts?

Can't we just have a single FileLister which is recursive?

We can, yes. As far as I see, that would make sorting harder, since we now need to extract the label and category from each path instead of just setting them. I'll implement it that way and benchmark.

If we use

class RandomPicker(IterDataPipe): def __init__(self, *datapipes): self.datapipes = datapipes def __iter__(self): non_exhausted = [iter(dp) for dp in self.datapipes] while non_exhausted: dp = random.choice(non_exhausted) try: yield next(dp) except StopIteration: non_exhausted.remove(dp)

instead of Concater we get a pseudo-shuffled dataset with minimal overhead. In fact it is a little faster than the combination of Concater and Shuffler.

I've called it pseudo-shuffled here, since in case the categories are imbalanced, the last samples in an epoch will most likely only come from the largest categories. .shuffle() with a small buffer_size suffers from a similar issue.

Edit: I've added this in bf316de to showcase.

torchvision/prototype/datasets/_folder.py

fmassa · 2021-09-17T14:11:30Z

@fmassa Should we have a prototype label for these PRs to easily tell them apart from other stuff for the release notes?

Yes, good idea, can you create one?

datumbox · 2021-09-17T14:18:24Z

I'm brining up a question I raised previously concerning the name of this space. A few were proposed over time, is prototype the selected term across all domain libs?

fmassa · 2021-09-17T14:19:36Z

@datumbox torchaudio follows prototype as well, and torchtext had traditionally used experimental. IIRC we had an agreement to use prototype in a past discussion

pmeier · 2021-09-17T14:39:51Z

Workflow now looks like:

from torchvision.prototype import datasets

dataset, categories = datasets.from_image_folder(...)
decode = datasets.decoder.decode_sample(datasets.decoder.pil, "image")

dataset = dataset.shuffle().map(decode)

for sample in dataset:
    ...

VitalyFedyunin · 2021-09-20T23:28:15Z

Hi! As we likely going to have shuffle topic again, I captured notes in the doc file. Feel free to comment/feedback. https://docs.google.com/document/d/15RzQtCMl2FDtu9loZH5a-wxHDPnSpT0LRT6wMjR1p-U/edit

torchvision/prototype/datasets/datapipes.py

VitalyFedyunin · 2021-09-20T23:34:30Z

torchvision/prototype/datasets/_folder.py

+    categories = sorted({item.name for item in os.scandir(root) if item.is_dir})
+    category_dps = []
+    category_dp: IterDataPipe
+    for label, category in enumerate(categories):


This is interesting approach to avoid scanning twice to obtain list of categories.

Interesting as in good or bad interesting?

pmeier · 2021-09-21T13:06:16Z

I've refactored the loading process to be able to shuffle directly after the files are enumerated. The workflow now looks like this:

from torchvision.prototype import datasets

dataset = datasets.from_image_folder(root)

By default, the images will now be shuffled with an infinite buffer, i.e. a true random permutation. To turn this off one can pass datasets.from_image_folder(..., shuffler=None).

datasets.from_image_folder(root) is now equivalent to the current datasets.ImageFolder(root, transform=transforms.ToTensor()). Testing this on Caltech256 data, leads to almost identical warm-up times and the datapipe implementation is even a little faster for completely consuming the dataset.

pmeier · 2021-09-21T14:12:56Z

torchvision/prototype/datasets/_folder.py

+# pseudo-infinite buffer size until a true infinite buffer is supported
+INFINITE = 1_000_000_000


@VitalyFedyunin @ejguan We need the ability to specify an infinite buffer, i.e. buffer_size=None for every datapipe that implements a buffer.

fmassa

Let's get this merged.

I have some comments that can be addressed in a follow-up PR

fmassa · 2021-09-23T13:50:45Z

torchvision/prototype/datasets/_folder.py

+def from_data_folder(
+    root: Union[str, pathlib.Path],
+    *,
+    shuffler: Optional[Callable[[IterDataPipe], IterDataPipe]] = lambda dp: Shuffler(dp, buffer_size=INFINITE),


Do we want to shuffle by default? Every basic instantiation would be different, is that what we would like to have?

I think not shuffling by default is ok. Although everyone will probably turn it on to do anything useful with the dataset, we now also require users to do so manually.

torchvision/prototype/datasets/decoder.py

Summary: * add prototype image folder dataset * remove decoder datapipe * [PROPOSAL] add RandomPicker * refactor data loading * fix mypy * remove per-category datapipes * fix mypy Reviewed By: datumbox Differential Revision: D31268037 fbshipit-source-id: 5fa15884668118c3aadd951741cc3345d31fbfd9 Co-authored-by: Francisco Massa <[email protected]>

* add prototype image folder dataset * remove decoder datapipe * [PROPOSAL] add RandomPicker * refactor data loading * fix mypy * remove per-category datapipes * fix mypy Co-authored-by: Francisco Massa <[email protected]> [ghstack-poisoned]

* add prototype image folder dataset * remove decoder datapipe * [PROPOSAL] add RandomPicker * refactor data loading * fix mypy * remove per-category datapipes * fix mypy Co-authored-by: Francisco Massa <[email protected]>

add prototype image folder dataset

88c4383

pmeier added the module: datasets label Sep 17, 2021

pmeier requested a review from fmassa September 17, 2021 13:56

facebook-github-bot added the cla signed label Sep 17, 2021

fmassa reviewed Sep 17, 2021

View reviewed changes

pmeier added the prototype label Sep 17, 2021

remove decoder datapipe

953de6b

[PROPOSAL] add RandomPicker

bf316de

VitalyFedyunin reviewed Sep 20, 2021

View reviewed changes

torchvision/prototype/datasets/datapipes.py Outdated Show resolved Hide resolved

VitalyFedyunin reviewed Sep 20, 2021

View reviewed changes

refactor data loading

ac856c5

fix mypy

fadaf39

pmeier commented Sep 21, 2021

View reviewed changes

pmeier added 4 commits September 21, 2021 16:56

remove per-category datapipes

b7c86c2

Merge branch 'main' into prototype-datasets

1f19835

Merge branch 'main' into prototype-datasets

bb8ad00

fix mypy

7f213dc

fmassa approved these changes Sep 23, 2021

View reviewed changes

Merge branch 'main' into prototype-datasets

69d509e

fmassa merged commit 021df7a into pytorch:main Sep 23, 2021

pmeier deleted the prototype-datasets branch September 23, 2021 15:29

pmeier mentioned this pull request Sep 23, 2021

cleanup prototype datasets #4471

Merged

		# pseudo-infinite buffer size until a true infinite buffer is supported
		INFINITE = 1_000_000_000

add prototype image folder dataset #4441

add prototype image folder dataset #4441

Uh oh!

Conversation

pmeier commented Sep 17, 2021 • edited by pytorch-probot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pmeier commented Sep 17, 2021

Uh oh!

pmeier commented Sep 17, 2021

Uh oh!

fmassa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pmeier Sep 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fmassa commented Sep 17, 2021

Uh oh!

datumbox commented Sep 17, 2021

Uh oh!

fmassa commented Sep 17, 2021

Uh oh!

pmeier commented Sep 17, 2021

Uh oh!

VitalyFedyunin commented Sep 20, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pmeier commented Sep 21, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fmassa left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pmeier Sep 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pmeier commented Sep 17, 2021 •

edited by pytorch-probot bot

Loading

pmeier Sep 17, 2021 •

edited

Loading

pmeier Sep 23, 2021 •

edited

Loading