Skip to content

add prototype image folder dataset #4441

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Sep 23, 2021
Merged

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented Sep 17, 2021

This is the precursor of the new style datasets. Example:

from torchvision.prototype import datasets

dataset, categories = datasets.from_image_folder(...)
dataset = dataset.shuffle().decode_sample(datasets.decoder.pil, "image")

for sample in dataset:
    ...

cc @pmeier

@pmeier
Copy link
Collaborator Author

pmeier commented Sep 17, 2021

@fmassa Should we have a prototype label for these PRs to easily tell them apart from other stuff for the release notes?

@pmeier
Copy link
Collaborator Author

pmeier commented Sep 17, 2021

One problem I see is that we cannot decode until after the .shuffle() call: it will read buffer_size samples (defaults to 10000) into memory before a single sample is returned to the user.

Given that in most of the cases the user will want to shuffle the dataset, I'm wondering if we even want to provide an option for decoding in the future API. If we don't we need to figure out a good way to pass on the information which keys of the sample need to be decoded from the dataset to the user.

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great start, thanks!

I have a few questions for my understanding, let me know what you think

fn_kwargs=dict(label=label, category=category),
)
category_dps.append(category_dp)
return Concater(*category_dps), categories
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this make shuffling much harder, because the elements will be sampled from the same category in order? Or does the shuffler have additional knowledge that knows how to handle those use-cases in a special way?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this make shuffling much harder, because the elements will be sampled from the same category in order?

Yes, but I have no idea how to improve this. Maybe we can read one sample from each directory before we start at the top again. Or we implement a RandomChoser which picks one element at random from multiple datapipes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just have a single FileLister which is recursive? This way it makes the job of a Shuffler much easier, as FileLister could implement a specific property that allows for perfect shuffling for those family of pipes.

@VitalyFedyunin @ejguan thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just have a single FileLister which is recursive?

We can, yes. As far as I see, that would make sorting harder, since we now need to extract the label and category from each path instead of just setting them. I'll implement it that way and benchmark.

Copy link
Collaborator Author

@pmeier pmeier Sep 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use

class RandomPicker(IterDataPipe):
    def __init__(self, *datapipes):
        self.datapipes = datapipes

    def __iter__(self):
        non_exhausted = [iter(dp) for dp in self.datapipes]
        while non_exhausted:
            dp = random.choice(non_exhausted)
            try:
                yield next(dp)
            except StopIteration:
                non_exhausted.remove(dp)

instead of Concater we get a pseudo-shuffled dataset with minimal overhead. In fact it is a little faster than the combination of Concater and Shuffler.

I've called it pseudo-shuffled here, since in case the categories are imbalanced, the last samples in an epoch will most likely only come from the largest categories. .shuffle() with a small buffer_size suffers from a similar issue.

Edit: I've added this in bf316de to showcase.

@fmassa
Copy link
Member

fmassa commented Sep 17, 2021

@fmassa Should we have a prototype label for these PRs to easily tell them apart from other stuff for the release notes?

Yes, good idea, can you create one?

@datumbox
Copy link
Contributor

I'm brining up a question I raised previously concerning the name of this space. A few were proposed over time, is prototype the selected term across all domain libs?

@fmassa
Copy link
Member

fmassa commented Sep 17, 2021

@datumbox torchaudio follows prototype as well, and torchtext had traditionally used experimental. IIRC we had an agreement to use prototype in a past discussion

@pmeier
Copy link
Collaborator Author

pmeier commented Sep 17, 2021

Workflow now looks like:

from torchvision.prototype import datasets

dataset, categories = datasets.from_image_folder(...)
decode = datasets.decoder.decode_sample(datasets.decoder.pil, "image")

dataset = dataset.shuffle().map(decode)

for sample in dataset:
    ...

@VitalyFedyunin
Copy link

Hi! As we likely going to have shuffle topic again, I captured notes in the doc file. Feel free to comment/feedback. https://docs.google.com/document/d/15RzQtCMl2FDtu9loZH5a-wxHDPnSpT0LRT6wMjR1p-U/edit

categories = sorted({item.name for item in os.scandir(root) if item.is_dir})
category_dps = []
category_dp: IterDataPipe
for label, category in enumerate(categories):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting approach to avoid scanning twice to obtain list of categories.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting as in good or bad interesting?

@pmeier
Copy link
Collaborator Author

pmeier commented Sep 21, 2021

I've refactored the loading process to be able to shuffle directly after the files are enumerated. The workflow now looks like this:

from torchvision.prototype import datasets

dataset = datasets.from_image_folder(root)

By default, the images will now be shuffled with an infinite buffer, i.e. a true random permutation. To turn this off one can pass datasets.from_image_folder(..., shuffler=None).

datasets.from_image_folder(root) is now equivalent to the current datasets.ImageFolder(root, transform=transforms.ToTensor()). Testing this on Caltech256 data, leads to almost identical warm-up times and the datapipe implementation is even a little faster for completely consuming the dataset.

Comment on lines +17 to +18
# pseudo-infinite buffer size until a true infinite buffer is supported
INFINITE = 1_000_000_000
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VitalyFedyunin @ejguan We need the ability to specify an infinite buffer, i.e. buffer_size=None for every datapipe that implements a buffer.

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's get this merged.

I have some comments that can be addressed in a follow-up PR

def from_data_folder(
root: Union[str, pathlib.Path],
*,
shuffler: Optional[Callable[[IterDataPipe], IterDataPipe]] = lambda dp: Shuffler(dp, buffer_size=INFINITE),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to shuffle by default? Every basic instantiation would be different, is that what we would like to have?

Copy link
Collaborator Author

@pmeier pmeier Sep 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think not shuffling by default is ok. Although everyone will probably turn it on to do anything useful with the dataset, we now also require users to do so manually.

@fmassa fmassa merged commit 021df7a into pytorch:main Sep 23, 2021
@pmeier pmeier deleted the prototype-datasets branch September 23, 2021 15:29
facebook-github-bot pushed a commit that referenced this pull request Sep 30, 2021
Summary:
* add prototype image folder dataset

* remove decoder datapipe

* [PROPOSAL] add RandomPicker

* refactor data loading

* fix mypy

* remove per-category datapipes

* fix mypy

Reviewed By: datumbox

Differential Revision: D31268037

fbshipit-source-id: 5fa15884668118c3aadd951741cc3345d31fbfd9

Co-authored-by: Francisco Massa <[email protected]>
husthyc added a commit that referenced this pull request Oct 22, 2021
* add prototype image folder dataset

* remove decoder datapipe

* [PROPOSAL] add RandomPicker

* refactor data loading

* fix mypy

* remove per-category datapipes

* fix mypy

Co-authored-by: Francisco Massa <[email protected]>

[ghstack-poisoned]
cyyever pushed a commit to cyyever/vision that referenced this pull request Nov 16, 2021
* add prototype image folder dataset

* remove decoder datapipe

* [PROPOSAL] add RandomPicker

* refactor data loading

* fix mypy

* remove per-category datapipes

* fix mypy

Co-authored-by: Francisco Massa <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants