-
Notifications
You must be signed in to change notification settings - Fork 7.1k
add prototype image folder dataset #4441
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@fmassa Should we have a |
One problem I see is that we cannot decode until after the Given that in most of the cases the user will want to shuffle the dataset, I'm wondering if we even want to provide an option for decoding in the future API. If we don't we need to figure out a good way to pass on the information which keys of the sample need to be decoded from the dataset to the user. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great start, thanks!
I have a few questions for my understanding, let me know what you think
fn_kwargs=dict(label=label, category=category), | ||
) | ||
category_dps.append(category_dp) | ||
return Concater(*category_dps), categories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this make shuffling much harder, because the elements will be sampled from the same category in order? Or does the shuffler have additional knowledge that knows how to handle those use-cases in a special way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this make shuffling much harder, because the elements will be sampled from the same category in order?
Yes, but I have no idea how to improve this. Maybe we can read one sample from each directory before we start at the top again. Or we implement a RandomChoser
which picks one element at random from multiple datapipes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just have a single FileLister
which is recursive? This way it makes the job of a Shuffler
much easier, as FileLister could implement a specific property that allows for perfect shuffling for those family of pipes.
@VitalyFedyunin @ejguan thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we just have a single
FileLister
which is recursive?
We can, yes. As far as I see, that would make sorting harder, since we now need to extract the label and category from each path instead of just setting them. I'll implement it that way and benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use
class RandomPicker(IterDataPipe):
def __init__(self, *datapipes):
self.datapipes = datapipes
def __iter__(self):
non_exhausted = [iter(dp) for dp in self.datapipes]
while non_exhausted:
dp = random.choice(non_exhausted)
try:
yield next(dp)
except StopIteration:
non_exhausted.remove(dp)
instead of Concater
we get a pseudo-shuffled dataset with minimal overhead. In fact it is a little faster than the combination of Concater
and Shuffler
.
I've called it pseudo-shuffled here, since in case the categories are imbalanced, the last samples in an epoch will most likely only come from the largest categories. .shuffle()
with a small buffer_size
suffers from a similar issue.
Edit: I've added this in bf316de to showcase.
Yes, good idea, can you create one? |
I'm brining up a question I raised previously concerning the name of this space. A few were proposed over time, is |
@datumbox torchaudio follows |
Workflow now looks like: from torchvision.prototype import datasets
dataset, categories = datasets.from_image_folder(...)
decode = datasets.decoder.decode_sample(datasets.decoder.pil, "image")
dataset = dataset.shuffle().map(decode)
for sample in dataset:
... |
Hi! As we likely going to have shuffle topic again, I captured notes in the doc file. Feel free to comment/feedback. https://docs.google.com/document/d/15RzQtCMl2FDtu9loZH5a-wxHDPnSpT0LRT6wMjR1p-U/edit |
categories = sorted({item.name for item in os.scandir(root) if item.is_dir}) | ||
category_dps = [] | ||
category_dp: IterDataPipe | ||
for label, category in enumerate(categories): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting approach to avoid scanning twice to obtain list of categories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting as in good or bad interesting?
I've refactored the loading process to be able to shuffle directly after the files are enumerated. The workflow now looks like this: from torchvision.prototype import datasets
dataset = datasets.from_image_folder(root) By default, the images will now be shuffled with an infinite buffer, i.e. a true random permutation. To turn this off one can pass
|
# pseudo-infinite buffer size until a true infinite buffer is supported | ||
INFINITE = 1_000_000_000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VitalyFedyunin @ejguan We need the ability to specify an infinite buffer, i.e. buffer_size=None
for every datapipe that implements a buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's get this merged.
I have some comments that can be addressed in a follow-up PR
def from_data_folder( | ||
root: Union[str, pathlib.Path], | ||
*, | ||
shuffler: Optional[Callable[[IterDataPipe], IterDataPipe]] = lambda dp: Shuffler(dp, buffer_size=INFINITE), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to shuffle by default? Every basic instantiation would be different, is that what we would like to have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think not shuffling by default is ok. Although everyone will probably turn it on to do anything useful with the dataset, we now also require users to do so manually.
Summary: * add prototype image folder dataset * remove decoder datapipe * [PROPOSAL] add RandomPicker * refactor data loading * fix mypy * remove per-category datapipes * fix mypy Reviewed By: datumbox Differential Revision: D31268037 fbshipit-source-id: 5fa15884668118c3aadd951741cc3345d31fbfd9 Co-authored-by: Francisco Massa <[email protected]>
* add prototype image folder dataset * remove decoder datapipe * [PROPOSAL] add RandomPicker * refactor data loading * fix mypy * remove per-category datapipes * fix mypy Co-authored-by: Francisco Massa <[email protected]> [ghstack-poisoned]
* add prototype image folder dataset * remove decoder datapipe * [PROPOSAL] add RandomPicker * refactor data loading * fix mypy * remove per-category datapipes * fix mypy Co-authored-by: Francisco Massa <[email protected]>
This is the precursor of the new style datasets. Example:
cc @pmeier