Skip to content

add prototype dataset for CelebA #4514

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Oct 6, 2021
Merged

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented Sep 30, 2021

@pmeier
Copy link
Collaborator Author

pmeier commented Oct 5, 2021

mypy will be happy after either #4513 or #4531 is merged.

@pmeier pmeier requested a review from fmassa October 5, 2021 07:04
@pmeier pmeier requested a review from ejguan October 5, 2021 14:49
Copy link
Contributor

@ejguan ejguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just want to discuss if we can eliminate CelebACSVParser and use CSVDictParser

Comment on lines +44 to +49
# Empty field names are filtered out, because some files have an extr white space after the header
# line, which is recognized as extra column
fieldnames = [name for name in next(csv.reader([next(file)], **self._fmtparams)) if name]
# Some files do not include a label for the image ID column
if fieldnames[0] != "image_id":
fieldnames.insert(0, "image_id")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super annoying that we can't use DictReader.
Since we have three datapipes with header, could we hard-code the fieldnames for each one and use CSVDictReader(dp, skip_lines=2, fieldnames=[...])

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but there is more to it. Note that we also need to map the output so that it is a tuple with the image id first and the remaining row second. At this point we would probably have a more elaborate implementation bending everything to use the default building blocks than to just write our own.

I agree it is annoying, but writing and understanding this custom parser is not hard so I feel its warranted.

Comment on lines 142 to 144
for partial_anns_dp in partial_anns_dps:
anns_dp = KeyZipper(anns_dp, partial_anns_dp, lambda data: data[0], buffer_size=INFINITE_BUFFER_SIZE)
anns_dp = Mapper(anns_dp, self._collate_partial_anns)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add an issue to data repo about the KeyZipper. This is a super anti-pattern design for stacked KeyZipper.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me see if the files are perfectly aligned. If yes, we could use a Zipper instead, which is able to handle more than 2 datapipes at once.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are perfectly aligned 🎉 The latest commits only need two KeyZippers to zip the splits, images, and annotations. We can't get around them at the moment.

@pmeier pmeier merged commit 8cc6d52 into pytorch:main Oct 6, 2021
@pmeier pmeier deleted the datasets/celeba branch October 6, 2021 08:10

def __iter__(self):
for _, file in self.datapipe:
file = (line.decode() for line in file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine as the whole file can fit into memory. But, to optimize it a little bit, could we change it to a streaming style by adding a decode method to yield each decoded line?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that would be a lot better. I'll send a patch.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After revisiting this, I think it is already doing what you proposed. Writing

file = (line.decode() for line in file)

is functionally equivalent to

def decode(file):
    for line in file:
        yield line.decode()

file = decode(file)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right!

facebook-github-bot pushed a commit that referenced this pull request Oct 8, 2021
Summary:
* add prototype dataset for CelebA

* fix code format

* fix mypy

* hardcode fmtparams

* fix mypy

* replace KeyZipper with Zipper for annotations

Reviewed By: NicolasHug

Differential Revision: D31505573

fbshipit-source-id: bc2a66dffd410d51bc5b240bd32b344000248f00
mszhanyi pushed a commit to mszhanyi/vision that referenced this pull request Oct 19, 2021
* add prototype dataset for CelebA

* fix code format

* fix mypy

* hardcode fmtparams

* fix mypy

* replace KeyZipper with Zipper for annotations
cyyever pushed a commit to cyyever/vision that referenced this pull request Nov 16, 2021
* add prototype dataset for CelebA

* fix code format

* fix mypy

* hardcode fmtparams

* fix mypy

* replace KeyZipper with Zipper for annotations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants