-
Notifications
You must be signed in to change notification settings - Fork 7.1k
add prototype dataset for CelebA #4514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just want to discuss if we can eliminate CelebACSVParser
and use CSVDictParser
# Empty field names are filtered out, because some files have an extr white space after the header | ||
# line, which is recognized as extra column | ||
fieldnames = [name for name in next(csv.reader([next(file)], **self._fmtparams)) if name] | ||
# Some files do not include a label for the image ID column | ||
if fieldnames[0] != "image_id": | ||
fieldnames.insert(0, "image_id") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is super annoying that we can't use DictReader
.
Since we have three datapipes with header, could we hard-code the fieldnames
for each one and use CSVDictReader(dp, skip_lines=2, fieldnames=[...])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could, but there is more to it. Note that we also need to map
the output so that it is a tuple with the image id first and the remaining row second. At this point we would probably have a more elaborate implementation bending everything to use the default building blocks than to just write our own.
I agree it is annoying, but writing and understanding this custom parser is not hard so I feel its warranted.
for partial_anns_dp in partial_anns_dps: | ||
anns_dp = KeyZipper(anns_dp, partial_anns_dp, lambda data: data[0], buffer_size=INFINITE_BUFFER_SIZE) | ||
anns_dp = Mapper(anns_dp, self._collate_partial_anns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add an issue to data repo about the KeyZipper. This is a super anti-pattern design for stacked KeyZipper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me see if the files are perfectly aligned. If yes, we could use a Zipper
instead, which is able to handle more than 2 datapipes at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are perfectly aligned 🎉 The latest commits only need two KeyZippers
to zip the splits, images, and annotations. We can't get around them at the moment.
|
||
def __iter__(self): | ||
for _, file in self.datapipe: | ||
file = (line.decode() for line in file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's fine as the whole file can fit into memory. But, to optimize it a little bit, could we change it to a streaming style by adding a decode method to yield each decoded line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that would be a lot better. I'll send a patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After revisiting this, I think it is already doing what you proposed. Writing
file = (line.decode() for line in file)
is functionally equivalent to
def decode(file):
for line in file:
yield line.decode()
file = decode(file)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right!
Summary: * add prototype dataset for CelebA * fix code format * fix mypy * hardcode fmtparams * fix mypy * replace KeyZipper with Zipper for annotations Reviewed By: NicolasHug Differential Revision: D31505573 fbshipit-source-id: bc2a66dffd410d51bc5b240bd32b344000248f00
* add prototype dataset for CelebA * fix code format * fix mypy * hardcode fmtparams * fix mypy * replace KeyZipper with Zipper for annotations
* add prototype dataset for CelebA * fix code format * fix mypy * hardcode fmtparams * fix mypy * replace KeyZipper with Zipper for annotations
cc @pmeier @mthrok @bjuncek