-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Add GTSRB dataset to prototypes #5214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CI failures summary and remediationsAs of commit 8283332 (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @NicolasHug! I've only reviewed the prototype part, let me know if I also need to look at the other changes.
@@ -877,7 +877,7 @@ def _make_archive(root, name, *files_or_dirs, opener, adder, remove=True): | |||
files, dirs = _split_files_or_dirs(root, *files_or_dirs) | |||
|
|||
with opener(archive) as fh: | |||
for file in files: | |||
for file in sorted(files): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmeier LMK what you think of this.
Down below I set buffer_size=1
in the Zipper, because in the original .zip files, both the image_dp
and the gt_dp
are fully aligned: they both contain images 00001, 00002, etc. in this order. So I'm assuming that buffer_size=1
is better than buffer_size=UNLIMITED
?
Without this call to sorted()
, the tests would fail: the .zip archive created by make_zip
would contain the files in a shuffled order (because files
is a set), and so image_dp
and the gt_dp
would not be aligned anymore, leading to a failure to match keys in the Zipper. (Note: this is only a problem in the tests; the code works fine otherwise on my custom script iterating over the dataset).
I hope this won't make other tests fails. This might not be a problem that we have right now, but perhaps something to keep in mind for the future: we might need the test archives to exactly match the order of the "original" archives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I didn't think of that. Do you know if the order of the returned paths of pathlib.Path.glob()
is stable? If yes, we could simply replace the sets in _split_files_or_dirs
with lists instead of sorting here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about Path.glob()
. I know that glob.glob
has no guaranteed order, but I don't think Path.glob()
relies on it. Maybe the safest is to not assume a specific order.
BTW, slightly related, what was the reason to use sets instead of lists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what was the reason to use sets instead of lists?
It think the reason was to avoid duplicates, but I don't remember if there was a case where I hit something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried using lists rather than sorting afterwards? If CI is not complaining for other datasets, I feel like that would be the better approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just saw this call to remove()
which might the reason for using sets:
Lines 862 to 863 in afda28a
if root in dirs: | |
dirs.remove(root) |
I can still switch to lists if you'd like, I guess I would have to write something like
dirs = [dir in dirs if dir != root]
LMK which one you prefer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only minor stuff inline. Thanks a lot @NicolasHug!
Besides that one other question: Given that the dataset also provides a bounding box for each image, shouldn't we also return it? For the test split we already load the data. For the training data, each image folder contains a GT-{label:05d}.csv
file in the same format as the testing annotations. Basically instead of Filter
ing the images, you could use a Demultiplexer
to get a images and a annotations datapipe.
@@ -877,7 +877,7 @@ def _make_archive(root, name, *files_or_dirs, opener, adder, remove=True): | |||
files, dirs = _split_files_or_dirs(root, *files_or_dirs) | |||
|
|||
with opener(archive) as fh: | |||
for file in files: | |||
for file in sorted(files): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried using lists rather than sorting afterwards? If CI is not complaining for other datasets, I feel like that would be the better approach.
Co-authored-by: Philip Meier <[email protected]>
Co-authored-by: Philip Meier <[email protected]>
Summary: Co-authored-by: Philip Meier <[email protected]> Reviewed By: jdsgomes, prabhat00155 Differential Revision: D33739393 fbshipit-source-id: f65df4355c53a2fed2534b4bbd3ce7c1aa0606e2
This PR adds the prototype version of the GTSRB dataset
cc @pmeier @bjuncek