-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Add support for PCAM dataset #5203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CI failures summary and remediationsAs of commit 8a3dd39 (more details on the Dr. CI page): ✅ None of the CI failures appear to be your fault 💚
🚧 1 ongoing upstream failure:These were probably caused by upstream breakages that are not fixed yet.
This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
|
||
def __len__(self) -> int: | ||
images_file = self._FILES[self._split]["images"][0] | ||
with self.h5py.File(self._base_folder / images_file) as images_data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for here and below: opening a File
does not load its data into memory, so the operation is very cheap and fast.
Similarly below accessing a single row in the file will not load the entire file, just a specific section of it.
I guess we could open the files and keep the handles in __init__
, but I'm not sure it would be any faster, and we might not be able to ever close the handles properly.
_FILES = { | ||
"train": { | ||
"images": ( | ||
"camelyonpatch_level_2_split_train_x.h5", # Data file name | ||
"1Ka0XfEMiwgCYPdTI-vv6eUElOBnKFKQ2", # Google Drive ID | ||
"1571f514728f59376b705fc836ff4b63", # md5 hash | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not ecstatic about this big dict, but I needed everything in the same place to support a per-split download logic (i.e. only download the test data if we don't need train nor val).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @NicolasHug! I have a few minor nits inline. Otherwise LGTM when CI is green.
download (bool, optional): If True, downloads the dataset from the internet and puts it into | ||
``root/oxford-iiit-pet``. If dataset is already downloaded, it is not downloaded again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Summary: * Add support for PCAM dataset * mypy * Apply suggestions from code review * Remove classes and class_to_idx attributes * Use _decompress Reviewed By: datumbox, NicolasHug Differential Revision: D33655258 fbshipit-source-id: a38e55340ab3c364969160f3c186d1a130bdc371 Co-authored-by: Philip Meier <[email protected]> Co-authored-by: Philip Meier <[email protected]>
Towards #5108
This PR adds support for the PCAM dataset.
cc @pmeier