Skip to content

Add support for PCAM dataset #5203

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jan 17, 2022
Merged

Add support for PCAM dataset #5203

merged 6 commits into from
Jan 17, 2022

Conversation

NicolasHug
Copy link
Member

@NicolasHug NicolasHug commented Jan 17, 2022

Towards #5108

This PR adds support for the PCAM dataset.

cc @pmeier

@facebook-github-bot
Copy link

facebook-github-bot commented Jan 17, 2022

💊 CI failures summary and remediations

As of commit 8a3dd39 (more details on the Dr. CI page):


None of the CI failures appear to be your fault 💚



🚧 1 ongoing upstream failure:

These were probably caused by upstream breakages that are not fixed yet.


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.


def __len__(self) -> int:
images_file = self._FILES[self._split]["images"][0]
with self.h5py.File(self._base_folder / images_file) as images_data:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note for here and below: opening a File does not load its data into memory, so the operation is very cheap and fast.

Similarly below accessing a single row in the file will not load the entire file, just a specific section of it.

I guess we could open the files and keep the handles in __init__, but I'm not sure it would be any faster, and we might not be able to ever close the handles properly.

Comment on lines +32 to +38
_FILES = {
"train": {
"images": (
"camelyonpatch_level_2_split_train_x.h5", # Data file name
"1Ka0XfEMiwgCYPdTI-vv6eUElOBnKFKQ2", # Google Drive ID
"1571f514728f59376b705fc836ff4b63", # md5 hash
),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not ecstatic about this big dict, but I needed everything in the same place to support a per-split download logic (i.e. only download the test data if we don't need train nor val).

Copy link
Collaborator

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NicolasHug! I have a few minor nits inline. Otherwise LGTM when CI is green.

Comment on lines +30 to +31
download (bool, optional): If True, downloads the dataset from the internet and puts it into
``root/oxford-iiit-pet``. If dataset is already downloaded, it is not downloaded again.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@NicolasHug NicolasHug merged commit 4946827 into pytorch:main Jan 17, 2022
facebook-github-bot pushed a commit that referenced this pull request Jan 19, 2022
Summary:
* Add support for PCAM dataset

* mypy

* Apply suggestions from code review

* Remove classes and class_to_idx attributes

* Use _decompress

Reviewed By: datumbox, NicolasHug

Differential Revision: D33655258

fbshipit-source-id: a38e55340ab3c364969160f3c186d1a130bdc371

Co-authored-by: Philip Meier <[email protected]>
Co-authored-by: Philip Meier <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants