Skip to content

Add SUN397 Dataset #5132

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jan 7, 2022
Merged

Add SUN397 Dataset #5132

merged 8 commits into from
Jan 7, 2022

Conversation

saswatpp
Copy link
Contributor

@saswatpp saswatpp commented Dec 27, 2021

Adds SUN397 dataset to address: #5108.

The size of the dataset was 39GB and a download option was added. Also, 10 official training and testing partitions can be loaded using the partition argument (valid values : int from 1-10 or None for the entire data). Is it a helpful ? @pmeier Official Partitions

Thank you.

cc @pmeier

@facebook-github-bot
Copy link

facebook-github-bot commented Dec 27, 2021

💊 CI failures summary and remediations

As of commit 69ee35e (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Copy link
Collaborator

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @saswatpp and thanks a lot for the PR! I have some comments inline that make the implementation (hopefully) a little cleaner. On top of that, you need to run our auto-formatters on your code to pass CI. Have a look at our contributing guide. In short, you can do the following:

pip install pre-commit
pre-commit install  # activates the auto-formatter for each subsequent commit
pre-commit run --all-files  # only needed once, since you already commited
git commit -am "fix code format"

Also, I couldn't understand what exactly the tests for datasets does just by looking at the code. Do we have some resources for that ?

The test for the datasets are located in test/test_datasets.py. From the contributor perspective you need to add a new test case that configures some basics and provides fake data.

class SUN397TestCase(datasets_utils.ImageDatasetTestCase):
    DATASET_CLASS = datasets.SUN397

    ADDITIONAL_CONFIGS = datasets_utils.combinations_grid(
        split=("train", "test"),
        # There is no need to test all individual partitions, since they all behave the same
        partition=(1, 10, "all"),
    )

    def inject_fake_data(tmp_dir, config):
        ...

inject_fake_data should prepare tmp_dir (in your case the same as root in the dataset) in a way that the structure of the files is equal to what would be there if I had instantiated the dataset with download=True. It needs to return the number of samples in the dataset. If you want to know more, you can read the documentation of the underlying test case.

@saswatpp
Copy link
Contributor Author

@pmeier Can you tell me how to run tests for specific dataset ? thank you

@pmeier
Copy link
Collaborator

pmeier commented Dec 30, 2021

The easiest would be to run

$ pytest test/test_datasets.py -k sun397 

if the test case you wrote contains sun397 (case-insenstive) in the name.

@saswatpp
Copy link
Contributor Author

saswatpp commented Jan 1, 2022

@pmeier, i had mentioned FEATURE_TYPES = (PIL.Image.Image, int) in the test class but the unittest still fails. Do you know what is going on ?

Copy link
Collaborator

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @saswatpp, I've suggested one change to the label generation. With that mypy is happy and I can't reproduce the test error locally. Let's see if CI is happy to.

Copy link
Collaborator

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @saswatpp for the PR. LGTM!

@pmeier pmeier requested a review from NicolasHug January 5, 2022 07:37
@saswatpp saswatpp changed the title [WIP] Add SUN397 Dataset Add SUN397 Dataset Jan 5, 2022
Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR @saswatpp and @pmeier for the review. I made a few comments below, the download() function needs some minor fixes but other than that it looks great. LMK what you think.

@pmeier
Copy link
Collaborator

pmeier commented Jan 6, 2022

@saswatpp Could you fix the merge conflicts? Afterwards this should be good to go.

@NicolasHug
Copy link
Member

Afterwards this should be good to go

I'm wondering whether @saswatpp pushed their changes yet? I see all comments as resolved but I don't see any new changes in the diff 😅

@saswatpp
Copy link
Contributor Author

saswatpp commented Jan 6, 2022

oh @NicolasHug I have made the changes in local repo gotta commit them only 😓

@NicolasHug
Copy link
Member

No worries at all @saswatpp !

@saswatpp
Copy link
Contributor Author

saswatpp commented Jan 6, 2022

git was behaving weird, so force pushed

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @saswatpp !

Running the tests locally, I'm seeing a lot of warnings about unclosed files:

test/test_datasets.py::LSUNTestCase::test_transforms
  /Users/nicolashug/dev/vision/torchvision/datasets/lsun.py:31: ResourceWarning: unclosed file <_io.BufferedWriter name='_cache_varfolderssyvkyzpyhrlqqcgnTtmpclcnsatowervallmdb'>
    pickle.dump(self.keys, open(cache_file, "wb"))

This is simlar to what happened in #5116 (review), perhaps @pmeier can advise on how he fixed it over there?

Other than that, LGTM!

@pmeier
Copy link
Collaborator

pmeier commented Jan 7, 2022

@NicolasHug As the error message implies, this is about files of the LSUN dataset and thus unrelated to this PR. The same warnings are present on the main branch. I'll look into it.

@pmeier pmeier merged commit 8c546f6 into pytorch:main Jan 7, 2022
@github-actions
Copy link

github-actions bot commented Jan 7, 2022

Hey @pmeier!

You merged this PR, but no labels were added. The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py

facebook-github-bot pushed a commit that referenced this pull request Jan 8, 2022
Summary:
* dataset class added

* fix code format

* fixed requested changes

* fixed issues in sun397

* Update torchvision/datasets/sun397.py

Reviewed By: sallysyw

Differential Revision: D33479277

fbshipit-source-id: 374d098c261adeacd073fae141380130a6c3aa95

Co-authored-by: Nicolas Hug <[email protected]>
Co-authored-by: Philip Meier <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants