-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Add SUN397 Dataset #5132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SUN397 Dataset #5132
Conversation
💊 CI failures summary and remediationsAs of commit 69ee35e (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @saswatpp and thanks a lot for the PR! I have some comments inline that make the implementation (hopefully) a little cleaner. On top of that, you need to run our auto-formatters on your code to pass CI. Have a look at our contributing guide. In short, you can do the following:
pip install pre-commit
pre-commit install # activates the auto-formatter for each subsequent commit
pre-commit run --all-files # only needed once, since you already commited
git commit -am "fix code format"
Also, I couldn't understand what exactly the tests for datasets does just by looking at the code. Do we have some resources for that ?
The test for the datasets are located in test/test_datasets.py
. From the contributor perspective you need to add a new test case that configures some basics and provides fake data.
class SUN397TestCase(datasets_utils.ImageDatasetTestCase):
DATASET_CLASS = datasets.SUN397
ADDITIONAL_CONFIGS = datasets_utils.combinations_grid(
split=("train", "test"),
# There is no need to test all individual partitions, since they all behave the same
partition=(1, 10, "all"),
)
def inject_fake_data(tmp_dir, config):
...
inject_fake_data
should prepare tmp_dir
(in your case the same as root
in the dataset) in a way that the structure of the files is equal to what would be there if I had instantiated the dataset with download=True
. It needs to return the number of samples in the dataset. If you want to know more, you can read the documentation of the underlying test case.
@pmeier Can you tell me how to run tests for specific dataset ? thank you |
The easiest would be to run $ pytest test/test_datasets.py -k sun397 if the test case you wrote contains |
@pmeier, i had mentioned FEATURE_TYPES = (PIL.Image.Image, int) in the test class but the unittest still fails. Do you know what is going on ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @saswatpp, I've suggested one change to the label generation. With that mypy
is happy and I can't reproduce the test error locally. Let's see if CI is happy to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @saswatpp for the PR. LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@saswatpp Could you fix the merge conflicts? Afterwards this should be good to go. |
I'm wondering whether @saswatpp pushed their changes yet? I see all comments as resolved but I don't see any new changes in the diff 😅 |
oh @NicolasHug I have made the changes in local repo gotta commit them only 😓 |
No worries at all @saswatpp ! |
6d3db78
to
e4bc4e0
Compare
git was behaving weird, so force pushed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @saswatpp !
Running the tests locally, I'm seeing a lot of warnings about unclosed files:
test/test_datasets.py::LSUNTestCase::test_transforms
/Users/nicolashug/dev/vision/torchvision/datasets/lsun.py:31: ResourceWarning: unclosed file <_io.BufferedWriter name='_cache_varfolderssyvkyzpyhrlqqcgnTtmpclcnsatowervallmdb'>
pickle.dump(self.keys, open(cache_file, "wb"))
This is simlar to what happened in #5116 (review), perhaps @pmeier can advise on how he fixed it over there?
Other than that, LGTM!
@NicolasHug As the error message implies, this is about files of the LSUN dataset and thus unrelated to this PR. The same warnings are present on the |
Hey @pmeier! You merged this PR, but no labels were added. The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py |
Summary: * dataset class added * fix code format * fixed requested changes * fixed issues in sun397 * Update torchvision/datasets/sun397.py Reviewed By: sallysyw Differential Revision: D33479277 fbshipit-source-id: 374d098c261adeacd073fae141380130a6c3aa95 Co-authored-by: Nicolas Hug <[email protected]> Co-authored-by: Philip Meier <[email protected]>
Adds SUN397 dataset to address: #5108.
The size of the dataset was 39GB and a download option was added. Also, 10 official training and testing partitions can be loaded using the
partition
argument (valid values :int
from 1-10 orNone
for the entire data). Is it a helpful ? @pmeier Official PartitionsThank you.
cc @pmeier