Skip to content

Remove download for ImageNet #1457

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Oct 21, 2019
Merged

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented Oct 14, 2019

addresses #1453

I introduced the function parse_train_archive and parse_val_archive and changed parse_devkit to take in an str pointing to an archive. Users that downloaded the dataset externally can use them to prepare the folders.

@pmeier pmeier marked this pull request as ready for review October 15, 2019 07:00
@pmeier
Copy link
Collaborator Author

pmeier commented Oct 15, 2019

Tests currently fail since the new code raises an error if the MD5 of the archive does not match. Shall I remove the download flag in the tests and simply parse the archives at first?

@mock.patch('torchvision.datasets.utils.download_url')
@unittest.skipIf(not HAS_SCIPY, "scipy unavailable")
def test_imagenet(self, mock_download):
with imagenet_root() as root:
dataset = torchvision.datasets.ImageNet(root, split='train', download=True)
self.generic_classification_dataset_test(dataset)

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

I have a comment, I think it will make usability better. Let me know what you think

@fmassa
Copy link
Member

fmassa commented Oct 16, 2019

@pmeier thanks for the PR! Can you rebase your changes on top of master, I've fixed CI for most of them.

@pmeier pmeier force-pushed the remove_imagenet_download branch from 61d7e23 to 2548910 Compare October 16, 2019 13:34
@pmeier pmeier requested a review from fmassa October 17, 2019 14:56
@fmassa
Copy link
Member

fmassa commented Oct 18, 2019

LGTM, but test failures seem to be related?

_____________________________ Tester.test_imagenet _____________________________

self = <test_datasets.Tester testMethod=test_imagenet>
mock_check = <MagicMock name='_verify_archive' id='140159361171976'>

    @mock.patch('torchvision.datasets.imagenet.ImageNet._verify_archive')
    @unittest.skipIf(not HAS_SCIPY, "scipy unavailable")
    def test_imagenet(self, mock_check):
        with imagenet_root() as root:
>           dataset = torchvision.datasets.ImageNet(root, split='train')

test/test_datasets.py:115: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeh/lib/python3.5/site-packages/torchvision/datasets/imagenet.py:62: in __init__
    self.extract_archives()
../_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeh/lib/python3.5/site-packages/torchvision/datasets/imagenet.py:81: in extract_archives
    parse_devkit_archive(archive)
../_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeh/lib/python3.5/site-packages/torchvision/datasets/imagenet.py:154: in parse_devkit_archive
    _verify_archive(archive, ARCHIVE_DICT["devkit"]["md5"], force)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

archive = '/tmp/tmpzh1hy800/ILSVRC2012_devkit_t12.tar.gz'
md5 = 'fa75699e90414af021442c21a62c3abf', force = False

    def _verify_archive(archive, md5, force):
        if not check_integrity(archive):
            raise RuntimeError("The file {} doesn't exist.".format(archive))
        if not check_md5(archive, md5) and not force:
            msg = ("The MD5 checksum of the file {} and the original archive do not match. "
                   "Use force=True to force an extraction")
>           raise RuntimeError(msg.format(archive))
E           RuntimeError: The MD5 checksum of the file /tmp/tmpzh1hy800/ILSVRC2012_devkit_t12.tar.gz and the original archive do not match. Use force=True to force an extraction

* remove force flag for parse_*_archive functions
* cleanup
@codecov-io
Copy link

codecov-io commented Oct 21, 2019

Codecov Report

Merging #1457 into master will decrease coverage by 0.11%.
The diff coverage is 80.89%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1457      +/-   ##
==========================================
- Coverage   64.46%   64.34%   -0.12%     
==========================================
  Files          83       83              
  Lines        6421     6454      +33     
  Branches      982      992      +10     
==========================================
+ Hits         4139     4153      +14     
- Misses       1992     2006      +14     
- Partials      290      295       +5
Impacted Files Coverage Δ
torchvision/datasets/imagenet.py 82.14% <80.89%> (-8.77%) ⬇️
torchvision/datasets/utils.py 58.38% <0%> (-3.73%) ⬇️
torchvision/transforms/functional_tensor.py 57.14% <0%> (-2.86%) ⬇️
torchvision/io/video.py 75% <0%> (-0.33%) ⬇️
torchvision/models/detection/transform.py 66.92% <0%> (ø) ⬆️
torchvision/models/detection/roi_heads.py 56.14% <0%> (+0.37%) ⬆️
torchvision/ops/boxes.py 100% <0%> (+5.55%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e279eba...b57b90e. Read the comment docs.

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@fmassa fmassa merged commit f46f2c1 into pytorch:master Oct 21, 2019
@pmeier pmeier deleted the remove_imagenet_download branch October 21, 2019 11:49
@z-a-f z-a-f mentioned this pull request Nov 8, 2019
@kanonjz
Copy link

kanonjz commented Dec 7, 2019

I downloaded imagenet myself and used parse_val_archive to prepare the folders, but got an error below. What is the meta.bin? I didn't find it in the imagenet.

The meta file meta.bin is not present in the root directory or is corrupted. " "This file is automatically created by the ImageNet dataset.

@pmeier
Copy link
Collaborator Author

pmeier commented Dec 7, 2019

@kanonjz I think your question justifies an independent issue. I opened one for you. Have a look at #1646. I answered your question there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants