Skip to content

Remove caching from MNIST and variants #3420

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Mar 16, 2021

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented Feb 19, 2021

Closes #3555

MNIST is one of the oldest datasets in torchvision. Back than we opted to cache the data in a custom binary for speed reasons. The caching happens in the download() method:

print('Processing...')
training_set = (
read_image_file(os.path.join(self.raw_folder, 'train-images-idx3-ubyte')),
read_label_file(os.path.join(self.raw_folder, 'train-labels-idx1-ubyte'))
)
test_set = (
read_image_file(os.path.join(self.raw_folder, 't10k-images-idx3-ubyte')),
read_label_file(os.path.join(self.raw_folder, 't10k-labels-idx1-ubyte'))
)
with open(os.path.join(self.processed_folder, self.training_file), 'wb') as f:
torch.save(training_set, f)
with open(os.path.join(self.processed_folder, self.test_file), 'wb') as f:
torch.save(test_set, f)
print('Done!')

This makes them harder to test: we need to run this method in each test if we provide a mock of the original binaries, but the new testcases from #3402 were designed to handle the download flag in a special way.

Since the speed difference is not significant anymore, we can safely remove the caching.


  1. We currently do not have tests for [EQ]MNIST. I've tested everything locally, but please review them extra carefully.
  2. Working on this module was really cumbersome. Although the datasets share so much functionality, it is hard to maintain them. I'll add tests for all MNIST variants next and afterwards refactor this module.

@codecov
Copy link

codecov bot commented Feb 19, 2021

Codecov Report

Merging #3420 (8cb1fb8) into master (22c548b) will increase coverage by 0.02%.
The diff coverage is 45.76%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3420      +/-   ##
==========================================
+ Coverage   75.13%   75.16%   +0.02%     
==========================================
  Files         105      105              
  Lines        9722     9735      +13     
  Branches     1563     1567       +4     
==========================================
+ Hits         7305     7317      +12     
  Misses       1930     1930              
- Partials      487      488       +1     
Impacted Files Coverage Δ
torchvision/datasets/mnist.py 57.32% <45.76%> (+2.07%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 22c548b...b0475de. Read the comment docs.

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks!

I've a couple of comments, let me know what you think

@pmeier
Copy link
Collaborator Author

pmeier commented Feb 23, 2021

Blocked by #3443

@pmeier pmeier requested a review from fmassa March 16, 2021 09:52
Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks a ton!

@fmassa fmassa merged commit bb2805a into pytorch:master Mar 16, 2021
@pmeier pmeier deleted the mnist-remove-caching branch March 16, 2021 12:32
facebook-github-bot pushed a commit that referenced this pull request Mar 19, 2021
Summary:
* remove caching from (Fashion|K)?MNIST

* remove unnecessary lazy import

* remove false check of binaries against the md5 of archives

* remove caching from EMNIST

* remove caching from QMNIST

* lint

* fix EMNIST

* streamline QMNIST download

Reviewed By: fmassa

Differential Revision: D27127995

fbshipit-source-id: 3f53be72b5e7c8abe191edb1e4467e3ef33741dd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Skip download of dataset if raw data exists
5 participants