Skip to content

Add image directory version of Fashion-MNIST dataset #18

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
iesahin opened this issue Jun 10, 2021 · 20 comments · Fixed by #19
Closed

Add image directory version of Fashion-MNIST dataset #18

iesahin opened this issue Jun 10, 2021 · 20 comments · Fixed by #19
Assignees

Comments

@iesahin
Copy link
Contributor

iesahin commented Jun 10, 2021

The structure of the Fashion-MNIST dataset is identical to MNIST.

We can use the same structure in #17.

@iesahin
Copy link
Contributor Author

iesahin commented Jun 15, 2021

I created the image directory and 70000 individual PNG images take up 273MB. A .tgz file that contains the images is 35MB.

Are we OK with the increase in size? @shcheklein If so, I'll go ahead and submit PRs for this and #17.

I'll check the other formats but the increase is mostly related with the format overhead.

@shcheklein
Copy link
Member

Sounds a bit too much for the get started repo. Is there a way to use a subset of it? And may be at the end show the performance on the large dataset?

@iesahin
Copy link
Contributor Author

iesahin commented Jun 16, 2021

Maybe I can include the "zipped directory" version, and create a stage to unzip this to data/. Using a subset seems to defeat the overall purpose of replacing the dataset.

@shcheklein
Copy link
Member

Good point. So, both solution looks suboptimal - gz not very usual, hides details, complicates code ... images dataset is too large

Have you tried to minify pngs with tinypng or something by chance?

@iesahin iesahin self-assigned this Jun 16, 2021
@iesahin
Copy link
Contributor Author

iesahin commented Jun 17, 2021

It's not that the individual files are large. They are around 350-400 bytes. Even a BMP would take less than 2KB for each file. (Each image is contains 784 bytes of data + format overhead.)

But each file takes at least 4KB in ext4, even if the file is 1 byte.

From tar.gz, I mean zipped version of these PNG files, not the .gz file of original .IDX3 file. fashion-mnist.tar.gz file can be expanded to an image directory in a single stage and the user can see the images.

Also, download overhead will probably take up 5-10x more time for 70000 individual files. I can test a dvc pull, but I doubt it will finish in <5 minutes for 70000 files.

Screenshot_20210617-203956_Termux.jpg

@iesahin
Copy link
Contributor Author

iesahin commented Jun 17, 2021

For example, instead of PNG, I converted to JPEG and although the individual files are a bit larger, the resulting directory size with du -hs is 276 MB again.

Screenshot_20210617-204709_Termux.jpg

@iesahin
Copy link
Contributor Author

iesahin commented Jun 17, 2021

And these are the results for BMP. The individual files are 1862 bytes each but the resulting directory size is the same.

Screenshot_20210617-205454_Termux.jpg

@iesahin
Copy link
Contributor Author

iesahin commented Jun 17, 2021

And these are sizes for tar (without zip) that uses 512 bytes as block size

179,220,480 fashion-mnist-bmp.tar
106,158,080 fashion-mnist-jpg.tar
 90,040,320 fashion-mnist-png.tar

@shcheklein
Copy link
Member

I would still start even with 200Mb+ rather than making it an archive artificially. My though process is - how often would DS/ML teams tar their datasets? It's not convenient. I know there are some specific formats for TF, etc. It would be even better to start with them? But then we would still have to complicate the example.

@iesahin
Copy link
Contributor Author

iesahin commented Jun 19, 2021

OK, I'll test with individual images and see how it fares.

@iesahin
Copy link
Contributor Author

iesahin commented Jun 22, 2021

I'm dvc pushing the dataset to s3://dvc-public/remote/dataset-registry-test and it seems it will take 20-25 minutes. My connection speed is around 38Mbps upload to the next-hop (while uploading the set), so I assume that the download would take 15-20 minutes as well.

I'll update this with the download speeds. 😄


Update:

It seems it's much worse than I expected. The download takes more than the upload, around ~30 minutes.

Screen Shot 2021-06-22 at 13 25 03

My local speeds (to the closest server) is something like:

Screen Shot 2021-06-22 at 13 22 57

I can do this test in AWS, Google Cloud or Katacoda, but I doubt it will matter much. I'll update the exact time after the download.

WDYT @shcheklein


Update:

dvc pull takes around 42 minutes. Even the checkout process took around 40 seconds.

@iesahin
Copy link
Contributor Author

iesahin commented Jun 22, 2021

I've pushed it to my clone.

You can test the download with

git clone [email protected]:iesahin/dataset-registry
dvc remote add -d --local storage s3://dvc-public/remote/dataset-registry-test
dvc pull fashion-mnist/images

@shcheklein

@iesahin
Copy link
Contributor Author

iesahin commented Jun 22, 2021

I also tested this in my VPS and getting the dataset with HTTPS seems to take about 1 hour.

@shcheklein
Copy link
Member

Could you run it with -j 100 or -j 10?

It feels it related to some known performance issues, need to confirm and address that. It should be around minutes as far as I understand to download it.

@iesahin
Copy link
Contributor Author

iesahin commented Jun 22, 2021

With -j 100 it seems to be about 20 minutes. @shcheklein

@iesahin
Copy link
Contributor Author

iesahin commented Jun 22, 2021

I did not want to pollute the dataset-registry remote, so I'm using another for testing. I can push the files to the remote and merge these changes for easier testing. I can also put a zipped version and we can use any of these. Core team can use the registry for the test.

@shcheklein
Copy link
Member

It's fine to "pollute" the data registry, there should be not harm in that. And it's good even to have it, we can use it as a test and optimize it to work for our needs.

It's annoying that we need to use -j to speedup it though and 20 minutes for 200 MB is quite suboptimal also. Let's do this and create a ticket on DVC repo to look into this:

  • increase -j defaults (detect if we are breaking something and decrease automatically?)
  • investigate the performance

For now, are there any other/simpler datasets that we could use for the get started purposes?

@iesahin
Copy link
Contributor Author

iesahin commented Jun 23, 2021

MNIST and Fashion-MNIST are very small datasets, to the level of being toy datasets. As a comparison, VGG Face dataset (2015) contained around 2,600,000 images and even that is small by today's standards.

DVC may be improved but there is an inherent latency when you do multiple requests. Each download is about 500 bytes, but the required TCP handshake, key exchange etc. takes time, and we do that connection 70,000 times.

@iesahin
Copy link
Contributor Author

iesahin commented Jun 23, 2021

I think I first need to browse the code a bit before asking for improvement. There may be easier ways to improve the networking performance by using the same Session in requests, this may be orthogonal to multiple jobs. (Currently, I don't even know if DVC uses requests for HTTPS 😃 )

@iesahin
Copy link
Contributor Author

iesahin commented Jun 23, 2021

Yes, it looks DVC creates a new Session object for each download. But also it looks requests has no way to use HTTP pipelining.. So let's keep this brief and continue in a core ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants