-
Notifications
You must be signed in to change notification settings - Fork 41
Add image directory version of Fashion-MNIST dataset #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I created the image directory and 70000 individual PNG images take up 273MB. A .tgz file that contains the images is 35MB. Are we OK with the increase in size? @shcheklein If so, I'll go ahead and submit PRs for this and #17. I'll check the other formats but the increase is mostly related with the format overhead. |
Sounds a bit too much for the get started repo. Is there a way to use a subset of it? And may be at the end show the performance on the large dataset? |
Maybe I can include the "zipped directory" version, and create a stage to unzip this to |
Good point. So, both solution looks suboptimal - gz not very usual, hides details, complicates code ... images dataset is too large Have you tried to minify pngs with tinypng or something by chance? |
It's not that the individual files are large. They are around 350-400 bytes. Even a BMP would take less than 2KB for each file. (Each image is contains 784 bytes of data + format overhead.) But each file takes at least 4KB in ext4, even if the file is 1 byte. From Also, download overhead will probably take up 5-10x more time for 70000 individual files. I can test a |
And these are sizes for
|
I would still start even with 200Mb+ rather than making it an archive artificially. My though process is - how often would DS/ML teams tar their datasets? It's not convenient. I know there are some specific formats for TF, etc. It would be even better to start with them? But then we would still have to complicate the example. |
OK, I'll test with individual images and see how it fares. |
I'm I'll update this with the download speeds. 😄 Update: It seems it's much worse than I expected. The download takes more than the upload, around ~30 minutes. My local speeds (to the closest server) is something like: I can do this test in AWS, Google Cloud or Katacoda, but I doubt it will matter much. I'll update the exact time after the download. WDYT @shcheklein Update:
|
I've pushed it to my clone. You can test the download with git clone [email protected]:iesahin/dataset-registry
dvc remote add -d --local storage s3://dvc-public/remote/dataset-registry-test
dvc pull fashion-mnist/images |
I also tested this in my VPS and getting the dataset with HTTPS seems to take about 1 hour. |
Could you run it with It feels it related to some known performance issues, need to confirm and address that. It should be around minutes as far as I understand to download it. |
With |
I did not want to pollute the |
It's fine to "pollute" the data registry, there should be not harm in that. And it's good even to have it, we can use it as a test and optimize it to work for our needs. It's annoying that we need to use
For now, are there any other/simpler datasets that we could use for the get started purposes? |
MNIST and Fashion-MNIST are very small datasets, to the level of being toy datasets. As a comparison, VGG Face dataset (2015) contained around 2,600,000 images and even that is small by today's standards. DVC may be improved but there is an inherent latency when you do multiple requests. Each download is about 500 bytes, but the required TCP handshake, key exchange etc. takes time, and we do that connection 70,000 times. |
I think I first need to browse the code a bit before asking for improvement. There may be easier ways to improve the networking performance by using the same |
Yes, it looks DVC creates a new |
The structure of the Fashion-MNIST dataset is identical to MNIST.
We can use the same structure in #17.
The text was updated successfully, but these errors were encountered: