Add image directory version of Fashion-MNIST dataset #18

iesahin · 2021-06-10T09:05:36Z

The structure of the Fashion-MNIST dataset is identical to MNIST.

We can use the same structure in #17.

iesahin · 2021-06-15T20:02:43Z

I created the image directory and 70000 individual PNG images take up 273MB. A .tgz file that contains the images is 35MB.

Are we OK with the increase in size? @shcheklein If so, I'll go ahead and submit PRs for this and #17.

I'll check the other formats but the increase is mostly related with the format overhead.

shcheklein · 2021-06-16T01:02:42Z

Sounds a bit too much for the get started repo. Is there a way to use a subset of it? And may be at the end show the performance on the large dataset?

iesahin · 2021-06-16T08:27:27Z

Maybe I can include the "zipped directory" version, and create a stage to unzip this to data/. Using a subset seems to defeat the overall purpose of replacing the dataset.

shcheklein · 2021-06-16T16:34:35Z

Good point. So, both solution looks suboptimal - gz not very usual, hides details, complicates code ... images dataset is too large

Have you tried to minify pngs with tinypng or something by chance?

iesahin · 2021-06-17T17:31:16Z

It's not that the individual files are large. They are around 350-400 bytes. Even a BMP would take less than 2KB for each file. (Each image is contains 784 bytes of data + format overhead.)

But each file takes at least 4KB in ext4, even if the file is 1 byte.

From tar.gz, I mean zipped version of these PNG files, not the .gz file of original .IDX3 file. fashion-mnist.tar.gz file can be expanded to an image directory in a single stage and the user can see the images.

Also, download overhead will probably take up 5-10x more time for 70000 individual files. I can test a dvc pull, but I doubt it will finish in <5 minutes for 70000 files.

iesahin · 2021-06-17T17:48:45Z

For example, instead of PNG, I converted to JPEG and although the individual files are a bit larger, the resulting directory size with du -hs is 276 MB again.

iesahin · 2021-06-17T17:55:55Z

And these are the results for BMP. The individual files are 1862 bytes each but the resulting directory size is the same.

iesahin · 2021-06-17T18:04:06Z

And these are sizes for tar (without zip) that uses 512 bytes as block size

179,220,480 fashion-mnist-bmp.tar
106,158,080 fashion-mnist-jpg.tar
 90,040,320 fashion-mnist-png.tar

shcheklein · 2021-06-19T04:10:18Z

I would still start even with 200Mb+ rather than making it an archive artificially. My though process is - how often would DS/ML teams tar their datasets? It's not convenient. I know there are some specific formats for TF, etc. It would be even better to start with them? But then we would still have to complicate the example.

iesahin · 2021-06-19T12:45:45Z

OK, I'll test with individual images and see how it fares.

iesahin · 2021-06-22T10:19:56Z

I'm dvc pushing the dataset to s3://dvc-public/remote/dataset-registry-test and it seems it will take 20-25 minutes. My connection speed is around 38Mbps upload to the next-hop (while uploading the set), so I assume that the download would take 15-20 minutes as well.

I'll update this with the download speeds. 😄

Update:

It seems it's much worse than I expected. The download takes more than the upload, around ~30 minutes.

My local speeds (to the closest server) is something like:

I can do this test in AWS, Google Cloud or Katacoda, but I doubt it will matter much. I'll update the exact time after the download.

WDYT @shcheklein

Update:

dvc pull takes around 42 minutes. Even the checkout process took around 40 seconds.

iesahin · 2021-06-22T11:09:02Z

I've pushed it to my clone.

You can test the download with

git clone [email protected]:iesahin/dataset-registry
dvc remote add -d --local storage s3://dvc-public/remote/dataset-registry-test
dvc pull fashion-mnist/images

@shcheklein

iesahin · 2021-06-22T13:49:49Z

I also tested this in my VPS and getting the dataset with HTTPS seems to take about 1 hour.

shcheklein · 2021-06-22T14:12:36Z

Could you run it with -j 100 or -j 10?

It feels it related to some known performance issues, need to confirm and address that. It should be around minutes as far as I understand to download it.

iesahin · 2021-06-22T14:26:31Z

With -j 100 it seems to be about 20 minutes. @shcheklein

iesahin · 2021-06-22T14:29:49Z

I did not want to pollute the dataset-registry remote, so I'm using another for testing. I can push the files to the remote and merge these changes for easier testing. I can also put a zipped version and we can use any of these. Core team can use the registry for the test.

shcheklein · 2021-06-23T02:19:23Z

It's fine to "pollute" the data registry, there should be not harm in that. And it's good even to have it, we can use it as a test and optimize it to work for our needs.

It's annoying that we need to use -j to speedup it though and 20 minutes for 200 MB is quite suboptimal also. Let's do this and create a ticket on DVC repo to look into this:

increase -j defaults (detect if we are breaking something and decrease automatically?)
investigate the performance

For now, are there any other/simpler datasets that we could use for the get started purposes?

iesahin · 2021-06-23T09:40:25Z

MNIST and Fashion-MNIST are very small datasets, to the level of being toy datasets. As a comparison, VGG Face dataset (2015) contained around 2,600,000 images and even that is small by today's standards.

DVC may be improved but there is an inherent latency when you do multiple requests. Each download is about 500 bytes, but the required TCP handshake, key exchange etc. takes time, and we do that connection 70,000 times.

iesahin · 2021-06-23T09:51:05Z

I think I first need to browse the code a bit before asking for improvement. There may be easier ways to improve the networking performance by using the same Session in requests, this may be orthogonal to multiple jobs. (Currently, I don't even know if DVC uses requests for HTTPS 😃 )

iesahin · 2021-06-23T10:10:58Z

Yes, it looks DVC creates a new Session object for each download. But also it looks requests has no way to use HTTP pipelining.. So let's keep this brief and continue in a core ticket.

iesahin self-assigned this Jun 16, 2021

iesahin mentioned this issue Jun 21, 2021

start: Experiments Trail iterative/dvc.org#2574

Merged

7 tasks

iesahin mentioned this issue Jun 23, 2021

Add Fashion-MNIST image directory to the registry #19

Merged

shcheklein closed this as completed in #19 Jun 24, 2021

iesahin mentioned this issue Jun 24, 2021

directory download and uploads are slow iterative/dvc#6222

Open

Add image directory version of Fashion-MNIST dataset #18

Add image directory version of Fashion-MNIST dataset #18

Comments

iesahin commented Jun 10, 2021

iesahin commented Jun 15, 2021

Uh oh!

shcheklein commented Jun 16, 2021

Uh oh!

iesahin commented Jun 16, 2021

Uh oh!

shcheklein commented Jun 16, 2021

Uh oh!

iesahin commented Jun 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iesahin commented Jun 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iesahin commented Jun 17, 2021

Uh oh!

iesahin commented Jun 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shcheklein commented Jun 19, 2021

Uh oh!

iesahin commented Jun 19, 2021

Uh oh!

iesahin commented Jun 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iesahin commented Jun 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iesahin commented Jun 22, 2021

Uh oh!

shcheklein commented Jun 22, 2021

Uh oh!

iesahin commented Jun 22, 2021

Uh oh!

iesahin commented Jun 22, 2021

Uh oh!

shcheklein commented Jun 23, 2021

Uh oh!

iesahin commented Jun 23, 2021

Uh oh!

iesahin commented Jun 23, 2021

Uh oh!

iesahin commented Jun 23, 2021

Uh oh!

iesahin commented Jun 17, 2021 •

edited

Loading

iesahin commented Jun 17, 2021 •

edited

Loading

iesahin commented Jun 17, 2021 •

edited

Loading

iesahin commented Jun 22, 2021 •

edited

Loading

iesahin commented Jun 22, 2021 •

edited

Loading