Pushing only diffs #7546

emilijapur · 2022-04-05T09:00:57Z

DVC pushes whole file to remote server, while pulling only diffs. Since I am working with big datasets (~50 GB of data before model) and a lot of iterations it would be nice if only diffs were pushed to remote server. Since whole file is pushed to the remote server multiple times it fills up space pretty quickly. Is it possible to solve this problem with big data?

pmrowla · 2022-04-05T09:03:21Z

Unfortunately DVC only supports deduplication at the file level, and does not do any diffing within files. DVC pull works the same way as push, it pulls entire files.

There is an existing feature request for this that you can follow for future updates: #829

Closing as duplicate of #829

emilijapur · 2022-04-05T10:57:41Z

Excuse me, can you explain me further what do you mean by saying "DVC only supports deduplication at the file level".
As I push the same file to dvc remote after minor changes, multiple versions of the same file is being stored. Is this part of dedublication?

For example, I have 10GB csv and pushed it to remote. Then only 1 observation changes and I push the same 10GB file again. So now in remote server there 2 versions of the save file instead of 1 file and what changes were made. So this 1 file already takes up 20 GB of space instead of 10 GB + few kilobytes of changes. So if my original data changes 10 times, it would take up 100GB of space.

So then what does "DVC supports deduplication at the file level" exactly mean?

pmrowla · 2022-04-05T11:29:59Z

It means that DVC stores files according to their hashes, as content addressable storage. So if you have two different files that have the same data content, DVC will only store one copy of that file. Likewise, if you have two tracked revisions of a file where the file is the same between those two revisions, DVC would only store one copy of that file. So DVC does de-duplication at the whole file level.

As you have noted, if you have a small change in a large file, this means that DVC still has to store the entire file twice (since they have different binary content).

shcheklein · 2022-04-06T00:05:54Z

@emilijapur here is also some explanation that might help - https://stackoverflow.com/questions/60365473/by-how-much-can-i-approx-reduce-disk-volume-by-using-dvc/60366262#60366262

The only possible (if it makes sense at all in your use case) workaround to this problem at this moment is to use directories and try to split large files into smaller one. E.g. if it's a CSV that you keep appending data to, could you split it by date, or by some other field?

emilijapur · 2022-04-06T05:37:10Z

Sadly I can not split my data in any blocks or parts due to project specifics. However, I thought since git tracks only diffs (not whole files), that DVC can do the same. From your answers I understand that this feature is not going to be implemented in the future?

pmrowla · 2022-04-06T05:45:50Z

It is planned to be implemented at some point, but there is no timetable on when that will be (see the issue I linked earlier: #829)

pmrowla closed this as completed Apr 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pushing only diffs #7546

Pushing only diffs #7546

emilijapur commented Apr 5, 2022

pmrowla commented Apr 5, 2022 •

edited

Loading

Uh oh!

emilijapur commented Apr 5, 2022 •

edited

Loading

Uh oh!

pmrowla commented Apr 5, 2022

Uh oh!

shcheklein commented Apr 6, 2022

Uh oh!

emilijapur commented Apr 6, 2022

Uh oh!

pmrowla commented Apr 6, 2022

Uh oh!

Pushing only diffs #7546

Pushing only diffs #7546

Comments

emilijapur commented Apr 5, 2022

pmrowla commented Apr 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emilijapur commented Apr 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pmrowla commented Apr 5, 2022

Uh oh!

shcheklein commented Apr 6, 2022

Uh oh!

emilijapur commented Apr 6, 2022

Uh oh!

pmrowla commented Apr 6, 2022

Uh oh!

pmrowla commented Apr 5, 2022 •

edited

Loading

emilijapur commented Apr 5, 2022 •

edited

Loading