Skip to content

Pushing only diffs #7546

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
emilijapur opened this issue Apr 5, 2022 · 6 comments
Closed

Pushing only diffs #7546

emilijapur opened this issue Apr 5, 2022 · 6 comments

Comments

@emilijapur
Copy link

DVC pushes whole file to remote server, while pulling only diffs. Since I am working with big datasets (~50 GB of data before model) and a lot of iterations it would be nice if only diffs were pushed to remote server. Since whole file is pushed to the remote server multiple times it fills up space pretty quickly. Is it possible to solve this problem with big data?

@pmrowla
Copy link
Contributor

pmrowla commented Apr 5, 2022

Unfortunately DVC only supports deduplication at the file level, and does not do any diffing within files. DVC pull works the same way as push, it pulls entire files.

There is an existing feature request for this that you can follow for future updates: #829

Closing as duplicate of #829

@pmrowla pmrowla closed this as completed Apr 5, 2022
@emilijapur
Copy link
Author

emilijapur commented Apr 5, 2022

Excuse me, can you explain me further what do you mean by saying "DVC only supports deduplication at the file level".
As I push the same file to dvc remote after minor changes, multiple versions of the same file is being stored. Is this part of dedublication?

For example, I have 10GB csv and pushed it to remote. Then only 1 observation changes and I push the same 10GB file again. So now in remote server there 2 versions of the save file instead of 1 file and what changes were made. So this 1 file already takes up 20 GB of space instead of 10 GB + few kilobytes of changes. So if my original data changes 10 times, it would take up 100GB of space.

So then what does "DVC supports deduplication at the file level" exactly mean?

@pmrowla
Copy link
Contributor

pmrowla commented Apr 5, 2022

It means that DVC stores files according to their hashes, as content addressable storage. So if you have two different files that have the same data content, DVC will only store one copy of that file. Likewise, if you have two tracked revisions of a file where the file is the same between those two revisions, DVC would only store one copy of that file. So DVC does de-duplication at the whole file level.

As you have noted, if you have a small change in a large file, this means that DVC still has to store the entire file twice (since they have different binary content).

@shcheklein
Copy link
Member

@emilijapur here is also some explanation that might help - https://stackoverflow.com/questions/60365473/by-how-much-can-i-approx-reduce-disk-volume-by-using-dvc/60366262#60366262

The only possible (if it makes sense at all in your use case) workaround to this problem at this moment is to use directories and try to split large files into smaller one. E.g. if it's a CSV that you keep appending data to, could you split it by date, or by some other field?

@emilijapur
Copy link
Author

Sadly I can not split my data in any blocks or parts due to project specifics. However, I thought since git tracks only diffs (not whole files), that DVC can do the same. From your answers I understand that this feature is not going to be implemented in the future?

@pmrowla
Copy link
Contributor

pmrowla commented Apr 6, 2022

It is planned to be implemented at some point, but there is no timetable on when that will be (see the issue I linked earlier: #829)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants