-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Pushing only diffs #7546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Excuse me, can you explain me further what do you mean by saying "DVC only supports deduplication at the file level". For example, I have 10GB csv and pushed it to remote. Then only 1 observation changes and I push the same 10GB file again. So now in remote server there 2 versions of the save file instead of 1 file and what changes were made. So this 1 file already takes up 20 GB of space instead of 10 GB + few kilobytes of changes. So if my original data changes 10 times, it would take up 100GB of space. So then what does "DVC supports deduplication at the file level" exactly mean? |
It means that DVC stores files according to their hashes, as content addressable storage. So if you have two different files that have the same data content, DVC will only store one copy of that file. Likewise, if you have two tracked revisions of a file where the file is the same between those two revisions, DVC would only store one copy of that file. So DVC does de-duplication at the whole file level. As you have noted, if you have a small change in a large file, this means that DVC still has to store the entire file twice (since they have different binary content). |
@emilijapur here is also some explanation that might help - https://stackoverflow.com/questions/60365473/by-how-much-can-i-approx-reduce-disk-volume-by-using-dvc/60366262#60366262 The only possible (if it makes sense at all in your use case) workaround to this problem at this moment is to use directories and try to split large files into smaller one. E.g. if it's a CSV that you keep appending data to, could you split it by date, or by some other field? |
Sadly I can not split my data in any blocks or parts due to project specifics. However, I thought since git tracks only diffs (not whole files), that DVC can do the same. From your answers I understand that this feature is not going to be implemented in the future? |
It is planned to be implemented at some point, but there is no timetable on when that will be (see the issue I linked earlier: #829) |
DVC pushes whole file to remote server, while pulling only diffs. Since I am working with big datasets (~50 GB of data before model) and a lot of iterations it would be nice if only diffs were pushed to remote server. Since whole file is pushed to the remote server multiple times it fills up space pretty quickly. Is it possible to solve this problem with big data?
The text was updated successfully, but these errors were encountered: