git-diff feature: Improve efficiency of IPFS for sharing updated file. Decrease file/block duplication 

* Motivation: 

When a file is updated and resync again, decrement its block duplication on nodes all over the world and decrease communication cost (only downloaded the updated blocks) and save storage (only updated section of the file as blocks will be stored).

-----------------

* Problem:

Example, there is a .tar.gz file, which contains a data.txt file, file.tar.gz (~100 GB) stored in my IPFS-repo, which is pulled from another Node-a.

I open the data.txt file and added a single character in a random locations in the file (beginning of the file, middle of the file, and end of the file), and compress it again as file.tar.gz and store it in my IPFS-repo. Here update is only few kilobytes.

[[*]](https://discuss.ipfs.io/t/does-ipfs-provide-block-level-file-copying-feature/6388/4?u=avatar-lavventura) When I deleted a single character at the beginning of a file, since the hash of all 124kb-blocks will be altered, which will lead to download complete file to be downloaded. 

As a result, when node-a wants to re-get the updated tar.gz file a re-sync will take place and whole file will be downloaded all over again. As a result there will be duplication of blocks (~100 GB in this example ) even the change is made only for few kilobytes. **And iteratively this duplication will be distributed to all over the peers, which is very inefficient and consumes high amount of storage and additional communication cost over time.** 

-----------

* Solution: 

Other clouds are try to solve this problem using [Block-Level File Copying](https://superuser.com/a/1368955). On their case like IPFS, since blocklist is considered for "Block-Level File Copying"; when a file is updated (a character is added at the beginning of the file), Dropbox, One-Drive will re-upload the whole file since the first block's hash will be change and it will also affect/change the hash of all the consequent blocks. This doesn't solve the problem.

**=>** I believe better soluiton is to consider between each commits of the files, approach that [git-diff](https://git-scm.com/docs/git-diff) uses could be considered. This will only uploads the changed (diff) parts of the file, that will be few kilobytes on the example I give, and its diffed blocks will be merged when other nodes pull that file. So as communication cost only few kilobytes of that will be transferes and that amount of data will be added to storage will be only few kilobytes as well.

I know that it will be difficult to re-design IPFS's design, but this could be done as a wrapper solution that combines `IPFS` and `git`, and users can use it for very large files based on their needs.

----------

This problem is not considered as priority by IPFS team but at least it should be on the priority.

> IPFS team is considering adding that eventually, but it’s not a priority.


------

Please see discussion I have already opened. Please feel free to add your ideas in to them.

=> [Does IPFS provide block-level file copying feature?](https://discuss.ipfs.io/t/does-ipfs-provide-block-level-file-copying-feature/6388)

=> [Efficiency of IPFS for sharing updated file](https://stackoverflow.com/a/52246029/2402577) 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

git-diff feature: Improve efficiency of IPFS for sharing updated file. Decrease file/block duplication #392

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

git-diff feature: Improve efficiency of IPFS for sharing updated file. Decrease file/block duplication #392

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions