-
Notifications
You must be signed in to change notification settings - Fork 5
Support Storage and Retrieval of Large & Arbitrary IPLD DAGs in Filecoin #22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,117 @@ | ||
# Support Large IPLD/IPFS DAGs | ||
|
||
Authors: Stebalien | ||
|
||
Initial PR: TBD <!-- Reference the PR first proposing this document. Oooh, self-reference! --> | ||
|
||
## Purpose & impact | ||
#### Background & intent | ||
_Describe the desired state of the world after this project? Why does that matter?_ | ||
|
||
First, it should be possible to store arbitrary and arbitrarily large IPLD DAGs on Filecoin using | ||
the built-in protocols. At the moment, Filecoin can only store "whole DAGs". If a DAG, doesn't fit | ||
into a sector when serialized as a CAR, it must be converted to raw-blocks, chunked, and then stored | ||
as those chunks. | ||
|
||
Unfortunately: | ||
|
||
1. This workaround erases the underlying DAG structure. This makes it difficult to transfer this | ||
data for both storage and retrieval. This is especially true when interacting with IPFS. | ||
2. This workaround requires storing an "overlay" DAG in Filecoin (paying for that storage). | ||
|
||
Second, it should be possible to retrieve subsets of DAGs. While the underlying protocols support | ||
retrieving subsets of DAGs, the CLI does not. This makes it impossible to, e.g., retrieve a single | ||
file from a directory without modifying Lotus. | ||
|
||
#### Assumptions & hypotheses | ||
_What must be true for this project to matter?_ | ||
|
||
There is no easy way (e.g., no out-of-band deals) to store large (> sector size) IPLD DAGs while | ||
preserving the DAG structure. | ||
|
||
#### User workflow example | ||
_How would a developer or user use this new capability?_ | ||
|
||
* `lotus client deal` should accept an IPLD selector. | ||
* `lotus client deal` should automatically split large DAGs between multiple sectors. | ||
* `lotus client retrieve` should support retrieving IPLD selectors (dag subsets). | ||
|
||
#### Impact | ||
_How directly important is the outcome to web3 dev stack product-market fit?_ | ||
|
||
🔥 | ||
|
||
At the moment, any tool wishing to support storing IPFS files/directories larger than 32GiB will need to store these IPFS files/directories as "raw blocks", throwing away all the DAG structural information. This will make future retrieval deals for subsets of this data infeasible and will make IPFS interop extremely difficult. | ||
|
||
This is only one 🔥 because there are plenty of useful sub-32GiB datasets and non-IPFS datasets. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is true although there is additional impact here which is enabling people to store compositions of data sets. If deals already exist on Filecoin for a dataset and then someone wants to reference that dataset (or some part of it) within theirs then the data has to be duplicated and stored in two separate deals. With this feature as long as there is a way to discover mappings of CID -> miner with CID (currently out of band, but is a required part of retrieval market work) then users don't need to store the same data twice (or worry about compositions exceeding 32GiB) |
||
|
||
#### Leverage | ||
_How much would nailing this project improve our knowledge and ability to execute future projects?_ | ||
|
||
🎯🎯🎯 | ||
|
||
If we don't solve this now, users will likely store large DAGs any way they can (e.g., as raw | ||
blocks). We could end up with a lot of unfortunately structured data in Filecoin that's difficult to | ||
retrieve and work with, especially from IPFS. | ||
|
||
#### Confidence | ||
_How sure are we that this impact would be realized? Label from [this scale](https://medium.com/@nimay/inside-product-introduction-to-feature-priority-using-ice-impact-confidence-ease-and-gist-5180434e5b15)_. | ||
|
||
?? | ||
|
||
## Project definition | ||
#### Brief plan of attack | ||
|
||
1. Implement selector support in `lotus client deal`. | ||
2. Implement selector support in `lotus client retrieve`. | ||
3. Support automatically splitting large dags into across deals in `lotus client retrieve`. | ||
|
||
#### What does done look like? | ||
_What specific deliverables should completed to consider this project done?_ | ||
|
||
All three of the above commands have been implemented. | ||
|
||
NOTE: stopping anywhere along the way will yield a useful result. As long as the first step is finished (selector support for `lotus client deal`), we'll be able to store large structured IPFS data on-chain. | ||
|
||
#### What does success look like? | ||
_Success means impact. How will we know we did the right thing?_ | ||
|
||
1. Developers can easily store large directory trees on Filecoin. | ||
2. Developers can easily retrieve individual files from large datasets on Filecoin. | ||
3. Snapshots of English Wikipedia can be stored on Filecoin. | ||
|
||
Stebalien marked this conversation as resolved.
Show resolved
Hide resolved
|
||
#### Counterpoints & pre-mortem | ||
_Why might this project be lower impact than expected? How could this project fail to complete, or fail to be successful?_ | ||
|
||
The primary risk is that there may be a lack of demand to store large IPFS-formatted datasets in Filecoin. That is, users storing large datasets (> 32GiB) may all be using custom formats and may not care about IPFS files/directories, partial retrieval, etc. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this the required path for partial retrievability - or is that a somewhat orthogonal (if related) problem? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure how the comment relates to the paragraph so I may be misinterpreting it. Step 2 of the "plan of attack" is required for partial retrieval. |
||
|
||
Another risk is that the IPLD selector language may be insufficient to describe useful selectors over IPFS data. It should be at least possible to _store_ DAG subsets using IPLD selectors, but we may need new selectors to, e.g., download individual IPFS files. | ||
|
||
#### Alternatives | ||
_How might this project’s intent be realized in other ways (other than this project proposal)? What other potential solutions can address the same need?_ | ||
|
||
1. Don't support datasets > 32GiB. | ||
2. Store large datasets as raw objects instead of IPFS files and accept the fact that these datasets | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess a 3rd one like this could include allowing users to send a parallel DAG structure that only contains links if they want to be queryable and accepting that our selector options will be limited and some dealing with this manifest may be a pain |
||
will be difficult to query/retrieve from IPFS. | ||
|
||
#### Dependencies/prerequisites | ||
<!--List any other projects that are dependencies/prerequisites for this project that is being pitched.--> | ||
|
||
None. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we should think about if this can be deferred or done in parallel with having the lotus client / market work using ipld-prime There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just using ipld-prime doesn't get us much. I need to be able to (a) make a deal over a selector and (b) retrieve a selector. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. #27 is probably a dependency. |
||
|
||
#### Future opportunities | ||
<!--What future projects/opportunities could this project enable?--> | ||
|
||
* Large IPFS datasets. | ||
* IPFS interop. | ||
|
||
## Required resources | ||
|
||
#### Effort estimate | ||
|
||
Small to medium. | ||
|
||
#### Roles / skills needed | ||
|
||
* Markets (ideally Hannah or Dirk). | ||
* IPLD/Selectors (Riba or Eric). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The protocols support this - I think this is referring to graphsync and the other IPLD pieces down to the data storage - but the CLI doesn't. What about the miner side of this? The wording of this suggests that it's just the client CLI that's blocked on this, is that true? Can an alternative retrieval client use the protocols today to retrieve an arbitrary sub-DAG from a miner or is there more to be done on that side too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested selector-based retrievals way back in August ( using a hardcoded selector in the client directly ) - they worked, in the context of everything else being flaky.
It's not a CLI issue, rather we do not have a decent selector interchange format in general ( a gob of cbor is not something to use over API/CLI )
In other words:
{ "/":"baf..." }
thingThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahhhhh back to the "selector syntax" problem, we should just solve that properly eh? so close ipld/specs#239