Skip to content

Support Storage and Retrieval of Large & Arbitrary IPLD DAGs in Filecoin #22

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
117 changes: 117 additions & 0 deletions proposals/large-ipld-dags.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# Support Large IPLD/IPFS DAGs

Authors: Stebalien

Initial PR: TBD <!-- Reference the PR first proposing this document. Oooh, self-reference! -->

## Purpose &amp; impact
#### Background &amp; intent
_Describe the desired state of the world after this project? Why does that matter?_

First, it should be possible to store arbitrary and arbitrarily large IPLD DAGs on Filecoin using
the built-in protocols. At the moment, Filecoin can only store "whole DAGs". If a DAG, doesn't fit
into a sector when serialized as a CAR, it must be converted to raw-blocks, chunked, and then stored
as those chunks.

Unfortunately:

1. This workaround erases the underlying DAG structure. This makes it difficult to transfer this
data for both storage and retrieval. This is especially true when interacting with IPFS.
2. This workaround requires storing an "overlay" DAG in Filecoin (paying for that storage).

Second, it should be possible to retrieve subsets of DAGs. While the underlying protocols support
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The protocols support this - I think this is referring to graphsync and the other IPLD pieces down to the data storage - but the CLI doesn't. What about the miner side of this? The wording of this suggests that it's just the client CLI that's blocked on this, is that true? Can an alternative retrieval client use the protocols today to retrieve an arbitrary sub-DAG from a miner or is there more to be done on that side too?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested selector-based retrievals way back in August ( using a hardcoded selector in the client directly ) - they worked, in the context of everything else being flaky.

It's not a CLI issue, rather we do not have a decent selector interchange format in general ( a gob of cbor is not something to use over API/CLI )

In other words:

  • if today I want to specify a cid - I usually get to do the funny { "/":"baf..." } thing
  • if today I want to express a selector - I do... ❓

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhhhh back to the "selector syntax" problem, we should just solve that properly eh? so close ipld/specs#239

retrieving subsets of DAGs, the CLI does not. This makes it impossible to, e.g., retrieve a single
file from a directory without modifying Lotus.

#### Assumptions &amp; hypotheses
_What must be true for this project to matter?_

There is no easy way (e.g., no out-of-band deals) to store large (> sector size) IPLD DAGs while
preserving the DAG structure.

#### User workflow example
_How would a developer or user use this new capability?_

* `lotus client deal` should accept an IPLD selector.
* `lotus client deal` should automatically split large DAGs between multiple sectors.
* `lotus client retrieve` should support retrieving IPLD selectors (dag subsets).

#### Impact
_How directly important is the outcome to web3 dev stack product-market fit?_

🔥

At the moment, any tool wishing to support storing IPFS files/directories larger than 32GiB will need to store these IPFS files/directories as "raw blocks", throwing away all the DAG structural information. This will make future retrieval deals for subsets of this data infeasible and will make IPFS interop extremely difficult.

This is only one 🔥 because there are plenty of useful sub-32GiB datasets and non-IPFS datasets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true although there is additional impact here which is enabling people to store compositions of data sets.

If deals already exist on Filecoin for a dataset and then someone wants to reference that dataset (or some part of it) within theirs then the data has to be duplicated and stored in two separate deals. With this feature as long as there is a way to discover mappings of CID -> miner with CID (currently out of band, but is a required part of retrieval market work) then users don't need to store the same data twice (or worry about compositions exceeding 32GiB)


#### Leverage
_How much would nailing this project improve our knowledge and ability to execute future projects?_

🎯🎯🎯

If we don't solve this now, users will likely store large DAGs any way they can (e.g., as raw
blocks). We could end up with a lot of unfortunately structured data in Filecoin that's difficult to
retrieve and work with, especially from IPFS.

#### Confidence
_How sure are we that this impact would be realized? Label from [this scale](https://medium.com/@nimay/inside-product-introduction-to-feature-priority-using-ice-impact-confidence-ease-and-gist-5180434e5b15)_.

??

## Project definition
#### Brief plan of attack

1. Implement selector support in `lotus client deal`.
2. Implement selector support in `lotus client retrieve`.
3. Support automatically splitting large dags into across deals in `lotus client retrieve`.

#### What does done look like?
_What specific deliverables should completed to consider this project done?_

All three of the above commands have been implemented.

NOTE: stopping anywhere along the way will yield a useful result. As long as the first step is finished (selector support for `lotus client deal`), we'll be able to store large structured IPFS data on-chain.

#### What does success look like?
_Success means impact. How will we know we did the right thing?_

1. Developers can easily store large directory trees on Filecoin.
2. Developers can easily retrieve individual files from large datasets on Filecoin.
3. Snapshots of English Wikipedia can be stored on Filecoin.

#### Counterpoints &amp; pre-mortem
_Why might this project be lower impact than expected? How could this project fail to complete, or fail to be successful?_

The primary risk is that there may be a lack of demand to store large IPFS-formatted datasets in Filecoin. That is, users storing large datasets (> 32GiB) may all be using custom formats and may not care about IPFS files/directories, partial retrieval, etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the required path for partial retrievability - or is that a somewhat orthogonal (if related) problem?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how the comment relates to the paragraph so I may be misinterpreting it.

Step 2 of the "plan of attack" is required for partial retrieval.


Another risk is that the IPLD selector language may be insufficient to describe useful selectors over IPFS data. It should be at least possible to _store_ DAG subsets using IPLD selectors, but we may need new selectors to, e.g., download individual IPFS files.

#### Alternatives
_How might this project’s intent be realized in other ways (other than this project proposal)? What other potential solutions can address the same need?_

1. Don't support datasets > 32GiB.
2. Store large datasets as raw objects instead of IPFS files and accept the fact that these datasets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess a 3rd one like this could include allowing users to send a parallel DAG structure that only contains links if they want to be queryable and accepting that our selector options will be limited and some dealing with this manifest may be a pain

will be difficult to query/retrieve from IPFS.

#### Dependencies/prerequisites
<!--List any other projects that are dependencies/prerequisites for this project that is being pitched.-->

None.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should think about if this can be deferred or done in parallel with having the lotus client / market work using ipld-prime

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just using ipld-prime doesn't get us much. I need to be able to (a) make a deal over a selector and (b) retrieve a selector.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#27 is probably a dependency.


#### Future opportunities
<!--What future projects/opportunities could this project enable?-->

* Large IPFS datasets.
* IPFS interop.

## Required resources

#### Effort estimate

Small to medium.

#### Roles / skills needed

* Markets (ideally Hannah or Dirk).
* IPLD/Selectors (Riba or Eric).