Skip to content
This repository was archived by the owner on Dec 6, 2022. It is now read-only.

Guideline for design: unixfsv2 as a "view" #10

Closed
warpfork opened this issue Jul 12, 2018 · 3 comments
Closed

Guideline for design: unixfsv2 as a "view" #10

warpfork opened this issue Jul 12, 2018 · 3 comments

Comments

@warpfork
Copy link

Let a SerializableView be defined as a function which yielded out an ordered series of bytes.

Given any SerializableView, we can apply a hashing function. Given these together, we can define both an equality predicate and a content addressable storage system.

IPLD already has one clearly defined SerializableView (though we don't call it such), and we hash this and use this hash... everywhere. It's pretty useful.

Let's lift and generalize this concept. The SerializableView of IPLD that we use for content-addressing hashes is certainly useful. However, since we use the hash of this view as the index for our storage and lookup systems, it quickly becomes the bearer of a bunch of burdens. For example, the choice of layout when we import large pieces of data (e.g., balanced tree vs trickle tree; variations in parameters for rabin chunking; etc) causes the hash of a tree of IPLD objects representing a large piece of data to vary. This is fine for the storage and lookup systems; it also makes this hash unusable for a lot of other purposes, such as a useful equality check when we don't care about the chunking and layout.

We should define a SerializableView (and thus a hash that's usable as a cheap equality predicate) for unixfsv2 which is not bound to the IPLD hash.

We can define other SerialableViews as well. One awesome applied example of this is a system by @mib-kd743naq which imports tars into IPFS and produces both a unixfsv1 tree... and a parallel tree of objects which can be cat'ed out to reproduce the original tar precisely. This is spectacularly useful because this view of bytes can be piped into a hashing function to match original.tar.sha256 or be verifiable against original.tar.asc. This tar-reemitter tree also reuses the vast majority of blob objects as the unixfsv1 tree it was produced with, which is nice (and generally, when implementing more new views, we can probably do this quite often as well -- and let's keep an eye out for making this easy as we design).

In summary:

  • A SerializableView of unixfsv2 which can be used for content-equality checks (and ignores chunking details) is desirable.
  • Making sure we can have SerializableView which (e.g.) reproduces a tar precisely can coexist nicely and shares objects with unixfsv2 is a good heuristic for a good design.
  • This seems to shake out as parallel trees for the metadata, and the both point in to trees of content blobs. (Some of these parallel trees have directories (unixvsf2); some are very different (e.g. tar which... doesn't necessarily exactly have directories per se, and does have a bunch of other stuff).) This should also probably inform our API design.
@mikeal
Copy link
Contributor

mikeal commented Jul 16, 2018

Would we want to use this hash during peer discovery?

If we did, we'd need to get more information in the unixfs-v2 definition for file parts, it currently does not contain size information for each chunk. Without it, you wouldn't be able to pull parts of a file from different peers that have the same file but with different chunking.

@warpfork
Copy link
Author

warpfork commented Sep 7, 2018

That's... an interesting question. Maybe, yeah.

My hottest/first hot take was "no", because ecosystemically, our tooling shouldn't really be encouraging the same file to be uploading in tons of wildly varying chunkings in the first place -- that's bad for pretty much everyone (less dedup, more metadata blocks, etc etc). And choosing all the different sets of chunks we could select to get fully overlapping ranges seems like a lot of algorithm to throw at a situation that we mostly intend to be rare in the first place; and given the presumption that all blocks of any two chunking parameters are equally available, it would always be less efficient to choose some blocks from one chunking parameter space and some from the other, so one simply wouldn't want to do this unless there's availability problems in one family of chunks. So there's just all sorts of reasons I don't think we'd want to rely on such a fancy feature, and thus I dunno if I'd advocate for implementing it at all.

But it's definitely a thing to consider. As long as we're doing so at a relatively high layer, so we maintain the "parallel trees" design property, it could be workable.

I'd +1 the idea that we should get the size info in such a place that we can do this, regardless.

@rvagg
Copy link
Member

rvagg commented Dec 6, 2022

closing for archival

@rvagg rvagg closed this as completed Dec 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants