Skip to content

unixfs support for large directories #1720

@willglynn

Description

@willglynn

This might be an extreme case, but consider someone wishing to make Wikipedia available on IPFS:

$ zcat dumps.wikimedia.org/enwiki/20150901/enwiki-20150901-all-titles-in-ns0.gz | wc -l
 11944438

It seems like a bad idea to have a merkledag node with 12M links, but that would be the representation using unixfs Data_Directory. It seems like a similarly bad idea to have a merkledag node past even 1k links, and directories with a thousand files occur more commonly in practice.

Alternate directory representations (Data_PrefixTreeDirectory and/or Data_HashTableDirectory) might be a solution, using intermediate merkledag nodes to ultimately reference all of a large directory's children in a way that permits efficient traversal and selective retrieval. The distinction between directory representations could be transparent to users, with implementations free to choose whichever data structure it deems suitable.

Going back to the example: the list of Wikipedia pages is 67 MB gzipped, before adding hashes. A user shouldn't have to download ~400 MB of protobufs just to find the hash of a single article page.

What's a sensible upper limit for a merkledag node's link count or encoded size in bytes? Is there precedent for reducing merkledag fan out? What other components would need to know about a new unixfs DataType?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions