Skip to content
This repository was archived by the owner on Apr 29, 2020. It is now read-only.

Default raw leaves to true #50

Closed
wants to merge 2 commits into from
Closed

Conversation

achingbrain
Copy link
Collaborator

@achingbrain achingbrain commented Jan 16, 2020

Builds on #49. A separate PR as it's slightly contentious.

Sets default DAG construction to be a dag-pb root node, dag-pb intermediate nodes and ipld-raw nodes for leaves. This will make parsing ever so slightly faster and DAG sizes ever so slightly smaller as there is no protobuf wrapper for the actual file data.

Currently you may end up with ipld-raw leaves or dag-pb leaves that contain UnixFS entries with type 'file' or 'raw' depending on where the importer is invoked from.

E.g. to generate the same CIDs as go-IPFS, ipfs.add will result in a balanced DAG with UnixFS leaf nodes with a type 'file', ipfs.files.write will result in a trickle DAG with UnixFS leaf nodes of a type raw, and specifying CID version 1 will get you ipld-raw leaf nodes and whatever tree strategy you specifed, default balanced. I believe the reason for this is largely historical, go-IPFS has different importers for different subsystems written by different people at different times when different bits of IPLD were available.

I think this is chaos and we should go a step further than this PR - we should use ipld-raw leaf types everywhere and only offer options to change the DAG structure, not leaf types.

BREAKING CHANGE:

  • options.rawLeaves now defaults to true

This is a `unixfs-importer` but it's possible to output CIDs that
resolve to `dag-raw` nodes if your data is small and you set
`rawLeaves` and `reduceSingleLeafToSelf` to true. In that case
you'll get a `dag-raw` node as the output.

This makes it impossible to set metadata on small files with
`rawLeaves` set to true.

BREAKING CHANGE:

If your data is below the chunk size, and you have `rawLeaves` and
`reduceSingleLeafToSelf` set to true, you'll get a CID that resolves
to a bona fide UnixFS file back with metadata and all that good
stuff instead of a `dag-raw` node.
Builds on #49.  A separate PR as it's slightly contentious.

Sets default DAG construction to be a `dag-pb` root node, `dag-pb`
intermediate nodes and `ipld-raw` nodes for leaves.  This will make
parsing ever so slightly faster and DAG sizes ever so slightly
smaller as there is no protobuf wrapper for the actual file data.

Currently you may end up with `ipld-raw` leaves or `dag-pb` leaves
that contain UnixFS entries with type 'file' or 'raw' depending on
where the importer is invoked from.

E.g. to generate the same CIDs as go-IPFS, `ipfs.add` will result in
a balanced DAG with UnixFS leaf nodes with a type 'file',
`ipfs.files.write` will result in a trickle DAG with UnixFS leaf
nodes of a type `raw`, and specifying CID version 1 will get you
`ipld-raw` leaf nodes and whatever tree strategy you specifed,
default balanced.

I think this is chaos, we should use `ipld-raw` leaf types everywhere
and only offer options to change the DAG structure, not leaf types.
@ribasushi
Copy link

... with type 'file' or 'raw' depending on ...

Archeological question: what is the rationale for using the raw PB type? The encoding size is the same, but you can always seamlessly compose "files" into "larger files", while a raw node is not usable standalone.

@achingbrain
Copy link
Collaborator Author

I'm not sure, it predates my time on the project. My gut feeling is that it's from a time before ipld-raw and CIDv1. Maybe @Stebalien knows more?

@Stebalien
Copy link

I believe the raw dag-pb unixfs type was supposed to be for file leaves. And yes, that came before IPLD.

@Stebalien
Copy link

We chose not to make this the default in go-ipfs to avoid changing CIDs. That is, users expect that ipfs add -r ... will always produce the same CID when run on the same directory.

go-ipfs does default to raw leaves when --cid-version=1 is passed as that changes CIDs anyways.

The plan is to switch all the defaults (cidv1, raw leaves, etc.) all at once, ideally when we get UnixFSv2. Until then, doing this piece-wise is likely to cause quite a bit of noise and annoying issues.

One solution would be to introduce an "importer version" flag. Each time we introduce a fancy new importer feature, we'd define a new set of import options and assign an "importer version" to it. Users would be able to pass --importer=latest to get the latest and greatest features (cidv1, raw leaves, better chunker, inline CIDs, etc.).

@mikeal
Copy link

mikeal commented Jan 16, 2020

Just for clarity, does this default actually persist into js-ipfs or is this being set every time by js-ipfs‘s use of the importer?

If the defaults here are always being overwritten by js-ipfs I think it makes the discussion about defaults for this library a bit clearer and easier.

@achingbrain
Copy link
Collaborator Author

In theory js-IPFS sets the option every time, but some may lack defaults so get passed as undefined which merge-options will ignore if a default has been set.

Anyway, @Stebalien, @alanshaw, @lidel and I had a catch up about this and the way forward is this:

  • Do not make this the default because it'll change hashes.
  • Create import presets or versions that set defaults for various options - cid v1, raw nodes, graph construction etc to let users opt-in to shiny new features in a way that is repeatable and doesn't encourage massive fragmentation of CIDs for a given bit of content

@achingbrain achingbrain deleted the default-raw-leaves-to-true branch February 20, 2020 23:34
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants