Skip to content

swarm: Directory manifest type #14349

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lmars opened this issue Apr 18, 2017 · 9 comments
Closed

swarm: Directory manifest type #14349

lmars opened this issue Apr 18, 2017 · 9 comments
Assignees

Comments

@lmars
Copy link
Contributor

lmars commented Apr 18, 2017

Collections of files are currently modelled using JSON manifests containing a list of entries, with each entry having a path, hash and some file attributes like size, content type, mode etc.

The manifests are stored as a trie such that multiple files which share a common prefix are represented by a sub-manifest.

Consider storing a collection of three files named foo.txt, files/01.txt and files/02.txt, these would be represented by the following:

# <root-hash>
[
  {"path": "f", "type": "application/bzz-manifest+json", "hash": "<hash1>"}
]

# <hash1>
[
  {"path": "iles/0", "type": "application/bzz-manifest+json", "hash": "<hash2>"},
  {"path": "oo.txt", "type": "text/plain", "hash": "<hash>"}
]

# <hash2>
[
  {"path": "1.txt", "type": "text/plain", "hash": "<hash>"},
  {"path": "2.txt", "type": "text/plain", "hash": "<hash>"}
]

This representation leads to added complexity when treating this collection like a Unix filesystem (e.g. mounting it with FUSE or listing files in a directory), and is actually less performant than if the manifests just represented directory indexes (compare looking up f -> iles/0 -> {1,2}.txt to an equivalent directory based lookup files -> 0{1,2}.txt).

Files can also contain a trailing slash which is also problematic (there is no way to access a file named files/ in a Unix filesystem as it would be treated like a directory lookup).

To solve these issues, we propose adding an explicit "directory" manifest type which is more like a traditional Unix filesystem:

  • entries of type application/bzz-manifest+json are equivalent to directory inodes (i.e. a list of files or subdirectories)
  • all other entries are equivalent to file inodes
  • no paths are permitted to contain a / character

This makes it easier to traverse the collection like a Unix filesystem, and also solves the issue of files which have trailing slashes being inaccessible in a FUSE mount (since they are no longer permitted).

The above collection of files would then be represented by:

# <root-hash>
[
  {"path": "files", "type": "application/bzz-manifest+json", "hash": "<hash1>"},
  {"path": "foo.txt", "type": "text/plain", "hash": "<hash>"}
]

# <hash1>
[
  {"path": "01.txt", "type": "text/plain", "hash": "<hash>"},
  {"path": "02.txt", "type": "text/plain", "hash": "<hash>"}
]

/cc @zelig @homotopycolimit

@zelig
Copy link
Contributor

zelig commented Apr 20, 2017

ok we should start writing this up in a wiki or gist to see the whole picture.

In general the direction is to hide manifests. they should not be the format used for directory indexes.
My concern is that we still want to allow directory trees organised as a trie and break on slash.
The primary motivation being the guaranteed limited size of the log(n)) intermediate chunks when a file is accessed.
In such cases we need to determine what we do for upload/download and trailing slash.
For upload from a directory, no trailing slash will be produced and manifest wide option can be provided on each manifest to indicate whether trailing slash is ignored for routing or not.
The behaviour when downloading such manifest is not problematic (current swarm is correct)

When downloading the compacted trie style manifest is, however. Even in such cases the problem only occurs if there are paths with and without the slash, e.g., path and path/. But if this is the case there will be a trie fork on path, ie;, path will point to a manifest that has entries for '\path' and an empty path. If we stipulate that empty path (default entry) always is explicitly represented in the manifest as a reference to a path of another manifest entry.

A related issue is the ability to set the default entries for each directory as path, e.g., an index.html.
In the special case that you want a different entry for a path and path/, when you upload from a directory, you then want 2 default entries, one for empty and a separate for trailing slash.

{
# <root-hash>
{
"default": "files/foo.txt"
"entries": [
  {"path": "f", "type": "application/bzz-manifest+json", "hash": "<hash0>"},
]
}

# hash0
{
"default": "oo.txt",
"prefix": "/", 
"entries": [
  {"path": "iles", "type": "application/bzz-manifest+json", "hash": "<hash1>"},
  {"path": "oo.txt", "type": "application/bzz-manifest+json" "hash": "<hash2>"},
 ]
}
# hash1
{
"prefix": "/",
"entries": [
  {"path": "01.txt", "type": "text/plain", "hash": "<hash>"},
  {"path": "02.txt", "type": "text/plain", "hash": "<hash>"}
]

#hash2 
{
"entries": [
{"path":"", "hash": <hash>}, 
{"path":"/", "hash": <hash>}}
]
}

this is equivalent (is downloaded as):

root/
    files/
      01.txt
      02.text
   foo.txt
   foo.txt._trailing_slash_
   ._manifest_

# cat ._manifest_:
{ 
  "default": "foo.txt",
}

would something like this make more sense?

@cobordism
Copy link
Contributor

There was some talk at the swarm orange summit 2018 about different types of manifest beyond the current simple type.
There seems to be a feeling that manifests should be more configurable - to allow things like path based routing and regular expression matching for example....
Perhaps we should have a general SIP about how to extend the Swarm Manifest spec for 0.4

@acud
Copy link
Member

acud commented Oct 15, 2018

this will generally obsolete the current manifest type and will make it a special case needed only for traversals. it will also obsolete most fields on current manifest type as they will not be needed anymore.

i think this change will make current traversal code's more complex in order to reach the exact same functionality it posses today. we would have to traverse both types anyway.

a more favourable solution would be to just add a bool flag on the current manifest type.
as far as i understand the only case where the information that a directory is marked as a directory is relevant just when an empty directory has to be created on the filesystem.

@cobordism
Copy link
Contributor

You mean a bool flag "ismanifest=true"?

sure, that's how you'd show that this manifest is of directory type, but the rules are more than just having empty dirs, it must also include no filenames ending in / for example. no?

@acud
Copy link
Member

acud commented Oct 17, 2018

in this regard, a trie split and a directory manifest are both cases of manifests. i mean a flag to denote isDirectory=true.
rules are to be enforced by tooling, sure.

@lmars
Copy link
Contributor Author

lmars commented Oct 18, 2018

@justelad I think it is worth zooming out and perhaps coming up with a more robust design.

Having the trie represented at the level of the manifest is, in my opinion, mixing two different abstractions into one.

I think the system should be layered:

  1. an Index layer that is just a mapping of UTF8 strings to blobs of data with an efficient, chunk-level representation

  2. a File layer which associates blobs of data with associated metadata (much like the current ManifestEntry, a specific type of Index with keys like mode, contentType, modTime etc.)

  3. a Filesystem layer that is a high level organisation of Files and traversal mechanisms (much like the current Manifest, a specific type of Index that maps paths to Files).

@acud
Copy link
Member

acud commented Oct 18, 2018

@lmars I agree with you and I think this is evident from my comment in the swarm manifest SIP. The idea to ditch the trie design was not accepted so far in the team

@cobordism
Copy link
Contributor

who is 'the team' here? who makes that decision? whose answer / input do you need?

@nonsense
Copy link
Member

nonsense commented Jun 3, 2019

@lmars @homotopycolimit @zelig let's continue this conversation at https://github.com/ethersphere/swarm as the Swarm codebase lives there from now on.

@nonsense nonsense closed this as completed Jun 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants
@nonsense @lmars @zelig @cobordism @acud and others