Skip to content
This repository was archived by the owner on Aug 2, 2021. It is now read-only.

Let's update the Swarm Manifest #878

Open
3 of 10 tasks
cobordism opened this issue Aug 10, 2018 · 4 comments
Open
3 of 10 tasks

Let's update the Swarm Manifest #878

cobordism opened this issue Aug 10, 2018 · 4 comments
Assignees
Labels
Milestone

Comments

@cobordism
Copy link

cobordism commented Aug 10, 2018

We've long talked about updating the Swarm Manifest data structure to allow for more advanced use-cases from path based routing to query string management.
This Issue collects all the changes we discussed over time.


Default entry is a path not a hash

  • Default manifest entry should not be a hash associated to the empty path, but a relative path associated with an explicit 'default' marker.

Thus a manifest in which the default is index.html should no longer have the empty path '' associated with a hash that happens to be the same hash as that associated with index.html but rather it should have a default entry which maps to index.html. While we might sometimes need more bzz manifest lookups, we gain simplicity in housekeeping.

Related issues and discussions
#162 https://gist.github.com/homotopycolimit/7b70436326535ef19028f836680a1f3c
#178 (comment)
https://gist.github.com/homotopycolimit/a83ad855a03190463738e93fcf6aa339
#824

  • prepare for easy updating of current manifests to this new type
  • deprecate the current empty-path-as-default manifests.

Manifest can define its own routing / path matching

  • Prepare for different "types" of manifest.
  • Define the "directory" (default) type

We discussed having a 'directory' type manifest which adheres to specific rules (eg. any entry ending in / has to be of type bzz-manifest+json ) and which can always be FUSE mountable. We discussed having other manifests which explicitly state what to do with over / undermatched paths. For example if a manifest contains only 'index.html what should be done with index.h or with index.html?a=b? If we want to enable apps that use path based routing, then manifests need to be more flexible in how they handle bzz queries - perhaps even allowing regular expressions?
We can also imagine allowing manifests to specify HTTP 301 redirect entries which would map to symlinks in fuse. We need discussion on this.

See also ethereum/go-ethereum#14349 (as well as the discussions linked above)

  • Create github issue discussions for the specific manifest types we might want to implement beyond the 'standard' directory type we are using now. RegExp, Overmatch, undermatch, ignore?querystring... (I'd like input from @lmars and @nagydani and others here)

prepare cli commands to work with new manifest

  • swarm --recursive up should generate directory manifest types
  • swarm fs should warn when attempting to mount non directory manifest types (or even fail).
  • swarm up file should generate a manifest with a named file and a default entry referring to the file as opposed to an unamed default file only.

Related to this question is that of cleaning up the bzz-list output
#726
#156


Docs

  • Add documentation for the above changes to swarm-guide as needed.
@janos
Copy link
Member

janos commented Aug 13, 2018

Having default entry as a path in manifest is more intuitive as it is closer to web server configuration. Even if this is something that I would like to see in manifests, there are more reasons against it. Current manifest implementation stores only the hash for the empty path. As manifests are mostly read for getting file content, not for looking up or manipulating manifest structure by the users directly, I am not sure that this change adds more value, except that it makes manifests more flexible for possible future features. Currently, default entry serves more like an index for default entry, not like a reference, for performance reasons I assume, but as manifest structure is not intended for users, this is fine. If changing to path based default entry, I for not changing the manifest entry type, but for adding a field to the manifest type to specify the default entry which has different fields then manifest entry type. But in general, I would still keep the current manifest structure, just for default page performance.

For more complex web sites on swarm, and even for single page javascript applications, I think that path matching is a better approach then having an implicit feature to serve the same content for subpaths (overmatch, undermatch). It may be better for users to have the freedom to construct routing, including redirections, for their own web application. Default entry is a special case, but I still do think that it needs to perform the best as possible, with hash reference.

@cobordism
Copy link
Author

What is missing of course right now is any tooling to manipulate manifests.
And you already saw that having the same hash twice (once for index.html and once for '') creates issues for the tools, the encryption cased a problem too.
So while I agree that 'users' are not meant to manipulate manifests by hand, Swarm webmasters are expected to manipulate manifests through not-yet-written tools.

Having the path reference instead of the hash is only slower if index.html (or whatever the default is) isn't listed in the same manifest but within some submanifest... and it will never be slower than calling /index.html directly.
The question is, is the little performance increase worth the extra complexity in the tooling?

@cobordism
Copy link
Author

  • better concrete manifest entry type that will allow us to describe metadata entries better.
    for example

for example for ACTs

@acud
Copy link
Member

acud commented Sep 14, 2018

In my endeavor to try and crunch the information a bit more, I can give further input regarding the issues above, and add more on top:


no empty directory representation:

As of now, except for manually tampering and editing a manifest, there exists no way in swarm to represent an empty directory in a manifest. That is since directory manifests and trie split entries have no distinctions. This is due to the fact that directories have no formal representation in swarm.
Solution: see trailing slash below


default entry behaviour

Default entry path has to be manually specified in the CLI. This should be mended and we should at least try to identify a default index.html, or an entry point specified within a package.json when uploading with swarm up (with a possibility to turn this off). So the default behaviour would auto-detect, a --defaultpath would override, and --nodefaultpath would leave it empty
Open question - should we allow subpaths as default entry? e.g. that the path is ‘./src/index.html’?


default entry is a manifest entry

This should not be the case, and it should be specified on the manifest metadata level as this introduces redundant data into the manifest entries data structure.


inconsistent naming conventions in the manifest definition

contentType as opposed to mod_time


allowing redirections and symlink alignment

Can be resolved with adding a symlink file type as @lmars suggested in ethereum/go-ethereum#14345 (comment)


Tasks:

  1. Manifest tooling:
  • Auto detect default entry - index.html/package.json entry point (783)
  • Allow users to disable default entry detection
  • Create flag to follow symlinks (solves 3601)
  • Go through manifest error messages (solves 726)
  1. Manifest data structure:
  • Add default entry as a path, not a hash, next to (as opposed to inside of) entries
  • Add manifest type annotation - trie split or a directory
  1. Traversal behaviour - path match/undermatch/overmatch:
  • Manifest without default entry should return bzz-list on http get - this should be possible to disable on the manifest level - similar as with autoindex directive on nginx or mod_autoindex/Indexes on httpd

NOTE: Unix FS allows everything except NULL or / to be inputted as a filename

  • This is problematic when choosing to undermatch on ? or other characters
  • This can be cumbersome when allowing manifests to define their own matching rules
    Recommendation: return an element only if it is directly found - do not allow overmatch or undermatch

Trailing slash problem:

  1. Current structure functions both as an indice and as an filesystem representation:
  • Both are designed to solve different problems
  • For most cases, the first is auxiliary to the latter
  1. In current design we break (break = create a new trie fork)
  • on a common prefix (has more than one child by definition)
  • ..and that’s it
  1. In the current design a slash is irrelevant - only the number of children
  • Assumptions:
  • This is a search trie design consideration, not a file system representation consideration
  • A file system representation cares about slashes (in fact, *nix filesystems care for slashes only)
  • A file system representation does not care about the number of children inodes as a precondition for a directory inode’s creation
  • The network complexity on lookups benefits with this design in the case that a directory contains a large number of flat subdirectories that contain only one element => no trie forks due to slashes

Possible solutions:

  1. Adopt Ext4 (or similar) structure - where:
    1.1. Each directory would be a manifest
    1.2. For each directory we keep an indice stored inside the manifest, side by side by the entries. This can also allow different implementations of the search tries to be included in different types of manifests (or according to runtime optimisations).
    1.3. The indice could be stored in the json as binary in base64 encoding
    1.4. The indice node will contain an inode or path which will be mapped to the relevant manifest entry through a hashmap/dictionary
    1.5 Optional: store dot + dotdot entries to allow bidirectional manifest traversal?
    in this case, we end up with something like:
{
   "entries":[
      {
         "inode":1,
         "path":"folder1", //might not be necessary 
         "type":"application/bzz-manifest+json",
         "hash":"<hash>"
      },
      {
         "inode":2,
         "path":"file1.txt", //might not be necessary
         "type":"text/plain",
         "hash":"<hash>"
      }
   ],
   "indice":"c29tZXJhbmRvbWluZGljZXRleHRnb2VzaGVyZQ==",
   "type": "directory",
   "default":2 //inode; faster lookup?
}
  1. Find a hybrid implementation between current implementation and Ext4 model where:
    2.1. We insert artificial directory entries into manifests when traversing filesystem trees upon manifest creation; with the drawback that:
    2.2. this structure cannot be enforced by definition - it’s an implementation detail and it is up to the client/uploader if to conform with this structure, which could in turn lead to inconsistencies for the network to keep up its promise

references:

#156 - manifests without a default path should default to list request
#878 - this sip - default entry is a path not a hash, manifest can define its own routing / path matching, prepare cli commands to work with new manifest
#162 - default entry
#178 - trailing slash problem [1]
#726 - error msg refining
#783 - default entry - auto detect on index.html
ethereum/go-ethereum#14349 - directory manifest type [1]
ethereum/go-ethereum#14345 - support http redirect
ethereum/go-ethereum#3601 - swarm recursive up fails on symlinks - resolved?

@acud acud added this to the 0.3.5 milestone Sep 20, 2018
@acud acud self-assigned this Sep 20, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants