Skip to content
This repository was archived by the owner on Dec 6, 2022. It is now read-only.
This repository was archived by the owner on Dec 6, 2022. It is now read-only.

Handing of non-utf-8 posix filenames #3

@kevina

Description

@kevina

Based on what I gather from #1 of the requirements is to be able to backup an Unix filesystem onto ipfs. In order to do this it is important to be able to handle any valid Unix/Posix filename. Unix filenames may contain any character except null (0x00) and and /, they is no guarantee that they are valid UTF-8 strings, even if this is now the best practice.

So the question is how to handle them in the CBOR encoding. The simplest option would be just to (1) make them CBOR byte strings and be done with it. This could present a problem when converting it to JSON.

A slightly more complicated option is to (2) make use of the text/byte distinction in CBOR. If a string is a valid UTF-8 string then it MUST be encoded as a text string if not then it is encoded as a byte string. I use the word MUST (in the RFC sense) so that there is only one way to encoded a given filename. Given this option a UTF-8 string can be encoded in JSON as is, but a byte string still needs special treatment.

So the question is how to handle a non-valid utf-8 bytes in JSON. As I see it there are several options (j1) Encode the bytes using the non-standard JSON escape sequence \0x## where ## is the hex value of the byte for example 0x77. (j2) Somehow encode the bytes in Unicode itself, one option would be for the / to act as a marker that the next charter is a literal byte and not a utf-8 character, for example the byte 0x77 could be encoded as /\u0077. (j3) If we elect to make use of the CBOR byte/string distinction then any filenames that are bytes can just be encoded using as a BASE64-string. (j4) A slightly more compact version of 4 is to assume the non-utf-8 string is ISO-8859-1 and convert to Unicode, when decoding it will be converted back from UTF-8 to ISO-8859-1 with no lose of information.

For (j3) or (j4) there will need to be way to signal that the string should not be interpreted as UTF-8 in JSON. One idea I have is to start the string with a '/' as that can be included in any filename.

Finally, there is always the option to just force UTF-8 (and maybe even a more restive set as I think @mib-kd743naq wants), but then there will be filenames that can not be repressed in a IPLD Unixfs and some other format will likely be needed for backing up the filesystem.

@whyrusleeping @Stebalien others, thoughts?

[Edited to include option (4) for JSON encoding.]

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions