Handing of non-utf-8 posix filenames

Based on what I gather from #1 of the requirements is to be able to backup an Unix filesystem onto ipfs.  In order to do this it is important to be able to handle any valid Unix/Posix filename.  Unix filenames may contain any character except null (`0x00`) and and `/`, they is no guarantee that they are valid UTF-8 strings, even if this is now the best practice.

So the question is how to handle them in the CBOR encoding.  The simplest option would be just to (1) make them CBOR byte strings and be done with it.  This could present a problem when converting it to JSON.

A slightly more complicated option is to (2) make use of the text/byte distinction in CBOR.  If a string is a valid UTF-8 string then it MUST be encoded as a text string if not then it is encoded as a byte string.  I use the word MUST (in the RFC sense) so that there is only one way to encoded a given filename.  Given this option a UTF-8 string can be encoded in JSON as is, but a byte string still needs special treatment.

So the question is how to handle a non-valid utf-8 bytes in JSON.  As I see it there are several options (j1) Encode the bytes using the non-standard JSON escape sequence `\0x##` where `##` is the hex value of the byte for example `0x77`.  (j2) Somehow encode the bytes in Unicode itself, one option would be for the `/` to act as a marker that the next charter is a literal byte and not a utf-8 character, for example the byte `0x77` could be encoded as `/\u0077`. (j3) If we elect to make use of the CBOR byte/string distinction then any filenames that are bytes can just be encoded using as a BASE64-string.  (j4) A slightly more compact version of 4 is to assume the non-utf-8 string is ISO-8859-1 and convert to Unicode, when decoding it will be converted back from UTF-8 to ISO-8859-1 with no lose of information.

For (j3) or (j4) there will need to be way to signal that the string should not be interpreted as UTF-8 in JSON.  One idea I have is to start the string with a '/' as that can be included in any filename.

Finally, there is always the option to just force UTF-8 (and maybe even a more restive set as I think @mib-kd743naq wants), but then there will be filenames that can not be repressed in a IPLD Unixfs and some other format will likely be needed for backing up the filesystem.

@whyrusleeping @Stebalien others, thoughts?

[Edited to include option (4) for JSON encoding.]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Handing of non-utf-8 posix filenames #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Handing of non-utf-8 posix filenames #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions