Skip to content

Support UTF-8 strings out of the box #43

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pannous opened this issue Mar 19, 2018 · 31 comments
Closed

Support UTF-8 strings out of the box #43

pannous opened this issue Mar 19, 2018 · 31 comments

Comments

@pannous
Copy link

pannous commented Mar 19, 2018

currently strings seem to be encoded in something like utf-16.
(data (i32.const 1000) "\03\00\00\00H\00I\00H\00")
would it be possible to use utf-8 instead (as compile flag or default)?

@dcodeIO
Copy link
Member

dcodeIO commented Mar 19, 2018

In general, the standard library is designed in a way that it can be replaced with a custom one, including replacing the string implementation, but UTF-8 support would also be a great addition as a compile time flag (or maybe a special string class), as you propose.

The idea behind picking UTF-16 in the first place was that the most prominent host environment will be a browser with JavaScript, where string methods usually deal with UTF-16 code units. Though, little is known on the exact requirements future bindings will have. Whatever default implementation(s) we finally pick should ideally not require translating strings from one encoding to another when calling between the host and WASM.

@pannous
Copy link
Author

pannous commented Mar 19, 2018

Is it possible to directly instantiate a wasm UTF-16 array into a js string?

what I've seen so far were transformations like:

	for (var i=offset;buffer[i] && i<buffer.length; i++) 
	  	str += String.fromCharCode(buffer[i]);

Or put another way: is let string = buffer.toString('utf16le') so much more efficient then let string = buffer.toString('utf8') ?

If no direct instantiation is possible that would be another reason to use utf-8.
By the time wasm gets direct access to js objects, the marshaling will probably be solved by the runtime anyways, or browsers might even change their internal representation. (?)

@dcodeIO
Copy link
Member

dcodeIO commented Mar 19, 2018

If no direct instantiation is possible that would be another reason to use utf-8.

The expectation I have is that JS VMs use UTF-16 internally, so it isn't necessary to convert between encodings when loading / storing strings from WASM. That's somewhat shadowed by the fact that one cannot directly instantiate a string from a buffer, of course (except when using node, but even node has to convert bytes to a string some way or another). I also must admit that haven't I investigated if UTF-16 is actually what a browser uses.

Ideally, we should pick the format as the default that implies the least conversion steps (even if the runtime does them for us), but with just reading memory and calling imports / exports, it's hard to tell at this point in time.

@MaxGraey
Copy link
Member

UTF-8 also require additional methods for separate work with utf8 and utf16. For example Rust have String::from_utf16 and String::from_utf8, String.len() and String.chars.count() and etc. All that not back compatible with TypeScript and should require additional steps for interop with JavaScript side

@toonsevrin
Copy link

Is there a std function already to convert to utf-8? If not, I'll implement it if you provide a prefered location for it.

@MaxGraey
Copy link
Member

This can be close for now as well?

@pannous
Copy link
Author

pannous commented Jun 21, 2018

is utf-8 default now, or is it optional?

@MaxGraey
Copy link
Member

MaxGraey commented Jun 21, 2018

There are special methods toUTF8() and get lengthUTF8(). But no method for fromUTF8. Is it really need?

And internally for all strings using UTF-16LE same as JS/TS

@dcodeIO dcodeIO changed the title UTF-8 Support UTF-8 strings out of the box Jul 10, 2018
@fcrick
Copy link
Contributor

fcrick commented Jul 15, 2018

I was also surprised to find a double byte internal encoding. It's double the memory...

@dcodeIO
Copy link
Member

dcodeIO commented Jul 15, 2018

There are two major reasons for picking UTF-16:

  • Compatibility: JavaScript, and thus TypeScript, commonly uses Web APIs that specifically expect UTF-16 character codes. While JS also supports some UTF-8 APIs nowadays like String.fromCodePoint and String#codePointAt, the latter, for example, still expects the underlying data to be UTF-16, just like all the other String APIs.
  • Efficiency: When transfering string data between WebAssembly and JavaScript, there's no additional encoding or decoding step required as it'd be with UTF-8.

Nonetheless, better support for UTF-8 encoded data is of course something we should investigate further. Possible improvements range from efficient conversion functions as partly implemented above to a full-fledged UTF8String class with an UTF-16 compatibility layer that can be used alternatively, if someone is willing to implement it.

@MaxGraey
Copy link
Member

Close due to inactivity. You are free to reopen

@vgrichina
Copy link

That' actually very important issue for our use case (https://nearprotocol.com/). We don't have to interoperate with JS whatsoever.

Coding our drop-in UTF8String for this purpose shouldn't be a big deal, bigger problem I feel is making assemblyscript output string literals for them. How do I approach that?

@MaxGraey
Copy link
Member

MaxGraey commented Jan 17, 2019

@vgrichina String class contain special methods for that: String.fromUTF8, String#lengthUTF8 and String#toUTF8 and provide conversions between builtin UTF-16LE string representation and UTF-8.

@pannous
Copy link
Author

pannous commented Jan 17, 2019

For use cases outside of the browser context it would still be very nice to have an option/flag to use utf-8 internally, but if it's too much work for you the ticket can remain closed.

@vgrichina
Copy link

@MaxGraey I know about these methods. The whole point is to avoid these conversions, as they only waste CPU and memory. UTF-16LE string literals also waste space in .wasm file (though it's probably negligible if compressed).

Also these methods are like super inconvenient to use. Why not return Uint8Array instead of raw pointer? I basically have to make yet another full copy to have usable object.

@MaxGraey
Copy link
Member

MaxGraey commented Jan 18, 2019

UTF-16LE is legacy of JS runtime but also this greatly simplify string's methods which will be quite complicated if using UTF8. Also this simplify and speedup interop between JS and wasm.

Why not return Uint8Array instead of raw pointer

extra UTF8 methods were created to be as simple as possible for interop with C++ embedder VMs which represent strings as simple and flat null terminated byte array without header. Returning Uint8Array in this case unnecessary and produce extra allocations

@dcodeIO
Copy link
Member

dcodeIO commented Jan 18, 2019

I think the only sensible solution to this (when looking at it as a feature request) is to provide a compile time switch to select between UTF16 and UTF8, and implement everything String-y twice depending on what's selected. The challenge here is that other parts of the standard library a) make assumptions about how a string is stored and b) rely on internal string helpers for performance reasons that are naturally hard to abstract into two different versions.

While there are certainly clever ways to make this happen (backing strings with something easier to abstract, basing stuff on encodings) when throwing enough manpower at it, we should also consider that future specs like host bindings might change the game a bit here if they'd explicitly dictate a specific encoding for interoperable string arguments. See for example: WebAssembly/interface-types#13

@MaxGraey
Copy link
Member

MaxGraey commented Jan 18, 2019

I think host-binding will be use DOMString because WebIDL use it. DOMString represents as UTF-16

@vgrichina
Copy link

Actually it looks like e.g. V8 doesn't even always use UTF16 internally:
https://stackoverflow.com/a/40612946/341267

@dcodeIO
Copy link
Member

dcodeIO commented Jan 18, 2019

Yeah, hard to tell from just looking at V8, but host-bindings will, eventually, have to specify what's a interoperable string argument.

@MaxGraey
Copy link
Member

MaxGraey commented Jan 18, 2019

JavaScript runtime even not always use UTF-16. In many case it interpret charset as UCS-2. For example String#length. In utf-16 character '𝌆' is one symbol but in JS '𝌆'.lenght == 2

@MaxGraey
Copy link
Member

MaxGraey commented Jan 18, 2019

I recommend read this article:
http://lucumr.pocoo.org/2014/1/9/ucs-vs-utf8

UTF-8 not always space efficient compare to UTF-16 especially for Chinese/Japanese language. But main advantage of UTF-16 as internal representation is more efficient string operations and length return the number of code units as well as the number of characters at the same time and for constant time O(1).

@vgrichina
Copy link

@MaxGraey it's possible to implement UTF-8 strings with O(log(N)) cost for operations you are describing + improved speed for some other operations (because ropes)
Basically use slightly altered version of ropes which would prefer to cut UTF-8 string into chunks with consistent byte length:
https://kukuruku.co/post/ropes-fast-strings/

However I agree that it would seriously increase complexity and there are far more important missing features (like dynamic dispatch).

I think reasonable middle ground might be something like:

  1. Don't rely on internal string representation in stdlib. If it's crucial optimization – then both implementation using public String API and optimized can be available and switched using compiler flag, etc.
  2. Make it possible to compile string constants as UTF-8. It probably can even be made to work with current string impl by inserting fromUTF8 call automatically (however this is somewhat undesired as also creates copy).
  3. I think long-term approach similar to V8 is needed: i.e. have multiple implementations switched behind common interfaces based on how string is constructed, what chars it contains, etc.

As for host-bindings – looks like UTF-8 is there for more than a year. Not sure why would they change to UTF-16.

@MaxGraey
Copy link
Member

MaxGraey commented Jan 18, 2019

I know data structure called rope. Actually it used mostly for mutable string operations like insert and delete which in this case take O(log n) complexity instead O(n). But String class hasn't this kind of operations and include only immutable operations which by the way rope structure slowdown. I mean random indexation and split. Definitely rope very useful for applications like editor when we have high load split/join string operations. But I don't know any language which use ropes by default for its runtime (except internal javascript runtime in some VMs)

@MaxGraey
Copy link
Member

MaxGraey commented Sep 26, 2019

We could introduce "compact strings". The similar approach used in Java 9:
https://www.baeldung.com/java-9-compact-string
Due to fact that all string operations are immutable this also possible in AS

wdyt?

cc @dcodeIO @vgrichina

@fcrick
Copy link
Contributor

fcrick commented Sep 26, 2019

My preference:

  • decouple how strings are represented so that modules can slot in or declare their own string mechanics.
    • hard to get right and a lot of work, but would be very nice
  • make utf-8 the default, and support other encodings as non-defaults - 'wide' strings.
  • add compact string support.

@willemneal
Copy link
Contributor

Also we need to start to consider how we can use interface types: https://github.com/WebAssembly/interface-types/blob/master/proposals/interface-types/Explainer.md#walkthrough

@vgrichina
Copy link

We could introduce "compact strings". The similar approach used in Java 9:
https://www.baeldung.com/java-9-compact-string
Due to fact that all string operations are immutable this also possible in AS

@MaxGraey this seems like approach used by V8 as well. Maybe we can even go further and introduce separate concept of ByteString which basically breaks .charCodeAt, etc contract but provides better performance (no need to convert data from UTF-8 and back). Then it can also be used as a backend for contract-conforming string when it is Latin-only.

@willemneal
Copy link
Contributor

(module
  (memory (export "mem") 1)
  (func (import "" "log_") (param i32 i32)) //Declared but not actually imported
  ...
  // import that has a string paramater
  (@interface func $log (import "" "log") (param $arg string))
  // local implementation that converts UTF8 with the new type instruction and calls the actual import.
  (@interface implement (import "" "log_") (param $ptr i32) (param $len i32)
    arg.get $ptr
    arg.get $len
    memory-to-string "mem"
    call-import $log
  )
)

Once we have this in binaryen then it because much easier to mix together two binaries since you'll know how to hook together all the imports. This we can compile each file individually and then merge the imports at runtime or compile time.

For example, it will make incremental computation possible. When compiling files wasm files are cached, otherwise their files are rebuilt. This could also be the heart of a REPL.

@willemneal
Copy link
Contributor

It would also help to bind together different definitions of the same abstract type, meaning your function can use other type. If a function expects utf8, that's what it'll get from any function that returns a string since there is an adapter for it. Heck we can transform between several different type:
DOMString: since the WebAssembly string type is defined as a sequence of Unicode code points and DOMString is defined as a sequence of 16-bit code units, conversion would be UTF-16 encoding/decoding, where lone surrogates in a DOMString decode to a surrogate code point.
USVString: a WebAssembly string is a superset of USVString. Conversion to a USVString would follow the same strategy as DOMString-to-USVString conversion and map lone surrogates to the replacement character.
ByteString: as a raw sequence of uninterpreted bytes, this type is probably best converted to and from a WebAssembly sequence interface type.

@fcrick
Copy link
Contributor

fcrick commented Sep 27, 2019

Focusing on making the language support only what's really needed and letting libraries do many of these suggestions would be ideal for me. Less is more.

radu-matei pushed a commit to radu-matei/assemblyscript that referenced this issue Oct 13, 2020
Bumps [assemblyscript](https://github.com/AssemblyScript/assemblyscript) from 0.12.5 to 0.13.0.
- [Release notes](https://github.com/AssemblyScript/assemblyscript/releases)
- [Commits](AssemblyScript/assemblyscript@v0.12.5...v0.13.0)

Signed-off-by: dependabot-preview[bot] <[email protected]>

Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants