-
-
Notifications
You must be signed in to change notification settings - Fork 672
Support UTF-8 strings out of the box #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
In general, the standard library is designed in a way that it can be replaced with a custom one, including replacing the string implementation, but UTF-8 support would also be a great addition as a compile time flag (or maybe a special string class), as you propose. The idea behind picking UTF-16 in the first place was that the most prominent host environment will be a browser with JavaScript, where string methods usually deal with UTF-16 code units. Though, little is known on the exact requirements future bindings will have. Whatever default implementation(s) we finally pick should ideally not require translating strings from one encoding to another when calling between the host and WASM. |
Is it possible to directly instantiate a wasm UTF-16 array into a js string? what I've seen so far were transformations like:
Or put another way: is If no direct instantiation is possible that would be another reason to use utf-8. |
The expectation I have is that JS VMs use UTF-16 internally, so it isn't necessary to convert between encodings when loading / storing strings from WASM. That's somewhat shadowed by the fact that one cannot directly instantiate a string from a buffer, of course (except when using node, but even node has to convert bytes to a string some way or another). I also must admit that haven't I investigated if UTF-16 is actually what a browser uses. Ideally, we should pick the format as the default that implies the least conversion steps (even if the runtime does them for us), but with just reading memory and calling imports / exports, it's hard to tell at this point in time. |
UTF-8 also require additional methods for separate work with utf8 and utf16. For example Rust have |
Is there a std function already to convert to utf-8? If not, I'll implement it if you provide a prefered location for it. |
This can be close for now as well? |
is utf-8 default now, or is it optional? |
There are special methods toUTF8() and get lengthUTF8(). But no method for And internally for all strings using UTF-16LE same as JS/TS |
I was also surprised to find a double byte internal encoding. It's double the memory... |
There are two major reasons for picking UTF-16:
Nonetheless, better support for UTF-8 encoded data is of course something we should investigate further. Possible improvements range from efficient conversion functions as partly implemented above to a full-fledged |
Close due to inactivity. You are free to reopen |
That' actually very important issue for our use case (https://nearprotocol.com/). We don't have to interoperate with JS whatsoever. Coding our drop-in |
@vgrichina String class contain special methods for that: |
For use cases outside of the browser context it would still be very nice to have an option/flag to use utf-8 internally, but if it's too much work for you the ticket can remain closed. |
@MaxGraey I know about these methods. The whole point is to avoid these conversions, as they only waste CPU and memory. UTF-16LE string literals also waste space in .wasm file (though it's probably negligible if compressed). Also these methods are like super inconvenient to use. Why not return |
UTF-16LE is legacy of JS runtime but also this greatly simplify string's methods which will be quite complicated if using UTF8. Also this simplify and speedup interop between JS and wasm.
extra UTF8 methods were created to be as simple as possible for interop with C++ embedder VMs which represent strings as simple and flat null terminated byte array without header. Returning |
I think the only sensible solution to this (when looking at it as a feature request) is to provide a compile time switch to select between UTF16 and UTF8, and implement everything String-y twice depending on what's selected. The challenge here is that other parts of the standard library a) make assumptions about how a string is stored and b) rely on internal string helpers for performance reasons that are naturally hard to abstract into two different versions. While there are certainly clever ways to make this happen (backing strings with something easier to abstract, basing stuff on encodings) when throwing enough manpower at it, we should also consider that future specs like host bindings might change the game a bit here if they'd explicitly dictate a specific encoding for interoperable string arguments. See for example: WebAssembly/interface-types#13 |
I think host-binding will be use |
Actually it looks like e.g. V8 doesn't even always use UTF16 internally: |
Yeah, hard to tell from just looking at V8, but host-bindings will, eventually, have to specify what's a interoperable string argument. |
JavaScript runtime even not always use UTF-16. In many case it interpret charset as |
I recommend read this article: UTF-8 not always space efficient compare to UTF-16 especially for Chinese/Japanese language. But main advantage of UTF-16 as internal representation is more efficient string operations and length return the number of code units as well as the number of characters at the same time and for constant time |
@MaxGraey it's possible to implement UTF-8 strings with However I agree that it would seriously increase complexity and there are far more important missing features (like dynamic dispatch). I think reasonable middle ground might be something like:
As for host-bindings – looks like UTF-8 is there for more than a year. Not sure why would they change to UTF-16. |
I know data structure called rope. Actually it used mostly for mutable string operations like |
We could introduce "compact strings". The similar approach used in Java 9: wdyt? |
My preference:
|
Also we need to start to consider how we can use interface types: https://github.com/WebAssembly/interface-types/blob/master/proposals/interface-types/Explainer.md#walkthrough |
@MaxGraey this seems like approach used by V8 as well. Maybe we can even go further and introduce separate concept of |
(module
(memory (export "mem") 1)
(func (import "" "log_") (param i32 i32)) //Declared but not actually imported
...
// import that has a string paramater
(@interface func $log (import "" "log") (param $arg string))
// local implementation that converts UTF8 with the new type instruction and calls the actual import.
(@interface implement (import "" "log_") (param $ptr i32) (param $len i32)
arg.get $ptr
arg.get $len
memory-to-string "mem"
call-import $log
)
) Once we have this in binaryen then it because much easier to mix together two binaries since you'll know how to hook together all the imports. This we can compile each file individually and then merge the imports at runtime or compile time. For example, it will make incremental computation possible. When compiling files wasm files are cached, otherwise their files are rebuilt. This could also be the heart of a REPL. |
It would also help to bind together different definitions of the same abstract type, meaning your function can use other type. If a function expects utf8, that's what it'll get from any function that returns a string since there is an adapter for it. Heck we can transform between several different type: |
Focusing on making the language support only what's really needed and letting libraries do many of these suggestions would be ideal for me. Less is more. |
Bumps [assemblyscript](https://github.com/AssemblyScript/assemblyscript) from 0.12.5 to 0.13.0. - [Release notes](https://github.com/AssemblyScript/assemblyscript/releases) - [Commits](AssemblyScript/assemblyscript@v0.12.5...v0.13.0) Signed-off-by: dependabot-preview[bot] <[email protected]> Co-authored-by: dependabot-preview[bot] <27856297+dependabot-preview[bot]@users.noreply.github.com>
currently strings seem to be encoded in something like utf-16.
(data (i32.const 1000) "\03\00\00\00H\00I\00H\00")
would it be possible to use utf-8 instead (as compile flag or default)?
The text was updated successfully, but these errors were encountered: