Skip to content

Fast conversion to/from ASCII/one-byte strings? #44

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dead-claudia opened this issue Sep 12, 2024 · 5 comments
Open

Fast conversion to/from ASCII/one-byte strings? #44

dead-claudia opened this issue Sep 12, 2024 · 5 comments

Comments

@dead-claudia
Copy link

It's a near universal optimization within JS implementations to split strings into two representations: a one-byte and a two-byte. Even the size-optimized runtime XS does it. It brings a massive speed boost and almost halves string memory usage in practice. So, I'd like to see fromCharByteArray and intoCharByteArray that return character codes modulo 256.

@mkustermann
Copy link
Contributor

From a dart2wasm perspective I tend to agree that there should be a fast mechanism to create a JS string from an WasmGC array of bytes that contain ascii - as this is extremely common thing.

Imagine the app that operates on a binary encoding of a message that has some structure (e.g. protobuf, flatbuffer, ...). Such messages can have strings in them. The message is going to be a WasmGC array of bytes and sub-sections of them are encoded strings. The common case being just ascii encoded strings.

The utf-8 decoding implementation written in WasmGC will scan the utf-8 encoded string to find out the string length and during this pass will also find out whether it's ascii or not.

=> If it's ascii we'd like to directly call a js-string builtin that has the char code array as a (array (mut i8))

This would avoid

  • allocating a temporary (array (mut i16)) twice it's size
  • copy the ascii bytes from (array (mut i8)) to (array (mut i16)) to use the existing fromCharCodeArray API

I think this would still align with:

Goals of Builtins
Builtins should not provide any new abilities to WebAssembly that JS doesn't already have.

As this isn't about adding utf-8 support to the the js-builtin proposal, but rather which array types are allowed in the imports.

If wasm allows duplicate imports (see e.g. WebAssembly/design#1402) one could even specify that one can import the same name fromCharCodeArray with two different signatures.

/cc @eqrion Is this something that could be included before js-string builtins ship?

/cc @osa1

@mkustermann
Copy link
Contributor

mkustermann commented Jan 22, 2025

As a tangential note:

The stated goals of the js-string builtin proposal is to expose existing JS mechanisms to Wasm via recognized imports and not introduce new functionality like UTF-8.

UTF8/WTF8 support

As stated above in 'goals for builtins', builtins are intended to just wrap existing primitives and not invent new functionality.

There is the Encoding API for TextEncoder/TextDecoder which can be used for UTF-8 support. However, this is technically a separate spec from JS and may not be available on all JS engines (in practice it's available widely). This proposal exposes UTF-8 data conversions using this API under separate wasm:text-encoder wasm:text-decoder interfaces which are available when the host implements these interfaces.

Though the lines are very bury here because the js-builtin proposal

  • Adds support for JS String from constant UTF-8 encoded strings (via WebAssemblyCompileOptions.importedStringConstants mechanism)
  • Doesn't add support for JS String from non-constant UTF-8 encoded strings (i.e. from arrays)

@dead-claudia
Copy link
Author

I just realized I forgot one other function that should've been in that proposal: isOneByteString to query whether it's one-byte or two-byte. This in practice is also a cheap comparison.

Also should note that while this proposal is titled js-string-builtins, it's very much not JS-specific. Any language that uses Latin-1+UCS-2 as its internal string representation (JS, Java, Dart, Kotlin, Python (mostly), etc.) could use this. There's use cases for it outside JS, so it'd only make sense for non-JS VMs to likewise implement this.

@mkustermann
Copy link
Contributor

It was pointed out to me that the js-builtin proposal has been effectively finalized and shipped. In proposal stages it's at phase 4 and from 4 to 5 it's apparently just a formality.

So I guess this is too late to adjust the js-builtin spec.

@eqrion
Copy link
Collaborator

eqrion commented Jan 23, 2025

Yes, now that this is phase 4 (and shipping in some browsers) we can't really change it. But I do intend to do a follow up proposal in the future with more extensions. So this could be worthwhile then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants