-
-
Notifications
You must be signed in to change notification settings - Fork 673
AssemblyScript & Interface Types - UTF-16 String Support / Consider UTF-8 Strings #1263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Also Dart lang which also inherit many aspects from JavaScript and also use UTF-16. But for current not going to WebAssembly direction (however has AOT compilation) but it could changed in future. |
Oh wow, this is probably silly of me, but I didn't realize that Java and C# also use UTF-16. Yes, with Blazor and Kotlin targeting wasm, I can definitely see this being a common need :) I'll go comment about this back in WebAssembly/interface-types#13. I did just want to clarify one thing: I was only suggesting that AS change the representation of strings in linear memory, not the programmer-visible semantics. I think this is still a good idea, even if Interface Types has UTF-16 lifting/lowering. More specifically, even assuming interface types has both UTF-8 and UTF-16 lifting/lowering, I think a nice impl for AS strings could be:
With such a scheme, I think only a small % of strings will get inflated, and, when that happens, the one-time inflation cost will be amortized. I think code size impact should from the extra code to handle the two cases should be small since it'll be factored out into runtime functions. And, in exchange, string memory use is ~50%. WDYT? |
Java has similar approach since 9th version (btw its second attempt). But only switching between ASCII/Latin1 and UTF16. It's turned off by default and called Compact String. And even with JIT runtime like in Java turning on Compact String may slowdown string manipulations sometimes. For us it means also significantly increasing code size of our runtime which we try keep as small as possible. I know JavaScript engines internally represent strings in different formats and even used rope structures when need concatenate strings frequently but for us it means significant bloating of runtime unfortunately |
I am not involved with the discussion, and maybe my question is offtopic. Does this solve the performance issue of string conversion between WASM and JavaScript? |
If I read correctly, Compact Strings are enabled by default (with a flag to disable) which demonstrates a widespread win in practice. Perhaps you're thinking of the previous Compressed Strings feature which it sounds like had more of a perf impact and was never enabled by default.
Do you have any data to show the code-size increase is significant? I'm assuming that, in general, the AS compiler would want to balance code-size and runtime-perf and thus the concrete magnitudes matter. |
Yes, it seems it turned by default now. But "In the unexpected event where a performance regression is observed in migrating from Java SE 8 to Java SE 9 and an analysis shows that Compact Strings introduces the regression."
I expect this changes introduce more cache misses due to split most of string methods to 2 or 4 branches depends on method's arity: for (String#concat, String#replace, comparision and etc)
for
It also increase code size for Strings and related to it functions to approx 2-3 times. Maybe I'm missing something, though? |
We can definitely get the advantage (twice faster and twice less memory footprint) if most of string which has only Latin1 encodings. But how often we use only English (without emojis, ligatures and special symbols like á / Á or ©) nowadays? For example, the first and second languages in prevalence in the world - Chinese and Spanish already require UTF8/UTF16. English is only in third place by number of native speakers however first in the web |
Some of us care about compile times too. People testing their software locally will care about untouched builds, and anyone using strings will see a very large increase in testing module size just by upgrading to a new compiler. Even if this adds only a single second to compile time (per run,) this adds up over long periods of time and gets very frustrating. |
I'm just a casual commentator here, so I won't press the matter any further, but I'd encourage measurement before ruling the option out. Cheers! |
@lukewagner Luke thank you so much for the reasonable suggestion about "compacting strings". I'm really appreciate for this and it totally makes sense for runtime which not distribute via web like JVM or DartVM. By the way, I already thought about it here. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I have changed my thoughts on this issue. We need some easy way to generate static utf8 strings. For instance, when traversing wasi file descriptors, we need to loop over preopened folders to make the fs.readFileSync functions. These use utf8 strings. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Note that ASCII is still extremely common in many areas: HTML tags/attributes, CSS, DOM APIs (e.g. My educated guess is that most strings are not created by humans, but instead created by other programs. I think it's a mistake to assume that human-created strings are dominant. All JS engines have a fast path for Latin1, because ASCII strings are so common. A similar situation happened with UTF-8 vs UTF-16: for certain Asian languages UTF-16 uses 2 bytes, whereas UTF-8 uses 3 bytes. And so people assumed that UTF-16 would be better for those Asian languages, but in practice UTF-8 is actually more efficient. This is why it's important to measure based on real data, and not based on assumptions. |
Yes, but for UTF-8 we still need special fast paths when all symbols are Latin1. Because in that case we can skip utf-8 decoding and optimize (vectorize) latin-only sequances much better. In this scenario it's not big deal it's UTF-16 (with compact mode) or UTF-8 (with specialize mode). The only one benefit is transfer size and memory consumption. |
@MaxGraey That depends on whether AssemblyScript uses pure UTF-8 or a hybrid. If it uses pure UTF-8 I don't think it would need a fast path, since UTF-8 is a superset of ASCII, and you only need to encode/decode at the boundaries of Wasm. So the choice is really between pure WTF-16 (a la JS), pure UTF-8 (a la Rust), or hybrid ASCII/WTF-16 (a la Python/Java/JS engines). |
Hello! So this is a follow up from: WebAssembly/interface-types#13
This is a long thread, exploring string encoding types in the upcoming interface types proposal. Where currently, the only supported string encoding (in the MVP), is UTF-8.
However, AssemblyScript uses UTF-16, to stay parallel with the Web APIs, and interface types could require some double encoding for UTF-16 languages (if I understand correctly.)
It was suggested, but there are a few issues, not the full list, we see on the AS side (I'll let @dcodeIO give implementation details where necessary):
.substring
andcharCodeAt
are implemented in a way that would be difficult to re-implement in UTF-8, but also could break libraries that depend on specific JS behavior (if they were to be ported to AS).Would be interested to hear everyone's thoughts. Looking forward to a respectful, thoughtful discussion here, and finding a good solution 😄 Thanks everyone! 👍
cc @lukewagner @dcodeIO @MaxGraey
The text was updated successfully, but these errors were encountered: