Skip to content

Commit 2452d25

Browse files
committed
Add warnings about UTF-16 vs UTF-8 strings
This commit aims to address #1348 via a number of strategies: * Documentation is updated to warn about UTF-16 vs UTF-8 problems between JS and Rust. Notably documenting that `as_string` and handling of arguments is lossy when there are lone surrogates. * A `JsString::is_valid_utf16` method was added to test whether `as_string` is lossless or not. The intention is that most default behavior of `wasm-bindgen` will remain, but where necessary bindings will use `JsString` instead of `str`/`String` and will manually check for `is_valid_utf16` as necessary. It's also hypothesized that this is relatively rare and not too performance critical, so an optimized intrinsic for `is_valid_utf16` is not yet provided. Closes #1348
1 parent c5f18b6 commit 2452d25

File tree

6 files changed

+89
-1
lines changed

6 files changed

+89
-1
lines changed

crates/js-sys/src/lib.rs

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3522,6 +3522,37 @@ impl JsString {
35223522
None
35233523
}
35243524
}
3525+
3526+
/// Returns whether this string is a valid UTF-16 string.
3527+
///
3528+
/// This is useful for learning whether `String::from(..)` will return a
3529+
/// lossless representation of the JS string. If this string contains
3530+
/// unpaired surrogates then `String::from` will succeed but it will be a
3531+
/// lossy representation of the JS string because unpaired surrogates will
3532+
/// become replacement characters.
3533+
///
3534+
/// If this function returns `false` then to get a lossless representation
3535+
/// of the string you'll need to manually use `iter` method (or
3536+
/// `char_code_at` accessor) to view the raw code points.
3537+
///
3538+
/// For more information, see the documentation on [JS strings vs Rust
3539+
/// strings][docs]
3540+
///
3541+
/// [docs]: https://rustwasm.github.io/docs/wasm-bindgen/reference/types/str.html
3542+
pub fn is_valid_utf16(&self) -> bool {
3543+
std::char::decode_utf16(self.iter()).all(|i| i.is_ok())
3544+
}
3545+
3546+
/// Returns an iterator over the `u16` character codes that make up this JS
3547+
/// string.
3548+
///
3549+
/// This method will call `char_code_at` for each code in this JS string,
3550+
/// returning an iterator of the codes in sequence.
3551+
pub fn iter<'a>(
3552+
&'a self,
3553+
) -> impl ExactSizeIterator<Item = u16> + DoubleEndedIterator<Item = u16> + 'a {
3554+
(0..self.length()).map(move |i| self.char_code_at(i) as u16)
3555+
}
35253556
}
35263557

35273558
impl PartialEq<str> for JsString {

crates/js-sys/tests/wasm/JsString.rs

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -541,3 +541,15 @@ fn raw() {
541541
);
542542
assert!(JsString::raw_0(&JsValue::null().unchecked_into()).is_err());
543543
}
544+
545+
#[wasm_bindgen_test]
546+
fn is_valid_utf16() {
547+
assert!(JsString::from("a").is_valid_utf16());
548+
assert!(JsString::from("").is_valid_utf16());
549+
assert!(JsString::from("🥑").is_valid_utf16());
550+
assert!(JsString::from("Why hello there this, 🥑, is 🥑 and is 🥑").is_valid_utf16());
551+
552+
assert!(JsString::from_char_code1(0x00).is_valid_utf16());
553+
assert!(!JsString::from_char_code1(0xd800).is_valid_utf16());
554+
assert!(!JsString::from_char_code1(0xdc00).is_valid_utf16());
555+
}

examples/without-a-bundler/index.html

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,12 @@
2525
// Also note that the promise, when resolved, yields the wasm module's
2626
// exports which is the same as importing the `*_bg` module in other
2727
// modes
28-
await init('./pkg/without_a_bundler_bg.wasm');
28+
// await init('./pkg/without_a_bundler_bg.wasm');
29+
30+
const url = await fetch('http://localhost:8001/pkg/without_a_bundler_bg.wasm');
31+
const body = await url.arrayBuffer();
32+
const module = await WebAssembly.compile(body);
33+
await init(module);
2934

3035
// And afterwards we can use all the functionality defined in wasm.
3136
const result = add(1, 2);

guide/src/reference/types/str.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,3 +20,30 @@ with handles to JavaScript string values, use the `js_sys::JsString` type.
2020
```js
2121
{{#include ../../../../examples/guide-supported-types-examples/str.js}}
2222
```
23+
24+
## UTF-16 vs UTF-8
25+
26+
Strings in JavaScript are encoded as UTF-16, but with one major exception: they
27+
can contain unpaired surrogates. For some Unicode characters UTF-16 uses two
28+
16-byte values. These are called "surrogate pairs" because they always come in
29+
pairs. In JavaScript, it is possible for these surrogate pairs to be missing the
30+
other half, creating an "unpaired surrogate".
31+
32+
When passing a string from JavaScript to Rust, it uses the `TextEncoder` API to
33+
convert from UTF-16 to UTF-8. This is normally perfectly fine... unless there
34+
are unpaired surrogates. In that case it will replace the unpaired surrogates
35+
with U+FFFD (�, the replacement character). That means the string in Rust is
36+
now different from the string in JavaScript!
37+
38+
If you want to guarantee that the Rust string is the same as the JavaScript
39+
string, you should instead use `js_sys::JsString` (which keeps the string in
40+
JavaScript and doesn't copy it into Rust).
41+
42+
If you want to access the raw value of a JS string, you can use `JsString::iter`,
43+
which returns an `Iterator<Item = u16>`. This perfectly preserves everything
44+
(including unpaired surrogates), but it does not do any encoding (so you
45+
have to do that yourself!).
46+
47+
If you simply want to ignore strings which contain unpaired surrogates, you can
48+
use `JsString::is_valid_utf16` to test whether the string contains unpaired
49+
surrogates or not.

guide/src/reference/types/string.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,9 @@ Copies the string's contents back and forth between the JavaScript
88
garbage-collected heap and the Wasm linear memory with `TextDecoder` and
99
`TextEncoder`
1010

11+
> **Note**: Be sure to check out the [documentation for `str`](str.html) to
12+
> learn about some caveats when working with strings between JS and Rust.
13+
1114
## Example Rust Usage
1215

1316
```rust

src/lib.rs

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -260,6 +260,16 @@ impl JsValue {
260260
///
261261
/// If this JS value is not an instance of a string or if it's not valid
262262
/// utf-8 then this returns `None`.
263+
///
264+
/// # UTF-16 vs UTF-8
265+
///
266+
/// JavaScript strings in general are encoded as UTF-16, but Rust strings
267+
/// are encoded as UTF-8. This can cause the Rust string to look a bit
268+
/// different than the JS string sometimes. For more details see the
269+
/// [documentation about the `str` type][caveats] which contains a few
270+
/// caveats about the encodings.
271+
///
272+
/// [caveats]: https://rustwasm.github.io/docs/wasm-bindgen/reference/types/str.html
263273
#[cfg(feature = "std")]
264274
pub fn as_string(&self) -> Option<String> {
265275
unsafe {

0 commit comments

Comments
 (0)