Skip to content

Strawman counterproposal #3

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 14 commits into from

Conversation

qwertie
Copy link

@qwertie qwertie commented May 24, 2016

See #697 in WebAssembly/design for an introduction. Note: base branch is wrong; I don't think I can fix it here, so see #8 for the diff.

This PR is meant largely as an illustration; I know it's not likely to be merged. It includes three categories of changes to the initial strawman; the first two are prerequisites for the third, which is the real point of the PR.

1. Changes to eliminate ambiguity

  • The original strawman uses a paren-free design for opcodes, so that presumably i32.rotl $0, 8 is how you rotate by 8. This is not ambiguous for the proposed parser since all opcodes are keywords, but it is visually ambiguous to readers, who could easily see call $foo(i32.rotl $0, 8, 9) as a function that takes three arguments rather than two. So I propose
    1. Parens for most opcodes: i32.rotl($0, 8)
    2. Instead of f32.store [5,+0], -0x0p0, use f32.store [5,+0] = -0x0p0
    3. Instead of br_table [$a, $b, $c], $default, $index, use br_table [a, b, c, default] : $index (this particular syntax is LES-compatible, see below.)
    4. Instead of br $exit, $i, use br exit => $i, where => $i is optional
    5. Similarly for br_if, use br exit (if condition) => $i
  • Use a numeric suffix to distinguish i64 from i32 and f64 from f32. Based on C/C++/C# I propose no required suffix for f64 or i32, but to use L for i64 and f for f32 (123L, 12.0p+0f). Another possibility would have been suffixes i64 and f32 as in Rust. Note: type inference is possible, e.g. $x + 5 could infer that 5 is f32 when $x is f32, but what about 3p+0 + 3p+0? Since wasm is low-level, the text should be allowed to specify the types, even if they can be inferred (and I'd prefer not to require the entire opcode name).

2. Simplifying changes

  • Instead of call $foo(...), use simply $foo(...).
  • Instead of call_import $foo(...), use $$foo(...)
  • Instead of call_indirect $sig [1] $min(0, 2) - wait, what is the [1] for? I don't see it in the s-exprs I'm looking at. So, I propose $min::$sig(0, 2), where :: is a high-precedence operator. Having removed the call_indirect keyword, I think reversing the order will be more readable in case $min is a complex expression.

3. Changes to allow LES

I propose that the text format be compatible with LES - as the PR text explains, not LES as it exists today, but as it will be when the MVP is launched. This gives the CG some freedom to make some changes to LES and not others. Specifically, any elements that make sense only in WebAssembly (e.g. keywords for wasm opcodes) would not be permitted, but changes such as tweaks to operator precedence, handling of semicolons, the grammar of LES "superexpressions", or the name used for "infinity", are fine.

The proposed goals of the text format are "match existing conventions on the Web (for example, curly braces, as in JavaScript and CSS)", and since LES is fairly JS-like language, I assert that any reasonable text format is straightforwardly modified to be LES-compatible. Here are the required changes:

  • In LES, $foo-loop subtracts loop from $foo, so you need to write $@foo-bar to tell the parser that the hyphen is part of the identifier. However, this could be changed if people are strongly in favor of allowing dashes in identifiers. LES also allows any UTF-8 string as an identifier, even @\n\0`` (a newline character and a null character) or @``` (the empty string). Similarly, i32.reinterpret/f32` is changed to `i32.reinterpret'f32` to make clear it is not a division (`'` is legal in identifiers).
  • In the original design, all the "rare" single-arg opcodes that don't use a special operator, such as i32.popcnt $x, do not need parentheses, but LES currently requires additional syntax: either i32.popcnt($x) or ``i32.popcnt $x (I prefer the first option.)
  • function $@fac-opt($a:i64) : i64: the parens around i64 are (and must be) optional in LES.
  • Understandably, <s and >s are not valid operator names, because in any language other than wasm, r>s should be parsed as r > s. LES operators must therefore consist of punctuation, and I selected > for signed and |> for unsigned. Unfortunately this turns out to be a little clumsy since we need >|= instead of |>= (this is explained in the PR), but LES does offer backquotes for making non-punctuation operators, so $0>s$1 or $0>=u$1 could be used instead.
  • var $x:i64 is changed to $x:i64 because although the former syntax is legal in LES, it is redundant. Since LES is highly regular, using essentially the same syntax everywhere, if $x:i64 is legal syntax in the formal argument list, it must also be legal in the function body.
  • As a consequence, labels of the form foo: are not very practical since : is also a binary operator. Consider foo: $x = 0, which would be parsed as (foo : $x) = 0. So instead I've moved the colon to the start; foo: $x = 0 would either have to be written on two lines, or on a single line as :foo; $x = 0.
  • Currently, LES gives a syntax error for function $foo () : i32; the space before ( must be removed; this is related to how LES manages to be a keyword-free language. However, if there's a lot of hate for this, I can eliminate the whitespace sensitivity (though something somewhere will have to be sacrificed).
  • The , in i32.store8 [$base, +4]:align=2, $value is bad as mentioned above, while : and = are not ideal punctuation. There are a variety of alternatives but I picked i32.store8 [$base, +4, align 2] = $value.
  • Sigils on labels are not really needed, since LES has no keywords that the labels could conflict with (future nullary opcodes could be written opcode() to ensure they do not clash with labels, as the latter cannot have arguments). Note there is no need for $ on local variables either, but I didn't remove them since we obviously can't replace variables like $0 with just 0, so there was no immediate benefit. FYI, @0 is how you write an identifier that begins with a digit in LES, in contrast to $0 which is the prefix operator $ applied to the literal 0.

Note: I left the text as "The $ sigil on function and variable names cleanly ensures that they never collide with wasm keywords, present or future." This is correct except for the word "keywords", since LES doesn't have keywords. Still, collisions may be possible at least when it comes to function calls, since for example loop({...}) means the same thing as loop {...} in LES, so if loop were a function and the $ were not required, there would be a collision.

There are other, niggly issues that deserve mention, but this is getting long so I'll stop here and let you read the rationales if you haven't yet.

TODOs

I have opinions about all the TODOs but I've left them out of the PR and also added my own TODO regarding whether or not semicolons should be required. I will just say that regarding the precedence of &|^, LES is spec'd to punt on that issue, by printing an error if you write an expression like x & y == z.

@sunfishcode
Copy link
Owner

call $foo(i32.rotl $0, 8, 9)

Another way to fix this is with precedence mechanisms. If we set up the grammar such that this parses with i32.rotl $0, 8, 9 as the subexpression, that'll require this code to be written with parentheses, without making parentheses a builtin part of the i32.rotl syntax.

f32.store [5,+0] = -0x0p0

That's a clever idea. I'll have to think about this more.

Instead of br_table [$a, $b, $c], $default, $index, use br_table [a, b, c, default] : $index

Conceptually, the default is not part of the table. It's the fallback for when the index is out of bounds on the table. It's a subtle distinction, but it matches the abstract performance model (jump table guarded by a branch).

br exit => $i
br exit (if condition) => $i

I'm confused about what this means. Is $i the value operand of the branch? If so, it needs to go to the left of the condition, because it is evaluated first in the wasm semantics.

numeric suffix

This is being discussed in #7

$foo(...)

This creates ambiguities with other parts of the grammar. My subjective experience reading LLVM IR is that an explicit call does not significantly impede readability.

Also, in optimized code where trivial calls have been inlined, highlighting the calls that remain is interesting.

$$foo(...)

The $ sigil in the plain call case is part of the identifier, not the call, so $$ is awkward as a way to represent call_indirect.

call_indirect $sig [1] $min(0, 2)

This was an error in the grammar example. I've fixed it now.

any UTF-8 string as an identifier

WebAssembly has arbitrary-byte-sequence identifiers, so the text format will need to support this. My experiment here doesn't currently cover this, so we'll need something.

I'm reluctant to use @ for this. As one of the few common keyboard keys not claimed by a C/JS/etc. operator, @ is extraordinarily valuable. I am inclined to save it for some other purpose, since there are many other viable ways to escape characters in identifier strings.

In the original design, all the "rare" single-arg opcodes that don't use a special operator, such as i32.popcnt $x, do not need parentheses, but LES currently requires additional syntax: either i32.popcnt($x) or i32.popcnt $x (I prefer the first option.)

I don't understand what you're proposing here.

function $@fac-opt($a:i64) : i64: the parens around i64 are (and must be) optional in LES

The return type syntax has parens because it's anticipating a future with multiple return values from functions, which generalizes function return types from a single type to a list of types.

<s and >s are not valid operator names

The trailing s and u nicely parallel WebAssembly operator names, and also anticipate the possible future addition of features like signed addition, subtraction, and multiplication, which would become +s, -s, and *s.

I also like how s and u aren't normally operator characters, so with my syntactic sensibilities, it visually indicates them as attributes of operators, whereas |> to me visually looks like it might be its own kind of operator unrelated to less-than.

>s adds clutter.

if $x:i64 is legal syntax in the formal argument list, it must also be legal in the function body [... and several others ...]

Conforming to LES' complex constraints is not a priority for this text format. I readily admit that this is a subjective choice.

printing an error if you write an expression like x & y == z

I like this idea. Pretty printers will likely want to emit parens anyway because because the ambiguity is confusing for humans as well, so we might as well just require them.

@kripken
Copy link

kripken commented May 24, 2016

This has 124 commits, I assume it needs to be rebased? hard to tell what is the relevant part.

@qwertie
Copy link
Author

qwertie commented May 24, 2016

@kripken: WTF! I swear I only ever committed TextFormat.md, I can't imagine how it decided 13 files changed! I guess I'm not good at git. Sorry.

@sunfishcode

Another way to fix this is with precedence mechanisms.

I'm not sure what you mean precisely, but if the bottom line is that call $foo(i32.rotl $0, 8, 9) is an error, I guess that would solve the problem (what about call $foo(i32.rotl $0, 8), is that also an error?). However, I don't see a sensible way to fit a paren-free multi-arg operators into LES. But call $foo((i32.rotl $0, 8), 9) is not better than call $foo(i32.rotl($0, 8), 9) and it's hard for me to think of an expression - with this operator anyway - where omitting parens is a good idea.

Conceptually, the default is not part of the table.

Okay, I suppose it should be visually distinguished from the other cases in some way, e.g. br_table [a, b, c] | default : $index (this is possible in LES, but currently br_table default | [a, b, c] : $index would give a more rational syntax tree - is it important that the default come last?)

br exit (if condition) => $i

I'm confused about what this means. Is $i the value operand of the branch? If so, it needs to go to the left of the condition, because it is evaluated first in the wasm semantics.

Sorry about that! Hmm, okay, then I'll suggest br exit => $i ? condition. Or drop the "cute" stuff in favor or br_if(exit, $i, condition).

$foo(...)

This creates ambiguities with other parts of the grammar.

I don't think it does, as locals and labels cannot be directly called. Anything in particular you're thinking of?

The $ sigil in the plain call case is part of the identifier, not the call, so $$ is awkward as a way to represent call_indirect.

I proposed $$ for call_import, not call_indirect. I wouldn't really say $ is part of the identifier - isn't it a "this is not an opcode or keyword" marker, which isn't strictly needed and doesn't exist in the binary format? Doubling-up on $$ doesn't seem awkward to me.

Also, in optimized code where trivial calls have been inlined, highlighting the calls that remain is interesting.

A narrow use case, but okay. Would you be willing to accept a different punctuation mark for calls? Since LES has prefix operators but not keywords, call ends up looking clumsy (``call $fnname(...)).

any UTF-8 string as an identifier

WebAssembly has arbitrary-byte-sequence identifiers, so the text format will need to support this.

Hmm, that's a challenge. Certainly, any reasonable text format should support UTF-8, so that identifiers like "الدين" won't come out as \xCA\xB8\xCB\xB2\xCA\xBF or whatever it is. So I'm thinking identifiers need to be treated as UTF-8 whenever possible, but the identifier syntax could have a special escape intended only for producing invalid UTF-8, something like... \?81, say. The biggest challenge here is probably in the implementation side - JS, Java, C#, etc. all use UTF-16, and it's not immediately obvious how to reliably round trip from "UTF-8-with-some-invalid-bytes" to UTF-16 and back - and without a clear documented way to do that, I'm sure some consumers would get it wrong. Maybe we can do something with invalid surrogate pairs?

I'm reluctant to use @ for this. As one of the few common keyboard keys not claimed by a C/JS/etc. operator, @ is extraordinarily valuable.

I chose @ because this is used for a similar purpose in C# where e.g. @if is an identifier. One way or another, doesn't a punctuation mark have to be sacrificed for escaping identifiers? AFAIK, the only precedents for this purpose are @ (C#) and \ (C/JS strings). Whichever one is used for escaping identifiers, the other is freed up for another purpose. Side note: @ being used for escapes doesn't necessarily mean it can't be used for something else also, e.g. LES currently uses it to denote attributes too (which are written @[foo(...)] since @foo(...) would merely be an ordinary function call.)

The tradeoff between @ and \ is that @ would go on the front, as in @that's so neat!`` while \ (by analogy with C strings) should escape individual characters as in `that's\ so\ neat!`

In the original design, all the "rare" single-arg opcodes that don't use a special operator, such as i32.popcnt $x, do not need parentheses, but LES currently requires additional syntax: either i32.popcnt($x) or ``i32.popcnt $x...

I don't understand what you're proposing here.

Again, it's the no-keywords property of LES - due to which i32.popcnt $x doesn't, in general, parse successfully, but i32.popcnt($x) is OK. I might be able to massage the grammar of LES so that i32.popcnt $x does parse, if it's important.

The return type syntax has parens because it's anticipating a future with multiple return values.

Uhhh.... I know. 😕

Conforming to LES' complex constraints is not a priority for this text format.

You've read #697 I hope? It sounds like the "out of scope!" argument that I anticipated. Do you really feel that some minor syntactic changes aren't worth it for potentially large benefits outside wasm itself?

@qwertie
Copy link
Author

qwertie commented May 24, 2016

@kripken Oh I see what's wrong, somehow sunfishcode:master ended up as the base for the merge. Very strange because I remember picking the correct branch and seeing a perfectly reasonable-looking diff when I made the PR.

@sunfishcode
Copy link
Owner

Conforming to LES is not currently a priority for me in this experiment. To your remaining concerns:

  • The br/br_if situation is definitely up in the air. I'll think about your suggestions.
  • It's actually possible that call_import may get merged into call in wasm, so I'm deferring thinking too much about it for now :-).
  • You're right that removing call from calls wouldn't necessarily make it unambiguous. There have been a few variations of this experiment in discussion, and it is in some of them. I don't think many of WebAssembly's banner use cases are reasonably considered "narrow" ;-), but I will consider your suggestion.
  • \ is a reasonable escape character. I don't have a full plan here though.
  • Yes, I read your LES proposal. If it is accepted by the group, I'll adapt accordingly.

@sunfishcode sunfishcode force-pushed the master branch 8 times, most recently from 8701525 to b566a2f Compare June 23, 2016 22:03
@qwertie
Copy link
Author

qwertie commented Aug 2, 2016

The biggest challenge here is probably in the implementation side - JS, Java, C#, etc. all use UTF-16, and it's not immediately obvious how to reliably round trip from "UTF-8-with-some-invalid-bytes" to UTF-16 and back - and without a clear documented way to do that, I'm sure some consumers would get it wrong. Maybe we can do something with invalid surrogate pairs?

I checked whether someone had solved this already, and found this:

D) Emit a malformed UTF-16 sequence for every byte in a malformed UTF-8 sequence

All the previous options for converting malformed UTF-8 sequences to UTF-16 destroy information. This can be highly undesirable in applications such as text file editors, where guaranteed binary transparency is a desireable feature. (E.g., I frequently edit executable code or graphic files with the Emacs text editor and I hate the idea that my editor might automatically make U+FFFD substitutions at locations that I haven't even edited when I save the file again.)

I therefore suggested 1999-11-02 on the [email protected] mailing list the following approach. Instead of using U+FFFD, simply encode malformed UTF-8 sequences as malformed UTF-16 sequences. Malformed UTF-8 sequences consist excludively of the bytes 0x80 - 0xff, and each of these bytes can be represented using a 16-bit value from the UTF-16 low-half surrogate zone U+DC80 to U+DCFF. Thus, the overlong "K" (U+004B) 0xc1 0x8b from the above example would be represented in UTF-16 as U+DCC1 U+DC8B. If we simply make sure that every UTF-8 encoded surrogate character is also treated like a malformed sequence, then there is no way that a single high-half surrogate could precede the encoded malformed sequence and cause a valid UTF-16 sequence to emerge.

This way 100% binary transparent UTF-8 -> UTF-16 -> UTF-8 round-trip compatibility can be achieved quite easily.

In the text format, this might involve the following additional rules:

  1. The text format should consist entirely of valid UTF-8 characters.
  2. Invalid bytes in strings and identifiers could be represented as \?yz or \iyz (? for "unknown encoding" or i for "invalid UTF-8") where yz is a byte in hex.
  3. Surrogate characters (\uD800...\uDFFF) would be disallowed.

The third rule seems optional; if surrogate characters are permitted in the text format, it should probably produce those characters in UTF-8 form, which has the following implications:

  • if the text-format lexer produces UTF-16 output, it should encode the 16-bit code unit in UTF-8 form and then transform it to UTF-16 as described above, such that the output is three 16-bit code units.
  • the surrogate characters wouldn't round-trip from text-to-binary-to-text, but they would still round-trip from binary-to-text-to-binary.

Overlong UTF-8 characters would be treated the same as invalid surrogates.

qwertie added a commit to qwertie/ecsharp that referenced this pull request Jul 5, 2021
I was actually planning to encode bytes as 0xDB80..0xDBFF because I
realized this would be better than the range originally proposed
(0xDC80..0xDCFF) because 0xDB80..0xDBFF is used extremely rarely.
I was in the process of making this change and noticed a couple of bugs
and less-than-ideal comments, so I fixed those.

But then I noticed that the encoding idea wasn't originally mine:
  sunfishcode/design#3 (comment)

So, if anyone else were using the same idea, they'd probably choose to
use the original range. Therefore, I should use the same range despite
it not being optimal.

Bug fix: UString.TryDecodeAt() should return an unpaired surrogate
  unchanged but sometimes returned -1.
Bug fix: EscapeCStyle should escape everything in the
  surrogate range in EscapeC.UnicodeNonCharacters mode.

Bug fixes weren't specifically tested
@sunfishcode
Copy link
Owner

I am no longer working on a Wasm text format proposal.

@sunfishcode sunfishcode closed this Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants