-
Notifications
You must be signed in to change notification settings - Fork 695
A JS-style Text Format #704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
experimental, debugging, optimization, and testing of the spec itself. | ||
* Working with WebAssembly code directly for reasons including pedagogical, | ||
experimental, debugging, profiling, optimization, and testing of the spec | ||
itself. | ||
|
||
The text format is equivalent and isomorphic to the [binary format](BinaryEncoding.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this statement still true given that function names may have to be converted from unrestricted byte sequences?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Arbitrary bytes are escaped so that they can be represented as valid unicode characters. The details are in the full grammar here.
$a = $a + -1; | ||
br_if ($a >s 1) $loop; | ||
} | ||
$end: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure I understand, this $end
label creates a block that starts just before the $x = 1
?
If this block started after $x = 1
then would it be presented with explicit curly-braces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This turned out to be a mistake. In the fac.wast file this file is translated from, the block starts after the set_local. I've now updated the example to accurately reflect where the block starts.
|
||
| Name | Syntax | Examples | ||
| ---- | ---- | ---- | | ||
| `block` | `{` … *label*: `}` | `{ br $a; a: }` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the label at the end is more sane, I agree. Any concern that it is differing from the equivalent JavaScript?
a: {
break a;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is a concern. Accompanying it is the concern that large and deeply nested block structures, with extensive use of labels, are much more common in wasm than they are in typical JS. Under such conditions, having the label "where the branch goes" improves readability because one doesn't have to jump far away to find the top of the loopblock and then jump back down to find the bottom.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think you're right, it's better to have labels where they go.
Though I just thought of another concern; shadowing a label name becomes a bit harder to detect because you have to look for the closest nested in either direction, e.g.
{
loop $a {
...
br $a;
}
$a:
...
}
But one could argue that this is like going to the doctor and saying "it hurts when I do this"... :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed; "don't do that" ;-). Alternatively, one might imagine prohibiting shadowing loop labels with block labels or vice versa. My guess is that we can address this at a higher level.
This is a little gentler.
Arithmetic operators use C/JS-like infix and prefix notation. | ||
|
||
Add, sub, mul, div, rem, and, or, xor, shl, and shr operators use | ||
`+`, `-`, `*`, `/`, `%`, `&`, `|`, `^`, `<<`, and `>>`, respectively. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could ^
as xor
cause confusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why it would cause? (Assuming that majority of people familiar with C or JavaScript)
There is always an options to leave additional alternative for infix operators in form of function call similar to e.g. i32.min
, so xor operator will have i32.xor
in addition to ^
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Js users will be familiar with it as the exponentiation operator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Js users will be familiar with it as the exponentiation operator
'^' defined in JS as bitwise-xor (see http://www.ecma-international.org/ecma-262/6.0/#sec-binary-bitwise-operators)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, explain the interaction with the unary negate operator.
@drom I guess it means a display like an assembler code, one operator per line, etc. I strongly expect that this community group expects an expression based language, that they might accept display as a linear list of opcodes for the MVP but only so long as there is a clear plan and a well articulated expression base text format and a proof-of-concept implementation would make this clear. People might be prepared to defer largely bike-shedding text format issues until after the MVP, but they know the difference between a linear stack machine code and an expression base language. I don't think the core issues here can be deferred until after the MVP rather there a many issues that constrain the language design and are core issues. |
If there are pop's in the language then it is not expression based. Typical expression blocks discard excess values of statements, so a special operator is needed to build values.
I don't think these examples demonstrate the real problems, but based on these I assume that this stack code allows operators that return no values to appear anywhere, so it removes that constraint. This would at least allow values to be stored in local variables if they need to be consumed in a pattern that does not fit the stack order. I think
The problem occurs when values are consumed out of stack order:
For this out-of-order consumption we need Can people think of any other good examples that demonstrate stack machine code that will not present well in an expression based language? |
@drom Thank you, there are some ideas from stack machines such a Fourth that I have pondered. For example the Fourth There might be a number of ways to shuffle stack elements around to consume them in the order required with Fourth style operators, compared to simply being able to encode access to the desired element. Being able to I think there was a point in wasm development, with the AST, in which it was not necessary to compare the stack elements at control flow merge points, rather just the local variables. If we allow We seem to have a good alternative in the prior approach, of just building up values within blocks on the values stack, and not allowing them to be changed. They could be referenced by index, rather than shuffling them around. Downside is a deeper decode-time type stack, but perhaps it would be worth it. In any case we don't yet have |
@JSStats I'm surprised you moved your comments here, as it was my impression that you were not talking mainly about the text format, but about the more fundamental idea (applicable to the binary format) of whether Wasm should stick with a more traditional AST. It certainly affects the text format, of course, but as I was saying in #697, it doesn't seem to require fundamental changes in the text syntax.
I'm sure the backers of the new stack machine (whoever they are) think it's "tenable", and so do I. I imagine they will add a new operator or two, if data shows that such operators would provide a large benefit. But I'll bet avoiding locals won't be found important enough for MVP, even if multi-value support were added, which itself seems unlikely. That said, I personally prefer an expression-based language and, if I had the opportunity, would have loved to look for alternate ways of achieving similar efficiency gains in the traditional AST regime. |
@JSStats Forth (not Fourth) language. Has a lot to offer to stack machine designers. In fact it influenced most of Virtual stack machines like Java Bytecode and .NET CIL. The syntax is a bit unusual, but the concept of stack manipulation words: Also, as @flagxor pointed out here: #261 (comment) Forth has very powerful concept structural control flow operations. |
Here is basic comparison table https://github.com/drom/wast-forth/blob/master/stack-manipulation.csv |
@drom I think wasm is still ok wrt 'structural control flow operations'. Perhaps it could do more to support loop analysis, to make the stride explicit etc. But what about structured local variables, these are important in languages too? Your list ignores the It's interesting that CIL is restricted to Can someone answer if wasm decoders will be required to record and compare the values stack for changes at every control flow path merge point, and what is the cost? In the expressionless encoding experiment this was a show stopper. There will still remains the possibility of canonicalizing on a different code, to consider the stack-machine code as just a compressed encoding of the canonical code, and to present the canonical code in a structured manner. But then it is not possible in general to specify the encoding in the canonical code, just as it is not expected that the choices of encoding that a compressor uses be expressed in the source input to the compressor. e.g. use a functional form of SSA as the wasm language that developers actually work with. |
@JSStats I have added But: you have to limit |
@drom Thank you, and good to hear. Here's an example that I am not sure how to handle?
In structured code, with the prior block semantics, the unused definitions are just discarded, so this could have been presented in a structured format e.g.
Adding |
@JSStats let me annotate your first example: [ stack notation ]
i32.const 1 [ c1 ]
i32.const 2 [ c1 c2 ]
pick 0 [ c1 c2 c2 ]
pick 1 [ c1 c2 c2 c2 ]
i32.add [ c1 c2 (c2+c2) ]
pick 0 [ c1 c2 (c2+c2) (c2+c2) ]
pick 1 [ c1 c2 (c2+c2) (c2+c2) (c2+c2) ]
i32.mul [ c1 c2 (c2+c2) ((c2+c2)*(c2+c2)) ]
i32.add [ c1 c2 ((c2+c2)+((c2+c2)*(c2+c2))) ]
// Gets messy here, need to keep the result but drop slots 0 and 1??
???? why you were caring |
@JSStats your second example: {
const $c0 = 1;
const $c1 = 2;
$c0 + $c1 + $c0 * $c1
} assuming correct operation precedence: in stack machine: [ stack notation ]
i32.const 1 [ c0 ]
i32.const 2 [ c0 c1 ]
pick 1 [ c0 c1 c0 ]
pick 1 [ c0 c1 c0 c1 ]
i32.mul [ c0 c1 (c0*c1) ]
i32.add [ c0 (c1+(c0*c1)) ]
i32.add [ (c0+(c1+(c0*c1))) ] In this particular case you don't really need to drop data, just use it. |
Tried to give $c0 and $c1 absolute indexes for the example. With relative offsets:
So the challenge is how to keep it structured, while informing the decoder that $c0 and $c1 are no longer used (or even better at their last use), and can both of these constrains be met?? I guess |
@drom Interesting example, but it did require re-ordering the operations which broken the intent of the example. Lets assume that the i32.add and i32.mul are actually function calls and that they need to be called in the order given in the example. |
@JSStats I understand the challenge. Here are 3 popular Forth solutions:
|
@JSStats compiling following ASM.JS into WASM using binaryen: test1asm.js function test1 (c1, c2) {
c1 = c1 |0;
c2 = c2 |0;
return ((c1 + c2) + (c1 * c2));
} wasm (11 byte):
stack machine (9 byte) (func $test1
pick 1 [ c1 c2 c1 ]
pick 1 [ c1 c2 c1 c2 ]
i32.add [ c1 c2 (c1+c2) ]
pick 2 [ c1 c2 (c1+c2) c1 ]
pick 2 [ c1 c2 (c1+c2) c1 c2 ]
i32.mul [ c1 c2 (c1+c2) (c1 * c2) ]
i32.add [ c1 c2 ((c1+c2) + (c1 * c2)) ]
nip [ c1 ((c1+c2) + (c1 * c2)) ]
nip [ ((c1+c2) + (c1 * c2)) ]
) test2asm.js function test2 (c1, c2) {
c1 = c1 |0;
c2 = c2 |0;
return (c1 + (c2 + (c1 * c2)));
} with binaryen today I am getting the following code (11 byte): (func $test2 (param $0 i32) (param $1 i32) (result i32)
(i32.add
(get_local $0)
(i32.add
(get_local $1)
(i32.mul
(get_local $0)
(get_local $1)
)
)
)
) good stack machine code would be (5 byte): (func $test2
pick 1 [ c0 c1 c0 ]
pick 1 [ c0 c1 c0 c1 ]
i32.mul [ c0 c1 (c0*c1) ]
i32.add [ c0 (c1+(c0*c1)) ]
i32.add [ (c0+(c1+(c0*c1))) ]
) |
@drom Ok, thank you for the solutions. I also see potential changing the function arguments to be on the stack and accessed by But it's not clear how you 'end game' could be represented in an expression. Also I think the following would inform the decoder earlier that c1 and c2 are consumed, which might make a difference to register pressure in a baseline compiler:
Another 'move' in this 'chess game' is that we can invalidate a stack slot and define it to be a validation error to consume it. Let
|
@JSStats yes, I like the |
Attempting to sketch a formatting rule with |
@JSStats is the "sketch" you are doing, for validation phase or stack scheduling? |
@JSStats in Forth |
@drom Just trying to understand if the structure can be preserved to keep a 'familiar' text format. If both Problem with Here are some more examples that illustrate some more challenging corner cases for the text formatter. I have not found any show stoppers yet, but some to consider would be welcomed.
|
@JSStats I was thinking about the name for the |
@JSStats All this |
@drom Could move discussion specific to the @flagxor did open this general issue for discussion here, and the use case being discussed is retaining the structure of the text format (a JS-style text format) rather than giving up as some implementers have proposed. No one has objected yet! |
@sunfishcode any idea how your text format proposal would be affected by new "stack machine" trend? |
@drom With the current stack-machine changes, our text format will require only a few additional features; possibly the addition of syntax for a "first" expression (similar to block, but returns the first value rather than the last) and possibly also syntax for a scoped or restricted "let" expression. I don't know what the actual syntax for these will be yet, but it's mostly just a matter of aesthetics :-). As @flagxor mentioned above, this proposal has been removed from consideration for standardization in wasm, at least for now. The text format proposed here will continue to be developed at https://github.com/mbebenita/was and in Firefox. |
This proposes an official text format for WebAssembly, aimed at browsers to use in "View Source", debugging, and related tooling. It uses a JavaScript-like syntax for readability and familiarity on the Web, though it differs from JS in several respects, as it aims to reflect the underlying WebAssembly language.
This proposal is meant to serve as a beginning. We'd like to establish this as a concrete place to start, so that we can then iterate, as we did with BinaryFormat.md.
You can try out a prototype of this proposal yourself in Firefox Nightly, for example by playing the AngryBots demo with the debugger open and examining the wasm file in the debugger.
Here's a screenshot of it in action.
A more complete description of the grammar and a parser implementation are available here. If this proposal is accepted, we'd like to move this repository under the WebAssembly GitHub organization to serve as the interim spec for the text format during the initial discussion.