Move to post-order encoding of syntax trees #611

naturaltransformation · 2016-03-16T00:42:56Z

I would like to work through the rationale and explanation of a post-order encoding of syntax trees that would simplify robust decoding. The goal is to avoid having to lean on recursive descent or shift-reduce parsers.

jfbastien · 2016-03-16T00:47:48Z

BinaryEncoding.md

+
+####Examples:
+
+* Given a simple AST node: I32Add(left: AstNode, right: AstNode)


Can you use code quotes?

ghost · 2016-03-16T02:30:40Z

Thank you for opening a discussion. Here's a show-stopper I see for post-order encoding. Humble apologies in advance if I lack technical understanding of the proposal.

Let an operator or function be defined that accepts one value and returns none, let it be named $pop. Consider its use in the following AST:

(func
  (call $pop (i32.const 1))
  (call $pop (i32.const 2))
  (call $pop (i32.const 3)))

Let the following be its post-order encoding:

i32.const 1
call pop
i32.const 2
call pop
i32.const 3
call pop

Now how are you going to validate post-order that the following is invalid?

i32.const 3
i32.const 2
i32.const 1
call pop
call pop
call pop

Pre-order has some nice properties for the wasm AST design. Each sub-expression can be validated on it's own. It might cost a shift-reduce algorithm, but can you propose a post-order validation algorithm that is simpler?

binji · 2016-03-16T03:09:14Z

Seems straightforward if you keep a type stack and push a void type in that case:

valid:
op            type stack
i32.const 1   i32
call pop      void
i32.const 2   void i32
call pop      void void
i32.const 3   void void i32
call pop      void void void

invalid:
i32.const 3   i32
i32.const 2   i32 i32
i32.const 1   i32 i32 i32
call pop      i32 i32 void
call pop      error, pop needs an i32 param, not void
call pop

Maybe there's a nicer way, but this seems reasonable to me.

sunfishcode · 2016-03-16T05:17:39Z

A possible variant of that rule: In the MVP, where return value counts are limited to 0 or 1 and there're no projection/destructure operators: after seeing an expression with 0 return types, discard all values from the top of the stack that are defined in the current block (or block-like thing, etc.).

Discussion: If those i32.const nodes were instead nodes with side effects, evaluating the sequence as a stack machine would observe the side effects in a different order than evaluating it as an AST. For example, consider code like this:

  foo(red())
  bar(green())

The post-order encoding might look like this:

  push red()
  foo(pop)
  push green()
  foo(pop)

However, what if someone encoded this?

  push red()
  push green()
  foo(pop)
  foo(pop)

A stack machine would evaluate red() first, then green(). And AST approach would build a tree, connecting green() to the first foo call and red() to the second foo call, so it would evaluate green() first, then red().

The rule above eliminates those cases where the two evaluation strategies would differ, because it requires that if a node has no result value, it doesn't appear between a node that does have a result value and that node's user.

Looking ahead, there are more questions, but with a rule like what's described here, we wouldn't need to answer all of them right away.

ghost · 2016-03-16T06:46:51Z

@binji Yes, thank you. So the entries on the parse stack are an object representing the AST, even one with no result values, and it seems natural that there would also be only one entry for multiple values - that's one matter cleared up, thanks.

@sunfishcode I don't think that block-zero-value optimization would work as some operators consume zero value ast nodes, or might? For example

(block (i32.add (i32.const 1) (br 5 (nop)))

block-start
  i32.const 1
  nop   <<< can't discard the stack above??
  br 5
  i32.add
block-end

I have another query about the top level expressions of a block or block-like operator. The AST defines the results of each expression to be discarded, except the last which is returned from the block. With a post-order encoding it would not know the end of each of each of these expressions while it is parsing, rather it seems it would only know the end of each of these top-level expressions at the end of the block when it would check the AST refs remaining on the stack between the start of the block and the end and assign each to one block top level expression?

That seems to explain why a block needs and end marker and can not use a pre-order count of the number of top level block expressions - because it can not count them while parsing them?

I guess the block start marker is needed to support the one pass SSA conversion, otherwise the block could be post-order encoded with an immediate count of the number of expressions to consume from the top of the stack?

With all the block top-level expression being consumed only at the end of the block, it might be a property of this design that the expression stack could grow as large as the block. Might the stack be larger than for a pre-order encoding which could discard values earlier. Might it lead to producers using dead operators to consume values, or to limiting block sizes and breaking them up.

ghost · 2016-03-16T09:20:25Z

BinaryEncoding.md

+  - Recursively write the first operand of the outer Call: 0 
+  - Recursively write the second operand: CALL_START 1 2 I32CONST 3 I32CONST CALL_END
+  - Recursively the write third operand: 4 I32CONST
+  - Write the opcode CALL_END 


Could some rationale be given for the need to have a call_start marker? I can see the need to know the start of control flow forks, for the single-pass SSA conversion, and for blocks to know the break depths and perhaps other reasons, but no need comes to mind for calls and it seems adequate to just post-order encode them?

rossberg · 2016-03-16T10:59:16Z

On 16 March 2016 at 03:30, JSStats [email protected] wrote:

Thank you for opening a discussion. Here's a show-stopper I see for
post-order encoding. Humble apologies in advance if I lack technical
understanding of the proposal.

Let an operator or function be defined that accepts one value and returns
none, let it be named $pop. Consider its use in the following AST:

(func
(call $pop (i32.const 1))
(call $pop (i32.const 2))
(call $pop (i32.const 2)))

Let the following be its post-order encoding:

i32.const 1
call pop
i32.const 2
call pop
i32.const 3
call pop

Now how are you going to validate post-order that the following is invalid?

i32.const 3
i32.const 2
i32.const 1
call pop
call pop
call pop

Because it does not parse. There is no problem. It's just a variant of RPN.
Any old parser can handle your examples with a grammar like

expr ::= i32.const num | expr call id
func ::= expr*

There is no ambiguity, and there is no problem syntactically detecting the
latter example as malformed.

Moreover, there already is a prototype decoder for V8 (
https://codereview.chromium.org/1694303002/) and a WIP one for the spec (
https://github.com/WebAssembly/spec/blob/binary/ml-proto/spec/binary.ml).
Both are currently out of sync, but after GDC we will finalise them.

AndrewScheidecker · 2016-03-16T11:31:25Z

Is that example invalid? If the deserializer is just a stack machine then it should be equivalent to the other ordering.

There are a few things that are harder to validate with post-order, though. For example, a branch needs to know the argument type of a target, which is only known from the use of the node introducing the target:

(iXX.add A... (block $break
    (br $break B...)))

In post-order:

A...
begin_block $break
    B...
    br $break
end_block
iXX.add

Bottom up validation here would have to derive a type for the block from its tail expression, or branches that target it. You could make it work, but maybe it would be easier to just explicitly declare the type in the begin_block.

rossberg · 2016-03-16T11:50:23Z

On 16 March 2016 at 12:31, Andrew Scheidecker [email protected]
wrote:

Is that example invalid? If the deserializer is just a stack machine then
it should be equivalent to the other ordering.

We are specifying a binary format, not a deserializer. The stack machine is
merely a detail of one possible implementation. So yes, that example is
malformed.

There are a few things that are harder to validate with post-order, though.

For example, a branch needs to know the argument type of a target, which is
only known from the use of the node introducing the target:

(iXX.add A... (block $break
(br $break B...)))

In post-order:

A...
begin_block $break
B...
br $break
end_block
iXX.add

Bottom up validation here would have to derive a type for the block from
its tail expression, or branches that target it. You could make it work,
but maybe it would be easier to just explicitly declare the type in the
begin_block.

Yes, if you do decoding and validation in a single pass, then post-order
forces you to derive types bottom-up, whereas pre-order allows top-down.
It's a slight complication for the implementation, but doable. The cost of
requiring type annotations on every block might be pretty substantial.

rossberg · 2016-03-16T11:54:44Z

@AndrewScheidecker, ah, pardon me, now I see what you mean. Yes you are right, the latter example is not actually invalid. Instead, it is the encoding of a different program, namely

(func
  (i32.const 3)
  (i32.const 2)
  (call $pop (call $pop (call $pop (i32.const 1))))
)

AndrewScheidecker · 2016-03-16T12:03:48Z

You were right the first time: I meant that you could define the deserializer to make both orderings equivalent. @sunfishcode's comments on evaluation order apply if the i32.const had side-effects, but I was actually thinking that the serialization order would be the same as evaluation order. It implies you have an additional degree of control over evaluation order that you don't have in an AST, but as you observed, it would require changing evaluation semantics rather than just the binary format.

ghost · 2016-03-16T12:08:19Z

@rossberg-chromium Thank you. I think your answer is basically the same as @binji and I hope I understand it now. The key for me was that it is not a stack of values (not like the JVM) rather a stack of AST node references, and there is always a single result AST node even if it has zero result values. I also understand the v8 implementation a little better: it records the stack_depth at the start of a block and at the end PopUpTo pops all values back to the initial stack_depth returning the top node and discarding the rest.

titzer · 2016-03-16T12:15:10Z

The postorder prototype that I developed for V8 works exactly as Andreas
described; it maintains a stack of AST nodes and does bottom-up
verification. The stack contains types for each AST node. Even nodes that
have type "void" are pushed onto the stack as well. Blocks require a
bracketed representation, where a pre-order "Block" opcode introduces a
block and an "End" opcode ends a block. That ensures that "Br" opcodes can
always be associated with their blocks in a single pass. An in-order
encoded of if-then and if-then-else also works out quite well, being
efficient to decode and can still support single-pass SSA construction.

Postorder is also better for interpretation, since the serialization order
exactly matches the evaluation order. An efficient in-place interpreter is
possible with the help of an auxilliary map of br/if/else opcodes to
bytecode offsets.

I plan to officially propose a switch to postorder along the lines of my V8
prototype soon.

-B

On Wed, Mar 16, 2016 at 1:08 PM, JSStats [email protected] wrote:

@rossberg-chromium https://github.com/rossberg-chromium Thank you. I
think your answer is basically the same as @binji
https://github.com/binji and I hope I understand it now. The key for me
was that it is not a stack of values (not like the JVM) rather a stack of
AST node references, and there is always a single result AST node even if
it has zero result values. I also understand the v8 implementation a little
better: it records the stack_depth at the start of a block and at the end
PopUpTo pops all values back to the initial stack_depth returning the top
node and discarding the rest.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#611 (comment)

ghost · 2016-03-16T12:28:58Z

@titzer Thank you. One point that is not yet clear to me in the v8 implementation is if it catches an empty stack wrt the stack_depth at the start of the block. E.g. does it detect the following invalid code:

i32.cont 1
block_start
  i32.eqz
block_end

titzer · 2016-03-16T12:36:17Z

On Wed, Mar 16, 2016 at 1:30 PM, JSStats [email protected] wrote:

@titzer https://github.com/titzer Thank you. One point that is not yet
clear to me in the v8 implementation is if it catches an empty stack wrt
the stack_depth at the start of the block. E.g. does it detect the
following invalid code:

i32.cont 1
block_start
i32.eqz
block_end

Ah ha! Thanks for your example. No we don't actually detect that case.
Have to think about that.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#611 (comment)

titzer · 2016-03-24T09:21:44Z

The solution to the "expressions crossing block boundaries" problem is to disallow popping expressions across a block boundary. I.e. every block begins with an "empty" stack.

ghost · 2016-03-24T10:48:07Z

The latest v8 patch https://codereview.chromium.org/1830663002/

titzer · 2016-03-29T13:17:22Z

BinaryEncoding.md

+  * Recursively write the first subnode: `BLOCK_START 2 I32CONST 3 I32CONST BLOCK_END`
+  * Recursively the write second subnode: `4 I32CONST`
+  * Write the opcode `BLOCK_END` 
+



Can you add an example of how high-level if is encoded:

I propose it be encoded as follows:

(expr)
if
(true-exprs)
else
(false-exprs)
end

The else bytecode and clause is optional.

Certainly, I'll get right on it.

titzer · 2016-03-29T13:18:45Z

I was going to start a PR that postorder in its full glory, but I think we can mold this one into the the right shape, or we can land it as is and iterate.

I have a separate PR that proposes bytecodes for the encoding.

naturaltransformation · 2016-03-29T16:00:16Z

@titzer I'm happy and ready to mold this to the consensus design.

titzer · 2016-03-29T16:02:35Z

BinaryEncoding.md

+  * Then write the opcode for `i32.add`, which is sufficient since `i32.add` is a binary opcode.   
+
+* Given an if-expression: `if_else(expr: AstNode, true-exprs: AstNode, false_exprs: AstNode)`
+  * Recursively encode expr


I like it!

Can you also give an example for an if with no else?

sunfishcode · 2016-03-30T17:40:39Z

@titzer's solution works for @titzer's proposal; the solution I mentioned above works in the proposal I am preparing.

sunfishcode · 2016-04-29T22:17:56Z

With #675 now merged, this is done.

sunfishcode · 2016-04-29T22:20:12Z

Or actually, it looks like BinaryEncoding.md still talks about a "Pre-order encoding". Reopening this.

naturaltransformation · 2016-05-04T17:08:51Z

Is there further feedback on this one? Is there perhaps value to adding some additional examples to summarize the above discussion or perhaps just keep the current set of examples?

I would like to work through the rationale and explanation of a post-order encoding of syntax trees that would simplify robust decoding. The goal is to avoid having to lean on recursive descent or shift-reduce parsers.

naturaltransformation · 2016-06-02T15:07:40Z

The BinaryEncoding doc's current explanation of post-order encoding is a good start. This PR does round it out with a coverage of all the exceptions to post-order along with more examples.

naturaltransformation · 2016-09-16T15:54:42Z

It looks like this is a moot point now with all the implementations and other documentation out there, plus the convergence on stack machine.

jfbastien reviewed Mar 16, 2016
View reviewed changes

BinaryEncoding.md

####Examples:

* Given a simple AST node: I32Add(left: AstNode, right: AstNode)

Copy link

Member

jfbastien Mar 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use code quotes?

ghost reviewed Mar 16, 2016
View reviewed changes

naturaltransformation force-pushed the patch-11 branch from 4776116 to de4b365 Compare March 16, 2016 15:10

titzer reviewed Mar 29, 2016
View reviewed changes

sunfishcode closed this Apr 29, 2016

sunfishcode reopened this Apr 29, 2016

Move to variant of post-order encoding

d56783c

I would like to work through the rationale and explanation of a post-order encoding of syntax trees that would simplify robust decoding. The goal is to avoid having to lean on recursive descent or shift-reduce parsers.

naturaltransformation added 2 commits June 2, 2016 08:05

update based on @titzer proposal

b96ac2b

ifThen example

0d1b9ae

naturaltransformation force-pushed the patch-11 branch from ae42bf1 to 0d1b9ae Compare June 2, 2016 15:06

sunfishcode modified the milestone: MVP Jul 8, 2016

sunfishcode added the clarification label Jul 12, 2016

naturaltransformation closed this Sep 16, 2016


		####Examples:

		* Given a simple AST node: I32Add(left: AstNode, right: AstNode)

Move to post-order encoding of syntax trees #611

Move to post-order encoding of syntax trees #611

Uh oh!

Conversation

naturaltransformation commented Mar 16, 2016

Uh oh!

jfbastien Mar 16, 2016

Choose a reason for hiding this comment

Uh oh!

ghost commented Mar 16, 2016

Uh oh!

binji commented Mar 16, 2016

Uh oh!

sunfishcode commented Mar 16, 2016

Uh oh!

ghost commented Mar 16, 2016

Uh oh!

ghost Mar 16, 2016

Choose a reason for hiding this comment

Uh oh!

rossberg commented Mar 16, 2016

Uh oh!

AndrewScheidecker commented Mar 16, 2016

Uh oh!

rossberg commented Mar 16, 2016

Uh oh!

rossberg commented Mar 16, 2016

Uh oh!

AndrewScheidecker commented Mar 16, 2016

Uh oh!

ghost commented Mar 16, 2016

Uh oh!

titzer commented Mar 16, 2016

Uh oh!

ghost commented Mar 16, 2016

Uh oh!

titzer commented Mar 16, 2016

Uh oh!

titzer commented Mar 24, 2016

Uh oh!

ghost commented Mar 24, 2016

Uh oh!

titzer Mar 29, 2016

Choose a reason for hiding this comment

Uh oh!

naturaltransformation Mar 29, 2016

Choose a reason for hiding this comment

Uh oh!

titzer commented Mar 29, 2016

Uh oh!

naturaltransformation commented Mar 29, 2016

Uh oh!

titzer Mar 29, 2016

Choose a reason for hiding this comment

Uh oh!

sunfishcode commented Mar 30, 2016

Uh oh!

sunfishcode commented Apr 29, 2016

Uh oh!

sunfishcode commented Apr 29, 2016

Uh oh!

naturaltransformation commented May 4, 2016

Uh oh!

naturaltransformation commented Jun 2, 2016

Uh oh!

naturaltransformation commented Sep 16, 2016

Uh oh!

Uh oh!