Skip to content

Invalid unreachable loop in text format #1541

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kripken opened this issue May 10, 2018 · 14 comments
Open

Invalid unreachable loop in text format #1541

kripken opened this issue May 10, 2018 · 14 comments

Comments

@kripken
Copy link
Member

kripken commented May 10, 2018

E.g.

(module
 (func $0
  (drop
   (block $label$1 (result i32)
    (loop $label$2
     (unreachable)
    )
   )
  )
 )
)

The loop is unreachable, but we need to emit an unreachable after the loop too, for the block, wabt says

w.wast:6:7: error: type mismatch in block, expected [i32] but got []

Not sure why this is just a bug in the text format - the binary for it is fine.

@MaxGraey
Copy link
Contributor

MaxGraey commented May 10, 2018

Hmm, what if just remove return result for such kind of code? Like:

(module
 (func $0
   (block $label$1
    (loop $label$2
     (unreachable)
    )
   )
 )
)

Or this sintetic example?

Because this doesn't depend on unreachable. Current code behave (failed) the same:

(module
 (func $0
  (drop
   (block $label$1 (result i32)
    (loop $label$2
     (nop)
    )
   )
  )
 )
)

@kripken
Copy link
Member Author

kripken commented May 10, 2018

It's true that removing the result would make this example work, but what would the new rules be for when to do so - how would that fit in the current typing rules? I'm not sure offhand how that could be done, and this is tricky stuff with many corner cases I'm afraid...

In general, if there isn't a simple fix for something like this, it may just be something we document as a text format difference between binaryen and wat - there are already a few in the list. While we need binary format support to be perfect - that's the point of binaryen :) - the text format is meant to represent our internal IR, which has been diverging from wasm. We may want to just stop claiming to support wat, and instead have our own byn text format as has been discussed before.

@dcodeIO
Copy link
Contributor

dcodeIO commented May 11, 2018

For reference, the following is the saved binary from

var mod = new binaryen.Module();
var funcType = mod.addFunctionType("v", binaryen.none, []);
var func = mod.addFunction("0", funcType, [],
  mod.drop(
    mod.block("label$1", [
      mod.loop("label$2",
        mod.unreachable()
      )
    ], binaryen.i32)
  )
);
mod.addExport("0", "0");

unreachable-loop

When reading it back, it becomes the original (apparently invalid) text format:

(module
 (type $0 (func))
 (export "0" (func $0))
 (func $0 (; 0 ;) (type $0)
  (drop
   (block $label$1 (result i32)
    (loop $label$2
     (unreachable)
    )
   )
  )
 )
)

So...

Not sure why this is just a bug in the text format - the binary for it is fine.

is this just the text printer missing something the binary writer adds implicitly? Asking because when looking at the comments above, it appears to be more complicated than that?

@kripken
Copy link
Member Author

kripken commented May 11, 2018

I believe the binary writer adds the extra unreachable after the loop. The reason is that Binaryen IR has an unreachable type for loops, but wasm binaries do not. To fix that, after it emits a loop that is unreachable it also emits an unreachable opcode, which ensures we are in an unreachable zone in the binary. And when reading wasm we do the opposite, in effect.

It's not clear how to fix this for the text format, though:

  • A "proper" fix would be to remove the unreachable type from the IR. See previous discussions on wasm diverging from Binaryen after wasm became a stack machine and removed unreachable loops, blocks and ifs. The core issue, iirc, is that unreachability makes optimization more easy - you can replace an unreachable loop with a br, and know it is valid because it has the same type, while if the loop has the empty type then it's only clear from the outside context whether the replacement would be ok, which is messy.
  • Rename Binaryen's current text format ".byt" ("binaryen text"), and add another input and output format, the .wat format. Then Binaryen would have three input/output formats, .wasm, .wat, .byt.

@dcodeIO
Copy link
Contributor

dcodeIO commented May 12, 2018

What if we'd just do

    if (curr->type != none && curr->body->type == unreachable) {
      o << maybeNewLine;
      !minify && doIndent(o, indent);
      printOpening(o, "unreachable");
      o << ')';
    }

right below this line, similar to what the binary writer does?

$> wasm-opt the-file.wat --print
(module
 (type $0 (func))
 (func $0 (; 0 ;) (type $0)
  (drop
   (block $label$1 (result i32)
    (loop $label$2
     (unreachable)
    )
    (unreachable)
   )
  )
 )
)

@kripken
Copy link
Member Author

kripken commented May 12, 2018

We could do that, but we'd also need to make changes in the text input code to handle that, and the bigger issue is that by making such changes the text format would no longer represent Binaryen IR as it currently does. And having a good way to print the internal IR is necessary for debugging and development. So we'd need to support two text formats, .byt and .wat as mentioned before.

I'm not saying that's a bad idea, but it would take some work to do, and I'm not sure it's worth it. How important is it that binaryen can emit the wasm text format correctly?

@dcodeIO
Copy link
Contributor

dcodeIO commented May 12, 2018

How important is it that binaryen can emit the wasm text format correctly?

For AssemblyScript it's quite convenient to have a text format printer provided by Binaryen (while there isn't really a need to read text format), so it can be its sole dependency. I've tried to avoid adding WABT for example because it adds another megabyte or so of JS just for printing text format.

@kripken
Copy link
Member Author

kripken commented May 12, 2018

I see, good point @dcodeIO

So, if emitting wasm text is important but reading it is not, then that's not too hard to do, we can make a variant of the --print pass, maybe --print-wat, and internally Print would have a parameter whether to conform to proper wat syntax.

Before going down that route, is there anyone else that does care about reading proper wat as well?

@binji
Copy link
Member

binji commented May 12, 2018

I've tried to avoid adding WABT for example because it adds another megabyte or so of JS just for printing text format.

Currently the JS build of wabt includes a lot of extra cruft -- it wouldn't be too hard to separate some of it out to slim down the library.

@MaxGraey
Copy link
Contributor

MaxGraey commented May 12, 2018

@binji it will be great if wasm-objdump provide an opportunity to optionally emit machine code representation (x86-64 for example) of wasm file. This allowed by capstone and one of standalone VM for example and Web Aseembly Studio used this. WDYT?

@binji
Copy link
Member

binji commented May 12, 2018

@MaxGraey that's a bit outside the scope of wasm-objdump, in my opinion. What you'd probably want is something more like compiler explorer where you can provide wasm input and choose the VM to produce assembly output.

@MaxGraey
Copy link
Contributor

MaxGraey commented May 12, 2018

Yah, may be better to create another separate tool for this routine. The main idea to determine which is codegen quality better for specific wasm operations or optimizations and how this changes during evolution some VM to get informative feedback and estimation about some binaryen performance optimizations.

PS Sorry for offtop

@kripken
Copy link
Member Author

kripken commented May 14, 2018

@binji "it wouldn't be too hard to separate some of it out to slim down the library."

Yeah, we could make a wabt build with just the binary loading and text printing code included, perhaps? Might be as easy as only specifying those functions in the list of exported functions.

@binji
Copy link
Member

binji commented May 14, 2018

@kripken Hm, not so sure now that I take a quick look. It seems at least 350k or so is the re2c generated lexing function! Haven't really dug into the rest, but it looks like the callback-based parsing might be bloating the resulting library too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants