Skip to content

Support embedded newline characters in names? #142

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sunfishcode opened this issue Oct 15, 2015 · 7 comments
Closed

Support embedded newline characters in names? #142

sunfishcode opened this issue Oct 15, 2015 · 7 comments

Comments

@sunfishcode
Copy link
Member

In #141 I created a test which attempted to test all the ASCII control characters in exported symbol names. All of them worked except 0x0a, the ASCII newline character. The spec interpreter gave this error when I tried it:

test/names.wast:50.11-50.14: unclosed text literal

What is the intended behavior here? I don't presently have an opinion here; I could see arguments for restricting the character set in some says, but I could also see arguments that it should be entirely unrestricted.

@rossberg
Copy link
Member

There's no particular reason for the current behaviour, other than the same regexp character in the lexer being used to define comment syntax. :)

I don't have an overly strong opinion either, but for hygiene and for the sake of following standard practice, I'd lean towards disallowing any ASCII control characters in literals (this still allows UTF8).

@jfbastien
Copy link
Member

Unicode has more control characters that ASCII :-)
Fun times can be had if we allow bidi in names!

IIRC @zygoloid was telling me that Unicode has a very well defined set of character literals that all languages should be using. As the editor of C++ he was appalled that C++ chose to ignore this and go its own special and nonsensical route.

@sunfishcode
Copy link
Member Author

I guess you refer to this?

One issue is that WebAssembly does actually want a wider set than the set any high-level language will be using for identifiers, because WebAssembly aims to support compilers that need to be able to mangle names into something unrepresentable in source languages.

The document above also says: "Generally if the programming language has case-sensitive identifiers, then Normalization Form C is appropriate". However, Unicode Normalization Form C is non-trivial. It would be unfortunate if every WebAssembly tool has to know how to validate and normalize identifiers just to correctly do symbol lookups.

Another concern is homograph confusion. Since WebAssembly identifiers aren't user-facing anyway, would it make sense to restrict the character set and have frontends mangle as needed? They'll probably always have to do some mangling in any case.

Another is whether ES modules impose any constraints on this domain.

Thoughts?

@jcbeyler
Copy link

Since we have to do mangling anyway, I don't think it matters really what we allow since the front-end can just mangle it entirely and put whatever it likes as the internal representation. That is what I've already done in my parser to remove certain characters that LLVM did not like for example.

@rossberg
Copy link
Member

With #143 merged, are people okay with closing this issue?

@jcbeyler
Copy link

I would vote yes

@sunfishcode
Copy link
Member Author

Having filed this, I think we can close this. WebAssembly doesn't want to require engines to be in the business of interpreting character sets, so the simplest thing is for it to just support arbitrary uninterpreted byte strings. The mappings to JS and other languages can define the correspondence to Unicode as appropriate.

Connicpu pushed a commit to Connicpu/wasm-spec that referenced this issue Jun 7, 2020
Fixes WebAssembly#142. A mismatched `DataCount` is malformed, not a validation error.
dhil pushed a commit to dhil/webassembly-spec that referenced this issue Mar 2, 2023
This updates the explainer text according to the new spec we agreed in
the 09-15-2020 CG meeting and discussions afterwards.

The following are modifications and clarifications we made after the
09-15-2020 CG meeting, and the relevant issue posts, if any:
https://github.com/WebAssembly/meetings/blob/master/main/2020/CG-09-15.md

- `catch_br` wasm renamed to `delegate` (WebAssembly#133)
- `rethrow` gains an immediate argument (WebAssembly#126)
- Removed dependences on the reference types proposal and the multivalue
  proposal. The multivalue proposal was previously listed as dependent
  because 1. `try` is basically a `block`, so it can have multivalue
  input/output 2. `br_on_exn` can extract multiple values from a
  `block`. We don't have `br_on_exn` anymore, and I'm not sure 1 is a
  strong enough reason to make it a dependence.
- Mention `rethrow` cannot rethrow exceptions caught by `unwind` (WebAssembly#142
  and WebAssembly#137)
- Mention some runtimes, especially web VMs, can attach stack traces to
  the exception object, implying stack traces are not required for all
  VMs
- Update label/validation rules for `delegate` and `rethrow` (WebAssembly#146)
- Finalize opcodes for `delegate` (0x18) and `catch_all` (0x19) (WebAssembly#145
  and WebAssembly#147)

I believe this resolves many previous issue threads, so I'll close them.
Please reopen them if you think there are things left for discussions in
those issues.

Resolves WebAssembly#113, resolves WebAssembly#126, resolves WebAssembly#127, resolves WebAssembly#128, resolves
WebAssembly#130, resolves WebAssembly#142, resolves WebAssembly#145, resolves WebAssembly#146, resolves WebAssembly#147.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants