Skip to content

parsing '*' fails due to a confusion between postfix and infix operator #50

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
neumannt opened this issue Oct 7, 2022 · 5 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@neumannt
Copy link

neumannt commented Oct 7, 2022

The parser eagerly turns * into postfix operators, which is not always correct. Consider the example below:

foo : (a:int) -> auto = { return a*2; }

It fails to compile in cppfront with an error message (missing ';'). The error goes away when adding a space in front of the * because the parser then does not recognize it as postfix operator. But insisting on a whitespace in front on a binary operator seems questionable, it is a significant deviation from regular C++.

In general the postfix * and & operators are problematic, as they create an ambiguity with binary * and &. Resolving that requires either an LR(2) parser or a lot of lexer magic to recognize these as postfix operators. Having them as prefix operators would be easier.

There are also other ambiguities in the grammar, some of them probably fixable (e.g., template-argument -> expression | id-expression, but if the expression is an identifier the parser cannot distinguish these cases), some of them probably unfixable without major changes to the language (e.g., template syntax, there the parser cannot parse a<b,c>d correctly without knowing if a is a template or not). I am not sure if it makes sense to open bugs for all of them. If you want I can report them, of course, but I do not want to spam the issue tracker with problems if you only consider the syntax experimental anyway. (On the other hand it is probably useful to know if the suggested syntax works or not).

@neumannt neumannt added the bug Something isn't working label Oct 7, 2022
@gregmarr
Copy link
Contributor

gregmarr commented Oct 7, 2022

Looks like return a*b; is handled because the parser when handling binary operators vs unary operators checks for */&/~ followed by an identifier, and return a*(2); is handled because it checks for it followed by left paren. It probably just needs to check for literals there too.
https://github.com/hsutter/cppfront/blob/main/source/parse.h#L1394

@hsutter hsutter self-assigned this Oct 8, 2022
@hsutter
Copy link
Owner

hsutter commented Oct 8, 2022

As @gregmarr says, done -- see latest commit (sorry, forgot to reference this issue). Thanks!

Yes, this is intentional and I'll write up a design note about it... in short, there is a tension between two things:

  • most unary operators including * and & naturally want to be postfix (*), and
  • we want to support all Cpp1 operators in order to ensure great compatibility with all existing C++ libraries including those that overload all the possible operators (e.g., expression templates) but that includes binary * and &

So if we want to have both, we have a few choices, including:

  • Change the syntax of pointers/dereference to use something that is not a binary operator. For example, a lot of people would suggest one of the very few unused symbols, notably @. But note that we only ever get to take each one of those once, ever, in all the history of time, so we should have a truly break-the-glass no-other-option reason to take it. This doesn't rise to that bar.
  • Change the syntax of pointers/dereference to use another operator that has a binary form we can ban. Here the notable and obvious example is ^, and in Cpp2 for a while I actually used ^ for pointers/deference and banned the bitwise binary operators and required programmers to write bitxor et al. instead, which is arguably more readable (and the slightly longer spelling highlights an important operation, and they're already alternative tokens, and there are already std:: versions, and other rationale/rationalizations). But the moment I tried converting spdlog to Cpp2 syntax by hand a year ago, I found myself having to convert a lot of those, and after the first dozen I realized that banning the bitwise binary operators was a nontrivial impediment to the more important goal of making converting code to Cpp2 as friction-free as possible.
  • Disambiguate unary and binary * and &. This is what I'm currently trying... the current rule is that the suffix unary operators must have no whitespace (which I think is natural, and nobody seems to have noticed this yet and reported it so it seems nobody has tried to write iter ++ which seems to be early validation this choice is acceptable) and * and & that appear as a suffix are the unary operators unless they are the last operator and are followed by ( or an identifier (or, now, a literal, thanks for pointing this out) which wouldn't be legal anyway for a unary operator. This is an experiment but so far it's working well for my uses.

(*) for now please grant this for the sake of discussion... briefly, this follows from declaration syntax mirroring use syntax and as I promised in the talk, I'll write it up soon

@neumannt neumannt closed this as completed Oct 8, 2022
hsutter added a commit that referenced this issue Oct 9, 2022
Examples like `f<a+a>` should be parsed as an _expression_ first (if it can be an expression, it is), and then if that fails parsed as an _id-expression_
The original comment was correct about the production order, so put that back to how it was
Added a regression test case
@neumannt
Copy link
Author

neumannt commented Oct 9, 2022

The commit 1943783 changes the parse order, assuming that if a template argument can be parsed as an expression it is an expression. But note that this does not work in all cases. This example here

foo(a<b,c>(3))

cannot be parsed without type information on a. It could either be call to foo with two call arguments (and two expressions), or a call with just a single argument, if a is a suitable template.

Having the parser depend on type information is a nightmare, of course. It requires a complex parser, and it prevents order-independent parsing (i.e., it requires forward declarations to work properly). IMHO we should change the template syntax to prevent that anomaly. If we would use, e.g., ![...] instead of <...> for templates the ambiguity would go away.

@hsutter
Copy link
Owner

hsutter commented Oct 9, 2022

Hi! You closed the PR but I hadn't replied to all your good questions yet. Here are answers to the other two questions.

Q: Parse ambiguity of expression | id-expression

A: In Cpp2, first match wins

(e.g., template-argument -> expression | id-expression, but if the expression is an identifier the parser cannot distinguish these cases),

Yes, my intent there was that the productions are tried in order. I do this in a few cases (statement is another). The intent is that by deterministically taking the first match, we can eliminate any ambiguity when input could match more than one production. In this case, "if it can be an expression, it is."

Note: This disambiguation is completely unlike today's vexing parse "if it can be a declaration, it is" disambiguation, which is truly awful because it requires directing the parse production selection based on whether a name is a type or not... which means today it requires name lookup (semantic analysis!) to decide the parse, which is totally backward from the proper "lex -> parse -> sema" order with all steps independent. Cpp2 has nothing like that problem, I never require sema to guide parsing.

As a result of this question I noticed the code wasn't testing these productions in the above order (oops!). Now it does... thanks for the bug case! Here's the corrected code from parse.h:

if (auto e = expression(false)) {   // disallow unparenthesized relational comparisons in template args
    term.arg = std::move(e);
}
else if (auto i = id_expression()) {
    term.arg = std::move(i);
}

Note that (I think?) this snippet's comment also answers your other related question...

Question: Parse ambiguity of a<b,c>d

A: In Cpp2, there are no comma-expressions, and relational expressions in a template-argument must be parenthesized

(e.g., template syntax, there the parser cannot parse a<b,c>d correctly without knowing if a is a template or not)

I'll explain why I think this works, but first a disclaimer: I am not a parser expert, I could well be overlooking problem examples, and C++ is known to be gnarly in this area... I believe I fixed these problems, but if you spot an example that doesn't work I wouldn't be too surprised, please let me know by opening a bug issue with a repro! Thanks.

First, in today's C++ these are a delightful parse (sarcasm) because of the comma operator and the overloading of the < and > tokens for template parameter/argument lists and relational comparisons.

In Cpp2 my intent is for these to be an easy parse (famous last words!) by addressing this in two chunks:

  • Cpp2 has no comma operator, so that removes the possibility of b,c being a comma-expression.
  • Cpp2 requires a relational comparison expression in a template-argument-list to be parenthesized. (*)

The reason I deliberately disallow unparenthesized relational comparisons in template arguments is as a simple way to avoid the many subtle issues that have complicated today's C++ parsers in this area.(**) For example, things like

a < b > c , d >

which is a slight variation of your example. In Cpp2 that won't compile as is, but is okay with the required parens:

a < ( b > c ) , d >   // ok

For your actual example,

a < b , c > d

my intent in Cpp2 is that a<b,c> can't be an expression, so it should parse as a template-id, and then following it immediately with an identifier d isn't grammatically valid.


(*)

I also considered other options.

For a while I tried making template parameter/argument lists use [ and ] instead of < and >. This had some theoretical benefits:

  • It avoided the overloaded uses of < > for relational comparison.
  • [ ] are balanced in the language (outside comments and literals) whereas < > are not because of the tokens' additional use as relational operators, and using balanced tokens enables lazy parsing to skip sections more easily (as D does, and as cppfront itself does to rely on balanced { } to skip Cpp1 code sections without having to fully parse them).

However, eventually I decided < > was better for two reasons:

  • It makes Cpp2 much easier for cppfront and for programmers, by letting both easily distinguish the meaning of a<b> vs a[b] locally without having to go off and do name lookup (in the compiler or by manual scrolling around) to figure out whether a is the name of a template or of an object that overloaded operator[].
  • It reduced the delta with today's Cpp1 syntax, which I want to minimize unless there's a compelling reason to be different (and here there isn't).

Also, I know that some languages put a prefix on template/generic argument lists (e.g., !); by itself this doesn't solve the problems here if we still use < and >, and (admittedly a matter of my personal taste) I considered it a bit ugly, so I didn't want to use it except as a last resort if nothing else worked. I think the current solution is better for Cpp2... other solutions may well be better for other languages, so this is not at all a diss of other languages' choices!

(**)

Even though in today's C++ grammar many of challenging cases aren't actually ambiguous, they still surely are:

  • visually ambiguous to programmers, so I want to improve them under the goal of improving "simplicity"; and
  • a known pain in the posterior to C++ parser writers, so I want to improve them under the goal of improving "toolability".

This comes up even if you're not writing a full C++ parser, just an approximate "tag" parser that tries to be much faster than an accurate C++ parser. In the past decade I've seen several efforts where people write new C++ parsers of both kinds, accurate and approximate, and each parser author I've talked to raised this class of example. The problem examples are just common enough in real code that even an approximate parser has to do something sensible with them.

Note these complications don't arise in C#, even though C# also uses < and > for generics. That's because C# doesn't have non-type template arguments (or the capability to do type computations that would involve relational comparison) and so a relational expression naturally cannot occur in a generic argument in C#. C# also doesn't have comma-expressions. So there seems to be some prior art experience that simply banning (unparenthesized) relational comparisons inside template argument lists, and banning comma-expressions everywhere, would similarly sidestep the issue in Cpp2.

@hsutter
Copy link
Owner

hsutter commented Oct 9, 2022

I just noticed your reply while I was writing my longer one:

foo(a<b,c>(3))

cannot be parsed without type information on a.

No, Cpp2 never does name lookup to guide parsing, I agree with you that that's a bad thing (see above).

In Cpp2, that code greedily parses a<b,c> as a template-id and calls foo with the single argument a<b,c>(3). Perhaps I should emit extra parens in that case in the Cpp1 code around the whole argument to foo to make sure the second compiler treats it the same way... I'll look into that.

In Cpp2, to call foo with two arguments, use parens: foo( (a<b) , c>(3) ).

If we would use, e.g., ![...] instead of <...> for templates the ambiguity would go away.

Yes, but see (*) above... I'm open to that as a last resort if nothing else works, but I think(?) what's there now is a good resolution (see the final "Note these complications don't arise in C#, even though C# also uses < and >" paragraph). That said, if there are examples that it doesn't cover well, I'm open to something like the ! wart.

Azmah-Bad pushed a commit to Azmah-Bad/cppfront that referenced this issue Feb 24, 2023
Examples like `f<a+a>` should be parsed as an _expression_ first (if it can be an expression, it is), and then if that fails parsed as an _id-expression_
The original comment was correct about the production order, so put that back to how it was
Added a regression test case
Azmah-Bad pushed a commit to Azmah-Bad/cppfront that referenced this issue Feb 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants