Skip to content

Some red flags with the current grammar for reserved-statement #635

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mihnita opened this issue Feb 12, 2024 · 11 comments
Closed

Some red flags with the current grammar for reserved-statement #635

mihnita opened this issue Feb 12, 2024 · 11 comments
Labels
Preview-Feedback Feedback gathered during the technical preview resolve-candidate This issue appears to have been answered or resolved, and may be closed soon. syntax Issues related with syntax or ABNF

Comments

@mihnita
Copy link
Collaborator

mihnita commented Feb 12, 2024

Reserved statements:

reserved-statement = reserved-keyword [s reserved-body] 1*([s] expression)
reserved-keyword   = "." name
reserved-body      = *([s] 1*(reserved-char / reserved-escape / quoted))
reserved-char     = content-char / "."

If I read this correctly, then these strings are valid:

// reserved-keyword   reserved-body   expression
   .foo
   .foo               .               {$baz}
   .foo               ....            {$baz}
   .foo               .bar            {$baz}

Do we really want to allow '.' in the reserved body?


Escaped text:

text-escape     = backslash ( backslash / "{" / "}" )
quoted-escape   = backslash ( backslash / "|" )
reserved-escape = backslash ( backslash / "{" / "|" / "}" )

To me this looks like a red flag.
Not only for the implementers of a parser, but also for the users of the final syntax.

In most cases escaped strings can be handled pretty low level (tokenizer).

But in this case knowing that we can now accept reserved-escape we need a lot of context.
And that allows us to support strings like this:

.foo .\{\|\} {$baz}

So I find the reserved-statement to be too lax.

Or I am reading the grammar wrong?

@aphillips aphillips added syntax Issues related with syntax or ABNF Agenda+ Requested for upcoming teleconference LDML45 LDML45 Release (Tech Preview) labels Feb 14, 2024
@aphillips
Copy link
Member

I think you're reading the grammar fine, but drawing an perhaps-incorrect conclusion.

So I find the reserved-statement to be too lax.

Too lax for what purpose? We deliberately made the grammar of reserved-body permissive, knowing that almost certainly rational people will not use it for weird things such as your example.

The { must be quoted so that we can recognize expression start. The | is quoted to allow quoted to work in reserved-body.

Basically this says "consume characters until you see an (unescaped) {, which starts the required expression sequence. If anything, reserved-body is too strict, in that it requires | to be paired (for quoted).

How would you change it? We need to freeze the syntax.

@aphillips
Copy link
Member

@mihnita I'm sorry I missed bringing this up in the F2F call. Can you clarify the request or, if appropriate, close this issues if it has been addressed? Thanks.

@aphillips
Copy link
Member

Per discussion in 2024-02-26 call, closing

@mihnita
Copy link
Collaborator Author

mihnita commented Mar 4, 2024

This is how our escaping of { | } looks like in different contexts:

Normal | \{ \}, but {  |quoted \| { }| &reserved \| \{ \} but |quoted inside! \| { } | out, so \| \{ \}} back to normal.
11111111111111111111 22 3333333333333 2222222222 444444444444  5555555555555555555555  4444444444444444 1111111111111111

Legend:
1: normal text (text-char = content-char / s / "." / "@" / "|")
2: inside a placeholder
3: quoted literal (quoted-char = content-char / s / "." / "@" / "{" / "}")
4: reserved (reserved-char = content-char / ".")
5: quoted literal inside a reserved


When writing a parser it is common to write a tokenizer that groups characters into tokens, and then the parser works at the higher level, on tokens.

But to write a tokenizer it is best to not need much context, just consume characters.

On occasions the tokenizer might enter a different mode, for example in strings ("...\"..."), block comment (/*...*/) or line comment (//...).

But these are one single level mode changes.
When I enter string mode, we stay there (with \") all the way to the last ".
We don't look for // or /* in a string.
Same, when we are in block comment mode we don't care about \" or //. And same for line comments.

For our grammar the tokenizer must know when we are inside a reserved section, but to get there it needs a { (to ender placeholder mode), then a &foo to enter reserved mode.
So it has to look for relatively high level information (concepts like "placeholder"), that belong at the parser level.

Worse, inside a reserve sequence we are still looking (and recognizing) a literal (block 5 above).
We need to know we are in a literal, in a reserved, in a placeholder.


This is unnecessary complex.
Most programming languages don't need such a tokenizer.

Our grammar forces the tokenizer to:

  1. go deep
  2. require context from parser

"I wrote a parser for it" is not an argument.
People have written parsers for bad grammars before.
Being able to write a parser does not mean the grammar is good.

And think what this does for our users.
They must understand all of these rules, when to escape what.

@mihnita mihnita reopened this Mar 4, 2024
@mihnita
Copy link
Collaborator Author

mihnita commented Mar 4, 2024

Per discussion in 2024-02-26 call, closing

I think this was a misunderstanding.
I though that we drop it as in "not a blocker for LDML 45"
Not close it completely.

@mihnita
Copy link
Collaborator Author

mihnita commented Mar 4, 2024

Kind of related, I think that there is no way to use @ inside a reserved body:

reserved-char   = content-char / "."
reserved-escape = backslash ( backslash / "{" / "|" / "}" )

@aphillips aphillips added Preview-Feedback Feedback gathered during the technical preview and removed Agenda+ Requested for upcoming teleconference LDML45 LDML45 Release (Tech Preview) labels Mar 4, 2024
@bhaible
Copy link
Contributor

bhaible commented Mar 20, 2024

When writing a parser

You mean writing a left-to-right (from the beginning to the end) parser, right?

I will concede that for our grammar, writing a parser that starts somewhere in the middle and needs to understand the context by looking left, is not so immediate as writing a left-to-right parser. But that is probably true for many (most) language grammars.

it is common to write a tokenizer that groups characters into tokens, and then the parser works at the higher level, on tokens.

There are two ways to write a parser:

  • It can take lexer/tokenizers at different levels. For instance, the parser may invoke a 1-character-at-a-time function in some places (such as inside a quoted) and a 1-token-at-a-time function in other places (such as in normal text). In an ISO C parser, at least 4 levels are useful (due to backslash-newline, trigraphs, \u escapes, and such). Or
  • It can work with a single lexer/tokenizer function across the entire grammar, but this function needs some context. In our grammar, it is possible to use a lexer/tokenizer function (as the basis of an LALR(1) parser) that
    • produces individual characters and tokens for "{{", "input", "local", "match",
    • receives 1 bit of context, that enables tokenizing everywhere except inside a text-escape or a reserved-escape.

But to write a tokenizer it is best to not need much context, just consume characters.

Yes. As said above, this can be achieved with 1 bit of context.

On occasions the tokenizer might enter a different mode, for example in strings ("..."..."), block comment (/.../) or line comment (//...).

But these are one single level mode changes.
When I enter string mode, we stay there (with ") all the way to the last ".
We don't look for // or /* in a string.
Same, when we are in block comment mode we don't care about " or //. And same for line comments.

For our grammar the tokenizer must know when we are inside a reserved section, but to get there it needs a { (to ender placeholder mode), then a &foo to enter reserved mode.
So it has to look for relatively high level information (concepts like "placeholder"), that belong at the parser level.

I disagree. As stated above, all the tokenizer needs to know is whether the parser currently is inside a text-escape or a reserved-escape. It does not need to know about placeholder, it does not need to know about quoted, etc.

Worse, inside a reserve sequence we are still looking (and recognizing) a literal (block 5 above).
We need to know we are in a literal, in a reserved, in a placeholder.

There is no problem with literals in the grammar. Maybe you designed a tokenizer that attempts to returns literals as a single token? If so, that is a problem with the design of that tokenizer, but not a problem with the grammar.

I mean, you would have the same problem in an ISO C parser, if you wrote a tokenizer that returns -5 as a single token. Then you would have problems with the unary-minus operator: return - 5; and return -5; would parse differently. The solution here would be to modify the tokenizer so that it returns - and 5 separately. In summary: To fit a certain grammar, you need to adapt the tokenizer appropriately.

@bhaible
Copy link
Contributor

bhaible commented Mar 20, 2024

I think that there is no way to use @ inside a reserved body:

I think this is intentional. Because otherwise there would be a parsing ambiguity inside an annotation-expression:

{ %foo @x }

could be parsed as

'foo @x' as reserved-body
no attributes

or as

'foo' as reserved-body
'@x' as attribute

@aphillips
Copy link
Member

I think that there is no way to use @ inside a reserved body:
I think this is intentional.

This is 100% intentional. A reserved annotation is part of annotation and needs to behave like any other annotation. This includes not subsuming other constructs, such as attribute.

@aphillips aphillips added the resolve-candidate This issue appears to have been answered or resolved, and may be closed soon. label Sep 10, 2024
@aphillips
Copy link
Member

Since we removed reserved-statement this might now be stale

@aphillips
Copy link
Member

See above comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Preview-Feedback Feedback gathered during the technical preview resolve-candidate This issue appears to have been answered or resolved, and may be closed soon. syntax Issues related with syntax or ABNF
Projects
None yet
Development

No branches or pull requests

3 participants