Some red flags with the current grammar for reserved-statement #635

mihnita · 2024-02-12T21:49:11Z

Reserved statements:

reserved-statement = reserved-keyword [s reserved-body] 1*([s] expression)
reserved-keyword   = "." name
reserved-body      = *([s] 1*(reserved-char / reserved-escape / quoted))
reserved-char     = content-char / "."

If I read this correctly, then these strings are valid:

// reserved-keyword   reserved-body   expression
   .foo
   .foo               .               {$baz}
   .foo               ....            {$baz}
   .foo               .bar            {$baz}

Do we really want to allow '.' in the reserved body?

Escaped text:

text-escape     = backslash ( backslash / "{" / "}" )
quoted-escape   = backslash ( backslash / "|" )
reserved-escape = backslash ( backslash / "{" / "|" / "}" )

To me this looks like a red flag.
Not only for the implementers of a parser, but also for the users of the final syntax.

In most cases escaped strings can be handled pretty low level (tokenizer).

But in this case knowing that we can now accept reserved-escape we need a lot of context.
And that allows us to support strings like this:

.foo .\{\|\} {$baz}

So I find the reserved-statement to be too lax.

Or I am reading the grammar wrong?

The text was updated successfully, but these errors were encountered:

aphillips · 2024-02-14T00:13:47Z

I think you're reading the grammar fine, but drawing an perhaps-incorrect conclusion.

So I find the reserved-statement to be too lax.

Too lax for what purpose? We deliberately made the grammar of reserved-body permissive, knowing that almost certainly rational people will not use it for weird things such as your example.

The { must be quoted so that we can recognize expression start. The | is quoted to allow quoted to work in reserved-body.

Basically this says "consume characters until you see an (unescaped) {, which starts the required expression sequence. If anything, reserved-body is too strict, in that it requires | to be paired (for quoted).

How would you change it? We need to freeze the syntax.

aphillips · 2024-02-18T18:23:57Z

@mihnita I'm sorry I missed bringing this up in the F2F call. Can you clarify the request or, if appropriate, close this issues if it has been addressed? Thanks.

aphillips · 2024-02-28T15:26:29Z

Per discussion in 2024-02-26 call, closing

mihnita · 2024-03-04T17:09:12Z

This is how our escaping of { | } looks like in different contexts:

Normal | \{ \}, but {  |quoted \| { }| &reserved \| \{ \} but |quoted inside! \| { } | out, so \| \{ \}} back to normal.
11111111111111111111 22 3333333333333 2222222222 444444444444  5555555555555555555555  4444444444444444 1111111111111111

Legend:
1: normal text (text-char = content-char / s / "." / "@" / "|")
2: inside a placeholder
3: quoted literal (quoted-char = content-char / s / "." / "@" / "{" / "}")
4: reserved (reserved-char = content-char / ".")
5: quoted literal inside a reserved

When writing a parser it is common to write a tokenizer that groups characters into tokens, and then the parser works at the higher level, on tokens.

But to write a tokenizer it is best to not need much context, just consume characters.

On occasions the tokenizer might enter a different mode, for example in strings ("...\"..."), block comment (/*...*/) or line comment (//...).

But these are one single level mode changes.
When I enter string mode, we stay there (with \") all the way to the last ".
We don't look for // or /* in a string.
Same, when we are in block comment mode we don't care about \" or //. And same for line comments.

For our grammar the tokenizer must know when we are inside a reserved section, but to get there it needs a { (to ender placeholder mode), then a &foo to enter reserved mode.
So it has to look for relatively high level information (concepts like "placeholder"), that belong at the parser level.

Worse, inside a reserve sequence we are still looking (and recognizing) a literal (block 5 above).
We need to know we are in a literal, in a reserved, in a placeholder.

This is unnecessary complex.
Most programming languages don't need such a tokenizer.

Our grammar forces the tokenizer to:

go deep
require context from parser

"I wrote a parser for it" is not an argument.
People have written parsers for bad grammars before.
Being able to write a parser does not mean the grammar is good.

And think what this does for our users.
They must understand all of these rules, when to escape what.

mihnita · 2024-03-04T17:11:59Z

Per discussion in 2024-02-26 call, closing

I think this was a misunderstanding.
I though that we drop it as in "not a blocker for LDML 45"
Not close it completely.

mihnita · 2024-03-04T17:21:01Z

Kind of related, I think that there is no way to use @ inside a reserved body:

reserved-char   = content-char / "."
reserved-escape = backslash ( backslash / "{" / "|" / "}" )

bhaible · 2024-03-20T15:36:31Z

When writing a parser

You mean writing a left-to-right (from the beginning to the end) parser, right?

I will concede that for our grammar, writing a parser that starts somewhere in the middle and needs to understand the context by looking left, is not so immediate as writing a left-to-right parser. But that is probably true for many (most) language grammars.

it is common to write a tokenizer that groups characters into tokens, and then the parser works at the higher level, on tokens.

There are two ways to write a parser:

It can take lexer/tokenizers at different levels. For instance, the parser may invoke a 1-character-at-a-time function in some places (such as inside a quoted) and a 1-token-at-a-time function in other places (such as in normal text). In an ISO C parser, at least 4 levels are useful (due to backslash-newline, trigraphs, \u escapes, and such). Or
It can work with a single lexer/tokenizer function across the entire grammar, but this function needs some context. In our grammar, it is possible to use a lexer/tokenizer function (as the basis of an LALR(1) parser) that
- produces individual characters and tokens for "{{", "input", "local", "match",
- receives 1 bit of context, that enables tokenizing everywhere except inside a text-escape or a reserved-escape.

But to write a tokenizer it is best to not need much context, just consume characters.

Yes. As said above, this can be achieved with 1 bit of context.

On occasions the tokenizer might enter a different mode, for example in strings ("..."..."), block comment (/.../) or line comment (//...).

But these are one single level mode changes.
When I enter string mode, we stay there (with ") all the way to the last ".
We don't look for // or /* in a string.
Same, when we are in block comment mode we don't care about " or //. And same for line comments.

For our grammar the tokenizer must know when we are inside a reserved section, but to get there it needs a { (to ender placeholder mode), then a &foo to enter reserved mode.
So it has to look for relatively high level information (concepts like "placeholder"), that belong at the parser level.

I disagree. As stated above, all the tokenizer needs to know is whether the parser currently is inside a text-escape or a reserved-escape. It does not need to know about placeholder, it does not need to know about quoted, etc.

Worse, inside a reserve sequence we are still looking (and recognizing) a literal (block 5 above).
We need to know we are in a literal, in a reserved, in a placeholder.

There is no problem with literals in the grammar. Maybe you designed a tokenizer that attempts to returns literals as a single token? If so, that is a problem with the design of that tokenizer, but not a problem with the grammar.

I mean, you would have the same problem in an ISO C parser, if you wrote a tokenizer that returns -5 as a single token. Then you would have problems with the unary-minus operator: return - 5; and return -5; would parse differently. The solution here would be to modify the tokenizer so that it returns - and 5 separately. In summary: To fit a certain grammar, you need to adapt the tokenizer appropriately.

bhaible · 2024-03-20T15:42:36Z

I think that there is no way to use @ inside a reserved body:

I think this is intentional. Because otherwise there would be a parsing ambiguity inside an annotation-expression:

{ %foo @x }

could be parsed as

'foo @x' as reserved-body
no attributes

or as

'foo' as reserved-body
'@x' as attribute

aphillips · 2024-03-20T15:56:16Z

I think that there is no way to use @ inside a reserved body:
I think this is intentional.

This is 100% intentional. A reserved annotation is part of annotation and needs to behave like any other annotation. This includes not subsuming other constructs, such as attribute.

aphillips · 2024-09-10T23:37:58Z

Since we removed reserved-statement this might now be stale

aphillips · 2024-10-07T14:46:25Z

See above comment.

aphillips added syntax Issues related with syntax or ABNF Agenda+ Requested for upcoming teleconference LDML45 LDML45 Release (Tech Preview) labels Feb 14, 2024

aphillips closed this as completed Feb 28, 2024

mihnita reopened this Mar 4, 2024

aphillips added Preview-Feedback Feedback gathered during the technical preview and removed Agenda+ Requested for upcoming teleconference LDML45 LDML45 Release (Tech Preview) labels Mar 4, 2024

eemeli mentioned this issue Mar 23, 2024

Collapse all escape sequence rules into one #743

Merged

aphillips added the resolve-candidate This issue appears to have been answered or resolved, and may be closed soon. label Sep 10, 2024

aphillips closed this as completed Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Some red flags with the current grammar for reserved-statement #635

Some red flags with the current grammar for reserved-statement #635

mihnita commented Feb 12, 2024

aphillips commented Feb 14, 2024

Uh oh!

aphillips commented Feb 18, 2024

Uh oh!

aphillips commented Feb 28, 2024

Uh oh!

mihnita commented Mar 4, 2024 •

edited

Loading

Uh oh!

mihnita commented Mar 4, 2024

Uh oh!

mihnita commented Mar 4, 2024 •

edited

Loading

Uh oh!

bhaible commented Mar 20, 2024

Uh oh!

bhaible commented Mar 20, 2024

Uh oh!

aphillips commented Mar 20, 2024

Uh oh!

aphillips commented Sep 10, 2024

Uh oh!

aphillips commented Oct 7, 2024

Uh oh!

Uh oh!

Some red flags with the current grammar for reserved-statement #635

Some red flags with the current grammar for reserved-statement #635

Comments

mihnita commented Feb 12, 2024

aphillips commented Feb 14, 2024

Uh oh!

aphillips commented Feb 18, 2024

Uh oh!

aphillips commented Feb 28, 2024

Uh oh!

mihnita commented Mar 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mihnita commented Mar 4, 2024

Uh oh!

mihnita commented Mar 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bhaible commented Mar 20, 2024

Uh oh!

bhaible commented Mar 20, 2024

Uh oh!

aphillips commented Mar 20, 2024

Uh oh!

aphillips commented Sep 10, 2024

Uh oh!

aphillips commented Oct 7, 2024

Uh oh!

mihnita commented Mar 4, 2024 •

edited

Loading

mihnita commented Mar 4, 2024 •

edited

Loading