Consider adding a token grammar

Programming languages typically have both a lexical grammar, which describes the structure of tokens, and a token grammar, which describes the syntax of the language and in which terminals are tokens.

[The MessageFormat grammar](https://github.com/unicode-org/message-format-wg/blob/main/spec/message.abnf) is a lexical grammar; there is no token grammar. Or, perhaps, a single grammar serves as both (with tokens as single characters), depending on how you look at it.

In the future, I think it would be worth refactoring the grammar so as to create a lexical grammar, describing tokens (this can be done with regexps for most languages); then using it to create a token grammar. This separates out describing the syntax from the details of where required and optional whitespace goes, for example.

For example (not the simplest one, unfortunately), see the JavaScript [lexical grammar](https://tc39.es/ecma262/multipage/ecmascript-language-lexical-grammar.html#sec-ecmascript-language-lexical-grammar) and the [token grammar](https://tc39.es/ecma262/multipage/ecmascript-language-expressions.html#sec-ecmascript-language-expressions) for expressions (the entire token grammar is split across a few different chapters of the JS spec).

Probably some of the other implementations already use a separate lexer and parser, but I chose to write a combined lexer and parser for MessageFormat so that I could tell if I was following the spec exactly (and because I already had to hand-write the parser, parser generators not being a good option in ICU4C). Without a token grammar as part of the spec, it's hard to do that (writing a separate lexer and parser effectively introduces an _ad hoc_ token grammar).

The trouble with the approach I chose is that there are many apparent syntactic ambiguities involving whitespace, which would probably be much easier to handle in an implementation that tokenizes the input before parsing it. Having a separate token grammar would both make it easier for implementors to verify that their front-ends conform to the spec, and make it easier for everyone to understand the syntax.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Consider adding a token grammar #729

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Consider adding a token grammar #729

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions