Skip to content

Consider adding a token grammar #729

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
catamorphism opened this issue Mar 14, 2024 · 3 comments
Closed

Consider adding a token grammar #729

catamorphism opened this issue Mar 14, 2024 · 3 comments
Labels
Future Deferred for future standardization specification Issue affects the specification

Comments

@catamorphism
Copy link
Collaborator

catamorphism commented Mar 14, 2024

Programming languages typically have both a lexical grammar, which describes the structure of tokens, and a token grammar, which describes the syntax of the language and in which terminals are tokens.

The MessageFormat grammar is a lexical grammar; there is no token grammar. Or, perhaps, a single grammar serves as both (with tokens as single characters), depending on how you look at it.

In the future, I think it would be worth refactoring the grammar so as to create a lexical grammar, describing tokens (this can be done with regexps for most languages); then using it to create a token grammar. This separates out describing the syntax from the details of where required and optional whitespace goes, for example.

For example (not the simplest one, unfortunately), see the JavaScript lexical grammar and the token grammar for expressions (the entire token grammar is split across a few different chapters of the JS spec).

Probably some of the other implementations already use a separate lexer and parser, but I chose to write a combined lexer and parser for MessageFormat so that I could tell if I was following the spec exactly (and because I already had to hand-write the parser, parser generators not being a good option in ICU4C). Without a token grammar as part of the spec, it's hard to do that (writing a separate lexer and parser effectively introduces an ad hoc token grammar).

The trouble with the approach I chose is that there are many apparent syntactic ambiguities involving whitespace, which would probably be much easier to handle in an implementation that tokenizes the input before parsing it. Having a separate token grammar would both make it easier for implementors to verify that their front-ends conform to the spec, and make it easier for everyone to understand the syntax.

@catamorphism catamorphism added specification Issue affects the specification LDML46 LDML46 Release (Tech Preview - October 2024) labels Mar 14, 2024
aphillips added a commit that referenced this issue Mar 17, 2024
Removes a questionable test
@eemeli
Copy link
Collaborator

eemeli commented Mar 18, 2024

This was closed by mistake.

@eemeli eemeli reopened this Mar 18, 2024
@aphillips aphillips added Future Deferred for future standardization and removed LDML46 LDML46 Release (Tech Preview - October 2024) labels Sep 9, 2024
@aphillips
Copy link
Member

I don't think we'll get to this item (assuming we decide to work on it at all) in v46. Marking Future.

@aphillips
Copy link
Member

discussed in 2025-03-10 teleconference. decided to let this go for now. can reopen in the future if requirements present themselves

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Future Deferred for future standardization specification Issue affects the specification
Projects
None yet
Development

No branches or pull requests

3 participants