Skip to content

Forbid unquoted non-numeric literals as expression operands #518

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
stasm opened this issue Nov 8, 2023 · 18 comments · Fixed by #553
Closed

Forbid unquoted non-numeric literals as expression operands #518

stasm opened this issue Nov 8, 2023 · 18 comments · Fixed by #553
Labels
Agenda+ Requested for upcoming teleconference syntax Issues related with syntax or ABNF

Comments

@stasm
Copy link
Collaborator

stasm commented Nov 8, 2023

I propose that we restrict the literal syntax and disallow unquoted non-numeric literals as expression operands.

Today, literals (quoted and unquoted) can appear in 3 positions:

  • as variant keys,
  • as option values,
  • as expression operands.

The first two positions are common and well justified. OTOH, I posit that the third position is very rare, except numeric values. In fact, I'm not aware of strong use-cases for unquoted non-numeric literals as expression operands. Furthermore, I'm concerned about the risk of confusing such unquoted literals for keywords.


Forbidding unquoted non-numeric literals as expression operands will provide two benefits:

  • Since our syntax uses keywords positioned at the front of a declaration, unquoted literals as expression operands may be confused for novel keywords. Consider:

    {now :datetime} vs. {|now| :datetime}
    {error :msgref} vs. {|error| :msgref}
    

    |now| and |error| are clearly recognizable as literal values.

  • It would open new opportunities for introducing new placeholder or expression syntax: {foo}. See Open/close design: A familiar HTML-like syntax as an alternative #516 for an example use-case.

As a reminder, {5 :number} would still be valid under this proposal, as well as :func opt=unquoted and when unquoted.


ABNF-wise, I think we'd need the following two changes:

  • Split unquoted into unquoted-string and unquoted-numeric.
  • Change literal-expression to "{" [s] (quoted / unquoted-numeric) [s annotation] [s] "}"
@stasm
Copy link
Collaborator Author

stasm commented Nov 8, 2023

Copying relevant comments from #516 (review):


@aphillips

I think this is relatively difficult to contemplate. We will want unquoted literals for keys. Is unquoted literals for operands different enough a use case?


@gibson042

Hmm, I think I agree. Expressions basically break down into

  • literal with optional function
  • variable with optional function
  • operand-free function
  • spannable open/close

If both variables and functions use a sigil, then I think spannables should as well, leaving literals as the sole sigil-free expression contents (and with consistent quoting requirements regardless of where they appear).

@stasm
Copy link
Collaborator Author

stasm commented Nov 8, 2023

I'm trying to approach this issue pragmatically, from the perspective of the user:

  • I don't suppose anyone would actually expect {|now| :datetime} to be allowed to be also written as {now :datetime}.
  • I do suppose that a lot of people will actually expect our markup syntax to look something like HTML, making {foo}, {/foo} and {foo/} particularly appealing.

@gibson042
Copy link
Collaborator

I don't suppose anyone would actually expect {|now| :datetime} to be allowed to be also written as {now :datetime}.

If that is true, then

  • Why should they expect {|42| :number} to also be writable as {42 :number}?
  • Why should they expect {$plurality :number style=|percent|} to also be writable as {$plurality :number style=percent}?

As mentioned in the linked comment, I think it is important for literal quoting requirements to be consistent throughout the grammar. This is going to be a relatively unfamiliar format for pretty much everyone who uses it, and gratuitous inconsistency seems more detrimental than beneficial even if it does save a few keystrokes.

I do suppose that a lot of people will actually expect our markup syntax to look something like HTML, making {foo}, {/foo} and {foo/} particularly appealing.

Counterpoint: given support for {42} and {$var} and ${:fn} as placeholders, why should one expect {foo} to act like an HTML tag?

@aphillips aphillips added the syntax Issues related with syntax or ABNF label Nov 10, 2023
@catamorphism
Copy link
Collaborator

I think it's desirable to minimize the number of special cases users have to think about. I vote for forbidding all unquoted literals as expression operands.

How common is it for an operand to be a numeric literal? Is the benefit of being able to write something like {42: number} concisely worth complicating the grammar?

@aphillips
Copy link
Member

Actually, I expect that number literals like {42 :number} or date or time literals will be the high-runner case. There are many places where today we ask translators to format numbers and datetime values by hand in translations because the source has the value statically and where the locale will do a better job. Moreover, the locale might be able to respond to user preferences like 12/24 hour time. Stuff like:

On {2023-11-11 :datetime style=shortDate} our opening hours will be...
Capacity for this event is {1123 :number} people...
Subscribe for only {99.0 :number type=currency currency=EUR} per year

String literals will be rare (I hope) because they represent string concatenation in a party dress.

I don't suppose anyone would actually expect {|now| :datetime} to be allowed to be also written as {now :datetime}.

now isn't a time value. It won't format as a datetime value.

Why should they expect {$plurality :number style=|percent|} to also be writable as {$plurality :number style=percent}?

This question is backwards. Why would we force users to write key-value pairs for options with quoted values, given that most values will be predefined word tokens?? If we required {$someValue :number style=|percent|}, the very first feature request (or local deviation by implementers) would be to remove the quotes. Mindless consistency isn't that helpful.

@gibson042
Copy link
Collaborator

This question is backwards. Why would we force users to write key-value pairs for options with quoted values, given that most values will be predefined word tokens?? If we required {$someValue :number style=|percent|}, the very first feature request (or local deviation by implementers) would be to remove the quotes. Mindless consistency isn't that helpful.

That same reasoning applies to operands, so I think we're in violent agreement here.

@stasm
Copy link
Collaborator Author

stasm commented Nov 13, 2023

@aphillips

now isn't a time value. It won't format as a datetime value.

I meant it as an example of a potential use-case in which passing a string literal to a function could be useful. I took it from Liquid's date filter.

Just ot be clear: I'm not proposing that datetime accept now as a possible string. In fact, what I meant to convey is that I didn't see many use-cases for string literals as operands, so much so that I couldn't produce a convincing example.

I agree that numeric operands are useful and should be encouraged.

@aphillips
Copy link
Member

I agree that numeric operands are useful and should be encouraged.

Cool. But do you still think that we should create a dichotomy in practice between options and operands regarding unquoted literals? For example:

{{I must {|quote| :function option=quote} differently}}

@stasm
Copy link
Collaborator Author

stasm commented Dec 1, 2023

My priorities are twofold:

  • Make sure that the spannable syntax is as ergonomic as possible. It's much more important to me to get the spannable syntax right than it is to have consistency in how unquoted literals can be used in different positions in the syntax.

    I'm very willing to sacrifice the {quote :function option=quote} syntax to make room for {tag}...{/tag}. If we pick a different syntax in [Discussion] {{Spannables}} #537 that avoids the syntax confict then great, but I absolutely wouldn't want us to not pick {tag}...{/tag} only because of the syntax conflict. I really don't see any good use-cases for {quote :function option=quote}.

  • Reduce surprises and make syntax discoverable. It's subjective, but I find {quote :function option=quote} confusing and I think it sets users up for mistakes; it doesn't look like a literal. Even if it was valid, I'd encourage everyone to spell it as {|quote| :function option=quote}.

    Furthermore, I don't see much problem with the dichotomy other than from a strictly grammar-puristic point of view. I regularly type obj.prop and obj["prop with spaces"], and I don't find that problematic at all.

I'm proposing that we separate the concepts of numeric and non-numeric literals, and that we have different rules for them. Unquoted numeric literals can go anywhere. Unquote non-numeric literals can go in variant keys and option values.

@aphillips
Copy link
Member

Thanks @stasm.

My priorities skew towards developer and translator effort. I think if we talk about literals abstractly (as I've been doing) this can obscure how the literals will be used, particularly in operands.

Furthermore, I don't see much problem with the dichotomy other than from a strictly grammar-puristic point of view. I regularly type obj.prop and obj["prop with spaces"], and I don't find that problematic at all.

Isn't your argument backwards? Aren't you glad you can type obj.prop instead of being forced to type obj["prop"]? Shorthands reward laziness and usually exist as a convenience to human users.

One of the interesting things about our syntax is that we are untyped and in many cases there will be strings used to convey typed data. You've identified numeric literals (i.e. numbers) as a case. Date/time values are also going to be common:

{2023-12-01 :date}
{2023-12-01T09:50:00-08:00 :date}
{2023-12-01T09:50:00[America/Los_Angeles] :date}
{12:34:56 :time}

Another common case will be enumerated values, either built-in or application specific. These are string tokens (often in English) but not natural language.

I agree that this depends on the design chosen for spannables (such as in #537 etc.) I just think, for the reasons in my previous comment, that, unless we choose a syntax option that requires it, having only one type of unquoted literal and allowing it anywhere literals are used will make our syntax easier to use than anything that varies.

@stasm
Copy link
Collaborator Author

stasm commented Dec 2, 2023

Furthermore, I don't see much problem with the dichotomy other than from a strictly grammar-puristic point of view. I regularly type obj.prop and obj["prop with spaces"], and I don't find that problematic at all.

Isn't your argument backwards? Aren't you glad you can type obj.prop instead of being forced to type obj["prop"]? Shorthands reward laziness and usually exist as a convenience to human users.

It was late for me and I didn't explain this clearly, sorry. I meant that obj.prop and obj["prop"] are different syntax contexts, similar to how MF2's operands and option values are different syntax contexts. It doesn't strike me as problematic that they require different spelling.

I am glad, however, that obj["prop"] cannot be also spelled as obj[prop] for convenience or for the sake of consistency with obj.prop.

One of the interesting things about our syntax is that we are untyped and in many cases there will be strings used to convey typed data. You've identified numeric literals (i.e. numbers) as a case. Date/time values are also going to be common:

{2023-12-01 :date}
{2023-12-01T09:50:00-08:00 :date}
{2023-12-01T09:50:00[America/Los_Angeles] :date}
{12:34:56 :time}

All of these make perfect sense to me. Note that they all start with a digit, and I would qualify them as "numeric literals". I don't want us to be too strict about what is a numeric literal: 10e2, 1.2, 1,2, 0xFF, 07, 2+, 3..4, 10MiB -- we need to be flexible to allow many different forms.

The common feature of all of them is that they all start with a digit. We can then extend the defnition of a "numeric literals" to datetime values. I realize that perhaps calling them "numeric" is misleading. Can you perhaps suggest a better name? A "digit literal"?

Another common case will be enumerated values, either built-in or application specific. These are string tokens (often in English) but not natural language.

Do you have a concrete example of when this would be useful? What sort of enumerated values would benefit from this syntax? I struggle to even come up with something to use as an example.

I'm concerned that not quoting such values makes them not look like values. I hypothesize that I'm not the only developer to whom they look like keywords, commands, function calls, references, etc.

OTOH, I don't have any issue with anything starting with a digit to be an unquoted argument. Because of the digit, the "valueness" is made clear.

@eemeli
Copy link
Collaborator

eemeli commented Dec 2, 2023

My main interest here would be for our syntax to not be weird. By that, I mean that I would prefer for e.g. the rules determining a user's question "Does this literal value need to be quoted?" to be as simple as possible.

With our current syntax, those rules are not simple. For example, I'm continuous catching myself when considering that even though time=12:34 and domain=example.com are valid, href=http://example.com isn't, because of the /. Adding further divergences between literals in operand and option positions would increase the weirdness.

I also note that it looks like no-one caught the mistake in @aphillips datetime examples above, where the [/] characters in his third example are not currently valid unquoted.

My preferred solution here would be to only allow the following as unquoted operands or options:

  • Numbers matching the regexp -?(0|[1-9]\d*)(\.\d+)?([eE][+-]?\d+)?.
  • The strings true, false and null.

In case you didn't spot it, those rules rather intentionally match JSON.

I would separate out the rule for unquoted keys to be a wholly separate one. For that, I would accept any non-empty sequence of Unicode word characters as unquoted.

In the syntax, we're treating variant keys as rather different from operators and operands, and we're also using them rather differently when formatting a message. Keys are not in expressions but as their own thing, and they're effectively custom syntax that's enabled by custom functions, rather than values. Each selector (like :plural) will need to provide its user with guidance on what keys it'll match, and then they'll feel like syntax for the user, quite unlike operand and option values.

Put together, the answer I'd like to give to the user of the first paragraph is "In expressions, JSON primitives don't need quotes. In variant keys, words don't need quotes." Simple, clear, not weird.

@stasm
Copy link
Collaborator Author

stasm commented Dec 2, 2023

With our current syntax, those rules are not simple. For example, I'm continuous catching myself when considering that even though time=12:34 and domain=example.com are valid, href=http://example.com isn't, because of the /. Adding further divergences between literals in operand and option positions would increase the weirdness.

I'd be very sad if I wasn't able to type case=accusative because of consistency with href=|http://example.com|. As a user, what do I gain from this consistency?

I would separate out the rule for unquoted keys to be a wholly separate one. For that, I would accept any non-empty sequence of Unicode word characters as unquoted.

Can you summarize how word is different from nmtoken? It looks like it allows many more mark characters. Anything else?

@stasm
Copy link
Collaborator Author

stasm commented Dec 2, 2023

A more general observation about our unquoted syntax. It occurs to me that we can choose one of three approaches:

  1. Be very strict on purpose, because the primary use cases are words like many, neuter, and accusative. They are defined in CLDR using LDML, and technically they can be as wide as an Nmtoken, but in practice, they're mostly [a-z]+.

    This is the approach taken by HTML 4:

    In certain cases, authors may specify the value of an attribute without any quotation marks. The attribute value may only contain letters (a-z and A-Z), digits (0-9), hyphens (ASCII decimal 45), periods (ASCII decimal 46), underscores (ASCII decimal 95), and colons (ASCII decimal 58). We recommend using quotation marks even when it is possible to eliminate them.

  2. Be somewhat liberal, and allow many more characters beyond ASCII. We can base on XML's Nmtoken or Unicode's word. They are reasonable but slightly different, and remembering which characters are allowed and which aren't is going to be tricky.

    This is MF2's right now.

  3. Be maximally liberal, and allow all characters except for those that would create ambiguity with other productions (e.g. a space or a vertical line).

    This is the approach taken by HTML 5:

    The attribute name, followed by zero or more space characters, followed by a single U+003D EQUALS SIGN character, followed by zero or more space characters, followed by the attribute value, which, in addition to the requirements given above for attribute values, must not contain any literal space characters, any U+0022 QUOTATION MARK characters ("), U+0027 APOSTROPHE characters ('), U+003D EQUALS SIGN characters (=), U+003C LESS-THAN SIGN characters (<), U+003E GREATER-THAN SIGN characters (>), or U+0060 GRAVE ACCENT characters (`), and must not be the empty string.

Perhaps 1 or 3 would be better than the current 2?

@aphillips
Copy link
Member

@eemeli noted:

In the syntax, we're treating variant keys as rather different from operators and operands, and we're also using them rather differently when formatting a message. Keys are not in expressions but as their own thing, and they're effectively custom syntax that's enabled by custom functions, rather than values. Each selector (like :plural) will need to provide its user with guidance on what keys it'll match, and then they'll feel like syntax for the user, quite unlike operand and option values.

I agree, although I would go further. The SelectFormat MF1 interface uses user-defined keys and for compatibility we will probably provide a :select function. Technically the keys are just any old string, but rational users pick enumerated "code-like" keywords, which they will prefer (like @stasm's accusative etc.) to unquote "just like :plural"

I think the ISO8601-like syntaxes put huge pressure on unquoted if we want to include them. I would favor narrowing unquoted to be somewhat restrictive just to make it easier for folks to understand when |quotes are needed|.

@stasm the "maximally liberal" approach has the problem that our syntax depends on sigils.

So my tendency is to make things only slightly wider than @eemeli's proposal. The number regex or [a-zA-Z0-9]* (e.g. super, UGLY, sUpErUgLy2000 etc. all work). If we want 8601-like syntax, then we're slipping down the type support slope.

@eemeli
Copy link
Collaborator

eemeli commented Dec 3, 2023

I'd be very sad if I wasn't able to type case=accusative because of consistency with href=|http://example.com|. As a user, what do I gain from this consistency?

Fewer errors in the messages you write. The simpler the rules, the easier it is to not break them. Us breaking our own current rules in the examples we write is a rather strong indicator that they're suboptimal.

Can you summarize how word is different from nmtoken? It looks like it allows many more mark characters. Anything else?

It doesn't allow for at least period ., hyphen-minus -, colon : or middot ·, all of which are valid in nmtoken. It does allow for sequences like accusative that make up a single "word".

So my tendency is to make things only slightly wider than @eemeli's proposal. The number regex or [a-zA-Z0-9]* (e.g. super, UGLY, sUpErUgLy2000 etc. all work). If we want 8601-like syntax, then we're slipping down the type support slope.

@aphillips Is the rule you're proposing a universal one covering operands, options and keys, or different for some of that?

Also, as not all code and source messages are written in English, I feel strongly that if this is valid as unquoted, then e.g. tämä needs to be valid as well, along with non-Latin characters. Hence my use of "word" characters above.

@aphillips
Copy link
Member

The rule I suggested above is for unquoted and the discussion we're having is about the namespace of unquoted. So it would be for all uses of unquoted literals (keys, operands, and options)

We haven't clarified what we're doing here yet, so we sometimes make bad examples. The idea we're all poking around at here is that unquoted literals should be "word-like" or "identifier-like" and--whoa!--we have an identifier concept. That concept includes namespaces, which do not apply to unquoted, so I think my proposal would be: make unquoted support name or a number pattern such as the regex @eemeli suggested above.

unquoted = name
         / number
number = ["-"] (0 / ([1-9]) *DIGIT [ "." 1*DIGIT] [ (%i"e") ("+"/"-") 1*DIGIT]

This precludes unquoted time values because it lacks : (dates without times are okay). We could extend the above by adding something like:

unquoted = name
         / number
         / datetime
; profile of RFC3339 and SEDATE's extension thereof 
; see https://www.rfc-editor.org/rfc/rfc3339.html#section-5.6
datetime = 4DIGIT "-" 2DIGIT "-" 2DIGIT [ %s"T" 2DIGIT ":" 2DIGIT [ ":" 2DIGIT [ "." 1*DIGIT ]]
           [%s"Z" / ( ("+"/"-") 2DIGIT ":" 2DIGIT ) ]
           [ ALPHA *(ALPHA / "_") "/" ALPHA *(ALPHA / "_") ]]

This "infects" us with types in some ways, but in a way that users might understand?

@aphillips aphillips added the Agenda+ Requested for upcoming teleconference label Dec 3, 2023
@aphillips
Copy link
Member

In the teleconference of 2023-12-04 it was agreed that unquoted would be a union of name and a new production for number as shown above. datetime was explicitly not adopted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Agenda+ Requested for upcoming teleconference syntax Issues related with syntax or ABNF
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants