From 7a6bd2e60d10a30ac5f5da798c90d41290a28b26 Mon Sep 17 00:00:00 2001 From: Mihai Nita Date: Fri, 8 May 2020 15:34:18 -0700 Subject: [PATCH 1/5] First version comitted --- doc/why_mf_next.md | 47 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 47 insertions(+) create mode 100644 doc/why_mf_next.md diff --git a/doc/why_mf_next.md b/doc/why_mf_next.md new file mode 100644 index 0000000000..6d549c34a2 --- /dev/null +++ b/doc/why_mf_next.md @@ -0,0 +1,47 @@ +# Why MessageFormat needs a successor ([issue #49](https://github.com/unicode-org/message-format-wg/issues/49)) + + +The `MessageFormat` has been around for a long time. + +Its “ancestor”, [java/text/MessageFormat](https://docs.oracle.com/javase/7/docs/api/java/text/MessageFormat.html), was introduced with Java 1.4, February 2002. + +The [ICU MessageFormat](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/MessageFormat.html) is tagged as stable API since ICU 3.0 (June 2004) + +The ICU version evolved compared to the JDK one: +* added support for plurals (ICU 3.8, 2007) +* added support for select (ICU 4.4, 2010) +* named arguments (`...{user}...` vs `...{0}...`) +* better handling of the apostrophe escaping +* date/time/number skeletons (2018) +* more + +Despite being around for such a long time, it is still not well supported by localization tools. + +Other efforts: Fluent, FB + +* No support for advanced features (for example inflections) +* Not standard. There are implementations for JavaScript, Closure, Dart, Go, others, but because there is no standard they are all slightly different (and incompatible). Would be nice to have at least a data-driven test suite. +* Not well supported by localization tools +* No standard way to extend it (would need to fork + change the ICU code) +* Moving too slowly. The arguments supported by MessageFormat right now are `number` (`integer`, `currency`, `percent`), `date`, `time`, `spellout`, `ordinal`, `duration`, and the selectors are `choice`, `plural`, `select`, and `selectordinal`. But ICU itself already supports a lot more: intervals, relative dates and times, lists, measurements, compact decimals. And we would like even more, both formatters and selectors (think gender, inflections, formality level) +* Carying with it legacy bagage that we know now better: date/time patterns, `ChoiceFormat`, clunky syntax (especially for nested plural/select), problematic escaping, selectors on part of the message +* It is hard to add new functionality while keeping backward compatibility +* We would like: inflections, protecting message ranges, formatToValue, formatting (think html) +* High "impedance" when converting to / from localization tools + +--- + +Mandatory xkcd: \ +[](https://xkcd.com/927/) + +--- + +_The Message Format Working Group (MFWG) is tasked with developing an industry +standard for the representation of localizable message strings to be a +successor to ICU MessageFormat. MFWG will recommend how to remove +redundancies, make the syntax more usable, and support more complex features, +such as gender, inflections, and speech. MFWG will also consider the +integration of the new standard with programming environments, including, but +not limited to, ICU, DOM, and ECMAScript, and with localization platform +interchange. The output of MFWG will be a specification for the new syntax, +which is expected to be on track to become a Unicode Technical Standard._ From 0949b0ff961f52c00960950db90a5cfe0d7a4739 Mon Sep 17 00:00:00 2001 From: Mihai Nita Date: Thu, 14 May 2020 16:49:22 -0700 Subject: [PATCH 2/5] Trying to go to the 'root causes' --- doc/why_mf_next.md | 102 +++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 95 insertions(+), 7 deletions(-) diff --git a/doc/why_mf_next.md b/doc/why_mf_next.md index 6d549c34a2..6b4356deba 100644 --- a/doc/why_mf_next.md +++ b/doc/why_mf_next.md @@ -1,7 +1,8 @@ -# Why MessageFormat needs a successor ([issue #49](https://github.com/unicode-org/message-format-wg/issues/49)) +# Why `MessageFormat` needs a successor ([issue #49](https://github.com/unicode-org/message-format-wg/issues/49)) +## Intro -The `MessageFormat` has been around for a long time. +The `MessageFormat` API and syntax have been around for a long time. Its “ancestor”, [java/text/MessageFormat](https://docs.oracle.com/javase/7/docs/api/java/text/MessageFormat.html), was introduced with Java 1.4, February 2002. @@ -19,20 +20,107 @@ Despite being around for such a long time, it is still not well supported by loc Other efforts: Fluent, FB +## Core problems with the current `MessageFormat` + +I've started with the list of problems in the next section. +But that ends up being a (biased) reshuffling of the issues and feature requests that we collected in [GitHub](https://github.com/unicode-org/message-format-wg/issues) + +So I have tried to distill that to a few root causes +(think [“5 why”](https://en.wikipedia.org/wiki/Five_whys)) + +I think that these are the problems we need to avoid repeating, +otherwise it is just a matter of time until we end up in the same place. + +Here it is: +1. Does not have any “extension points” +2. Not a formal standard with an “acceptance test suite” +3. Can't remove anything, even if we know know better +4. Hard to map to the existing localization core structures +5. Designed to work on plain text, UI, “imperative style” + +### 1. Does not have any “extension points” + +No extension points means that it is hard to add new functionality unless you +are doing it in ICU itself. +It also means most tools used to process these messages are built rigidly, +and are unprepared to handle changes +(think localization tools, liners, friendly UIs, etc.). + +### 2. Not a formal standard with an “acceptance test suite” + +This means that the implementations ICU4C and ICU4J are +“de facto reference-implementations”, and the ports to other languages +(JavaScript, Go, Dart, etc.) are at risk for being “slightly incompatible” + +### 3. Can't remove anything, even if we know know better + +ICU is old, but also very popular (right now it is the core i18n library +for all major operating systems, and many products). + +This is how he have both numeric and named parameters, partial strings in +plural / select (technically concatenation, which is bad i18n), date / time +patterns (bad i18n, when skeletons are the better way), nesting selectors, +unfriendly escaping (think doubling the apostrophe `''` ), `#` in plurals. + +Most of it can't be “blamed” directly on a bad decision, it is just time +teaching us what works (for instance skeletons did not exist when the +date/time parameters were added). + +But the stability requirements prevent any major cleanup. + +### 4. Hard to map to the existing localization core structures + +The format is not supported by any major localization system that I know of. \ +I think that the root cause of that is not it is difficult to parse. +Because it is not. And ICU4J has public API for parsing. + +Most translation tools take a string (with placeholders) in a source language +and gives back a translated string, usually with the same placeholders +(with some degree of flexibility). + +It makes it very difficult to translate things like plurals, where the input +has (for example) 2 “message variants” (English, 1 / many, singular / plural), +and return 4 message variants for Russian, for example. + +This is not a superficial problem. It affects most steps in the normal +localization flow: +* leveraging (the same string “X files” must be translated +in 2/3 different ways) +* validation (placeholders, length, terminology, etc.) +* word count and payment +* alignment (the process of creating a TM from source + translated documents) + +### 5. Designed to work on plain text, UI, “imperative style” + +The main (only?) use case was: load the string from resources, +replace placeholders, and return the string result. + +It does not play well with binding, use formatting tags (thing `html`), +or “document-like” content, like templating +(think [freemarker](https://freemarker.apache.org/), +[mustache](https://mustache.github.io/), etc.) + +--- + +**If we agree on the above section, we can drop this.** +**Or we “map” these bullets to the root causes above.** + +## Problems with the current `MessageFormat` + * No support for advanced features (for example inflections) * Not standard. There are implementations for JavaScript, Closure, Dart, Go, others, but because there is no standard they are all slightly different (and incompatible). Would be nice to have at least a data-driven test suite. * Not well supported by localization tools * No standard way to extend it (would need to fork + change the ICU code) -* Moving too slowly. The arguments supported by MessageFormat right now are `number` (`integer`, `currency`, `percent`), `date`, `time`, `spellout`, `ordinal`, `duration`, and the selectors are `choice`, `plural`, `select`, and `selectordinal`. But ICU itself already supports a lot more: intervals, relative dates and times, lists, measurements, compact decimals. And we would like even more, both formatters and selectors (think gender, inflections, formality level) -* Carying with it legacy bagage that we know now better: date/time patterns, `ChoiceFormat`, clunky syntax (especially for nested plural/select), problematic escaping, selectors on part of the message +* Moving too slowly. The arguments supported by `MessageFormat` right now are `number` (`integer`, `currency`, `percent`), `date`, `time`, `spellout`, `ordinal`, `duration`, and the selectors are `choice`, `plural`, `select`, and `selectordinal`. But ICU itself already supports a lot more: intervals, relative dates and times, lists, measurements, compact decimals. And we would like even more, both formatters and selectors (think gender, inflections, formality level) +* Carrying with it legacy baggage that we know now better: date/time patterns, `ChoiceFormat`, clunky syntax (especially for nested plural/select), problematic escaping, selectors on part of the message * It is hard to add new functionality while keeping backward compatibility -* We would like: inflections, protecting message ranges, formatToValue, formatting (think html) -* High "impedance" when converting to / from localization tools +* We would like: inflections, protecting message ranges, `formatToValue`, formatting (think `html` tags) +* High “impedance” when converting to / from localization tools --- Mandatory xkcd: \ -[](https://xkcd.com/927/) +[](https://xkcd.com/927/) --- From db8313504e460d7739163f418b074a6f007332c2 Mon Sep 17 00:00:00 2001 From: Mihai Nita Date: Thu, 14 May 2020 16:55:35 -0700 Subject: [PATCH 3/5] Another cause, removing the old style (list of issue) --- doc/why_mf_next.md | 48 +++++++++++++--------------------------------- 1 file changed, 13 insertions(+), 35 deletions(-) diff --git a/doc/why_mf_next.md b/doc/why_mf_next.md index 6b4356deba..e1fe6ea568 100644 --- a/doc/why_mf_next.md +++ b/doc/why_mf_next.md @@ -36,7 +36,7 @@ Here it is: 2. Not a formal standard with an “acceptance test suite” 3. Can't remove anything, even if we know know better 4. Hard to map to the existing localization core structures -5. Designed to work on plain text, UI, “imperative style” +5. Designed to be API only, plain text, UI, “imperative style” ### 1. Does not have any “extension points” @@ -90,46 +90,24 @@ in 2/3 different ways) * word count and payment * alignment (the process of creating a TM from source + translated documents) -### 5. Designed to work on plain text, UI, “imperative style” +### 5. Designed to be API only, plain text, UI, “imperative style” The main (only?) use case was: load the string from resources, -replace placeholders, and return the string result. +replace placeholders, and return the string result. +An i18n-aware `printf`, basically. -It does not play well with binding, use formatting tags (thing `html`), -or “document-like” content, like templating +It does not play well with binding, formatting tags (thing `html`), +protecting content from translation, or “document-like” content, like templates (think [freemarker](https://freemarker.apache.org/), -[mustache](https://mustache.github.io/), etc.) +[mustache](https://mustache.github.io/), even JSP, PHP, etc.) ---- - -**If we agree on the above section, we can drop this.** -**Or we “map” these bullets to the root causes above.** - -## Problems with the current `MessageFormat` - -* No support for advanced features (for example inflections) -* Not standard. There are implementations for JavaScript, Closure, Dart, Go, others, but because there is no standard they are all slightly different (and incompatible). Would be nice to have at least a data-driven test suite. -* Not well supported by localization tools -* No standard way to extend it (would need to fork + change the ICU code) -* Moving too slowly. The arguments supported by `MessageFormat` right now are `number` (`integer`, `currency`, `percent`), `date`, `time`, `spellout`, `ordinal`, `duration`, and the selectors are `choice`, `plural`, `select`, and `selectordinal`. But ICU itself already supports a lot more: intervals, relative dates and times, lists, measurements, compact decimals. And we would like even more, both formatters and selectors (think gender, inflections, formality level) -* Carrying with it legacy baggage that we know now better: date/time patterns, `ChoiceFormat`, clunky syntax (especially for nested plural/select), problematic escaping, selectors on part of the message -* It is hard to add new functionality while keeping backward compatibility -* We would like: inflections, protecting message ranges, `formatToValue`, formatting (think `html` tags) -* High “impedance” when converting to / from localization tools +And it was API only. \ +No standard way to store the stings in a serialized format and to carry +info or directives for translators or localization tools. +No comments, length limits, protecting non-translatable sections of text, etc. --- -Mandatory xkcd: \ -[](https://xkcd.com/927/) - ---- +**Mandatory xkcd:** -_The Message Format Working Group (MFWG) is tasked with developing an industry -standard for the representation of localizable message strings to be a -successor to ICU MessageFormat. MFWG will recommend how to remove -redundancies, make the syntax more usable, and support more complex features, -such as gender, inflections, and speech. MFWG will also consider the -integration of the new standard with programming environments, including, but -not limited to, ICU, DOM, and ECMAScript, and with localization platform -interchange. The output of MFWG will be a specification for the new syntax, -which is expected to be on track to become a Unicode Technical Standard._ +[](https://xkcd.com/927/) From 5eff76825ffa18c02a81f4fd0c7d96750c1a2c30 Mon Sep 17 00:00:00 2001 From: Mihai Nita Date: Thu, 11 Jun 2020 13:58:53 -0700 Subject: [PATCH 4/5] Implemented some of the feedback, clarified some areas --- doc/why_mf_next.md | 95 ++++++++++++++++++++++++---------------------- 1 file changed, 49 insertions(+), 46 deletions(-) diff --git a/doc/why_mf_next.md b/doc/why_mf_next.md index e1fe6ea568..b40ffd8404 100644 --- a/doc/why_mf_next.md +++ b/doc/why_mf_next.md @@ -4,39 +4,28 @@ The `MessageFormat` API and syntax have been around for a long time. -Its “ancestor”, [java/text/MessageFormat](https://docs.oracle.com/javase/7/docs/api/java/text/MessageFormat.html), was introduced with Java 1.4, February 2002. +Intro +* `MessageFormat` is the Unicode API for software localization +* It is 20 years old, well designed, proven solution +* Its design is optimized for the software development model of 20y ago and its +shortcomings result in mixed reception and adoption by the industry. -The [ICU MessageFormat](https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/text/MessageFormat.html) is tagged as stable API since ICU 3.0 (June 2004) +The current wave of software development uses dynamic languages, modern UI +frameworks and new forms of user interactions (voice, VR etc.). -The ICU version evolved compared to the JDK one: -* added support for plurals (ICU 3.8, 2007) -* added support for select (ICU 4.4, 2010) -* named arguments (`...{user}...` vs `...{0}...`) -* better handling of the apostrophe escaping -* date/time/number skeletons (2018) -* more +Considering these new challenges, combined with the lessons learned from using +`MessageFormat`, we aim to design the next iteration of `MessageFormat` +suitable for current generation of software, and adoption by Web Standards. -Despite being around for such a long time, it is still not well supported by localization tools. - -Other efforts: Fluent, FB +Other efforts: [Fluent](https://projectfluent.org/), +[FBT](https://facebook.github.io/fbt/) ## Core problems with the current `MessageFormat` -I've started with the list of problems in the next section. -But that ends up being a (biased) reshuffling of the issues and feature requests that we collected in [GitHub](https://github.com/unicode-org/message-format-wg/issues) - -So I have tried to distill that to a few root causes -(think [“5 why”](https://en.wikipedia.org/wiki/Five_whys)) - -I think that these are the problems we need to avoid repeating, -otherwise it is just a matter of time until we end up in the same place. - -Here it is: 1. Does not have any “extension points” -2. Not a formal standard with an “acceptance test suite” -3. Can't remove anything, even if we know know better -4. Hard to map to the existing localization core structures -5. Designed to be API only, plain text, UI, “imperative style” +2. Can't remove anything, even if now we know better +3. Hard to map to the existing localization core structures +4. Designed to be API only, plain text, UI, “imperative style” ### 1. Does not have any “extension points” @@ -46,13 +35,7 @@ It also means most tools used to process these messages are built rigidly, and are unprepared to handle changes (think localization tools, liners, friendly UIs, etc.). -### 2. Not a formal standard with an “acceptance test suite” - -This means that the implementations ICU4C and ICU4J are -“de facto reference-implementations”, and the ports to other languages -(JavaScript, Go, Dart, etc.) are at risk for being “slightly incompatible” - -### 3. Can't remove anything, even if we know know better +### 2. Can't remove anything, even if now we know better ICU is old, but also very popular (right now it is the core i18n library for all major operating systems, and many products). @@ -68,10 +51,10 @@ date/time parameters were added). But the stability requirements prevent any major cleanup. -### 4. Hard to map to the existing localization core structures +### 3. Hard to map to the existing localization core structures -The format is not supported by any major localization system that I know of. \ -I think that the root cause of that is not it is difficult to parse. +The format is not well supported by any major localization system. \ +The root cause of that is not it is difficult to parse. Because it is not. And ICU4J has public API for parsing. Most translation tools take a string (with placeholders) in a source language @@ -90,21 +73,41 @@ in 2/3 different ways) * word count and payment * alignment (the process of creating a TM from source + translated documents) -### 5. Designed to be API only, plain text, UI, “imperative style” +### 4. Designed to be API only, plain text, UI, “imperative style” -The main (only?) use case was: load the string from resources, -replace placeholders, and return the string result. +The main (only?) use case for `MessageFormat` is: load the string from resources, +replace placeholders, and return the string result with placeholders replaced. \ An i18n-aware `printf`, basically. -It does not play well with binding, formatting tags (thing `html`), -protecting content from translation, or “document-like” content, like templates -(think [freemarker](https://freemarker.apache.org/), +It does not play well with binding, formatting tags (think `html`), +or “document-like” content (for example templating systems like +[freemarker](https://freemarker.apache.org/), [mustache](https://mustache.github.io/), even JSP, PHP, etc.) -And it was API only. \ -No standard way to store the stings in a serialized format and to carry -info or directives for translators or localization tools. -No comments, length limits, protecting non-translatable sections of text, etc. +Because it is API only it has no standard way to store the stings in a +serialized format and to carry info or directives for translators or +localization tools. \ +So there is no way for a message to reference another message, or to fallback +to a different locale. That is all left to the "host resource manager" +(whatever that is for the given tech stack) + +There is also no metadata: comments, length limits, example, links, +protecting non-translatable sections of text, etc. + +But this is also an advantage. + +One can store the strings in the format recommended for the tech stack used +(`.properties`, `.strings`, `.rc`, `.resx`, `strings.xml`, `.po`, databases, etc). + +Applications don't need to migrate all the strings to a new format and resource +resolution only to support some more advanced features in a few messages. + +And since the string loading is left to the underlying tech stack it means that +the locale resolution and fallback is consistent with everything else. \ +For example in Android there is locale based selection (with fallback) for +styles, images, sounds, any kind of assets. \ +So there is no risk that the string fallback is different than the sound +fallback, for example. --- From 73f3bdbe585806ad7a31b604e34108d3ec8fd372 Mon Sep 17 00:00:00 2001 From: Mihai Nita Date: Mon, 20 Jul 2020 01:40:54 -0700 Subject: [PATCH 5/5] Implemented feedback from the June 15th meeting --- doc/why_mf_next.md | 40 +++++++++++++++++++++++++++++++++++----- 1 file changed, 35 insertions(+), 5 deletions(-) diff --git a/doc/why_mf_next.md b/doc/why_mf_next.md index b40ffd8404..544821be0f 100644 --- a/doc/why_mf_next.md +++ b/doc/why_mf_next.md @@ -22,20 +22,33 @@ Other efforts: [Fluent](https://projectfluent.org/), ## Core problems with the current `MessageFormat` -1. Does not have any “extension points” -2. Can't remove anything, even if now we know better +1. The design is not modular enough + * Does not have any “extension points” + * Can't deprecate anything, even if now we know better +2. Some existing problems 3. Hard to map to the existing localization core structures 4. Designed to be API only, plain text, UI, “imperative style” -### 1. Does not have any “extension points” +### 1. The design is not modular enough + +The "data model" is hard-coded in the standard and in the syntax. +This makes it very rigid. + +#### 1.1 Does not have any “extension points” No extension points means that it is hard to add new functionality unless you are doing it in ICU itself. + It also means most tools used to process these messages are built rigidly, and are unprepared to handle changes -(think localization tools, liners, friendly UIs, etc.). +(think localization tools, linters, friendly UIs, etc.). -### 2. Can't remove anything, even if now we know better +The most basic functionality would be adding a new formatter. Meantime ICU +added other formatters: time intervals, measurement, lists. But MessageFormat +did not keep up. And adding support for any of these new formats risks to break +existing tools. + +### 1.2. Can't deprecate anything, even if now we know better ICU is old, but also very popular (right now it is the core i18n library for all major operating systems, and many products). @@ -51,6 +64,23 @@ date/time parameters were added). But the stability requirements prevent any major cleanup. +### 2. Some existing problems +* ICU added new formatters, but MessageFormat does not support them +* Combined selectors (select + plural) results in unreadable and error +prone nesting +* Select and plurals inside the message are difficult to translate because of +grammatical agreement requires words outside select / plural to change. +See https://en.wikipedia.org/wiki/Agreement_(linguistics) +* Patterns in the date / time / number placeholders are bad i18n, should use skeletons +* No official support for gender. It can be done with `select`, but it +is not the same thing (same as the difference between an `enum` and integer/strings). Developers can use masculine/feminine, masc/fem, male/female, etc. +* Formatting for “parameters” known at compile time +* Escaping with apostrophe is error prone. There is no reliable way to tell if +it has to be doubled or not. +* The # is used in plural format instead of {...}, but does not work for nesting unless the plural is the innermost selector. But named placeholders don't work +properly for plurals with offset. So there are 2 ways to do the same thing that work in 98% of cases, but in special situations only one of the ways works. +* Does not support inflections, and it would be hard to add without breaking existing tools. + ### 3. Hard to map to the existing localization core structures The format is not well supported by any major localization system. \