-
Notifications
You must be signed in to change notification settings - Fork 597
Arabic Numeric Shaping Support #553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @AhmedMustafa, Thanks for the detailed description and extensive code in your pull request. Follow some questions...
I understood that numeric shaping is required to fix incorrectly written documents. For example, a correctly written Arabic paragraph would use native or Arabic-Indic numerals given that's what Arabic users expect. Please, correct me if I'm wrong. About the solution...
1: Globalize('ar').formatNumber(3.1415, {maximumFractionDigits: 4}) 2: Globalize('ar').formatNumber(100000) 3: Globalize('ar').formatDate(new Date())
In #503 (comment), @tomerm said "we can certainly provide shaping functionality for a wider range of scripts not only for Arabic", but in this implementation I still see a very strict support for Arabic script only. How hard is it to extend the implementation for other scripts? By the way, which are the scripts that need analogous support? Thanks PS: Note to self, keeping here examples using var Globalize = require('globalize');
Globalize.load(require('cldr-data').entireSupplemental());
Globalize.load(require('cldr-data').entireMainFor('en', 'ar'))
Globalize('ar').formatNumber(Globalize('en').parseNumber('123'));
// > '١٢٣'
Globalize('ar').formatNumber(Globalize('en').parseNumber('100,000'));
// > '١٠٠٬٠٠٠'
Globalize('ar').formatDate(Globalize('en').parseDate('11/17/2015'));
// > '١٧/١١/٢٠١٥' |
@rxaviers As was mentioned in #503 my team (specifically @ashensis) will provide a pull request (in a month or so) with a skeleton for general formatting framework. It will include specific number formaters at least for Arabic script (based on code published here) , Hebrew script and some additional ones. |
The numeric shaping is a display only feature, it does not affect the buffer/memory, it shapes the numerals based on the user setting. Numerals are still saved in European, even when the user enters a new data with contextual option, all numerals will be saved as European, but will be displayed as per the adjoining characters.
Formatting is not the scope of numeric shaping, it should be handled by different rules/settings. For example, the datetime stamp will be displayed differently based on many settings (user can set the format directly from user preferences like in Windows) or it can be affected by environment setting like mirroring of UI. The cases included as examples here have another special support to be handled correctly. When enabling Arabic support for a product, we care about these cases (handling datetime stamps, pure numeric data cases and so on). Numeric shaping is not the only feature we enable for a product, it is a feature of many other features that are enabled together to enable the Arabic support for a certain product.
We are supporting Arabic script as we already have a business need for it, many products will consume this support once it is implemented. However, the support is not closed on Arabic script only as digits substitution done based on CLDR data. Besides, our main target for Arabic script is the contextual option that is not implemented in many technologies and frameworks. Other types of support (National digits option and Never or None option) can be easily implemented using many straight forward numeric APIs. We always lack the support for the contextual option which is the major expectation for the Arabic users, as it requires special handling by checking adjoining character to determine the shape of the digits. Here is how the support can be extended for scripts other than Arabic: nationalDigits = numberNumberingSystemDigitsMap( cldr ) || "0123456789"; Here is how the contextual support can be extended for scripts other than Arabic (if the contextual option is required for that script): case "Contextual":
if ( locale.indexOf( "ar" ) === 0 ) {
return numberShapeContextualAr( value, nuDigitsMap, textDir === "rtl" ? 2 : 1 );
}else if ( locale.indexOf( "he" ) === 0 ) {
....
} |
@AhmedMustafa We always use the Western Arabic Digits (0, 1, 2, 3, 4, 5, 6, 7, 8 and 9), so we expect to see these digits in all your examples above instead of the Arabic-Indic digits. |
/cc @jquery/globalize team for feedback |
@Arkni |
After reading the detailed introduction and problem statement, I feel like having a basic idea of the challenge of contextual numeric shaping. It seems reasonable to me to address this as part of Globalize. That said, I don't think the To prove that more locales (or numbering systems?) can be supported with this approach, it would be great to extend this PR with at least one more locale/number-system. With two in place, it'll be much more obvious how to add additional ones.
I can't tell how that fits between the existing formatter methods and this PR. Could you start by filing an issue outlining what you're planning to implement? We wouldn't want you to spend time on implementing something that we couldn't eventually accept. |
I suggest not to mix 2 threads since it creates huge confusion. This specific PR is dedicated to Arabic Numeric shaping AND ability to address it in the context of String (general text). I totally agree with @jzaefferer that addressing this use case should not be part of the number module (concerned with formatting numeric values ONLY). #503 is dedicate to numeric formatting based on different numbering systems (Arabic, Thai, Telugu, Hebrew, Roman and many others). It is limited to NUMBERS formatting and thus I believe does belong to number module. I suggest to change the title of #503 to reflect that. @ashensis will publish a skeleton of framework which would allow generic number format based on any numbering system defined in CLDR. This is planned to be done via #503 PR. As part of this work he will later on (after initial skeleton is approved) integrate code published in this PR to target specifically Arabic numeric shaping. Since again, in #503 we are talking about numeric values ONLY, contextual value of Arabic numeric shaping most probably won't be applicable there. |
Did you mean #537? For clarity, there are two types of numbering systems: algorithmic and numeric (more info). Globalize does support number formatting based on different numeric numbering systems. For example: Globalize('zh').formatNumber(Math.PI)
// > '3.142'
Globalize('zh-u-nu-native').formatNumber(Math.PI)
// > '三.一四二' More examples can be seen here. It doesn't support the algorithmic ones (RBNF), for example: Hebrew, Roman, Cyrillic. |
I agree about not including this as part of the number module. I liked the idea of extending this PR with at least one more locale and numbering system to prove (and make it more clear) that more locales can be supported with this approach. |
Thanks. My intention was to say that we will leverage #503 (unless you believe it is better to open a new PR) to suggest how current support in Globalize (for number formatting based on different numeric numbering systems) can be extended to support algorithmic numbering systems such as Hebrew mentioned in #537 . |
Sounds good. Thanks. |
@rxaviers digit-shaper is a standalone module now & I've provided more test cases for the Farsi locale. Please take a look. |
@jzaefferer Please consider reviewing the latest commits. |
Hi, @jzaefferer and I have both reviewed the PR. The implementation helped us to understand the dependencies the module has on Globalize and it turns out there is almost none. It only depends on the simple function that returns the numbering system of a locale (code), which means the module is loosely coupled with Globalize. This information could be retrieved by directly using cldrjs, which is the library Globalize itself uses to retrieve this information. With that in mind, we suggest you create a standalone number shaping library, that uses the cldrjs module as a dependency (along with cldr-data) and I can provide help here if you need. You should be able to take the code from this PR, which is already complete, and provide the bits of infrastructure around that you actually need (which is likely less then what Globalize needs), so I believe the investment you made so far by implementing this feature can be fully utilized in a standalone library. You can then update that module independently of Globalize. And it will be easier to maintain, since you won’t have to support globalize-compiler (which doesn’t provide any benefit for the number shaping module). Having said that, if you want to make this a jQuery hosted project as well, I believe @kborchers will be happy in assisting you further. |
Please, just let us know if you have additional questions. |
Introduction to Arabic Numeric Shaping
Arabic and many other languages (Thai and Bengali) have classical shapes for digits “National Digits” that are different from the conventional Western Digits (European).
National digits have the same semantic meaning as the European digits, and the numbers they form are read from left to right (most significant digit on the left). The difference is only a difference in glyphs.
From the Arabic user's point of view, Arabic-Indic numerals are the basic numerals used in almost all forms of documents such as most of government documents (IDs, birth certificates, driver's licenses, passports and household bills), bank statements, newspapers, calendars, road signs and menus.
### Options for Arabic Numeric Shaping
There are 3 options which should be taken into consideration when implementing national numeric shaping support in any framework/technology. These options are:
When there is no preceding strong characters, the base text direction attribute determines the digit shaping.(Arabic-Indic digits in RTL context and European digits in LTR context).
### Problem Statement
Most of the available frameworks/technologies lack the contextual shaping option of national digits. Contextual digit shaping is a very important feature as the Arabic users don’t expect to see Arabic-Indic numerals or European numerals only when they have mixed English and Arabic data.
For example if a document has many paragraphs some in Arabic and others in English, in the Arabic paragraphs the Arabic users expect to see national or Arabic-Indic numerals, and in the English paragraphs the Arabic users expect to see European numerals.
Since the mixed English and Arabic data cases are very common in Arabic region, the same case with numerals is very common too.
Arabic paragraphs that list references’ names(which include numerals) are very common for Arabic users as well. In that case the Arabic users expect to see the numerals as European not Arabic-Indic.
So direct conversion of digits from latin to national will not fulfil Arabic users’ needs.
Contextual behavior is the core numeric shaping option that is needed from the Arabic users’ point of view.