Add pattern keyword tests for non-BMP Unicode code points #264

awwright · 2019-05-23T00:15:41Z

I got a question about support for a "pattern" expression that matches only printable Unicode characters. It appears many implementations do not have support for multibyte characters in regular expressions, especially in ECMAScript. This means that non-BMP characters (which must be represented with surrogate pairs in UTF-16) don't work as expected:

> new RegExp('^🐲*$').test('')
false
> new RegExp('^🐲*$').test('🐲')
true
> new RegExp('^🐲*$').test('🐲🐲')
false

This regular expression is only matching the second character of the surrogate pair. Even though JSON Schema suggests limiting patterns to ECMAScript-compatible regular expressions, this is not intended to constrain the encoding of the string; JSON only decodes to a string of Unicode code points. ECMAScript implementations that want to match Unicode characters correctly will need to match the two-byte sequence like so:

> new RegExp('^(🐲)*$').test('🐲🐲')
true

Or use the u flag available in newer implementations of ECMAScript:

> new RegExp('^🐲*$', 'u').test('🐲🐲')
true

The "pattern" and "patternProperties" tests should test for this behavior.

The text was updated successfully, but these errors were encountered:

Julian · 2019-05-23T12:55:00Z

Sounds reasonable to me

…

On Wed, May 22, 2019, 20:15 Austin Wright ***@***.***> wrote: I got a question about support for a "pattern" expression that matches only printable Unicode characters. It appears many implementations do not have support for multibyte characters in regular expressions, especially in ECMAScript. This means that non-MBP characters (which must be represented with surrogate pairs in UTF-16) don't work as expected: > new RegExp('^🐲*$').test('') false > new RegExp('^🐲*$').test('🐲') true > new RegExp('^🐲*$').test('🐲🐲') false This regular expression is only matching the second character of the regular expression. Even though JSON Schema suggests ECMAScript-compatible regular expressions, this is not intended to constrain the type of string supplied to the regexp test; JSON only decodes to a string of Unicode code points. ECMAScript implementations that want to match Unicode characters correctly will need to match the two-byte sequence like so: > new RegExp('^(🐲)*$').test('🐲🐲') true Or use the u flag available in newer implementations of ECMAScript: > new RegExp('^🐲*$', 'u').test('🐲🐲') true The "pattern" and "patternProperties" tests should test for this behavior. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#264?email_source=notifications&email_token=AACQQXRZD2U42WC743NDHG3PWXO23A5CNFSM4HOY5IU2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GVKWWVA>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AACQQXWSTPBCS2N3XM5B2ODPWXO23ANCNFSM4HOY5IUQ> .

Julian added the missing test A request to add a test to the suite that is currently not covered elsewhere. label Sep 27, 2019

ssilverman mentioned this issue Apr 23, 2020

[264] Add pattern tests for non-BMP code points #339

Merged

ssilverman closed this as completed in #339 Apr 25, 2020

awwright mentioned this issue Jul 21, 2020

Clarify which regular expression production rule from the ECMA specification is intended json-schema-org/json-schema-spec#821

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pattern keyword tests for non-BMP Unicode code points #264

Add pattern keyword tests for non-BMP Unicode code points #264

awwright commented May 23, 2019 •

edited

Loading

Julian commented May 23, 2019 via email

Add pattern keyword tests for non-BMP Unicode code points #264

Add pattern keyword tests for non-BMP Unicode code points #264

Comments

awwright commented May 23, 2019 • edited Loading

Julian commented May 23, 2019 via email

awwright commented May 23, 2019 •

edited

Loading