Skip to content

Proposal: Use I-Regexp instead of ECMA-262 #1327

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gregsdennis opened this issue Oct 18, 2022 · 31 comments
Closed

Proposal: Use I-Regexp instead of ECMA-262 #1327

gregsdennis opened this issue Oct 18, 2022 · 31 comments

Comments

@gregsdennis
Copy link
Member

gregsdennis commented Oct 18, 2022

The IETF JSON Path working group is putting together I-Regexp (authored by @cabo) intended as an interoperable subset of the various flavors of regular expressions. Think of it as a "capabilities intersection." The idea is that the spec defines only features which are known to be supported by the majority of existing regular expression libraries, which means that most libraries should be able to claim conformance without actually having to make any changes.

I would like to propose that we use that here instead of ECMA-262, which seems to have varying support across languages. It would likely reduce the set of guaranteed-supported expressions, but at least we would be able to legitimately claim some guaranteed support (which I don't think we can do right now).

This spec is still in draft phase and continues to evolve. I'm not too rushed on getting this in.

@jdesrosiers
Copy link
Member

I like the idea. I'm assuming our first stable release will come before this is stable, so I'm wondering if we can adopt this later without breaking compatibility? We might need to consider some wording changes to make space for a future change like this, but I think we can make it work.

@handrews
Copy link
Contributor

This sounded like a good idea until I saw this:

5.3. ECMAScript Regexps
Perform the following steps on an I-Regexp to obtain an ECMAScript regexp [ECMA-262]:

  • For any dots (.) outside character classes (first alternative of charClass production): replace dot by [^\n\r].
  • Envelope the result in ^ and $.

Note that where a regexp literal is required, the actual regexp needs to be enclosed in /.

Unless I'm missing something that means that I-Regexps are anchored at both ends by default, which would be a huge breaking change. That's a completely different mindset for thinking about regexes, and it's not how they work in Perl, Python, Ruby, or JavaScript.

Please correct me if I'm misunderstanding here, but if this is accurate I'm a no.

@gregsdennis
Copy link
Member Author

I'll bring that up, @handrews, and report back. Good catch.

@handrews
Copy link
Contributor

handrews commented Oct 19, 2022

[EDIT: Never mind this comment, I was reading things wrong]

@cabo
Copy link

cabo commented Oct 19, 2022

iregexp is for matching regular expressions, which in practice always require anchors (except for rather unusual cases). Since there is no consensus on anchors, the best approach is to leave out this redundant noise -- note that this is not an innovation as XSD regexps have always done this.
Similarly, "." is a meta character with varying interpretation in different dialects, so translating this to a specific dialect may need interpolation.
None of this should be a big surprise to anyone who actually has tried to obtain fully interoperable regexps.

@handrews
Copy link
Contributor

@cabo the issue at hand is not one of requiring the most interoperable regexes. Schema authors can decide their own tradeoffs for that. The issue is whether to break a feature that has had the same behavior (in terms of anchoring and using ECMA as a reference) since the very beginning of JSON Schema.

@cabo
Copy link

cabo commented Oct 19, 2022

Sure. I don't have an opinion on that.

Of course, you can standardize on bracketing an iregexp with ^$ for backwards compatibility (I don't think anyone will notice the fact that "." does not include LS/PS in ECMAscript but does in iregexp). You can also use a different JSON member name to introduce iregexps, but that doesn't help you with the old JSON member name.

@handrews
Copy link
Contributor

(I don't think anyone will notice the fact that "." does not include LS/PS in ECMAscript but does in iregexp)

Oh yeah, I agree, that's why I edited out the comment about "." - I had just read that part wrong. I-Regexp doesn't do what I was worried about with that comment anyway.

But right now {"pattern": "foo"} considers "foobar" to be a valid instance, and if it suddenly considers it invalid because "foo" as a pattern now behaves like "^foo$"... that's going to be a very big surprise to people. And considering that we're trying to stabilize behavior in JSON Schema wherever possible these days, I'm skeptical that it is worth doing.

However, an "iregexp" keyword (or "ipattern" or whatever) might be a good keyword proposal.

While there are always a few people who think regexes should be anchored by default (presumably coming from the XSD world), I'd say the vast majority of people who use pattern understand how it works and would be surprised to migrate to the stable version of JSON Schema and have their pattern-using schemas break.

@tviegut
Copy link

tviegut commented Oct 19, 2022

(I don't think anyone will notice the fact that "." does not include LS/PS in ECMAscript but does in iregexp)

Oh yeah, I agree, that's why I edited out the comment about "." - I had just read that part wrong. I-Regexp doesn't do what I was worried about with that comment anyway.

But right now {"pattern": "foo"} considers "foobar" to be a valid instance, and if it suddenly considers it invalid because "foo" as a pattern now behaves like "^foo$"... that's going to be a very big surprise to people. And considering that we're trying to stabilize behavior in JSON Schema wherever possible these days, I'm skeptical that it is worth doing.

However, an "iregexp" keyword (or "ipattern" or whatever) might be a good keyword proposal.

While there are always a few people who think regexes should be anchored by default (presumably coming from the XSD world), I'd say the vast majority of people who use pattern understand how it works and would be surprised to migrate to the stable version of JSON Schema and have their pattern-using schemas break.

Hi @handrews , in the IEC standards that are pending publication I know that REGEX patterns are utilized extensively for dates (i.e. date, datetime, duration, month, etc.) to ensure that the string representations are expressed as their respective ISO 8601 compliant equivalents. This since dates are not native primitives in JSON. I recall extensive testing with the French N.C. that I will need to double check on to see where that landed once again. Sorry as I just bumped into this thread now. ~Todd

@gregsdennis
Copy link
Member Author

@tviegut this isn't about regular expressions in general. We want to use and currently are using regex. That's not in question.

This issue is about which flavor of regex we want to support in the spec. Currently, we have ECMA-262, but support for that is inconsistent across languages/platforms. We need something that has guaranteed interoperability.

@handrews
Copy link
Contributor

handrews commented Oct 19, 2022

@gregsdennis I was reading @tviegut's comments as indicating what sort of things might break / need to be updated if we made this change.

@tviegut we do often recommend using regexps for date verification since pattern is more reliably implemented than format (as long as your regexps don't do anything too advanced). So that use case makes sense to me.

@gregsdennis I'm curious about what environments can't (as opposed to just currently don't) support ECMA-262. In particular the anchor thing, since we could easily say "if you anchor your regexes and don't use . you will stay within more broadly interoperable functionality." We've never tried to restrict how much regex functionality implementations offer. We just define what subset is reliable. So we could define the I-Regexp-compatible subset of ECMA 262 instead of just saying ECMA 262 in general. But as a "here's what you can use if you're worried about interoperability", not as a "this is the only thing you can use."

@gregsdennis
Copy link
Member Author

My suggestion isn't based on personal experience, but more on recollection of complaints from others regarding ECMA-262 support in their language of choice. It's not always 100%.

.Net, for example, has a 262-compliant mode, but it doesn't support some cases (that I can't specifically recall).

@ucaiug-admin
Copy link

ucaiug-admin commented Oct 19, 2022

@tviegut we do often recommend using regexps for date verification since pattern is more reliably implemented than format (as long as your regexps don't do anything too advanced). So that use case makes sense to me.

@handrews yes that's why pattern was chosen for the IEC standards due to the unreliability of format. So this would impact what we have out in our reference implementations in E.U. messages, etc. Anyway, sorry to "rock the boat" but maybe this helps us navigate things a bit. Thoughts?

Now, I did some further digging since I posted earlier. In reviewing the background further, turns out we had reps from both the U.S. and French teams thoroughly vett out and test the REGEX-s that were going to be published as part of the standard. Now, I see contributed to the draft the following:

image

image

@gregsdennis
Copy link
Member Author

gregsdennis commented Oct 19, 2022

@tviegut / @admin-cimug I don't want this getting off-topic. The point of this issue isn't "are regexes useful?" The point is determining the best regex specification. To that end, I don't think your comments add to that discussion.

If your concern is follow-on specifications which repeat what JSON Schema states, I can't advise on a process to address that. JSON Schema continues to evolve, and as such, references like this will need to update.

@tviegut
Copy link

tviegut commented Oct 19, 2022

@tviegut / @admin-cimug I don't want this getting off-topic. The point of this issue isn't "are regexes useful?" The point is determining the best regex specification. To that end, I don't think your comments add to that discussion.

If your concern is follow-on specifications which repeat what JSON Schema states, I can't advise on a process to address that. JSON Schema continues to evolve, and as such, references like this will need to update.

@gregsdennis : No, the context of my comments wasn't if they're useful. That's self evident :). Rather the context was in response to @handrews statement to you:

@gregsdennis I was reading @tviegut's comments as indicating what sort of things might break / need to be updated if we made this change.

(also I apologize as @tviegut is my personal account from my GitHub app and @admin-cimug an SDO related account)

@gregsdennis
Copy link
Member Author

@tviegut thanks for the clarification.

The idea behind this is that i-regexp is supposed to be largely compatible with existing libraries, so things shouldn't break in practice. Basically, we'd be reducing the set of guaranteed expressions that are supported, but we're not restricting libraries from supporting additional expressions. We'd just be saying that those additional expressions wouldn't be guaranteed to be interoperable.

I hope that makes sense.

@handrews
Copy link
Contributor

The idea behind this is that i-regexp is supposed to be largely compatible with existing libraries

Right. But if it's in truth only directly compatible with XSD due to the implicit anchoring and incompatible with ECMA (and Perl, Python, Ruby, etc.), which is the ecosystem to which JavaScript and therefore JSON and therefore JSON Schema belong, it's not going to be of use to us. Not in the standard pattern or patternProperties keywords, but perhaps as extensions.

@m-mohr
Copy link

m-mohr commented Oct 25, 2022

Just a quick thought here: When reading iregexp (or especially ipattern) above, my first reaction was that this might be understood as case-insensitive variant of pattern (i.e. like the i flag for a regexp). So I'd be careful with the name.

@handrews
Copy link
Contributor

@m-mohr good point- naming is hard! If we do add new keywords, they will get their own issues for discussion first, so we don't need to sort that out here. But I'm glad you brought it up!

@gregsdennis
Copy link
Member Author

Given the resistance I've received when I asked about i-regexp being based on XSD (and this the implicit anchors), I'm no longer sure this is a good fit for us. I've made the argument that explicit anchors are more common and more well-known by developers, and they're not listening. I'm happy to close this if others are.

Thanks for looking into it.

@handrews
Copy link
Contributor

Yeah I've weighed in over there but they seem dead-set on ignoring the ecosystem that they're allegedly targeting so I'll probably give up pretty soon. It's baffling. I can't possibly see a justification to make an unintuitive breaking change to JSON Schema regexes that runs counter to the vast majority of JSON/JavaScript/ECMA technologies and the programming languages that most often parse them, particularly not as we're trying to emphasize stability.

@cabo
Copy link

cabo commented Oct 26, 2022

Henry,

iregexp was not designed for json-schema.org.

For json-schema.org, the question what kinds of regexps you want to use is pretty much moot, as that ship has sailed. I don't understand why this needs to be discussed. iregexp won't "just drop in" in json-schema.org's pattern keyword.

The contribution that iregexp can make here is that it provides a well-defined subset (intersection) that actually is widely interoperable, well beyond the JavaScript ecosystem. This subset needs a bit of translation to work with sliding regexps and explicit anchors, but it is still a useful subset. I would be way more interested in whether that subset hits your requirements than in the current discussion.

@gregsdennis
Copy link
Member Author

gregsdennis commented Oct 26, 2022

@cabo

iregexp was not designed for json-schema.org.

No one is making this claim. What's baffling is that iregexp is being created as an interoperable standard with the intention of being usable by other specifications, yet it's ignoring the ecosystem it claims to target.

The contribution that iregexp can make here is that it provides a well-defined subset (intersection) that actually is widely interoperable, well beyond the JavaScript ecosystem.

But it's being developed specifically for JSON Path, an obvious JSON (thus JavaScript ecosystem) technology.

If it's not a good fit for JSON Schema, I argue that it's not a good fit for JSON Path for the same reasons.

@cabo
Copy link

cabo commented Oct 26, 2022

it's being developed specifically for JSON Path

Very much not so.

The main argument for iregexp is that we do not need a new regex dialect per application environment.

iregexp is a good fit for JSONPath, which is why we are completing the work in the JSONPath WG. But the intention is for this spec to have wider application.

I understand that the json-schema.org people have adopted the ECMAScript dialect long ago, so going for a more general approach may seem unnatural here. I'm sorry, but that doesn't have a bearing on whether iregexp is a good fit for JSONPath.

@gregsdennis
Copy link
Member Author

iregexp is a good fit for JSONPath

I'm sorry, but that doesn't have a bearing on whether iregexp is a good fit for JSONPath.

We'll continue this argument in other channels.

@gregsdennis
Copy link
Member Author

I have been convinced that iregexp is not a good fit for JSON Schema.

@ssbarnea
Copy link

ssbarnea commented Dec 3, 2022

Very interesting subject, maybe someone can tell me which is the current way to include multiline matching in a string pattern as the page from http://json-schema.org/understanding-json-schema/reference/regular_expressions.html does mention that . does not match newlines and the single example included uses a regex part that does not allow any modifiers because it does not include the slashes.

Does this mean that the specified pattern is impossible to match multiline strings?

@cabo
Copy link

cabo commented Dec 3, 2022

Very interesting subject, maybe someone can tell me which is the current way to include multiline matching in a string pattern as the page from http://json-schema.org/understanding-json-schema/reference/regular_expressions.html does mention that . does not match newlines

(in ECMAscript, that is the "dotAll" feature, triggered with the "s" flag.
This is not the "multiline" feature, which changes the semantics of anchors and is triggered with the "m" flag.)

and the single example included uses a regex part that does not allow any modifiers because it does not include the slashes.

Does this mean that the specified pattern is impossible to match multiline strings?

What do you mean by "the specified pattern"? The example on the page you reference deliberately does not match newlines.

The ECMAscript flavor has indeed been designed under the assumption that flags can be specified with the regexp.
Other flavors allow setting/unsetting flags with (?flag) and (?-flag), but not ECMAscript, which only supports separate flags (e.g., after the /../ part of the regexp literal).
So in json-schema.org you are stuck with dotAll and multiline off. You can emulate these (disjunctions with negative character classes for dotAll, lookahead/lookbehind for multiline), but that is tedious and error prone.

@ssbarnea
Copy link

ssbarnea commented Dec 3, 2022

\s should be a workaround but at least in python-jsonschema it does not.

@cabo
Copy link

cabo commented Dec 3, 2022

The classical way to emulate dotAll is a character class that combines positive and negative escapes, e.g., [\s\S].
You say that doesn't work in python-jsonschema?

@ssbarnea
Copy link

ssbarnea commented Dec 3, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants