Skip to content

Pattern match does not seem to work with newlines, possible unescaping issue #1023

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ssbarnea opened this issue Dec 3, 2022 · 6 comments
Closed

Comments

@ssbarnea
Copy link
Contributor

ssbarnea commented Dec 3, 2022

Apparently ^{{[.\s]*}}$ regex pattern is not correctly applied by the library. Inside JSON this pattern should be represented like "pattern": "^{{[.\\s]*}}$" due to JSON escaping.

Still, when using it I still get validation error that prints the pattern using the unescaped value, which makes me believe that the string was not loaded correctly.

"{{ foo\n bar }}" does not match '^{{[.\\s]*}}$'

Python does not add extra escapes for \s, so it should be a loading issue?

$ python -c "print('a\sb')"
a\sb
@ssbarnea
Copy link
Contributor Author

ssbarnea commented Dec 3, 2022

I was able to use workaround of "(?s)^{{.*}}$" that did not need \s but I think that the bug is still present.

@ssbarnea
Copy link
Contributor Author

ssbarnea commented Dec 3, 2022

In fact the workaround is not really applicable because AJV, another major JSON Schema validator, not only that it does not support it, but it also closed the bug report as wontfix at ajv-validator/ajv#101

So as a JSON schema author, I am stuck between two broken libraries, cannot write pattern validator that would work with both. 🤷🏽‍♂️

@ssbarnea
Copy link
Contributor Author

ssbarnea commented Dec 3, 2022

After reading ECMA-262 and also testing with https://regex101.com/ I conclude that in JavaScript world it use of (?s) is not allowed, as modifiers can only be specified outside.

This means that the only possible way to make a match using the current JSON Schema specification is to correctly implement \s. Some argues that JSON would also allow % escapes like HTML but recommended against as it is known that many tools forget to unescape them.

@ssbarnea ssbarnea changed the title Pattern match does not seem to work with newlins, possible unescaping issue Pattern match does not seem to work with newlines, possible unescaping issue Dec 3, 2022
@Julian
Copy link
Member

Julian commented Dec 3, 2022

Python does not add extra escapes for \s, so it should be a loading issue?

The library doesn't do any loading, so whatever string you loaded is the one being used.

But you're confusing str with repr, which is what's shown in error messages:

⊙  python -c "print(repr('a\sb'))"
'a\\sb'

@ssbarnea
Copy link
Contributor Author

ssbarnea commented Dec 3, 2022

@Julian I wonder do I need to write inside the JSON schema file in order to be able to match a multiline string that can start with something and end with something else?

I currently have "pattern": "^\\{\\{.*\\}\\}$" which only matches single-line strings and I want to fix it to allow use of multi-line.

As use of flags/modifiers is not possible because the spec failed to specify them and made use of ECMA-262, which does not include support for embedded modifiers, we are forced not to use them.

As others noted a common workaround to make .* match multi-line when you cannot use flags is to make it [\s\S]*.

That means that a solution without modifiers/flags should be ^{{[\s\S]*}}$ regex, which tested with regex101 appears to be valid not only for python but also for ecmascript/javascript.

Now the challenge is how to correctly encode the above regex for JSON. Online encoders report "^{{[\\s\\S]*}}$" and I agree with them as the only character needed escaping is the backslash. Still, doing this produce errors not only from check-jsonschema but also from ajv.

ajv:
    "message": "must match pattern \"^\\{\\{\\[\\\\S\\\\s\\]*\\}\\}$\"",
    "params": {
      "pattern": "^\\{\\{\\[\\\\S\\\\s\\]*\\}\\}$"
    },

check-jsonschema:
          "message": "'{{ should_ignore_errors }}' does not match '^\\\\{\\\\{\\\\[\\\\\\\\S\\\\\\\\s\\\\]*\\\\}\\\\}$'"

Update few hours later...

After digging a little bit inside content from schemastore, I was able to find one example of regex that was supposed to match a multiline string, one that finally worked: "^\\{\\{(.|[\r\n])*\\}\\}$".

To be honest I do not know why this syntax was used because [.\n\r]* should have also worked and be shorter.

@Julian
Copy link
Member

Julian commented Dec 4, 2022

This doesn't sound much like a JSON Schema question. The only JSON Schema relevant piece is that the specification doesn't specify what Regex flavor implementations MUST support (it only recommends ECMA 262, and specifically says you the schema author should stick to syntax common across engines). Indeed this library uses Python regexes, since that's the only flavor really accessible to Python.

So as a JSON schema author, I am stuck between two broken libraries, cannot write pattern validator that would work with both. 🤷🏽‍♂️

As usual, the way you file issues leaves a lot to be desired. There's nothing broken about the library for this particular case.

Now the challenge is how to correctly encode the above regex for JSON.

This of course has nothing to do with JSON Schema nor this library, but here's how you answer your own question.

First check that re.search does what you want, since we're using Python regexes.

I assume that you're checking against:

>>> re.search(r"^{{[\s\S]*}}$", "{{foo}}")
<re.Match object; span=(0, 7), match='{{foo}}'>

>>> re.search(r"^{{[\s\S]*}}$", "{{foo\nbar\nbaz}}")
<re.Match object; span=(0, 15), match='{{foo\nbar\nbaz}}'>

and that indeed that means you're getting the behavior you want.

The way you answer how to check what a string is in JSON is by dumping it.

Specifically:

>>> import json
>>> json.dumps(r"^{{[\s\S]*}}$")
'"^{{[\\\\s\\\\S]*}}$"'

That string is the representation of that pattern in JSON. The escaping will of course need adjusting if you say, paste it as is (since you're looking at a repr again, not a str), so the real easiest way if you're unsure is to json.dumps your entire schema, and then write that to a file.

@Julian Julian closed this as completed Dec 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants