Skip to content

json_encode: Escape U+2028 and U+2029. #1701

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

kohler
Copy link
Contributor

@kohler kohler commented Dec 29, 2015

These characters are illegal in Javascript, so leaving them unescaped
is risky.

This is a minor feature addition/behavior change.

History: I was using json_encode to send JSON to the browser in a <script> section, using JSON_UNESCAPED_UNICODE to save space (“é” is 2 bytes natively, 6 bytes escaped). An unescaped U+2028 character in the encoded JSON caused browser syntax errors.

(More on this issue: https://web.archive.org/web/20150502034803/http://timelessrepo.com/json-isnt-a-javascript-subset )

Although the JSON spec does allow U+2028 and U+2029 to appear unescaped in strings, it would be more friendly to users to generate the subset of JSON that is also valid browser JS.

Cc: @bukka

@kohler kohler force-pushed the json-escaped-u2028 branch 2 times, most recently from 1be9f51 to 07b5cfe Compare December 29, 2015 01:34
@bukka
Copy link
Member

bukka commented Dec 29, 2015

I'd prefer to introduce a new flag called JSON_HEX_LINE_TERMINATOR in the same way as we have JSON_HEX_APOS and other hex flags.

We try keep the default exactly as it is defined in JSON spec which is the right thing IMO so it shouldn't be enabled by default.

@kohler
Copy link
Contributor Author

kohler commented Dec 29, 2015

But this doesn't change the default. It just changes the behavior of the non-default JSON_UNESCAPED_UNICODE flag. Do I misunderstand your comment? FWIW Rails behaves as I’m suggesting.

@bukka
Copy link
Member

bukka commented Dec 30, 2015

I don't think we should change the default. There is nothing wrong with not escaping U+2028 and U+2029 by the spec and it doesn't cause any problems in many cases. We don't do any escaping be default which is the fastest and correct solution IMO. Probably the only case that this can cause issue is printing json encoded string between <script> and </script> . However you should already use at least JSON_HEX_TAG to prevent XSS if you do this. It means that you need to add flags anyway. But that's fine because it's a special use case.

The JSON_HEX_LINE_TERMINATOR could be a good addition as it would allow escaping just U+2028 and U+2029 instead of all U+0080+ characters (when using JSON_UNESCAPED_UNICODE) which can reduce space and might be useful though.

I think that we could later introduce some grouped constant as requested in https://bugs.php.net/bug.php?id=65257 so you could just use one constant for all flags needed for secure and working printing between script tags.

I'm not really expert at Rails but from the quick look, the only thing I see is http://api.rubyonrails.org/classes/ERB/Util.html#method-c-json_escape which is actually just an escaper function written in ruby using gsub on string. The thing that Ruby Json actually has the same default as we have (not hex escaping anything). I just tried

require 'json'
my_hash = {:hello => "a\xE2\x80\xA8b"}
puts JSON.generate(my_hash)

and it didn't escaped anything so it actually confirms what I think the default should be...

I think that having just a flag for this use case is fine. The user space can always wrap it and use appropriate flags if needed.

@kohler
Copy link
Contributor Author

kohler commented Dec 30, 2015

No, no. PHP absolutely does escape by default. It escapes all chars >U+0080 by default. So “we don't do any escaping be default” is false. The docs say re: JSON_UNESCAPED_UNICODE “Encode multibyte Unicode characters literally (default is to escape as \uXXXX).” Note use of word “default”.

There is no problem with U+2028 by default because, like every other large Unicode character, U+2028 is escaped by default. My commit changes the behavior of JSON_UNESCAPED_UNICODE which is, again, not default.

Here is the “bug” fixed in Rails
rails/rails#10534
https://github.com/rails/rails/blob/4-2-stable/activesupport/lib/active_support/core_ext/string/output_safety.rb
(note the JSON_ESCAPE_REGEXP)

Here is the “bug” fixed in Django Rest framework
encode/django-rest-framework#2195

@bukka
Copy link
Member

bukka commented Dec 30, 2015

Ah you are right. Sorry I got a bit confused, it's actually UNescaped... :)

I would probably go with another flag anyway. So it could be used as JSON_UNESCAPED_UNICODE | JSON_HEX_LINE_TERMINATOR. That wouldn't change the behaviour of JSON_UNESCAPED_UNICODE. Otherwise it would be a small BC break JSON_UNESCAPED_UNICODE and also its name would still do what it says.

I will probably need to think about it a bit more as it's a bit late here...

if (us != 0x2028 && us != 0x2029 && (options & PHP_JSON_UNESCAPED_UNICODE)) {
pos -= 3;
us = (unsigned char)s[pos];
goto unescaped_char;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think that something like this would be better:

smart_str_appendl(buf, &s[pos - 3], 3);
continue; /* alternatively add the rest to else block (instead of continue...) */

it's untested and it's quite late here but hope you get the idea... ;)

@nikic
Copy link
Member

nikic commented Dec 30, 2015

I'm in favor of changing JSON_UNESCAPE_UNICODE to keep Unicode newlines escaped. json_encode is already suffering from severe flagiritis and I really don't think we should be adding a new flag for every. single. thing. Unicode newlines in JSON are a well known problem and it would be nice if we could handle them reasonably without requiring the user to jump through additional hoops.

@marcioAlmada
Copy link
Contributor

I agree with @nikic. Recently I found myself using a lot of the JSON_* flags. All these options are becoming non straight forward to use on a single line of code to the point we prefer to create new constants to make it manageable.

const
    BLAH_PRETTY_PRINT =
        |   JSON_PRETTY_PRINT
        |   JSON_BIGINT_AS_STRING
        |   JSON_UNESCAPED_UNICODE
        |   JSON_UNESCAPED_SLASHES
;

U. new lines seems to be a natural fit for JSON_UNESCAPE_UNICODE.

@Majkl578
Copy link
Contributor

Strictly speaking, unicode newlines are still valid unicode chars accordidng to JSON specification, so it doesn't make sense to escape them in JSON_UNESCAPED_UNICODE mode, does it? Also it'd be a BC break.
Maybe a constant like JSON_JAVASCRIPT_COMPLIANT would work? (It could be a combination of other flags too.)

@kohler kohler force-pushed the json-escaped-u2028 branch from 07b5cfe to 3304254 Compare December 31, 2015 01:40
@kohler
Copy link
Contributor Author

kohler commented Dec 31, 2015

I agree pretty strongly with @nikic and @marcioAlmada. @Majkl578: BC complaints don't seem real; what code would realistically break?

Thanks @bukka, I changed the code as you suggested; the tests still pass.

@bukka
Copy link
Member

bukka commented Jan 1, 2016

I finally got a bit of time to properly think about it.

Changing JSON_UNESCAPED_UNICODE default would mean that you could never unescape U+2028 and U+2029 which is not wrong by the RFC 7159 and it doesn't have anything to do with this constant. There also can be theoretical concern that some parsers can't handle unicode escaping (yes I mean non-conformant parsers) so I disagree with changing JSON_UNESCAPED_UNICODE as it would allow escaping even if user doesn't want it.

I'd much prefer to have JSON_HEX_LINE_TERMINATOR. I'd also like to define a new constant that would group all needed flags for safe printing of json encoded string between script tag:

Something like

define('JSON_SCRIPT_SAFE', JSON_HEX_LINE_TERMINATOR | JSON_HEX_TAG | JSON_HEX_APOS | JSON_HEX_AMP |  JSON_HEX_QUOT)

but define in ext of course... The name is just an initial idea and can be changed of course. Think that it would simplify things for users a bit IMHO.

I will try to draft an RFC if I get some time to clarify it a bit...

@kohler
Copy link
Contributor Author

kohler commented Jan 2, 2016

Well, then, how about we treat U+2028 and U+2029 like we treat slash?

Like U+2028 and U+2029, the slash character is dangerous to include unescaped. (For slash, the reason is potential XSS.) So PHP escapes it by default but provides a JSON_UNESCAPED_SLASH flag.

Why not a JSON_UNESCAPED_LINE_TERMINATOR option?

@kohler
Copy link
Contributor Author

kohler commented Jan 2, 2016

I also want to note that all of the other JSON_HEX_ flags are NOT important for Javascript compatibility or XSS safety. json_encode(x) is perfectly safe to include in a <script> tag. (XSS is prevented by \/; line terminator damage is prevented because all high-Unicode chars are escaped.) json_encode(x, JSON_UNESCAPED_UNICODE) is not safe, but only because of U+2028/2029.

These characters are illegal in Javascript, so leaving them unescaped
is risky. The default encoder ($flags = 0) is fine, but the encoder
with JSON_UNESCAPED_UNICODE flag is not.

In case anyone wants the ability to leave these characters unescaped,
provide JSON_UNESCAPED_LINE_TERMINATORS.
@kohler kohler force-pushed the json-escaped-u2028 branch from 3304254 to b41c5b0 Compare January 2, 2016 02:45
@kohler
Copy link
Contributor Author

kohler commented Jan 2, 2016

In the latest version of the patch I added JSON_UNESCAPED_LINE_TERMINATORS.

@bukka
Copy link
Member

bukka commented Jan 2, 2016

I think that looks more reasonable than my idea.

It's still slightly strange that JSON_UNESCAPED_UNICODE won't unescape all unicode characters but I guess if we add an option JSON_UNESCAPED_LINE_TERMINATORS and document it properly, then it should be fine for 7.1

@kohler Are you able to email internals? It's a small change in the behaviour and a new flag so it will need to be announced there - the email should state what it is for and what the changes are. If there are no objections, then it can be merged. Otherwise it will need an RFC.

Thanks for the work on this!

@kohler kohler changed the title json_encode: Always escape U+2028 and U+2029. json_encode: Escape U+2028 and U+2029. Jan 5, 2016
@kohler
Copy link
Contributor Author

kohler commented Jan 6, 2016

Hi Jakub, so I did join internals and write a message. It got no pushback, which is good! What's the next step?

@jerrygrey
Copy link

Wouldn't JSON_ESCAPE_LINE_TERMINATORS be better? Since it won't change the meaning of JSON_UNESCAPED_UNICODE, which would be more like JSON_UNESCAPED_UNICODE_BUT_NOT_LINE_TERMINATORS with this change.

@kohler
Copy link
Contributor Author

kohler commented Jan 6, 2016

It is better for json_encode variants to be safe by default. The slash character must often be escaped to avoid XSS problems; JSON_UNESCAPED_SLASHES turns off this escaping. Similarly, U+2028 and U+2029 must often be escaped to avoid Javascript interpretation problems. A separate flag should be required to turn off this escaping.

I don't think there is any practical benefit to preserving the current unsafe behavior of JSON_UNESCAPED_UNICODE.

@jerrygrey
Copy link

@kohler You might not see a benefit, but others might. Plus changing the default of JSON_UNESCAPED_UNICODE would be a BC break and it will go against the official JSON standard. I agree that json_encode should be safe by default, but that is not what we are talking about, you want to change JSON_UNESCAPED_UNICODE and have it escape characters which goes against what the flag says it does.

@kohler
Copy link
Contributor Author

kohler commented Jan 6, 2016

@jerrygrey No, it doesn't “go against the official JSON standard.” The standard says that U+2028 may be escaped or not, and any proper JSON decoder must be able to parse the escaped version. The standard likewise says that slash may be escaped or not, and json_encode escapes slash by default.

Yes, this is a BC break, with advantages and disadvantages.

Advantages:

  1. It defuses a potential time bomb in users' existing code. “A…common usecase is to embed server generated globals [as JSON] in the page to avoid an extra server request…a line-separator character will break parsing, likely leaving you with a busted page. I’m pretty sure this is a time bomb waiting to affect many sites.” https://medium.com/joys-of-javascript/json-js-42a28471221d This exact time bomb went off in my code, inspiring this patch.
  2. It makes the simple flag combinations safe and thus makes json_encode easier to use.

Disadvantages:

  1. There may exist an improper JSON decoder that can parse the line separator character as Unicode but cannot parse \u2028. This seems extremely unlikely.
  2. There may exist tests in test suites somewhere that depend on json_encode(..., JSON_UNESCAPED_UNICODE) not escaping U+2028/U+2029. This also seems quite unlikely—those are not popular characters!—but it is possible. If any such tests exist, they would require updating.
  3. There may exist code somewhere that compares JSON-encoded strings byte-for-byte, for example to detect when an “object” is modified. Such code might detect a modification that's really just a change in the way U+2028/U+2029 are encoded.
  4. Other things I haven't thought of?

The patch was done this way because I believe the advantages outweigh the disadvantages. In particular it is generous to users to defuse time bombs. The disadvantages seem more theoretical than practical, which is why I said (maybe too strongly) there was no “practical benefit” to the current behavior.

@jerrygrey
Copy link

@kohler A BC break like this probably delay this until PHP 8, not ideal. As for the "time bomb", the core features should not change just because the programmer makes a mistake.

Also, I'm going to make this point again, when a programmer uses the flag JSON_UNESCAPED_UNICODE, e.g. json_encode($text, JSON_UNESCAPED_UNICODE), they want all the unicode characters to be unescaped, if they do that then there is a reason for it. There are many other uses for JSON, instead of just being handed to and parsed by JavaScript.

Like pointed out by @bukka, there may be some parsers can't handle unicode escaping, and to overcome that the JSON_UNESCAPED_UNICODE flag is used. But what you are asking here is, so you can save a few bytes with your code, people dealing with poor parsers has to deal with this BC break. Why can't you add another flag to your own code like JSON_ESCAPE_LINE_TERMINATORS and avoid this BC break?

I have no issue with escaping these line terminators in the default, e.g. json_encode($text), but when the JSON_UNESCAPED_UNICODE flag is used it should be respected, imo.

@nikic
Copy link
Member

nikic commented Jan 6, 2016

A BC break like this probably delay this until PHP 8, not ideal.

Not true. "BC break" is not binary, it's a scale. This change is very low on the BC scale. It's above the breakage introduced by fixing a segmentation fault, but it's far, far, far below the threshold that would require postponing it to a major version.

(Not commenting on any other parts of this argument, just want to point out that this sentence is BS. I hate this kind of knee-jerk reaction whenever someone says "BC break").

@kohler
Copy link
Contributor Author

kohler commented Jan 6, 2016

@bukka said he preferred JSON_UNESCAPED_LINE_TERMINATORS to JSON_HEX_LINE_TERMINATORS, which is what you're proposing; though JSON_HEX_LINE_TERMINATORS would clearly be better than doing nothing, or waiting until PHP 8!

That said, can anyone actually find a JSON decoder that can't handle Unicode escapes? That argument seems like a real straw man. Even JSON_UNESCAPED_UNICODE can contain Unicode escapes (for control characters):

json_encode("\x1b", JSON_UNESCAPED_UNICODE)
>>> "\u001B"

And there are far more people encoding JSON for browsers than people encoding JSON for crappy parsers that barf on \u2028. Probably infinitely more people :)

@jerrygrey
Copy link

@nikic Note my use of the word "probably". My point with that sentence is that if someone did have an issue with the change it could cause delays to it getting pulled. I've seem other good PR rejected for BC breaks much more minor than this, especially when there is an alternative way to go about it. Of course, I could be completely wrong and I probably am, after all I'm just playing devil's advocate.

@kohler Exactly. Also, there was a reason why the JSON_UNESCAPED_UNICODE flag was added in. I don't know why it was, but I'm sure it wasn't to save a few bytes. If the reason is gone now, than maybe it should be removed instead of adding yet another flag.

@bukka
Copy link
Member

bukka commented Jan 8, 2016

@jerrygrey I think that @kohler has got a point with the escaping of the control characters so that the naming / non-complaint parsers issue (JSON_UNESCAPED_UNICODE actually escapes unicode in some cases) is not already the case. As it was also said the number of users that could be hit by this is probably really small compare to ones that would profit from it. And if there are any, they will be able to add an extra flag. Also this is not something that would go to a bug fixing release - it would go to 7.1. Thinking about it more, we are not actually breaking BC in terms of JSON spec. It would still produce a valid JSON. If you strongly disagree, I'm happy to call for RFC but personally I think that it's ok without RFC.

About the reason for JSON_UNESCAPED_UNICODE, I think that it's definitely useful. It can significantly reduce size of the json as well as readability for anyone who want to see / debug output especially if that person comes from China, Japan or other country where all letters would be escaped otherwise. So I'd definitely stay away from removing this flag!

@kohler I'd like to wait at least a week to see if anyone has got any objections and then we will see if it can go in without RFC. Thanks again for you work!

@jerrygrey
Copy link

@bukka Good point. I don't have any objections 👍

@kohler
Copy link
Contributor Author

kohler commented Jan 16, 2016

Hello all! Thanks for the feedback. No hurry, but it has been a week with no further objections.

php-pulls pushed a commit that referenced this pull request Jan 22, 2016
@php-pulls
Copy link

Comment on behalf of bukka at php.net:

Merged via 104876d and NEWS + UPGRADING updated in 1091cec

@php-pulls php-pulls closed this Jan 22, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants