-
Notifications
You must be signed in to change notification settings - Fork 7.9k
json_encode: Escape U+2028 and U+2029. #1701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1be9f51
to
07b5cfe
Compare
I'd prefer to introduce a new flag called We try keep the default exactly as it is defined in JSON spec which is the right thing IMO so it shouldn't be enabled by default. |
But this doesn't change the default. It just changes the behavior of the non-default |
I don't think we should change the default. There is nothing wrong with not escaping U+2028 and U+2029 by the spec and it doesn't cause any problems in many cases. We don't do any escaping be default which is the fastest and correct solution IMO. Probably the only case that this can cause issue is printing json encoded string between The I think that we could later introduce some grouped constant as requested in https://bugs.php.net/bug.php?id=65257 so you could just use one constant for all flags needed for secure and working printing between script tags. I'm not really expert at Rails but from the quick look, the only thing I see is http://api.rubyonrails.org/classes/ERB/Util.html#method-c-json_escape which is actually just an escaper function written in ruby using gsub on string. The thing that Ruby Json actually has the same default as we have (not hex escaping anything). I just tried require 'json'
my_hash = {:hello => "a\xE2\x80\xA8b"}
puts JSON.generate(my_hash) and it didn't escaped anything so it actually confirms what I think the default should be... I think that having just a flag for this use case is fine. The user space can always wrap it and use appropriate flags if needed. |
No, no. PHP absolutely does escape by default. It escapes all chars >U+0080 by default. So “we don't do any escaping be default” is false. The docs say re: There is no problem with U+2028 by default because, like every other large Unicode character, U+2028 is escaped by default. My commit changes the behavior of Here is the “bug” fixed in Rails Here is the “bug” fixed in Django Rest framework |
Ah you are right. Sorry I got a bit confused, it's actually UNescaped... :) I would probably go with another flag anyway. So it could be used as I will probably need to think about it a bit more as it's a bit late here... |
if (us != 0x2028 && us != 0x2029 && (options & PHP_JSON_UNESCAPED_UNICODE)) { | ||
pos -= 3; | ||
us = (unsigned char)s[pos]; | ||
goto unescaped_char; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think that something like this would be better:
smart_str_appendl(buf, &s[pos - 3], 3);
continue; /* alternatively add the rest to else block (instead of continue...) */
it's untested and it's quite late here but hope you get the idea... ;)
I'm in favor of changing |
I agree with @nikic. Recently I found myself using a lot of the const
BLAH_PRETTY_PRINT =
| JSON_PRETTY_PRINT
| JSON_BIGINT_AS_STRING
| JSON_UNESCAPED_UNICODE
| JSON_UNESCAPED_SLASHES
; U. new lines seems to be a natural fit for |
Strictly speaking, unicode newlines are still valid unicode chars accordidng to JSON specification, so it doesn't make sense to escape them in |
07b5cfe
to
3304254
Compare
I agree pretty strongly with @nikic and @marcioAlmada. @Majkl578: BC complaints don't seem real; what code would realistically break? Thanks @bukka, I changed the code as you suggested; the tests still pass. |
I finally got a bit of time to properly think about it. Changing I'd much prefer to have Something like define('JSON_SCRIPT_SAFE', JSON_HEX_LINE_TERMINATOR | JSON_HEX_TAG | JSON_HEX_APOS | JSON_HEX_AMP | JSON_HEX_QUOT) but define in ext of course... The name is just an initial idea and can be changed of course. Think that it would simplify things for users a bit IMHO. I will try to draft an RFC if I get some time to clarify it a bit... |
Well, then, how about we treat U+2028 and U+2029 like we treat slash? Like U+2028 and U+2029, the slash character is dangerous to include unescaped. (For slash, the reason is potential XSS.) So PHP escapes it by default but provides a Why not a |
I also want to note that all of the other |
These characters are illegal in Javascript, so leaving them unescaped is risky. The default encoder ($flags = 0) is fine, but the encoder with JSON_UNESCAPED_UNICODE flag is not. In case anyone wants the ability to leave these characters unescaped, provide JSON_UNESCAPED_LINE_TERMINATORS.
3304254
to
b41c5b0
Compare
In the latest version of the patch I added |
I think that looks more reasonable than my idea. It's still slightly strange that @kohler Are you able to email internals? It's a small change in the behaviour and a new flag so it will need to be announced there - the email should state what it is for and what the changes are. If there are no objections, then it can be merged. Otherwise it will need an RFC. Thanks for the work on this! |
Hi Jakub, so I did join internals and write a message. It got no pushback, which is good! What's the next step? |
Wouldn't |
It is better for I don't think there is any practical benefit to preserving the current unsafe behavior of |
@kohler You might not see a benefit, but others might. Plus changing the default of |
@jerrygrey No, it doesn't “go against the official JSON standard.” The standard says that U+2028 may be escaped or not, and any proper JSON decoder must be able to parse the escaped version. The standard likewise says that slash may be escaped or not, and Yes, this is a BC break, with advantages and disadvantages. Advantages:
Disadvantages:
The patch was done this way because I believe the advantages outweigh the disadvantages. In particular it is generous to users to defuse time bombs. The disadvantages seem more theoretical than practical, which is why I said (maybe too strongly) there was no “practical benefit” to the current behavior. |
@kohler A BC break like this probably delay this until PHP 8, not ideal. As for the "time bomb", the core features should not change just because the programmer makes a mistake. Also, I'm going to make this point again, when a programmer uses the flag Like pointed out by @bukka, there may be some parsers can't handle unicode escaping, and to overcome that the I have no issue with escaping these line terminators in the default, e.g. |
Not true. "BC break" is not binary, it's a scale. This change is very low on the BC scale. It's above the breakage introduced by fixing a segmentation fault, but it's far, far, far below the threshold that would require postponing it to a major version. (Not commenting on any other parts of this argument, just want to point out that this sentence is BS. I hate this kind of knee-jerk reaction whenever someone says "BC break"). |
@bukka said he preferred That said, can anyone actually find a JSON decoder that can't handle Unicode escapes? That argument seems like a real straw man. Even
And there are far more people encoding JSON for browsers than people encoding JSON for crappy parsers that barf on |
@nikic Note my use of the word "probably". My point with that sentence is that if someone did have an issue with the change it could cause delays to it getting pulled. I've seem other good PR rejected for BC breaks much more minor than this, especially when there is an alternative way to go about it. Of course, I could be completely wrong and I probably am, after all I'm just playing devil's advocate. @kohler Exactly. Also, there was a reason why the |
@jerrygrey I think that @kohler has got a point with the escaping of the control characters so that the naming / non-complaint parsers issue ( About the reason for @kohler I'd like to wait at least a week to see if anyone has got any objections and then we will see if it can go in without RFC. Thanks again for you work! |
@bukka Good point. I don't have any objections 👍 |
Hello all! Thanks for the feedback. No hurry, but it has been a week with no further objections. |
These characters are illegal in Javascript, so leaving them unescaped
is risky.
This is a minor feature addition/behavior change.
History: I was using
json_encode
to send JSON to the browser in a<script>
section, using JSON_UNESCAPED_UNICODE to save space (“é” is 2 bytes natively, 6 bytes escaped). An unescaped U+2028 character in the encoded JSON caused browser syntax errors.(More on this issue: https://web.archive.org/web/20150502034803/http://timelessrepo.com/json-isnt-a-javascript-subset )
Although the JSON spec does allow U+2028 and U+2029 to appear unescaped in strings, it would be more friendly to users to generate the subset of JSON that is also valid browser JS.
Cc: @bukka