-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Fix GH-10648: add check function pointer into mbfl_encoding #10828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ext/mbstring/tests/gh10648.phpt
Outdated
// Testing for trailing escape sequence | ||
var_dump(mb_check_encoding(hex2bin($jis_bytes), 'JIS')); | ||
var_dump(mb_check_encoding(hex2bin($jis_bytes_without_esc), 'JIS')); // false | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Included of last SI(0x0f
) to var_dump(mb_check_encoding(hex2bin("1b244224220f"), "JIS"));
result is true. Originally, I think require escape sequence (0x1b2842
).
3v4l result is PHP 8.0 is returns false. Hmm...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review!
Certainly, escape sequences and SO/SI should not be confused.
However, I do not know if this can even be called invalid as ISO-2022-JP.
Because RFC 1468 only requires ending with ASCII.
Well, it is natural since SO/SI is not defined in RFC 1468 😢
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed mb_check_encoding to return false when SO/SI and escape sequences are mismatched.
This fix also causes mb_check_encoding to return false when 0x0e or 0x0f appears by itself, but I think it did not a problem since it is returning false in PHP 8.0.
https://3v4l.org/DhZde
ext/mbstring/tests/gh10648.phpt
Outdated
// Testing for JIS X 0201 kana | ||
var_dump(mb_check_encoding(hex2bin($esc_kana), 'JIS')); | ||
var_dump(mb_check_encoding(hex2bin($so_kana), 'JIS')); | ||
var_dump(mb_check_encoding(hex2bin($gr_kana), 'JIS')); // false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, so mb_check_encoding
will now treat GR kana as invalid even for JIS
?? I thought this should be only for ISO-2022-JP
.
What we call JIS
should include both JIS7
and JIS8
variants of ISO-2022-JP
, not so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had mistakenly thought that JIS was a simple alias for ISO-2022-JP.
There is indeed a difference between the two in PHP.
https://3v4l.org/eFt7a
Then I think that ISO-2022-JP is strictly RFC1468 compliant and does not support JIS X 0201 kana in any way.
On the other hand, JIS supports JIS X 0201 kana in every way.
Specifically, the following code comments.
https://3v4l.org/Hb0lN
This behavior is different from any previous versions, but I think it makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found another difference between JIS and ISO-2022-JP.
JIS converts characters in the JIS X 0212 range, while ISO-2022-JP does not.
Here is a summary of how the three encodings ISO-2022-JP, JIS, and ISO-2022-JP-2004 convert or validate the three character sets JIS X 0208, JIS X 0212, and JIS X 0213.
https://3v4l.org/tSsN1
https://3v4l.org/nin0e
Since PHP 8.1, mb_check_encoding returns JIS X 0212 as valid as ISO-2022-JP, which needs to be fixed.
Unrelated to this issue, I think JIS should convert JIS X 0213 since there is no encoding called JIS-2004.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much. I still need to look at the comparison in 3v4l, haven't checked it out yet.
ext/mbstring/tests/gh10192.phpt
Outdated
--EXTENSIONS-- | ||
mbstring | ||
--FILE-- | ||
<?php |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test cases should more thoroughly explore the behavior of mb_detect_encoding
and mb_check_encoding
as re: UTF-7.
You should also include test cases for UTF7-IMAP as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test cases should also demonstrate the behavior of mb_convert_encoding
, showing that while conversion does not insert '?' and does not increment mb_get_info('illegal_chars')
, mb_check_encoding
still returns false
for UTF-7 and UTF7-IMAP strings with improperly terminated Base64 sections.
@pakutoma Great job for your first mbstring patch!! I have left a few comments on where the patch can be improved. In general, we want to have lots of test cases which thoroughly explore the behavior of each function. I understand you may not want to spend a lot of time writing hundreds of test cases, but the number here is definitely a bit too small. I would be even more comfortable if we used a fuzzer to check for any cases where there are unexpected changes in behavior, but again, I understand that learning to use a fuzzer right now might be a bigger step than what you want to take with your very first patch. I may try to make some time to fuzz this code when possible, and I don't think we need to hold up merging it for that. |
@pakutoma I think you are making a lot of progress on this. One other thing I remembered: I think this will change the behavior of I think the best way to avoid that may be to guard the |
@alexdowad Thanks for the lightning fast review! Edit: Edit 2:
And looking at master, I see that instead of increasing I think I can add a number like |
@pakutoma About non-strict detection... first, let me say that this a dubious feature, which probably should not have been included. But since it was, and we already have people using it, we need to make it continue working. Here is what non-strict detection does: normally, if all the candidate text encodings are invalid, then An example: Let's say we have a 1000-byte string as input. Maybe the three candidate encodings are UTF-8, ISO-2022-JP, and EUC-JP. Imagine that the 100th byte is invalid in UTF-8, the 200th byte is invalid in ISO-2022-JP, and the 300th byte is invalid in EUC-JP. In that case, with non-strict detection, the return value will be You will probably agree that this is a strange feature, and that it is probably a bad idea for people to use it. Do you see why your new check functions are a problem for that feature? With the new check functions, we have no way to determine that this encoding is invalid on the 100th byte, but that encoding is invalid on the 200th byte. The check functions just process the entire string and return true or false. This is actually the reason why I originally said that the function signature should include a state argument. But I think we can get around the problem just using a guard |
@alexdowad However, considering that adding |
@pakutoma Sorry that my response wasn't "lightning fast" this time! At the moment, my personal preference is that we leave non-strict detection alone and just allow your new check functions to help improve the accuracy of strict detection. I am always open to being convinced otherwise if there are good reasons. @pakutoma One more thing. Could you please add entries in NEWS and UPGRADING for this change? I am going on a trip tomorrow, but when I get to my hotel and have some extra time, I may try to build your code and run the tests. |
@alexdowad Thanks! I'm working slowly so don't mind me. Have a great trip!
Thanks, I will do this.
I will try to write them. |
When I started working on mbstring, I should have included entries in UPGRADING at that time, but wasn't aware of the use of that file yet. Whatever is entered in UPGRADING now will go into the upgrade notes for the next release, so I think it should only include changes which are relevant to people upgrading from the previous release to this one. |
Just built the code; indeed, all the tests pass. That is good to see. Let me request a few more tweaks:
I still need to:
@youkidearitai pointed out in another thread that this adjustment to @cmb69 @Girgias Do you have some comments? Do you think it is appropriate to target 8.2 with this PR? |
Thank you, I'll fix them!
If multiple encodings are specified in from_encoding and mbstring.strict_detection is set to 1 in the INI file, this code will probably be used to choose encoding from from_encoding.
There were a lot of this type of code, so there may be some omissions to correct. If I have omitted anything, please point it out.
I will make another PR as I probably need to add quite a lot of code. |
Oh, I just found it. I will fix it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nits and comments.
I'm confused by the NEWS entry, it mentions fixing behavioural differences between PHP 8.0 and 8.1 while only targeting PHP 8.2.
If this lands in 8.2, the UPGRADING entry needs something like the imap_open()
entry (Only as of PHP 8.2.X)
static bool is_base64_end_valid(unsigned char n, bool gap, bool is_surrogate) | ||
{ | ||
return !(gap || is_surrogate || n == ILLEGAL); | ||
} | ||
|
||
|
||
static bool is_utf16_cp_valid(uint16_t cp, bool is_surrogate) | ||
{ | ||
if (is_surrogate) { | ||
return cp >= 0xDC00 && cp <= 0xDFFF; | ||
} else if (cp >= 0xDC00 && cp <= 0xDFFF) { | ||
/* 2nd part of surrogate pair came unexpectedly */ | ||
return false; | ||
} else if (cp >= 0x20 && cp <= 0x7E && cp != '&') { | ||
return false; | ||
} | ||
return true; | ||
} | ||
|
||
static bool has_surrogate(uint16_t cp, bool is_surrogate) | ||
{ | ||
return !is_surrogate && cp >= 0xD800 && cp <= 0xDBFF; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These functions seem to be common with ext/mbstring/libmbfl/filters/mbfilter_utf7.c
would it make sense to put them in a separate file and header include them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would even suggest combining UTF-7 and UTF7-IMAP in a single .c
file, though maybe that could be done as a follow-up PR. If these functions are moved into the header file, they should be static inline
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Combining them is also a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_base64_end_valid and has_surrogate match in two files. It might be better to separate them into the header file.
I can't have an opinion on file combining because I don't understand why existing files are merged.
Of course, if you all say so, I know you are right.
Personally, I feel that any of the externally exposed functions must be common in order to combine files.
There are exceptions though, such as mbfilter_sjis_2004.c.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't have an opinion on file combining because I don't understand why existing files are merged.
Originally, almost all the text encodings supported by libmbfl (and thus mbstring) were implemented in separate files. There were a couple exceptions, but basically it was one .c
file per supported text encoding.
I have combined some of them to facilitate sharing code and thus reducing the total code size of the library. I would also like to combine more files in the future.
Don't know if this makes things clearer or not.
Thanks for the comment.
This change will be put into PHP 8.2 and then backported to PHP 8.1. @alexdowad |
Hmm. Normally if changes are intended to target PHP 8.1, they should be merged into 8.1 first, then merged down into 8.2 and master. I am trying to remember... was it me who advised you to target PHP 8.2 first and then fix 8.1 later? If so, I must have been thinking that we would make changes to the conversion filters. Actually... I think I remember now. Originally I thought we would use the That part of the code is completely different between PHP 8.1 and 8.2, so if you had done 8.1 first, it would not have been possible to merge it down into 8.2 in any straightforward way. That was why I suggested fixing 8.2/master first. But now that we have decided to add this new |
Just tried rebasing on PHP-8.1. There are a lot of nuisance merge conflicts, especially in the definitions of the Personally since this is so close to being ready, I am wondering if it may be OK to go ahead and merge into 8.2/master, then go back and fix up 8.1 in a separate PR. I do think we can improve the NEWS and UPGRADING messages a bit. But at this stage, rebasing on PHP-8.1 may be more trouble than it's worth. @Girgias Any concerns about this plan? |
I have fixed the information pointed out in the following comments. Also, ESC ( H was supported in both JIS and ISO-2022-JP, so it has been changed to be supported only in JIS. |
@pakutoma Thanks very much, great job. I have seen that the check function for JIS allows ESC ( I escape, but the check function for ISO-2022-JP does not. I assume that you have good reasons for this, but would just like to ask that you add test cases for this. You also already explained why the JISX 0212 escape (ESC $ ( D) is only supported for JIS and not for ISO-2022-JP. The explanation made sense and I have no issue with it, but just ask that we should have test cases for that. |
The following test case seems to be for them. As a side note, gh10648.phpt has the test case as bin2hex, but I noticed that this makes it hard to find the test case if we remember the escape sequence as a letter. |
No, it's fine. I apologize for not checking thoroughly before asking about the ESC ( I escape. You are right. I just ran my fuzzer for 5 minutes without it finding any more interesting cases (!). Later today I hope to run it for at least 15 minutes straight. If nothing else is found, I expect that we can merge this PR today. |
The fuzzer just found another interesting case:
That encodes 3 codepoints; the first two are non-surrogates, but the 3rd one is a surrogate (it actually should be the 2nd part of a surrogate; it's higher than 0xDB00).
I am just checking if this is a bug in the conversion filter code. I think it should emit MBFL_BAD_INPUT to the output. |
I see, very interesting.
Edit: |
Previously, mbstring used the same logic for encoding validationas for encoding conversion. However, there are cases where we want to use different logic for validation and conversion. For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail. To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation. Also, added implementation of validation logic for UTF-7, UTF7-IMAP, ISO-2022-JP and JIS.
Yes, you are right. I will fix this problem with |
@pakutoma I just thought of another case which needs to be checked... what if a Base64 section in UTF-7 ends on the first half of a surrogate pair, and the 2nd half of the surrogate pair is missing... but instead of the string ending, it goes back to ASCII mode? I will check what both |
@pakutoma I have run the fuzzer now for 30 minutes without finding anything else. |
@pakutoma Do you mind if I edit NEWS and UPGRADING to make the entries more informative when merging this PR? |
Never mind... I think what you have written is fine. |
Landed on PHP-8.2, now merging down to master. |
GitHub's RSA host key has changed! Interesting. They mentioned this on their company blog. |
Merged locally into master, fixed merge conflicts, just building and running tests. |
Landed on master. |
@pakutoma Are you interested in learning how to use libfuzzer to fuzz your own PRs? If you do not wish to do this, that is fine. |
Sigh... looks like this merge has broken the CI build for Windows. I know what the problem is, but don't have time to fix it right this moment. Hoping to fix it tomorrow. |
I interesting to how to use libfuzzer on another PRs review. |
When I built and tested 0779950 locally, the build was successful and all tests passed. However, in CI, some CI jobs are failing due to compile errors. Fix those.
I believe Windows CI build should be fixed now. |
Fair enough. |
I am very interested! I would like to know how. |
OK, that is good! Let me type up a few instructions on how to get started...
You may wish to download it and store it in the root folder of the PHP source tree.
Notice some key features: A)
A very important part is here:
That will make the fuzzer keep trying to shrink the failing test case until it can't find a shorter input string which still crashes. This makes it far, far easier to analyze the crash and figure out the cause. After minimization is done, the fuzzer will print the name of the file where the minimized case is stored.
You may find that your new mbstring C code is faulty, in which case you can fix the bug and add one or more regression test cases to As an example of "short-circuiting" the fuzzer, perhaps you know that your assertions will fail for encoding HTML-ENTITIES, but you don't care about that. Then you can add fuzzer code like:
That will prevent the fuzzer from testing any cases involving HTML-ENTITIES. You can do something similar for any known failures which you don't care about and don't want the fuzzer to report.
This will help you make sure that the crash is actually fixed. Also, when minimizing a test case, sometimes the fuzzer accidentally finds a different crash, so running This is how I do it, but maybe you will find a better way! |
@alexdowad |
@alexdowad could you maybe document those steps in the PHP Internals Book? As I always wondered how the fuzzer worked too! |
Previously, mbstring used the same logic for encoding validation as for encoding conversion.
However, there are cases where we want to use different logic for validation and conversion.
For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail.
To achieve this, a function pointer
mb_check_fn
has been added tostruct mbfl_encoding
to implement the logic used for validation.Also, added implementation of validation logic for UTF-7, UTF7-IMAP, and JIS.