Skip to content

Fix GH-10648: add check function pointer into mbfl_encoding #10828

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

pakutoma
Copy link
Contributor

Previously, mbstring used the same logic for encoding validation as for encoding conversion.
However, there are cases where we want to use different logic for validation and conversion.
For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail.
To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation.
Also, added implementation of validation logic for UTF-7, UTF7-IMAP, and JIS.

@pakutoma pakutoma changed the title Fix phpGH-10648: add check function pointer into mbfl_encoding Fix GH-10648: add check function pointer into mbfl_encoding Mar 10, 2023
// Testing for trailing escape sequence
var_dump(mb_check_encoding(hex2bin($jis_bytes), 'JIS'));
var_dump(mb_check_encoding(hex2bin($jis_bytes_without_esc), 'JIS')); // false

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Included of last SI(0x0f) to var_dump(mb_check_encoding(hex2bin("1b244224220f"), "JIS")); result is true. Originally, I think require escape sequence (0x1b2842).

3v4l result is PHP 8.0 is returns false. Hmm...

Copy link
Contributor Author

@pakutoma pakutoma Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review!
Certainly, escape sequences and SO/SI should not be confused.
However, I do not know if this can even be called invalid as ISO-2022-JP.
Because RFC 1468 only requires ending with ASCII.
Well, it is natural since SO/SI is not defined in RFC 1468 😢

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed mb_check_encoding to return false when SO/SI and escape sequences are mismatched.
This fix also causes mb_check_encoding to return false when 0x0e or 0x0f appears by itself, but I think it did not a problem since it is returning false in PHP 8.0.
https://3v4l.org/DhZde

// Testing for JIS X 0201 kana
var_dump(mb_check_encoding(hex2bin($esc_kana), 'JIS'));
var_dump(mb_check_encoding(hex2bin($so_kana), 'JIS'));
var_dump(mb_check_encoding(hex2bin($gr_kana), 'JIS')); // false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so mb_check_encoding will now treat GR kana as invalid even for JIS?? I thought this should be only for ISO-2022-JP.

What we call JIS should include both JIS7 and JIS8 variants of ISO-2022-JP, not so?

Copy link
Contributor Author

@pakutoma pakutoma Mar 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had mistakenly thought that JIS was a simple alias for ISO-2022-JP.
There is indeed a difference between the two in PHP.
https://3v4l.org/eFt7a
Then I think that ISO-2022-JP is strictly RFC1468 compliant and does not support JIS X 0201 kana in any way.
On the other hand, JIS supports JIS X 0201 kana in every way.
Specifically, the following code comments.
https://3v4l.org/Hb0lN
This behavior is different from any previous versions, but I think it makes sense.

Copy link
Contributor Author

@pakutoma pakutoma Mar 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found another difference between JIS and ISO-2022-JP.
JIS converts characters in the JIS X 0212 range, while ISO-2022-JP does not.

Here is a summary of how the three encodings ISO-2022-JP, JIS, and ISO-2022-JP-2004 convert or validate the three character sets JIS X 0208, JIS X 0212, and JIS X 0213.
https://3v4l.org/tSsN1
https://3v4l.org/nin0e
Since PHP 8.1, mb_check_encoding returns JIS X 0212 as valid as ISO-2022-JP, which needs to be fixed.

Unrelated to this issue, I think JIS should convert JIS X 0213 since there is no encoding called JIS-2004.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much. I still need to look at the comparison in 3v4l, haven't checked it out yet.

--EXTENSIONS--
mbstring
--FILE--
<?php
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test cases should more thoroughly explore the behavior of mb_detect_encoding and mb_check_encoding as re: UTF-7.

You should also include test cases for UTF7-IMAP as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test cases should also demonstrate the behavior of mb_convert_encoding, showing that while conversion does not insert '?' and does not increment mb_get_info('illegal_chars'), mb_check_encoding still returns false for UTF-7 and UTF7-IMAP strings with improperly terminated Base64 sections.

@alexdowad
Copy link
Contributor

@pakutoma Great job for your first mbstring patch!!

I have left a few comments on where the patch can be improved. In general, we want to have lots of test cases which thoroughly explore the behavior of each function. I understand you may not want to spend a lot of time writing hundreds of test cases, but the number here is definitely a bit too small.

I would be even more comfortable if we used a fuzzer to check for any cases where there are unexpected changes in behavior, but again, I understand that learning to use a fuzzer right now might be a bigger step than what you want to take with your very first patch. I may try to make some time to fuzz this code when possible, and I don't think we need to hold up merging it for that.

@alexdowad
Copy link
Contributor

@pakutoma I think you are making a lot of progress on this.

One other thing I remembered: I think this will change the behavior of mb_detect_encoding in non-strict detection mode. Or more to the point: I think it will break non-strict detection. (To enable non-strict detection, set the INI setting mbstring.strict_detection to false, or pass false as the 3rd argument to mb_detect_encoding.)

I think the best way to avoid that may be to guard the for loop which you added to mb_detect_encoding with if (identd->strict).

@pakutoma
Copy link
Contributor Author

pakutoma commented Mar 11, 2023

@alexdowad Thanks for the lightning fast review!

Edit:
I had misread mbfl_encoding_detector_feed.
I see that bad++; would cause this function to fail.
I will figure out how to fix it.

Edit 2:
I guess I was under the mistaken impression that the bug that existed in PHP 8.0 was still there.
I see that the issue with mb_detect_encoding returning false for non-strict detection has been fixed since PHP 8.1.
https://3v4l.org/JtLcJ
And here it is happening again.
This is indeed a serious mistake. I will fix it.

By "break" I assume you mean that the added for loop increases the num_illegalchars for a given encoding, so that the mbfl_encoding_detector_judge function excludes the encoding.
Non-strict detection is expected to return more than one result, so in that sense this change is breaking it.
Hmmm, but in this sense, wasn't it originally broken when the input with MBFL_BAD_INPUT was passed? I think so, and that has been fixed in master.

And looking at master, I see that instead of increasing num_illegalchars, it adds 1000 to demerits.
https://github.com/php/php-src/blob/master/ext/mbstring/mbstring.c#L3053-3060

I think I can add a number like 1000 to data->score instead of guarding the for loop with if, but is that a bit hacky?
I think the advantage is that we can use the output of the check function as a heuristic for non-strict detection.

@alexdowad
Copy link
Contributor

@pakutoma About non-strict detection... first, let me say that this a dubious feature, which probably should not have been included. But since it was, and we already have people using it, we need to make it continue working.

Here is what non-strict detection does: normally, if all the candidate text encodings are invalid, then mb_detect_encoding does not return any encoding. However, if non-strict detection is enabled, then as soon as N-1 candidate encodings are eliminated, then mb_detect_encoding immediately returns the last remaining one.

An example: Let's say we have a 1000-byte string as input. Maybe the three candidate encodings are UTF-8, ISO-2022-JP, and EUC-JP. Imagine that the 100th byte is invalid in UTF-8, the 200th byte is invalid in ISO-2022-JP, and the 300th byte is invalid in EUC-JP. In that case, with non-strict detection, the return value will be "EUC-JP". But with strict detection, the return value will be null.

You will probably agree that this is a strange feature, and that it is probably a bad idea for people to use it.

Do you see why your new check functions are a problem for that feature? With the new check functions, we have no way to determine that this encoding is invalid on the 100th byte, but that encoding is invalid on the 200th byte. The check functions just process the entire string and return true or false.

This is actually the reason why I originally said that the function signature should include a state argument. But I think we can get around the problem just using a guard if clause, as I said above. This means that non-strict detection will not benefit from your check functions; but anyways, non-strict detection is... well, non-strict. So I don't think it matters if non-strict detection does not pay attention to ISO-2022-JP strings ending in an incorrect mode.

@pakutoma
Copy link
Contributor Author

@alexdowad
Thanks for the detailed explanation, it helped me to understand.
It seems that the check function needs a major change to "fully" support this feature and there is not much to be gained by it.

However, considering that adding data->score += 1000 gives a slightly better result for non-strict detection than doing nothing, I think it's a good idea to add this, but I also understand that you don't want to add more code for non-strict detection. What do you think?

@alexdowad
Copy link
Contributor

@pakutoma Sorry that my response wasn't "lightning fast" this time!

At the moment, my personal preference is that we leave non-strict detection alone and just allow your new check functions to help improve the accuracy of strict detection. I am always open to being convinced otherwise if there are good reasons.

@pakutoma One more thing. Could you please add entries in NEWS and UPGRADING for this change?

I am going on a trip tomorrow, but when I get to my hotel and have some extra time, I may try to build your code and run the tests.

@pakutoma
Copy link
Contributor Author

pakutoma commented Mar 14, 2023

@alexdowad Thanks! I'm working slowly so don't mind me. Have a great trip!

my personal preference is that we leave non-strict detection alone and just allow your new check functions to help
improve the accuracy of strict detection.

Thanks, I will do this.
Surely it would be better to allow users to select character codes in other appropriate ways rather than to continue to have them use unstable methods by making ad hoc fixes to non-strict detection methods.

Could you please add entries in NEWS and UPGRADING for this change?

I will try to write them.
Is the change to write in UPGRADING only due to this PR?
As far as I know, mb_(check|detect|convert)_encoding has also changed significantly in PHP 8.1, do I able to include that change as well?

@alexdowad
Copy link
Contributor

Could you please add entries in NEWS and UPGRADING for this change?

I will try to write them. Is the change to write in UPGRADING only due to this PR? As far as I know, mb_(check|detect|convert)_encoding has also changed significantly in PHP 8.1, do I able to include that change as well?

When I started working on mbstring, I should have included entries in UPGRADING at that time, but wasn't aware of the use of that file yet.

Whatever is entered in UPGRADING now will go into the upgrade notes for the next release, so I think it should only include changes which are relevant to people upgrading from the previous release to this one.

@alexdowad
Copy link
Contributor

Just built the code; indeed, all the tests pass. That is good to see.

Let me request a few more tweaks:

  • Squash into a single commit.
  • Remove trailing spaces in NEWS.
  • UPGRADING mentions mb_convert_encoding(), but I think that might be incorrect; does this patch affect mb_convert_encoding??
  • For UTF-7, don't say if (cp >= ...) return true; else return false;. Just say return (cp >= ...);

I still need to:

  • Carefully examine the new test cases and check if the EXPECT section looks right.
  • Benchmark to make sure there is no significant performance loss.
  • Fuzz.

@youkidearitai pointed out in another thread that this adjustment to mb_check_encoding should probably apply to ISO-2022-JP-2004 as well. You could include that in this PR or make a separate one after this one is merged; either way is fine.

@cmb69 @Girgias Do you have some comments? Do you think it is appropriate to target 8.2 with this PR?

@pakutoma
Copy link
Contributor Author

Thank you, I'll fix them!

UPGRADING mentions mb_convert_encoding(), but I think that might be incorrect; does this patch affect mb_convert_encoding??

If multiple encodings are specified in from_encoding and mbstring.strict_detection is set to 1 in the INI file, this code will probably be used to choose encoding from from_encoding.
It may be difficult to understand from the way this is written that it affects the choice of from_encoding.
I will fix this.

For UTF-7, don't say if (cp >= ...) return true; else return false;. Just say return (cp >= ...);

There were a lot of this type of code, so there may be some omissions to correct. If I have omitted anything, please point it out.

You could include that in this PR or make a separate one after this one is merged;

I will make another PR as I probably need to add quite a lot of code.

@pakutoma
Copy link
Contributor Author

If I have omitted anything, please point it out.

Oh, I just found it. I will fix it.

Copy link
Member

@Girgias Girgias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nits and comments.

I'm confused by the NEWS entry, it mentions fixing behavioural differences between PHP 8.0 and 8.1 while only targeting PHP 8.2.

If this lands in 8.2, the UPGRADING entry needs something like the imap_open() entry (Only as of PHP 8.2.X)

Comment on lines 738 to 759
static bool is_base64_end_valid(unsigned char n, bool gap, bool is_surrogate)
{
return !(gap || is_surrogate || n == ILLEGAL);
}


static bool is_utf16_cp_valid(uint16_t cp, bool is_surrogate)
{
if (is_surrogate) {
return cp >= 0xDC00 && cp <= 0xDFFF;
} else if (cp >= 0xDC00 && cp <= 0xDFFF) {
/* 2nd part of surrogate pair came unexpectedly */
return false;
} else if (cp >= 0x20 && cp <= 0x7E && cp != '&') {
return false;
}
return true;
}

static bool has_surrogate(uint16_t cp, bool is_surrogate)
{
return !is_surrogate && cp >= 0xD800 && cp <= 0xDBFF;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions seem to be common with ext/mbstring/libmbfl/filters/mbfilter_utf7.c would it make sense to put them in a separate file and header include them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would even suggest combining UTF-7 and UTF7-IMAP in a single .c file, though maybe that could be done as a follow-up PR. If these functions are moved into the header file, they should be static inline.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combining them is also a good idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_base64_end_valid and has_surrogate match in two files. It might be better to separate them into the header file.
I can't have an opinion on file combining because I don't understand why existing files are merged.
Of course, if you all say so, I know you are right.
Personally, I feel that any of the externally exposed functions must be common in order to combine files.
There are exceptions though, such as mbfilter_sjis_2004.c.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't have an opinion on file combining because I don't understand why existing files are merged.

Originally, almost all the text encodings supported by libmbfl (and thus mbstring) were implemented in separate files. There were a couple exceptions, but basically it was one .c file per supported text encoding.

I have combined some of them to facilitate sharing code and thus reducing the total code size of the library. I would also like to combine more files in the future.

Don't know if this makes things clearer or not.

@pakutoma
Copy link
Contributor Author

Thanks for the comment.

I'm confused by the NEWS entry, it mentions fixing behavioural differences between PHP 8.0 and 8.1 while only targeting PHP 8.2.

This change will be put into PHP 8.2 and then backported to PHP 8.1.
So perhaps NEWS and UPGRADING should be written when backporting to PHP8.1.

@alexdowad
What do you think about this?

@alexdowad
Copy link
Contributor

Thanks for the comment.

I'm confused by the NEWS entry, it mentions fixing behavioural differences between PHP 8.0 and 8.1 while only targeting PHP 8.2.

This change will be put into PHP 8.2 and then backported to PHP 8.1. So perhaps NEWS and UPGRADING should be written when backporting to PHP8.1.

@alexdowad What do you think about this?

Hmm. Normally if changes are intended to target PHP 8.1, they should be merged into 8.1 first, then merged down into 8.2 and master.

I am trying to remember... was it me who advised you to target PHP 8.2 first and then fix 8.1 later? If so, I must have been thinking that we would make changes to the conversion filters. Actually... I think I remember now. Originally I thought we would use the unsigned int *state parameter to detect ISO-2022-JP strings which end in the wrong state.

That part of the code is completely different between PHP 8.1 and 8.2, so if you had done 8.1 first, it would not have been possible to merge it down into 8.2 in any straightforward way. That was why I suggested fixing 8.2/master first.

But now that we have decided to add this new check function pointer instead, I think this code will apply to both 8.1 and 8.2 without any problems. So it is better you target 8.1. I think there should only be very few merge conflicts when merging down to 8.2 and master.

@alexdowad
Copy link
Contributor

@pakutoma I just fetched the latest version of this PR and read through the diff. I think you have addressed just about all the feedback from myself and @Girgias.

If I am the one to merge this PR, I may just wrap your commit log message to ~80 columns.

@alexdowad
Copy link
Contributor

Just tried rebasing on PHP-8.1. There are a lot of nuisance merge conflicts, especially in the definitions of the mbfl_encoding structs, because they already have one added member in 8.2 and now you have added another member again.

Personally since this is so close to being ready, I am wondering if it may be OK to go ahead and merge into 8.2/master, then go back and fix up 8.1 in a separate PR.

I do think we can improve the NEWS and UPGRADING messages a bit. But at this stage, rebasing on PHP-8.1 may be more trouble than it's worth.

@Girgias Any concerns about this plan?

@pakutoma
Copy link
Contributor Author

I have fixed the information pointed out in the following comments.
#10828 (comment)
#10828 (comment)

Also, ESC ( H was supported in both JIS and ISO-2022-JP, so it has been changed to be supported only in JIS.
Note that mb_check_encoding in PHP 8.0 returns false for ESC ( H, so this is not a BC break.
https://3v4l.org/NHiVH

@alexdowad
Copy link
Contributor

@pakutoma Thanks very much, great job.

I have seen that the check function for JIS allows ESC ( I escape, but the check function for ISO-2022-JP does not. I assume that you have good reasons for this, but would just like to ask that you add test cases for this.

You also already explained why the JISX 0212 escape (ESC $ ( D) is only supported for JIS and not for ISO-2022-JP. The explanation made sense and I have no issue with it, but just ask that we should have test cases for that.

@pakutoma
Copy link
Contributor Author

pakutoma commented Mar 23, 2023

The following test case seems to be for them.
Or do I need to add them to iso2022jp_encoding.phpt?
https://github.com/php/php-src/pull/10828/files#diff-2fc13b86b6293c61785a8e98704d7375bed84c7e05994eae2b0ef1d245b7debeR11
https://github.com/php/php-src/pull/10828/files#diff-2fc13b86b6293c61785a8e98704d7375bed84c7e05994eae2b0ef1d245b7debeR17

As a side note, gh10648.phpt has the test case as bin2hex, but I noticed that this makes it hard to find the test case if we remember the escape sequence as a letter.
It's hard to say which is easier to read, "\x1b\$B\x24\x22\x1b(B" or hex2bin('1b244224221b2842'), but since other test cases are written with the former, I feel it is better to use the former.

@alexdowad
Copy link
Contributor

The following test case seems to be for them. Or do I need to add them to iso2022jp_encoding.phpt?

No, it's fine. I apologize for not checking thoroughly before asking about the ESC ( I escape. You are right.

I just ran my fuzzer for 5 minutes without it finding any more interesting cases (!). Later today I hope to run it for at least 15 minutes straight. If nothing else is found, I expect that we can merge this PR today.

@alexdowad
Copy link
Contributor

The fuzzer just found another interesting case:

var_dump(mb_check_encoding("+999999uJ", 'UTF-7'));

That encodes 3 codepoints; the first two are non-surrogates, but the 3rd one is a surrogate (it actually should be the 2nd part of a surrogate; it's higher than 0xDB00).

mb_check_utf7 very correctly detects this as an error, whereas the existing PHP 8.2 code, surprisingly, does not.

I am just checking if this is a bug in the conversion filter code. I think it should emit MBFL_BAD_INPUT to the output.

@pakutoma
Copy link
Contributor Author

pakutoma commented Mar 23, 2023

I see, very interesting.
https://3v4l.org/nesn7

I was convinced when I saw the flow that mb_check_utf7() judges this input as false.
mb_check_utf7() returns this input as false because the exit state is wrong.
In other words, mb_check_utf7 was not checking the order of UTF-16 surrogate pairs.
This should be fixed and I will fix it!

Oops, this was my misunderstanding.
mb_check_utf7() was checking the order of surrogate pairs.
I will add a test case for this.

Edit:
@alexdowad
OK, The second surrogate pair starts at 0xDC00, not 0xDB00.
Since mb_convert_encoding() does not care about the state, it does not generate an error if the string ends up in the middle of processing.
I wish I had noticed it sooner, but I was confused too.

Previously, mbstring used the same logic for encoding validationas for encoding
conversion.
However, there are cases where we want to use different logic for validation
and conversion. For example, if a string ends up with missing input
required by the encoding, or if a character is input that is invalid
as an encoding but can be converted, the conversion should succeed and
the validation should fail.
To achieve this, a function pointer mb_check_fn has been added to
struct mbfl_encoding to implement the logic used for validation.
Also, added implementation of validation logic for UTF-7, UTF7-IMAP,
ISO-2022-JP and JIS.
@alexdowad
Copy link
Contributor

Edit: @alexdowad OK, The second surrogate pair starts at 0xDC00, not 0xDB00. Since mb_convert_encoding() does not care about the state, it does not generate an error if the string ends up in the middle of processing. I wish I had noticed it sooner, but I was confused too.

Yes, you are right. I will fix this problem with mb_convert_encoding.

@alexdowad
Copy link
Contributor

@pakutoma I just thought of another case which needs to be checked... what if a Base64 section in UTF-7 ends on the first half of a surrogate pair, and the 2nd half of the surrogate pair is missing... but instead of the string ending, it goes back to ASCII mode?

I will check what both mb_convert_encoding and mb_check_encoding do in this case.

@alexdowad
Copy link
Contributor

@pakutoma I have run the fuzzer now for 30 minutes without finding anything else.

@alexdowad
Copy link
Contributor

@pakutoma Do you mind if I edit NEWS and UPGRADING to make the entries more informative when merging this PR?

@alexdowad
Copy link
Contributor

@pakutoma Do you mind if I edit NEWS and UPGRADING to make the entries more informative when merging this PR?

Never mind... I think what you have written is fine.

@alexdowad
Copy link
Contributor

Landed on PHP-8.2, now merging down to master.

@alexdowad
Copy link
Contributor

GitHub's RSA host key has changed! Interesting. They mentioned this on their company blog.

@alexdowad
Copy link
Contributor

Merged locally into master, fixed merge conflicts, just building and running tests.

@alexdowad
Copy link
Contributor

Landed on master.

@alexdowad alexdowad closed this Mar 24, 2023
@alexdowad
Copy link
Contributor

@pakutoma Are you interested in learning how to use libfuzzer to fuzz your own PRs? If you do not wish to do this, that is fine.

@alexdowad
Copy link
Contributor

Sigh... looks like this merge has broken the CI build for Windows. I know what the problem is, but don't have time to fix it right this moment. Hoping to fix it tomorrow.

@youkidearitai
Copy link
Contributor

@pakutoma Are you interested in learning how to use libfuzzer to fuzz your own PRs? If you do not wish to do this, that is fine.

I interesting to how to use libfuzzer on another PRs review.

alexdowad referenced this pull request Mar 25, 2023
When I built and tested 0779950 locally, the build was successful
and all tests passed. However, in CI, some CI jobs are failing due to
compile errors. Fix those.
@alexdowad
Copy link
Contributor

I believe Windows CI build should be fixed now.

@alexdowad
Copy link
Contributor

I interesting to how to use libfuzzer on another PRs review.

Fair enough.

@pakutoma
Copy link
Contributor Author

@alexdowad

Are you interested in learning how to use libfuzzer to fuzz your own PRs? If you do not wish to do this, that is fine.

I am very interested! I would like to know how.

@alexdowad
Copy link
Contributor

I am very interested! I would like to know how.

OK, that is good! Let me type up a few instructions on how to get started...

  1. libFuzzer relies on special instrumentation code which can be inserted into your program at compile-time, by clang. So basically it only works if you build PHP with clang. If you don't have clang on your dev machine, you will need to install it. You might also need to install one or two other packages.

  2. When building PHP, you will need to provide arguments to ./configure telling it that you want to build the fuzzers in sapi/fuzzer. To make this easy for myself, I have packaged up the invocation of ./configure as a shell script: https://gist.github.com/alexdowad/27b13fce5a3d55e5f9bc5a054424e514

You may wish to download it and store it in the root folder of the PHP source tree.

  1. Once you have clang installed, try to build the PHP fuzzers with make clean, ./buildconf, ./config.fuzz, make. If this fails, try to see if there is a missing package you need to install.

  2. Once that succeeds, try running the fuzzer for 10 or 20 seconds just to get a feel: ./sapi/fuzzer/php-fuzz-mbstring. Use CTRL+C to stop it.

  3. Rather than modifying the build scripts, I usually just modify fuzzer-mbstring.c and recompile php-fuzz-mbstring. Try looking over the code for fuzzer-mbstring.c now...

Notice some key features: A) LLVMFuzzerTestOneInput is called for each test case generated by the fuzzer. B) It receives a vector of bytes; you have to somehow convert those bytes into the arguments for the function you are testing. For example, on lines 52-60, you can see that we scan the input bytes for a comma, cut out the part before the comma, and take it as the name of the input text encoding. C) If the input bytes cannot be converted to function arguments, we just free any dynamically allocated memory and return zero. D) Before we call any functions from PHP, we need to call fuzzer_request_startup first. E) After calling the function we want to test, check the results and use ZEND_ASSERT to cause a crash if the results are wrong. F) If the results are correct, free all dynamically allocated memory (otherwise the fuzzer will think that our code has a memory leak), then call fuzzer_request_shutdown.

  1. Now, for the code which I used to fuzz your recent PR, see: https://github.com/alexdowad/php-src/blob/fuzzpaku/sapi/fuzzer/fuzzer-mbstring.c

  2. Get fuzzer-mbstring.c from that branch. Then you might try deliberately introducing a bug in your recently added code; for example, in mb_check_iso2022jp, you could try changing } else if (state == JISX_0208 && (c > 0x20 && c < 0x7F)) { to } else if (state == JISX_0208 && (c > 0x20 && c < 0x70)) {, or anything else that you prefer. Build fuzz-php-mbstring and run it. How long does it take to find a crash?

  3. Here is what the output of the crash looked like for me:

Encoding: ISO-2022-JP                                                                                   
String: J\x14\x1b$BIG6-utQf\x10\x10\x1033333333333333333333JY\x1b(BI;x\x00I\x00,J\x00\x1b(BI+xxxo\x1b(Bu],e+owxUCu\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x
00\x00\x00\x00\x00\x00tf6ZI\x1b(B%:022JP\x00\x04J\x00\x00\x00\x00\x00\x00],e+owxU-58ZI\x1b(B%:\x00I\x00Y\x1b(BI+xxxo\x1b(BI+x\x00I\x00,\x00JY\x1b(BI+xxxo\x1b(Butf7\x00,\x0d\x00\x00\x00\x1b(BI+xxxo\x1b(Bu],e+owxUCS-16ZI\x1b(B%:022-JP\x0033utf1333$BIG6utQf\x10\x10\x103333333333333333333333utf13333333333333333333JIS\x0833;3333tf133333333333JIS\x0833;3333333333333333333333333333333333\x10\x10\x10\x10\x10\x10\x104
php-fuzz-mbstring: /home/alex/Programming/php/php-src/sapi/fuzzer/fuzzer-mbstring.c:319: int LLVMFuzzerTestOneInput(const uint8_t *, size_t): Assertion `good1 == good2' failed.
==1616743== ERROR: libFuzzer: deadly signal
    #0 0x5593798c33b1 in __sanitizer_print_stack_trace (/home/alex/Programming/php/php-src/sapi/fuzzer/php-fuzz-mbstring+0x26c33b1) (BuildId: d178d9cec7246ea5cf3e7c1ae74acc1af4007d80)
    #1 0x559379835c48 in fuzzer::PrintStackTrace() (/home/alex/Programming/php/php-src/sapi/fuzzer/php-f
uzz-mbstring+0x2635c48) (BuildId: d178d9cec7246ea5cf3e7c1ae74acc1af4007d80)
    #2 0x55937981b6c3 in fuzzer::Fuzzer::CrashCallback() (/home/alex/Programming/php/php-src/sapi/fuzzer
/php-fuzz-mbstring+0x261b6c3) (BuildId: d178d9cec7246ea5cf3e7c1ae74acc1af4007d80)
    #3 0x7f41f4c4251f  (/lib/x86_64-linux-gnu/libc.so.6+0x4251f) (BuildId: 69389d485a9793dbe873f0ea2c93e
02efaa9aa3d)
    #4 0x7f41f4c96a7b in __pthread_kill_implementation nptl/./nptl/pthread_kill.c:43:17
    #5 0x7f41f4c96a7b in __pthread_kill_internal nptl/./nptl/pthread_kill.c:78:10
    #6 0x7f41f4c96a7b in pthread_kill nptl/./nptl/pthread_kill.c:89:10
    #7 0x7f41f4c42475 in gsignal signal/../sysdeps/posix/raise.c:26:13
    #8 0x7f41f4c287f2 in abort stdlib/./stdlib/abort.c:79:7
    #9 0x7f41f4c2871a in __assert_fail_base assert/./assert/assert.c:92:3
    #10 0x7f41f4c39e95 in __assert_fail assert/./assert/assert.c:101:3
    #11 0x55937da057e4 in LLVMFuzzerTestOneInput /home/alex/Programming/php/php-src/sapi/fuzzer/fuzzer-mbstring.c:319:4
    #12 0x55937981cc53 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) (/home/alex/Programming/php/php-src/sapi/fuzzer/php-fuzz-mbstring+0x261cc53) (BuildId: d178d9cec7246ea5cf3e7c1ae7
4acc1af4007d80)
    #13 0x55937981c3a9 in fuzzer::Fuzzer::RunOne(unsigned char const*, unsigned long, bool, fuzzer::InputInfo*, bool, bool*) (/home/alex/Programming/php/php-src/sapi/fuzzer/php-fuzz-mbstring+0x261c3a9) (BuildId: d178d9cec7246ea5cf3e7c1ae74acc1af4007d80)
    #14 0x55937981db99 in fuzzer::Fuzzer::MutateAndTestOne() (/home/alex/Programming/php/php-src/sapi/fuzzer/php-fuzz-mbstring+0x261db99) (BuildId: d178d9cec7246ea5cf3e7c1ae74acc1af4007d80)
    #15 0x55937981e715 in fuzzer::Fuzzer::Loop(std::vector<fuzzer::SizedFile, std::allocator<fuzzer::SizedFile> >&) (/home/alex/Programming/php/php-src/sapi/fuzzer/php-fuzz-mbstring+0x261e715) (BuildId: d178d
9cec7246ea5cf3e7c1ae74acc1af4007d80)
    #16 0x55937980c852 in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) (/home/alex/Programming/php/php-src/sapi/fuzzer/php-fuzz-mbstring+0x260c852) (BuildId: d178d9cec7246
ea5cf3e7c1ae74acc1af4007d80)
    #17 0x559379836542 in main (/home/alex/Programming/php/php-src/sapi/fuzzer/php-fuzz-mbstring+0x26365
42) (BuildId: d178d9cec7246ea5cf3e7c1ae74acc1af4007d80)
    #18 0x7f41f4c29d8f in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
    #19 0x7f41f4c29e3f in __libc_start_main csu/../csu/libc-start.c:392:3
    #20 0x559379801294 in _start (/home/alex/Programming/php/php-src/sapi/fuzzer/php-fuzz-mbstring+0x260
1294) (BuildId: d178d9cec7246ea5cf3e7c1ae74acc1af4007d80)

NOTE: libFuzzer has rudimentary signal handlers.
      Combine libFuzzer with AddressSanitizer or similar for better crash reports.
SUMMARY: libFuzzer: deadly signal
MS: 4 PersAutoDict-CopyPart-CopyPart-CrossOver- DE: "utf1"-; base unit: 918af7bb847db7a7cd7566b260504ec3
0b3bc7b6
artifact_prefix='./'; Test unit written to ./crash-41ea6e371a5bdbefc3e08e1453760b4bae1192c4
sapi/fuzzer/php-fuzz-mbstring  296.05s user 34.93s system 99% cpu 5:31.19 total

A very important part is here: Test unit written to ./crash-41ea6e371a5bdbefc3e08e1453760b4bae1192c4. This file will store the input vector which provoked the crash, so you can reproduce it.

  1. Make sure the crash is reproducible: sapi/fuzzer/php-fuzz-mbstring ./crash-41ea6e371a5bdbefc3e08e1453760b4bae1192c4. In your case, you will need to insert the name of the test unit file created on your machine.

  2. Next step is very important: sapi/fuzzer/php-fuzz-mbstring -minimize_crash=1 -max_total_time=10 ./crash-41ea6e371a5bdbefc3e08e1453760b4bae1192c4

That will make the fuzzer keep trying to shrink the failing test case until it can't find a shorter input string which still crashes. This makes it far, far easier to analyze the crash and figure out the cause.

After minimization is done, the fuzzer will print the name of the file where the minimized case is stored.

  1. Now you can run the minimized test case: sapi/fuzzer/php-fuzz-mbstring ./minimized-from-948a628e390ab73d4f56214a6efe4959d7b9f525. You can insert debug printf's and run it again, run it in a debugger and print the values of variables, etc. etc. Any technique which you would normally use to analyze a failing C program can be used.

You may find that your new mbstring C code is faulty, in which case you can fix the bug and add one or more regression test cases to ext/mbstring/tests. Or, if you are comparing the results of an old C function and the new version, the bug might be in the old code under comparison. In that case, you can still add a regression test case, and perhaps temporarily fix the old code so that fuzzing can continue. Or, it may be that the problem is in the fuzzer code. You might need to adjust the assertions in the fuzzer, or add an if clause which short-circuits the fuzzer.

As an example of "short-circuiting" the fuzzer, perhaps you know that your assertions will fail for encoding HTML-ENTITIES, but you don't care about that. Then you can add fuzzer code like:

#include "ext/mbstring/libmbfl/filters/mbfilter_htmlent.h"
if (Encoding == &mbfl_encoding_html_ent) {
  // free allocated memory first
  return 0;
}

That will prevent the fuzzer from testing any cases involving HTML-ENTITIES. You can do something similar for any known failures which you don't care about and don't want the fuzzer to report.

  1. Once you have fixed the code, recompile and do this:

sapi/fuzzer/php-fuzz-mbstring ./crash-*
sapi/fuzzer/php-fuzz-mbstring ./minimized-*

This will help you make sure that the crash is actually fixed. Also, when minimizing a test case, sometimes the fuzzer accidentally finds a different crash, so running ./crash-* will show that the original crash is still there. If so, you can minimize it again.


This is how I do it, but maybe you will find a better way!

@pakutoma
Copy link
Contributor Author

pakutoma commented Mar 27, 2023

@alexdowad
Thank you so much!
I am going to work on implementing check functions for the ISO-2022-JP variants in the future, so I will try to use it.

@Girgias
Copy link
Member

Girgias commented Mar 27, 2023

@alexdowad could you maybe document those steps in the PHP Internals Book? As I always wondered how the fuzzer worked too!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants