-
Notifications
You must be signed in to change notification settings - Fork 7.9k
mb_check_encoding() returns true for incorrect but interpretable ISO-2022-JP byte sequences #10648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yes, probably indeed. @alexdowad Would you look this issue? |
@pakutoma Thanks very much for raising this issue. This is a result of unifying the code for converting text encoding and checking the validity of text encoding. I would like to hear your opinion. What do you think What if an ISO-2022-JP string ends with something like |
Indeed, I think there are many situations where mb_check_encoding would be happier if it were true, since the $jis_bytes_without_esc example is simply missing the tail. I found this problem when I discovered the behavior that mb_check_encoding returns true when interpreting the byte sequence 0xb1 0xc7 in ISO-2022-JP. Alternatively, I think it would make sense if you could clarify the use of this function. |
@pakutoma Thanks for explaining the background of this report. I wouldn't say that Let me try to explain more about the background of my comments above. Perhaps what I am asking will make more sense that way. I understand that as a programmer and a user of mbstring, you have a very specific problem with PHP 8.1 right now. You are interested primarily in solving this specific problem. As the library maintainer, my view is a bit different. I am interested in helping you with your problem, but even more, I am interested in fixing all instances of the same problem, for all possible uses of mbstring. As library maintainer, I am also thinking: should I only adjust the behavior of I think that your issue is a real problem and should be fixed. Depending on how I fix it, it is possible that |
To be clear, my inclination here is to fix only |
@youkidearitai Any opinion on this? |
@alexdowad I read "insert an error marker" to mean insert MBFL_BAD_INPUT which is converted to "?". On the other hand, I think mb_check_encoding() looks good to return "false" if a broken escape sequence is passed. To make a short summary, I agree with you. |
@pakutoma OK, well, this sounds good. I saw you opened a PR today. Are you interested in doing some coding on mbstring? If you would like to work on this issue, I can give you some guidance on what to do. I think this problem should be fixed on PHP-8.2 first, merged down to |
Thanks for the great suggestions! I understand about the merge destination. I think we will start implementation only after the discussion is over, so I will try to learn about mbfilter until then. I first thought of doing |
@pakutoma Thanks for participating in mbstring development. These are the first things you should look at:
int mbfl_conv_function(int c, mbfl_convert_filter *filter) Each such function would take a single The problem with this is that we have to make one function call for each byte which is converted to Unicode, and one call for each Unicode codepoint which is converted back to bytes. This is very slow. The new conversion code takes an entire input string, fills a buffer with wchars (Unicode codepoints), then we convert the entire buffer to output bytes, then go back and refill the buffer with wchars again, etc. On average, this is about 3 times faster than the older libmbfl code.
Important question: Can you see what each of the arguments to
|
(You should also examine the definition of |
@alexdowad
The answer to your question seems to be good with the following:
And I also read Please let me know if I am understanding something incorrectly. |
@pakutoma Everything you said above is correct. I would just like to mention a small difference between a 'wchar' and a 'UTF-32 code unit'... while UTF-32LE code units are always stored in memory in little-endian byte order, and UTF-32BE code units are always stored in big-endian byte order... wchars are stored in the host byte order used by the current machine's CPU. There are other stateful encodings other than ISO-2022 variants. There is also HZ, UTF-7, UTF7-IMAP, CP50220, CP50221, CP50222... and there are a few other places where we do need to remember state between calls to I think there is still a way to make the new code recognize when an input string ends in an illegal state, without adding an extra It will mean adding a new member to The new member of
|
@alexdowad I was not fully aware of the By the way, just to clarify, in which |
Certainly, if you are able to learn more about the other encodings, that is better. I believe that the mechanism I proposed above will do what is needed, but if you discover some reason why it will not, I would love to hear about it.
I don't know if there is any relevant standard or specification which states when GR-invoked kana can legally appear in a ISO-2022-JP string. In any case, my experience with mbstring is that our Japanese users are very sensitive to any change of behavior which affects Japanese text encodings, so my policy now with such encodings is to maintain the existing behavior whenever possible, unless it is very clear that a certain behavior is wrong. For the issue you have raised here about Regarding GR-invoked kana, I would suggest that the most practical thing to do would be to check the behavior for PHP 8.0 and earlier. If there is any change in PHP 8.1/8.2, we can consider changing it back to restore compatibility with PHP 8.0. Incidentally, when refactoring the ISO-2022-JP code, I did accidentally break the handling of GR-invoked kana at one point, and because there were not enough unit tests covering this aspect, the problem was not noticed right away. Later, when I noticed it, I tried to restore the same behavior as PHP 8.0 and earlier. (See 8f84192.) |
I'm behind on updates, so I'll write what I think now.
Then we should fix the following behavior. If such a problem exists in many other encodings, other methods should be used to validate the encoding. If the problem is unique to this encoding, then it may be possible to solve the problem in a tricky way, such as adding a state to indicate that a GR-invoked kana has appeared. To choose between these two methods, I first need to read how mb_check_encoding judges ISO-2022-JP in PHP 8.0 and and check its criteria. Next, I also need to read what other stateful encodings it judges. |
Hmm! This is interesting. |
I have learned that the change in mb_check_encoding behavior was introduced by the following commit. And before the commit, characters with multiple corresponding codes (this includes Japanese halfwidth kana in ISO-2022-JP) were always judged There are two types of behavior changes that have occurred in mb_check_encoding as a result of this commit.
I think these issues should be treated separately, so I will consider only the former. |
@pakutoma Thank you for the thorough research!
Hmm, I had forgotten about this commit! I only remembered the later refactoring by which
True. Looking at the old implementation of It also appears illogical to always judge ISO-2022-JP with GR kana as 'invalid'; if that was true, then why do we accept and convert GR kana in Anyways, if there are logical reasons why we should judge some strings with GR kana as 'invalid', then certainly, it can be discussed.
For clarity, when you say "incorrect but interpretable" codes, are you referring specifically to GR kana in ISO-2022-JP? Are there other known examples? |
I think you are correct.
I am referring to GR kana in ISO-2022-JP. Also, I do not currently know of any other examples. |
Hmm. This is an interesting point. Not sure what to do about it right now. |
@pakutoma I do know of one other issue with "incorrect but interpretable" input for UTF-7 and UTF7-IMAP. It will be nice to fix that problem using the same mechanism as we use to fix this one. I think what I suggested earlier about using the ending
|
@alexdowad
I understand that the UTF-7 issue refers to the following change. I will try to implement the fix using the new 'check' function pointer. |
Previously, mbstring used the same logic for encoding validation as for encoding conversion. However, there are cases where we want to use different logic for validation and conversion. For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail. To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation. Also, added implementation of validation logic for UTF-7, UTF7-IMAP, and JIS.
Previously, mbstring used the same logic for encoding validation as for encoding conversion. However, there are cases where we want to use different logic for validation and conversion. For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail. To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation. Also, added implementation of validation logic for UTF-7, UTF7-IMAP, and JIS.
Previously, mbstring used the same logic for encoding validation as for encoding conversion. However, there are cases where we want to use different logic for validation and conversion. For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail. To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation. Also, added implementation of validation logic for UTF-7, UTF7-IMAP, and JIS.
@alexdowad The pull request includes a new function pointer and implementations for UTF-7, UTF7-IMAP, and JIS using it.
I couldn't think of a case where I would use unsigned int *state, so I removed it. |
As also, Probably (easy reference)
|
Yes, I think that is probably true. |
Previously, mbstring used the same logic for encoding validation as for encoding conversion. However, there are cases where we want to use different logic for validation and conversion. For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail. To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation. Also, added implementation of validation logic for UTF-7, UTF7-IMAP, and JIS.
Previously, mbstring used the same logic for encoding validation as for encoding conversion. However, there are cases where we want to use different logic for validation and conversion. For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail. To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation. Also, added implementation of validation logic for UTF-7, UTF7-IMAP, and JIS.
Previously, mbstring used the same logic for encoding validation as for encoding conversion. However, there are cases where we want to use different logic for validation and conversion. For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail. To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation. Also, added implementation of validation logic for UTF-7, UTF7-IMAP, and JIS.
Previously, mbstring used the same logic for encoding validationas for encoding conversion. However, there are cases where we want to use different logic for validation and conversion. For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail. To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation. Also, added implementation of validation logic for UTF-7, UTF7-IMAP, and JIS.
Previously, mbstring used the same logic for encoding validationas for encoding conversion. However, there are cases where we want to use different logic for validation and conversion. For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail. To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation. Also, added implementation of validation logic for UTF-7, UTF7-IMAP, and JIS.
Previously, mbstring used the same logic for encoding validationas for encoding conversion. However, there are cases where we want to use different logic for validation and conversion. For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail. To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation. Also, added implementation of validation logic for UTF-7, UTF7-IMAP, ISO-2022-JP and JIS.
Previously, mbstring used the same logic for encoding validationas for encoding conversion. However, there are cases where we want to use different logic for validation and conversion. For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail. To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation. Also, added implementation of validation logic for UTF-7, UTF7-IMAP, ISO-2022-JP and JIS.
Previously, mbstring used the same logic for encoding validationas for encoding conversion. However, there are cases where we want to use different logic for validation and conversion. For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail. To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation. Also, added implementation of validation logic for UTF-7, UTF7-IMAP, ISO-2022-JP and JIS.
Previously, mbstring used the same logic for encoding validationas for encoding conversion. However, there are cases where we want to use different logic for validation and conversion. For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail. To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation. Also, added implementation of validation logic for UTF-7, UTF7-IMAP, ISO-2022-JP and JIS.
Previously, mbstring used the same logic for encoding validationas for encoding conversion. However, there are cases where we want to use different logic for validation and conversion. For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail. To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation. Also, added implementation of validation logic for UTF-7, UTF7-IMAP, ISO-2022-JP and JIS.
Previously, mbstring used the same logic for encoding validationas for encoding conversion. However, there are cases where we want to use different logic for validation and conversion. For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail. To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation. Also, added implementation of validation logic for UTF-7, UTF7-IMAP, ISO-2022-JP and JIS.
Previously, mbstring used the same logic for encoding validationas for encoding conversion. However, there are cases where we want to use different logic for validation and conversion. For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail. To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation. Also, added implementation of validation logic for UTF-7, UTF7-IMAP, ISO-2022-JP and JIS.
* PHP-8.2: Fix GH-10648: add check function pointer into mbfl_encoding
Previously, mbstring used the same logic for encoding validation as for encoding conversion. However, there are cases where we want to use different logic for validation and conversion. For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail. To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation. Also, added implementation of validation logic for UTF-7, UTF7-IMAP, ISO-2022-JP and JIS. (The same change has already been made to PHP 8.2 and 8.3; see 6fc8d01. This commit is backporting the change to PHP 8.1.)
* PHP-8.1: Fix GH-10648: add check function pointer into mbfl_encoding
* PHP-8.2: Fix GH-10648: add check function pointer into mbfl_encoding
Description
Since PHP 8.1, mb_check_encoding returns true for many incorrect but interpretable ISO-2022-JP (JIS) byte sequences.
For example, IETF RFC 1468, often referenced as the definition of ISO-2022-JP, says "the text must end in ASCII." https://datatracker.ietf.org/doc/html/rfc1468
This means that an ISO-2022-JP byte sequence must end with the escape sequence 0x1b 0x28 0x42 to switch to ASCII.
However, mb_check_encoding() returns true without the escape sequence in PHP 8.1 and later.
The documentation says it returns true when "valid", but what should mb_check_encoding return in such a case?
https://www.php.net/manual/en/function.mb-check-encoding.php
3v4l:
https://3v4l.org/9i19F
The following code:
Resulted in this output:
But I expected this output instead:
PHP Version
PHP 8.1.16
Operating System
No response
The text was updated successfully, but these errors were encountered: