Skip to content

mb_check_encoding() returns true for incorrect but interpretable ISO-2022-JP byte sequences #10648

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pakutoma opened this issue Feb 21, 2023 · 26 comments

Comments

@pakutoma
Copy link
Contributor

Description

Since PHP 8.1, mb_check_encoding returns true for many incorrect but interpretable ISO-2022-JP (JIS) byte sequences.
For example, IETF RFC 1468, often referenced as the definition of ISO-2022-JP, says "the text must end in ASCII." https://datatracker.ietf.org/doc/html/rfc1468
This means that an ISO-2022-JP byte sequence must end with the escape sequence 0x1b 0x28 0x42 to switch to ASCII.
However, mb_check_encoding() returns true without the escape sequence in PHP 8.1 and later.

The documentation says it returns true when "valid", but what should mb_check_encoding return in such a case?
https://www.php.net/manual/en/function.mb-check-encoding.php

3v4l:
https://3v4l.org/9i19F

The following code:

<?php

$jis_bytes = '1b244224221b2842'; // 'あ' in ISO-2022-JP
$jis_bytes_without_esc = '1b24422422'; // 'あ' in ISO-2022-JP without escape sequence
var_dump(mb_check_encoding(hex2bin($jis_bytes), 'JIS'));
var_dump(mb_check_encoding(hex2bin($jis_bytes_without_esc), 'JIS'));

Resulted in this output:

bool(true)
bool(true)

But I expected this output instead:

bool(true)
bool(false)

PHP Version

PHP 8.1.16

Operating System

No response

@youkidearitai
Copy link
Contributor

Also, the text must end in ASCII.

Yes, probably indeed. @alexdowad Would you look this issue?

@alexdowad
Copy link
Contributor

@pakutoma Thanks very much for raising this issue. This is a result of unifying the code for converting text encoding and checking the validity of text encoding.

I would like to hear your opinion. What do you think mb_convert_encoding should return when converting $jis_bytes_without_esc? Should it insert an error marker and increment mb_get_info('illegal_chars')? I think the answer is no.

What if an ISO-2022-JP string ends with something like "\x1B"? That is, an incomplete escape sequence. Should we insert an error marker and increment mb_get_info('illegal_chars')? Or should we ignore the incomplete trailing escape sequence for conversion, but still treat it as an error for validity checking?

@pakutoma
Copy link
Contributor Author

pakutoma commented Feb 21, 2023

Indeed, I think there are many situations where mb_check_encoding would be happier if it were true, since the $jis_bytes_without_esc example is simply missing the tail.

I found this problem when I discovered the behavior that mb_check_encoding returns true when interpreting the byte sequence 0xb1 0xc7 in ISO-2022-JP.
https://3v4l.org/kMuBG
This byte sequence represents the character "映" in EUC-JP, but it clearly does not seem to be a valid ISO-2022-JP string.
However, since this is a 1-byte character within the scope of JIS X 0201, the current judgement is "true".
I would like this to be judged as "false".

Alternatively, I think it would make sense if you could clarify the use of this function.
If mb_check_encoding is a function to avoid conversions that result in abnormal byte sequences, then this judgment is also correct.

@alexdowad
Copy link
Contributor

@pakutoma Thanks for explaining the background of this report.

I wouldn't say that mb_check_encoding is a function "to avoid conversions that result in abnormal byte sequences". Whether conversion results in an abnormal byte sequence or not, mb_check_encoding should simply tell us: is this sequence of bytes legal in this text encoding, or not?

Let me try to explain more about the background of my comments above. Perhaps what I am asking will make more sense that way.

I understand that as a programmer and a user of mbstring, you have a very specific problem with PHP 8.1 right now. You are interested primarily in solving this specific problem.

As the library maintainer, my view is a bit different. I am interested in helping you with your problem, but even more, I am interested in fixing all instances of the same problem, for all possible uses of mbstring.

As library maintainer, I am also thinking: should I only adjust the behavior of mb_check_encoding here? Is there any other function in mbstring which might also need to be changed? What else will be affected by the fix?

I think that your issue is a real problem and should be fixed. Depending on how I fix it, it is possible that mb_convert_encoding will also be affected; or, it is possible to fix this problem without making any changes to mb_convert_encoding.

@alexdowad
Copy link
Contributor

To be clear, my inclination here is to fix only mb_check_encoding and mb_detect_encoding and to leave the behavior of mb_convert_encoding unchanged.

@alexdowad
Copy link
Contributor

What if an ISO-2022-JP string ends with something like "\x1B"? That is, an incomplete escape sequence. Should we insert an error marker and increment mb_get_info('illegal_chars')? Or should we ignore the incomplete trailing escape sequence for conversion, but still treat it as an error for validity checking?

@youkidearitai Any opinion on this?

@pakutoma
Copy link
Contributor Author

pakutoma commented Feb 21, 2023

@alexdowad
Thank you for your detailed explanation.
I agree with you that libraries are used for many purposes and we need to be consistent between those purposes.
If I am not interrupting, let me also comment on this discussion.

I read "insert an error marker" to mean insert MBFL_BAD_INPUT which is converted to "?".
If it is correct, then people who use mb_convert_encoding() will see mystery characters and will be troubled.

On the other hand, I think mb_check_encoding() looks good to return "false" if a broken escape sequence is passed.
In my opinion, it is also acceptable to return "false" even if the escape sequence is completely missing.
I think it would be useful as a heuristic to have an element that checks whether the correct escape sequence is used in mb_detect_encoding() as well.

To make a short summary, I agree with you.

@alexdowad
Copy link
Contributor

@pakutoma OK, well, this sounds good.

I saw you opened a PR today. Are you interested in doing some coding on mbstring? If you would like to work on this issue, I can give you some guidance on what to do.

I think this problem should be fixed on PHP-8.2 first, merged down to master (PHP-8.3), then finally, we can also fix it on PHP-8.1. This is because the code has changed in PHP-8.2 and the fix for PHP-8.1 will not apply to PHP-8.2.

@pakutoma
Copy link
Contributor Author

pakutoma commented Feb 22, 2023

@alexdowad

Are you interested in doing some coding on mbstring?

Thanks for the great suggestions!
I'd love to contribute to this issue!

I understand about the merge destination.
I also understand that the active branch for PHP8.2 is PHP-8.2.
I will send a PR there.

I think we will start implementation only after the discussion is over, so I will try to learn about mbfilter until then.

I first thought of doing filter->num_illegalchar++; when there is an input inconsistent with the current filter->status (or no input in flush).
I thought this way I could flag mb_check_encoding for failure without affecting mb_convert_encoding.
However, such code does not seem to be used elsewhere.
There seems to be a mechanism I am unaware of to handle these issues.

@alexdowad
Copy link
Contributor

@pakutoma Thanks for participating in mbstring development. These are the first things you should look at:

  1. First, check out the latest PHP-8.2. You will find that in PHP-8.2, new code for converting text encodings was added. The original libmbfl conversion filters had an interface like:
int mbfl_conv_function(int c, mbfl_convert_filter *filter)

Each such function would take a single int, which would represent either a byte value or a Unicode codepoint value. It would convert that single value, pass another value to filter->output_function, then generally return 0.

The problem with this is that we have to make one function call for each byte which is converted to Unicode, and one call for each Unicode codepoint which is converted back to bytes. This is very slow.

The new conversion code takes an entire input string, fills a buffer with wchars (Unicode codepoints), then we convert the entire buffer to output bytes, then go back and refill the buffer with wchars again, etc. On average, this is about 3 times faster than the older libmbfl code.

  1. See the interface for the new conversion functions in mbfl_encoding.h. Pick any specific encoding to see an example of their implementation; perhaps you might look at mbfilter_jis.c, since you seem to be interested in ISO-2022-JP.

Important question: Can you see what each of the arguments to mb_to_wchar_fn and mb_from_wchar_fn are for?

  1. For an example of code using the new conversion functions, go to the bottom of mbfl_convert.c and see mb_fast_convert. Make sure you understand the usage of the mb_to_wchar_fn and mb_from_wchar_fn.

  2. Now see the implementation of mb_check_encoding. The key part is in mbstring.c, in the function php_mb_check_encoding.

  3. Ask questions about anything which you don't understand so far.

@alexdowad
Copy link
Contributor

(You should also examine the definition of mb_convert_buf in mbfl_encoding.h. Note, this is another reason for the speed of the new conversion code; the older libmbfl code would emit output bytes into a malloc'd buffer, and then copy that buffer to create a zend_string; the new code directly builds a zend_string, so no copy is required.)

@pakutoma
Copy link
Contributor Author

pakutoma commented Feb 22, 2023

@alexdowad
Thank you so much!
I now have a better understanding of how the new encoding conversion works.

Can you see what each of the arguments to mb_to_wchar_fn and mb_from_wchar_fn are for?

The answer to your question seems to be good with the following:

mb_to_wchar_fn
unsigned char **in: start position of input byte sequence.
size_t *in_len: number of bytes remaining in the input string.
uint32_t *out: an output buffer to store intermediate string in wchar, possibly UTF-32.
size_t out_len: length of the output buffer.
unsigned int *state: current state of the conversion function.
return value of mb_to_wchar_fn: length of intermediate string. It will always be less than or equal to out_len.
The reason in and in_len are passed as pointers is for copy reduction. mb_to_wchar_fn directly advances the caller's starting position and directly reduces the number of remaining bytes.
The same strategy cannot be used for out and out_len. This is because the final length of intermediate string is not known from the input byte sequence.
It seems that state is used in stateful encodings such as ISO-2022-JP to preserve its state. It does not appear to be used in other encodings (I checked only UTF-8).

mb_from_wchar_fn
uint32_t *in: same as mb_to_wchar_fn's out buffer.
size_t in_len: same as mb_to_wchar_fn's return value.
mb_convert_buf *out: the output string, state, error count, ? and other replacement characters, and a flag to show replacement characters.
bool end: flag indicating that this call is the end of the entire intermediate string.

And I also read php_mb_check_encoding. It is very plain and easy to read compared to PHP 8.1.
By understanding this much, I think I finally understand what you have been thinking.
It seems to me that the current implementation in PHP 8.2 does not allow changing the result of mb_check_encoding without adding MBFL_BAD_INPUT. It is not just a problem that mb_check_encoding counts only MBFL_BAD_INPUT. It also means that there is no interface other than MBFL_BAD_INPUT to notify that mb_to_wchar_fn is in a bad state when interpreting byte sequences. It would take a major change to make mb_check_encoding return the expected result without changing the result of mb_convert_encoding.
If I understand correctly, apparently I was about to do something very serious...

Please let me know if I am understanding something incorrectly.
Also, my English is poor and I rely on machine translation for some parts, so please feel free to point out anything that is difficult to understand.

@alexdowad
Copy link
Contributor

@pakutoma Everything you said above is correct. I would just like to mention a small difference between a 'wchar' and a 'UTF-32 code unit'... while UTF-32LE code units are always stored in memory in little-endian byte order, and UTF-32BE code units are always stored in big-endian byte order... wchars are stored in the host byte order used by the current machine's CPU.

There are other stateful encodings other than ISO-2022 variants. There is also HZ, UTF-7, UTF7-IMAP, CP50220, CP50221, CP50222... and there are a few other places where we do need to remember state between calls to mb_to_wchar_fn, such as some of the SJIS variants.

I think there is still a way to make the new code recognize when an input string ends in an illegal state, without adding an extra MBFL_BAD_INPUT error marker to the output.

It will mean adding a new member to struct mbfl_encoding. The new member should be a function pointer which takes unsigned int state as input, and returns a bool telling the caller if state is valid for string ending or not.

The new member of struct mbfl_encoding can be left NULL for non-stateful encodings. In mb_check_encoding, when the end of the input string is reached, check if the new function pointer is non-NULL, and if so, call it and see if it returns true or false.

mb_detect_encoding will also have to be updated.

@pakutoma
Copy link
Contributor Author

pakutoma commented Feb 23, 2023

@alexdowad
Thanks for the detailed explanation about wchar, the world of C is mysterious...
And thanks for the advice on other stateful encodings. It looks like I will need to know how the other encodings work in order to create a mechanism that works with all of them.

I was not fully aware of the state. Surely, this is available from outside the function, so it can be used in other functions.
I have considered what can be detected by using the variable, using ISO-2022-JP as an example. First, missing or broken escape sequences at the end can obviously be detected. On the other hand, for Shift Out 0x0E and Shift In 0x0F, using 0x0F instead of the last escape sequence is valid as ISO-2022-JP. However, this should not be a problem since such a byte sequence should rarely appear. It seems difficult to determine that GR-invoked kana is invalid when it appears in ASCII state. No matter which state it is in, it is always uniquely encoded as half-width kana because the MSB is 1. If I were to change the state when GR-invoked kana appeared, the result of mb_convert_encoding would also change.
I feel we need to check on other encodings as well.

By the way, just to clarify, in which state is it correct for GR-invoked kana to appear?
I thought the answer to this was JIS X 0201 latin and JIS X 0201 kana, but after the discussion in #10651, I am not so sure about this idea. Should we allow GR-invoked kana to appear in ASCII state in ISO-2022-JP?
As I have remarked in the past, I have been complaining since PHP 8.1 that mb_check_encoding returns true when GR-invoked kana appears in an ISO-2022-JP byte sequence. However, if that is the correct implementation of ISO-2022-JP I can accept it.
By clarifying this, I think I can focus on more important issues.

@alexdowad
Copy link
Contributor

@alexdowad Thanks for the detailed explanation about wchar, the world of C is mysterious... And thanks for the advice on other stateful encodings. It looks like I will need to know how the other encodings work in order to create a mechanism that works with all of them.

Certainly, if you are able to learn more about the other encodings, that is better. I believe that the mechanism I proposed above will do what is needed, but if you discover some reason why it will not, I would love to hear about it.

By the way, just to clarify, in which state is it correct for GR-invoked kana to appear? I thought the answer to this was JIS X 0201 latin and JIS X 0201 kana, but after the discussion in #10651, I am not so sure about this idea. Should we allow GR-invoked kana to appear in ASCII state in ISO-2022-JP? As I have remarked in the past, I have been complaining since PHP 8.1 that mb_check_encoding returns true when GR-invoked kana appears in an ISO-2022-JP byte sequence. However, if that is the correct implementation of ISO-2022-JP I can accept it. By clarifying this, I think I can focus on more important issues.

I don't know if there is any relevant standard or specification which states when GR-invoked kana can legally appear in a ISO-2022-JP string. In any case, my experience with mbstring is that our Japanese users are very sensitive to any change of behavior which affects Japanese text encodings, so my policy now with such encodings is to maintain the existing behavior whenever possible, unless it is very clear that a certain behavior is wrong.

For the issue you have raised here about mb_check_encoding on ISO-2022-JP strings, I think we are restoring backwards compatibility with PHP 8.0. The Japanese users should be happy about that.

Regarding GR-invoked kana, I would suggest that the most practical thing to do would be to check the behavior for PHP 8.0 and earlier. If there is any change in PHP 8.1/8.2, we can consider changing it back to restore compatibility with PHP 8.0.

Incidentally, when refactoring the ISO-2022-JP code, I did accidentally break the handling of GR-invoked kana at one point, and because there were not enough unit tests covering this aspect, the problem was not noticed right away. Later, when I noticed it, I tried to restore the same behavior as PHP 8.0 and earlier. (See 8f84192.)

@pakutoma
Copy link
Contributor Author

I'm behind on updates, so I'll write what I think now.

In any case, my experience with mbstring is that our Japanese users are very sensitive to any change of behavior which affects Japanese text encodings, so my policy now with such encodings is to maintain the existing behavior whenever possible, unless it is very clear that a certain behavior is wrong.

Then we should fix the following behavior.
https://3v4l.org/CeDbY
However, as mentioned previously, the method using the exit state is difficult to determine because GR-invoked kana is valid in ASCII state.

If such a problem exists in many other encodings, other methods should be used to validate the encoding. If the problem is unique to this encoding, then it may be possible to solve the problem in a tricky way, such as adding a state to indicate that a GR-invoked kana has appeared.

To choose between these two methods, I first need to read how mb_check_encoding judges ISO-2022-JP in PHP 8.0 and and check its criteria. Next, I also need to read what other stateful encodings it judges.
This will take a little more time.

@alexdowad
Copy link
Contributor

Then we should fix the following behavior. https://3v4l.org/CeDbY However, as mentioned previously, the method using the exit state is difficult to determine because GR-invoked kana is valid in ASCII state.

Hmm! This is interesting.

@pakutoma
Copy link
Contributor Author

pakutoma commented Mar 5, 2023

I have learned that the change in mb_check_encoding behavior was introduced by the following commit.
be1a215

And before the commit, characters with multiple corresponding codes (this includes Japanese halfwidth kana in ISO-2022-JP) were always judged false for GR-invoked kana because only certain codes were accepted.
https://3v4l.org/BMNs0
Since ISO-2022-JP is a 7-bit encoding, the behavior of judging 8-bit codes as false is not a bug. Therefore, I think this change in behavior should be discussed.
However, this change in behavior seems to be completely different from the bug I am raising here.

There are two types of behavior changes that have occurred in mb_check_encoding as a result of this commit.

  1. Byte sequences terminated with an incorrect state are marked as true.
  2. "Incorrect but interpretable" codes is marked as true.

I think these issues should be treated separately, so I will consider only the former.

@alexdowad
Copy link
Contributor

@pakutoma Thank you for the thorough research!

I have learned that the change in mb_check_encoding behavior was introduced by the following commit. be1a215

Hmm, I had forgotten about this commit! I only remembered the later refactoring by which mbfl_identify_filters were eliminated.

And before the commit, characters with multiple corresponding codes (this includes Japanese halfwidth kana in ISO-2022-JP) were always judged false for GR-invoked kana because only certain codes were accepted. https://3v4l.org/BMNs0 Since ISO-2022-JP is a 7-bit encoding, the behavior of judging 8-bit codes as false is not a bug. Therefore, I think this change in behavior should be discussed. However, this change in behavior seems to be completely different from the bug I am raising here.

True.

Looking at the old implementation of mb_check_encoding (before be1a215), it appears to me that the way it would always reject GR kana was not intentional. However, only the original author would know for sure.

It also appears illogical to always judge ISO-2022-JP with GR kana as 'invalid'; if that was true, then why do we accept and convert GR kana in mb_convert_encoding?

Anyways, if there are logical reasons why we should judge some strings with GR kana as 'invalid', then certainly, it can be discussed.

There are two types of behavior changes that have occurred in mb_check_encoding as a result of this commit.

  1. Byte sequences terminated with an incorrect state are marked as true.
  2. "Incorrect but interpretable" codes is marked as true.

I think these issues should be treated separately, so I will consider only the former.

For clarity, when you say "incorrect but interpretable" codes, are you referring specifically to GR kana in ISO-2022-JP? Are there other known examples?

@pakutoma
Copy link
Contributor Author

pakutoma commented Mar 6, 2023

It also appears illogical to always judge ISO-2022-JP with GR kana as 'invalid'; if that was true, then why do we accept and convert GR kana in mb_convert_encoding?

I think you are correct.
I am probably just sticking to the old behavior of this function.
The reason is that ISO-2022-JP is mainly used today for sending e-mail that needs to be compatible with older clients, and only 7-bit codes are available for e-mail. At this time, if 8-bit codes are mixed in, unintended decoding will occur. The old behavior is convenient to avoid this.

For clarity, when you say "incorrect but interpretable" codes, are you referring specifically to GR kana in ISO-2022-JP? Are there other known examples?

I am referring to GR kana in ISO-2022-JP. Also, I do not currently know of any other examples.

@alexdowad
Copy link
Contributor

It also appears illogical to always judge ISO-2022-JP with GR kana as 'invalid'; if that was true, then why do we accept and convert GR kana in mb_convert_encoding?

I think you are correct. I am probably just sticking to the old behavior of this function. The reason is that ISO-2022-JP is mainly used today for sending e-mail that needs to be compatible with older clients, and only 7-bit codes are available for e-mail. At this time, if 8-bit codes are mixed in, unintended decoding will occur. The old behavior is convenient to avoid this.

Hmm. This is an interesting point. Not sure what to do about it right now.

@alexdowad
Copy link
Contributor

@pakutoma I do know of one other issue with "incorrect but interpretable" input for UTF-7 and UTF7-IMAP. It will be nice to fix that problem using the same mechanism as we use to fix this one.

I think what I suggested earlier about using the ending state value is not flexible enough. Instead, it may be better to add a new 'check' function pointer into struct mbfl_encoding. The signature could be like:

typedef bool (*mb_check_fn)(unsigned char *in, size_t in_len, unsigned int *state);

@pakutoma
Copy link
Contributor Author

pakutoma commented Mar 7, 2023

@alexdowad
Thank you!

one other issue with "incorrect but interpretable" input for UTF-7 and UTF7-IMAP.

I understand that the UTF-7 issue refers to the following change.
https://3v4l.org/TJRZF (as reported on #10192)

I will try to implement the fix using the new 'check' function pointer.
It's going to take some time, but I'll work on it for a week or so!

pakutoma added a commit to pakutoma/php-src that referenced this issue Mar 10, 2023
Previously, mbstring used the same logic for encoding validation as for encoding conversion. However, there are cases where we want to use different logic for validation and conversion. For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail. To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation. Also, added implementation of validation logic for UTF-7, UTF7-IMAP, and JIS.
pakutoma added a commit to pakutoma/php-src that referenced this issue Mar 10, 2023
Previously, mbstring used the same logic for encoding validation as for encoding conversion. However, there are cases where we want to use different logic for validation and conversion. For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail. To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation. Also, added implementation of validation logic for UTF-7, UTF7-IMAP, and JIS.
pakutoma added a commit to pakutoma/php-src that referenced this issue Mar 10, 2023
Previously, mbstring used the same logic for encoding validation as for encoding conversion. However, there are cases where we want to use different logic for validation and conversion. For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail. To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation. Also, added implementation of validation logic for UTF-7, UTF7-IMAP, and JIS.
@pakutoma
Copy link
Contributor Author

@alexdowad
I created a pull request to solve this issue.
#10828

The pull request includes a new function pointer and implementations for UTF-7, UTF7-IMAP, and JIS using it.

typedef bool (*mb_check_fn)(unsigned char *in, size_t in_len);

I couldn't think of a case where I would use unsigned int *state, so I removed it.
I would appreciate a review if you would be so kind.
Thank you in advance.

@youkidearitai
Copy link
Contributor

As also, Probably ISO-2022-JP-2004 is text must end of ASCII (I haven't read it properly yet). 3v4l here

(easy reference)
ISO-2022-JP-2004 ‐ 通信用語の基礎知識

改行文字(CR/LF)の前では、必ずASCIIに戻さなければならない。
情報の終了の前でも、必ずASCIIに戻さなければならない。

@pakutoma
Copy link
Contributor Author

pakutoma commented Mar 15, 2023

Yes, I think that is probably true.
Also for other ISO-2022-JP variants (ISO-2022-JP-MS, CP5022X, ISO-2022-JP-MOBILE#KDDI) we will need to create check functions.
https://3v4l.org/6enim
And for other stateful encodings.
It is a daunting task...

pakutoma added a commit to pakutoma/php-src that referenced this issue Mar 15, 2023
Previously, mbstring used the same logic for encoding validation as for encoding conversion. However, there are cases where we want to use different logic for validation and conversion. For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail. To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation. Also, added implementation of validation logic for UTF-7, UTF7-IMAP, and JIS.
pakutoma added a commit to pakutoma/php-src that referenced this issue Mar 15, 2023
Previously, mbstring used the same logic for encoding validation as for encoding conversion. However, there are cases where we want to use different logic for validation and conversion. For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail. To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation. Also, added implementation of validation logic for UTF-7, UTF7-IMAP, and JIS.
pakutoma added a commit to pakutoma/php-src that referenced this issue Mar 16, 2023
Previously, mbstring used the same logic for encoding validation as for encoding conversion. However, there are cases where we want to use different logic for validation and conversion. For example, if a string ends up with missing input required by the encoding, or if a character is input that is invalid as an encoding but can be converted, the conversion should succeed and the validation should fail. To achieve this, a function pointer mb_check_fn has been added to struct mbfl_encoding to implement the logic used for validation. Also, added implementation of validation logic for UTF-7, UTF7-IMAP, and JIS.
pakutoma added a commit to pakutoma/php-src that referenced this issue Mar 17, 2023
Previously, mbstring used the same logic for encoding validationas for encoding
conversion.
However, there are cases where we want to use different logic for validation
and conversion. For example, if a string ends up with missing input
required by the encoding, or if a character is input that is invalid
as an encoding but can be converted, the conversion should succeed and
the validation should fail.
To achieve this, a function pointer mb_check_fn has been added to
struct mbfl_encoding to implement the logic used for validation.
Also, added implementation of validation logic for UTF-7, UTF7-IMAP, and JIS.
pakutoma added a commit to pakutoma/php-src that referenced this issue Mar 17, 2023
Previously, mbstring used the same logic for encoding validationas for encoding
conversion.
However, there are cases where we want to use different logic for validation
and conversion. For example, if a string ends up with missing input
required by the encoding, or if a character is input that is invalid
as an encoding but can be converted, the conversion should succeed and
the validation should fail.
To achieve this, a function pointer mb_check_fn has been added to
struct mbfl_encoding to implement the logic used for validation.
Also, added implementation of validation logic for UTF-7, UTF7-IMAP, and JIS.
pakutoma added a commit to pakutoma/php-src that referenced this issue Mar 17, 2023
Previously, mbstring used the same logic for encoding validationas for encoding
conversion.
However, there are cases where we want to use different logic for validation
and conversion. For example, if a string ends up with missing input
required by the encoding, or if a character is input that is invalid
as an encoding but can be converted, the conversion should succeed and
the validation should fail.
To achieve this, a function pointer mb_check_fn has been added to
struct mbfl_encoding to implement the logic used for validation.
Also, added implementation of validation logic for UTF-7, UTF7-IMAP,
ISO-2022-JP and JIS.
pakutoma added a commit to pakutoma/php-src that referenced this issue Mar 19, 2023
Previously, mbstring used the same logic for encoding validationas for encoding
conversion.
However, there are cases where we want to use different logic for validation
and conversion. For example, if a string ends up with missing input
required by the encoding, or if a character is input that is invalid
as an encoding but can be converted, the conversion should succeed and
the validation should fail.
To achieve this, a function pointer mb_check_fn has been added to
struct mbfl_encoding to implement the logic used for validation.
Also, added implementation of validation logic for UTF-7, UTF7-IMAP,
ISO-2022-JP and JIS.
pakutoma added a commit to pakutoma/php-src that referenced this issue Mar 19, 2023
Previously, mbstring used the same logic for encoding validationas for encoding
conversion.
However, there are cases where we want to use different logic for validation
and conversion. For example, if a string ends up with missing input
required by the encoding, or if a character is input that is invalid
as an encoding but can be converted, the conversion should succeed and
the validation should fail.
To achieve this, a function pointer mb_check_fn has been added to
struct mbfl_encoding to implement the logic used for validation.
Also, added implementation of validation logic for UTF-7, UTF7-IMAP,
ISO-2022-JP and JIS.
pakutoma added a commit to pakutoma/php-src that referenced this issue Mar 21, 2023
Previously, mbstring used the same logic for encoding validationas for encoding
conversion.
However, there are cases where we want to use different logic for validation
and conversion. For example, if a string ends up with missing input
required by the encoding, or if a character is input that is invalid
as an encoding but can be converted, the conversion should succeed and
the validation should fail.
To achieve this, a function pointer mb_check_fn has been added to
struct mbfl_encoding to implement the logic used for validation.
Also, added implementation of validation logic for UTF-7, UTF7-IMAP,
ISO-2022-JP and JIS.
pakutoma added a commit to pakutoma/php-src that referenced this issue Mar 23, 2023
Previously, mbstring used the same logic for encoding validationas for encoding
conversion.
However, there are cases where we want to use different logic for validation
and conversion. For example, if a string ends up with missing input
required by the encoding, or if a character is input that is invalid
as an encoding but can be converted, the conversion should succeed and
the validation should fail.
To achieve this, a function pointer mb_check_fn has been added to
struct mbfl_encoding to implement the logic used for validation.
Also, added implementation of validation logic for UTF-7, UTF7-IMAP,
ISO-2022-JP and JIS.
pakutoma added a commit to pakutoma/php-src that referenced this issue Mar 23, 2023
Previously, mbstring used the same logic for encoding validationas for encoding
conversion.
However, there are cases where we want to use different logic for validation
and conversion. For example, if a string ends up with missing input
required by the encoding, or if a character is input that is invalid
as an encoding but can be converted, the conversion should succeed and
the validation should fail.
To achieve this, a function pointer mb_check_fn has been added to
struct mbfl_encoding to implement the logic used for validation.
Also, added implementation of validation logic for UTF-7, UTF7-IMAP,
ISO-2022-JP and JIS.
pakutoma added a commit to pakutoma/php-src that referenced this issue Mar 23, 2023
Previously, mbstring used the same logic for encoding validationas for encoding
conversion.
However, there are cases where we want to use different logic for validation
and conversion. For example, if a string ends up with missing input
required by the encoding, or if a character is input that is invalid
as an encoding but can be converted, the conversion should succeed and
the validation should fail.
To achieve this, a function pointer mb_check_fn has been added to
struct mbfl_encoding to implement the logic used for validation.
Also, added implementation of validation logic for UTF-7, UTF7-IMAP,
ISO-2022-JP and JIS.
alexdowad added a commit that referenced this issue Mar 24, 2023
* PHP-8.2:
  Fix GH-10648: add check function pointer into mbfl_encoding
alexdowad pushed a commit that referenced this issue Mar 25, 2023
Previously, mbstring used the same logic for encoding validation as for
encoding conversion.

However, there are cases where we want to use different logic for validation
and conversion. For example, if a string ends up with missing input
required by the encoding, or if a character is input that is invalid
as an encoding but can be converted, the conversion should succeed and
the validation should fail.

To achieve this, a function pointer mb_check_fn has been added to
struct mbfl_encoding to implement the logic used for validation.
Also, added implementation of validation logic for UTF-7, UTF7-IMAP,
ISO-2022-JP and JIS.

(The same change has already been made to PHP 8.2 and 8.3; see
6fc8d01. This commit is backporting the change to PHP 8.1.)
alexdowad added a commit that referenced this issue Mar 25, 2023
* PHP-8.1:
  Fix GH-10648: add check function pointer into mbfl_encoding
alexdowad added a commit that referenced this issue Mar 25, 2023
* PHP-8.2:
  Fix GH-10648: add check function pointer into mbfl_encoding
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants