mb_detect_encoding is more accurate on strings with UTF-8/16 BOM #10373

alexdowad · 2023-01-18T19:32:59Z

Thanks to the GitHub user 'titanz35' for pointing out that the new implementation of mb_detect_encoding had poor detection accuracy on UTF-8 and UTF-16 strings with a byte-order mark.

This relates to #7871. That issue is primarily about the detection accuracy of mb_detect_encoding on strings which contain emoji, but one commenter also pointed out a problem with strings that start with a byte-order mark. Once the emoji issue is also worked out, then #7871 can be closed.

@cmb69 @Girgias @nikic @kamil-tekiela @youkidearitai

Thanks to the GitHub user 'titanz35' for pointing out that the new implementation of mb_detect_encoding had poor detection accuracy on UTF-8 and UTF-16 strings with a byte-order mark.

Girgias

Seems to make sense, should this be backported to PHP 8.1?

alexdowad · 2023-01-18T20:46:50Z

Seems to make sense, should this be backported to PHP 8.1?

That's a good idea.

youkidearitai · 2023-01-18T21:11:26Z

Looks good to me.

alexdowad · 2023-01-19T06:30:58Z

Hmm, the patch doesn't apply on PHP-8.1 because the code has been refactored on PHP-8.2...

alexdowad · 2023-01-19T06:34:25Z

Hmm, it doesn't even apply on PHP-8.2.

alexdowad · 2023-01-19T06:44:30Z

Merging into PHP 8.3 for now. Applying the same fix to 8.1 and 8.2 might be done later (though it's more difficult to do this with the legacy code).

mb_detect_encoding is more accurate on strings with UTF-8/16 BOM

2655b39

Thanks to the GitHub user 'titanz35' for pointing out that the new implementation of mb_detect_encoding had poor detection accuracy on UTF-8 and UTF-16 strings with a byte-order mark.

github-actions bot added the Extension: mbstring label Jan 18, 2023

Girgias reviewed Jan 18, 2023

View reviewed changes

alexdowad closed this Jan 19, 2023

alexdowad deleted the bom branch January 19, 2023 06:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mb_detect_encoding is more accurate on strings with UTF-8/16 BOM #10373

mb_detect_encoding is more accurate on strings with UTF-8/16 BOM #10373

Uh oh!

alexdowad commented Jan 18, 2023

Uh oh!

Girgias left a comment

Uh oh!

alexdowad commented Jan 18, 2023

Uh oh!

youkidearitai commented Jan 18, 2023

Uh oh!

alexdowad commented Jan 19, 2023

Uh oh!

alexdowad commented Jan 19, 2023

Uh oh!

alexdowad commented Jan 19, 2023

Uh oh!

Uh oh!

mb_detect_encoding is more accurate on strings with UTF-8/16 BOM #10373

mb_detect_encoding is more accurate on strings with UTF-8/16 BOM #10373

Uh oh!

Conversation

alexdowad commented Jan 18, 2023

Uh oh!

Girgias left a comment

Choose a reason for hiding this comment

Uh oh!

alexdowad commented Jan 18, 2023

Uh oh!

youkidearitai commented Jan 18, 2023

Uh oh!

alexdowad commented Jan 19, 2023

Uh oh!

alexdowad commented Jan 19, 2023

Uh oh!

alexdowad commented Jan 19, 2023

Uh oh!

Uh oh!