-
Notifications
You must be signed in to change notification settings - Fork 7.9k
mb_detect_encoding()
detects UTF-8 emoji byte sequence as ISO-8859-1 since PHP 8.1
#7871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, @filecage, and thanks for much for this report. Hope to look into it very soon. |
Hmm. One point to start: Legacy versions of PHP return The string is actually valid under both encodings; if interpreted as ISO-8859-1, it's an accented "o", followed by an accented "Y", followed by a Yen sign, followed by a superscript digit 3. A funny combination, but not invalid. The current detection code treats all codepoints higher than U+FFFF as "rare", and penalizes candidate encodings under which the string contains such codepionts. That is what is making it prefer ISO-8859-1 over UTF-8. @nikic Do you think I need to teach it that some emoji over U+FFFF are not "rare"? Another point... no matter what we do, encoding detection will always be very, very unreliable for short strings. If you want to auto-detect text encoding with any level of reliability, you really need to give it more input text to work with. Even so, let us think a bit about this issue of emoji. It is a fact that some emoji are used by a lot of people, and they do occupy ranges of Unicode codepoints over U+FFFF. |
Thanks a lot @alexdowad for looking into this!
When I saw that the output was different, I did not think that previous versions might be accidentally returning the correct results.
I haven't yet had a chance to look deeply into it but I could imagine that if we're somehow able to reliably detect emojis in a string, it's a very strong indicator for a string to be UTF-8, even for short strings. It might not be worth the effort though. |
I have the same problem, if text starts with emoji, it's not detected as UTF-8 (I am using Docker php:8.1.6-fpm) |
I have a smiliar problem too, with a custom test suit in PhpUnit that throw us an error. The test check if mb_detect_encoding find the correct encoding with an UTF-8 file content, with a BOM header. You can see the simplify test here : The problem is present in php 8.1 and 8.2 too. |
@titanz35's problem has nothing to do with emoji; rather, it's about detecting encoding for strings which use a BOM. I have just prepared a fix for that. |
@filecage, I am just amending Do you think that qualifies as a 'common' and thus expected character to be found in input strings for |
@alexdowad according to unicode, U+1F973 is categorized as basic emoji: https://unicode.org/Public/emoji/15.0/emoji-sequences.txt When it comes to what I'd expect from Just out of personal interest, what would be a good reason not to include all unicode standardized emojis? Is it that it would slow down the detection itself or that we have no reliable way of having an exhaustive list of existing emojis? Or is it something else? |
@filecage The reason would be both to keep the detection code tight and fast and also to reduce the chances of wrongly detecting some random byte sequence as 'emoji'. |
@filecage Thanks for sharing that link. The file states that there are 1386 'basic emoji'. Hmm... That's about 1% of the space from U+0000 up to U+1FFFF.
Thanks for sharing those thoughts. Here's the thing: "detecting" the encoding of some random string of bytes which one finds on the Internet is actually impossible. ( Given that fact, if we see that a sequence of bytes decodes to U+1F600 GRINNING FACE in one possible text encoding, but decodes to U+1F9FF NAZAR AMULET in another possible text encoding, would it really make sense to say that both choices are "equally likely"? (Both are included in the list of "basic emoji"!) If we wanted to make I think there may be another misunderstanding here, and I have probably contributed to it. You talked about For example, please see https://3v4l.org/gVQXJ.
For your test case, the reason why (As it turns out, encoding detection is extremely problematic when the ISO-8859-X encodings are involved. That is because the strongest signal which helps us to "detect" the correct encoding is when the input bytes do not form a valid string in most of the candidate encodings. However, every sequence of bytes is always a valid string in ISO-8859-1, so our strongest test is completely ineffective.) Hopefully all of this helps you understand that the answer is not "all or nothing". We still do need to determine what emoji are more likely to appear in |
@filecage ...So we still come back to the question of U+1F973 should be on our "common" list, and I guess your answer is probably "yes"... I should have mentioned, you can find this list in |
I have some of the same issues. But find it hard to reason out, why/when the function returns what I want and when it does not. I get that the old way was broken, but why/when mb_detect_encoding returns UTF-8 or not in the two cases here is beyond me. This is mixed, but would expect $text = 'abcde æøå æøå ð Ð ≗ 䎇 顃 ‰';
echo mb_detect_encoding($text, ['UTF-8', 'CP1252', 'ISO-8859-1'], true); Do I remove the last $text = 'abcde æøå æøå ð Ð ≗ 䎇 顃 ‰';
echo mb_detect_encoding($text, ['UTF-8', 'CP1252', 'ISO-8859-1'], true); What I would expect was |
@mrbase, I will be happy to have a look at these test cases as soon as I find some time and explain exactly what
Well, that's not quite the way to think about it. You need to think of the string as a sequence of bytes, not characters. Until we know how a string is encoded, it is not meaningful to talk about "characters". It's just raw bytes. To guess the encoding, we need to look at what those bytes might mean if they were interpreted as UTF-8 text, CP1252 text, or ISO-8859-1 text. Sure, when interpreted as UTF-8, that string might include characters not in the Latin charset. But that doesn't mean anything. When interpreted as ISO-8859-1, it will most definitely not include characters outside the Latin charset. |
@mrbase Just one more point... if you want to provide more information about what you are actually doing, I may (or may not) be able to suggest a better way to achieve your goal. If your interest is just in |
@alexdowad Yeah, I get it - it is not easy at all. We have a system that runs on both Windows and Linux, and historically all files have been CP1252 encoded. On Linux you can (in most cases) run Also we use it to find the encoding on csv/txt/.. import files we do not know the encoding of. And yes, it would be super to have a better understanding of how it works, but I also think you are right in, that we should not use it as we do. |
@mrbase Thanks for sharing more info. I haven't analyzed your test cases yet, but first, if UTF-8 is the "default" encoding which most of your input files are encoded in, I would like to suggest you first try If it's not valid in your 'default' encoding, then using If you want to make things work as well as possible for your users, you could provide a button which they can click if their file contents look 'corrupted'. (For example, say you guessed that the file was the default CP1252, but it's actually UTF-8.) When they click that button, show them a snippet of text from their file when interpreted as UTF-8, CP1252, ISO-8859-1, or whatever other text encodings you support. Let them click on the one which looks right. |
@mrbase If you have any ideas of heuristics which could be used to distinguish between CP1252 and ISO-8859-1, I am very open to adding those. |
In our old test case we have the following comment, don't know if it helps or not:
Using |
OK, I'll add your test case to the official test cases for the PHP interpreter.
I am suggesting that you use |
Fantastic, will have to restructure some of our code, but with the pointers I guess we are close :) |
Are you allowed to share more of the test case code or not? If so, and we add it to the official test cases for PHP, it will help to ensure that future development on mbstring does not break your application. |
@mrbase Thanks very much... Hmm. Well, on Take, for example, the final U+2030 PER MILLE SIGN (‰) character in the test files. In UTF-8, that's 0xE280B0. When interpreted as CP1252, the same bytes represent the three characters ‰. The current implementation of And I don't know if I can really fault it. Neither a bare "‰" nor the nonsense string "‰" really looks like natural text. If I, a human, was given those bytes with no other context, I would probably guess that they were UTF-8, simply because they are valid as UTF-8 and most text on the Internet nowadays is UTF-8. So... does that mean that we should give UTF-8 a 'bonus' so that I am also open to giving a 'bonus' to the first candidate encoding in the provided array. Again, it's just a matter of working out the details. |
giving UTF8 per se a bonus might work today, but maybe a new charset will come up in the future which will take the lead and changing a implicit charset which gets a bonus (and decide at which point in time when this decision is made) is a big BC break.
I think this makes more sense, as the consumer of the api is in control and can take better decisions for the use-case at hand then php alone for the whole ecosystem |
@alexdowad But to go back to your original answer to @filecage, regarding returning utf8 and not iso. Why would you not return the first match, when the list of test-charsets is prioritized? Is it because some (in this case uf8) characters is also valid iso characters? I might be talking out of my ass here, the topic is a little over my head :) |
The documentation for If the string is only valid in one of the provided candidate encodings, then definitely, we return that one. If it's valid in more than one candidate encoding, then we have to guess which is the "most likely character encoding".
Again, when we don't know the encoding of a string, it's just a sequence of bytes. And yes, the same sequence of bytes may encode one sequence of characters as UTF-8, and encode a different sequence of characters as ISO-8859-1. We need to look at those two sequences of characters and guess which one is "right". This is, of course, an impossible job, so all we can do is try to return results which are usually what the user intended. In this case, we need to guess whether the user wants ‰ or whether they want ‰. If you just want to pick the "first match" in the list, then it is better to call |
@alexdowad Yeah - that's what I figured :) |
@mrbase You are very welcome. Thanks also for supporting the PHP project by providing feedback to the developers. |
@alexdowad Have been thinking about this a while.
Simplified I know, but conceptually ... I guess the score could be the same, if a string can be both |
@mrbase Thanks for the comment. It is unlikely that two candidate encodings will have exactly the same score. In a few cases it will happen, but not much. You used an interesting expression... "weight". If we treat the order of supplied encodings as "weight", that could mean (for example) that if 3 encodings are listed, we multiply the score for the 2nd one by 1.05 and the 3rd one by 1.1, or something along those lines. The big question is how heavily the order should affect the results of the operation. Many users simply pass |
If I read the mb_detect_order docs correctly, the order/weight is first is highest - so I would prefer the same to be true for the encoding's supplied :) But you are correct, just supplying the full list does not make much sense.. Would it be an idea to compare the I'm all for the weighted solution tho :) |
@iluuu1994 or @arnaud-lb would probably know more about those things then I do :/ |
@alexdowad It depends on whether you'd like to share the value in the same request, across all requests, or even across processes. Either way,
The first option is probably enough. Is this function particularly slow? Or what's the motivation behind it? |
@iluuu1994 The actual motivation is that so that a cheap pointer equality check can be used to check whether a list of text encodings passed in to |
@iluuu1994 What's the best way to free the memory used by such an array on module shutdown? |
@alexdowad You don't have to do anything special, if you set the |
@iluuu1994 Thanks! I seem to be making some progress here. However, something seems to be dawning on me here... Is it correct that the "immutable array" feature is only to be used for arrays which user code will never attempt to modify? Because with my current test code, this causes an interpreter crash on shutdown...
|
@alexdowad It should be fine to "mutate" immutable arrays. The point is that the This means, after using I think, if you only share the value in the same process, you could also just not make the array immutable and set the RC to 2. This should also prevent it from being deallocated, while keeping refcounting. |
@iluuu1994 Thanks for that. I need to dig in more and see if I can figure out why this crash is happening. It looks like on The persistent string is freed from zend_vm_execute.h:42445:
If that analysis is true, then I need to figure out why the array is not being copied. If you have time to have a look, this is my test code: alexdowad@c8cf129 |
@alexdowad Here's a simpler implementation that only uses a request-bound allocation. master...iluuu1994:php-src:mbstring-shared-mb_list_encodings |
@iluuu1994 Wow, thanks! Now that I think about it, this will also achieve the same purpose... we can still tell if |
@alexdowad Yes. You can just compare |
So I don't forget, you'll also need to remove |
Thank you very much!! I totally would have missed that. |
I have just opened a PR which adds @mrbase's test case to the official test suite. |
@alexdowad Is this issue resolved or not? |
@Girgias It's resolved. Thanks. |
Uh oh!
There was an error while loading. Please reload this page.
Description
I have a piece of code that tries to normalise the encoding of incoming strings using
mb_detect_encoding()
. While upgrading to PHP 8.1, I've noticed that a test which ensures that an urlencoded UTF-8 sequence (party hat emoji) now fails. It all comes down to a behaviour change ofmb_detect_encoding()
when passingUTF-8
andISO-8859-1
(in which ever order) as$encodings
in PHP 8.1.I know that the way this method works (or the way that determining the enconding of a string works at all) can not be 100% realiable, so I'd also agree on not classifying this as a bug. However, it is an undocumented behaviour change introduced in 8.1 that might break existing code as it did with mine.
I assume that this change has been introduced with 28b346b.
Example
See https://3v4l.org/RgdfE
Resulted in this output:
But I expected this output instead:
PHP Version
8.1.1
Operating System
No response
The text was updated successfully, but these errors were encountered: