Skip to content

mb_detect_encoding() detects UTF-8 emoji byte sequence as ISO-8859-1 since PHP 8.1 #7871

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
filecage opened this issue Jan 2, 2022 · 53 comments

Comments

@filecage
Copy link

filecage commented Jan 2, 2022

Description

I have a piece of code that tries to normalise the encoding of incoming strings using mb_detect_encoding(). While upgrading to PHP 8.1, I've noticed that a test which ensures that an urlencoded UTF-8 sequence (party hat emoji) now fails. It all comes down to a behaviour change of mb_detect_encoding() when passing UTF-8 and ISO-8859-1 (in which ever order) as $encodings in PHP 8.1.

I know that the way this method works (or the way that determining the enconding of a string works at all) can not be 100% realiable, so I'd also agree on not classifying this as a bug. However, it is an undocumented behaviour change introduced in 8.1 that might break existing code as it did with mine.

I assume that this change has been introduced with 28b346b.

Example

See https://3v4l.org/RgdfE

<?php
echo mb_detect_encoding('🥳', ['UTF-8', 'ISO-8859-1']);

Resulted in this output:

ISO-8859-1

But I expected this output instead:

UTF-8

PHP Version

8.1.1

Operating System

No response

@alexdowad
Copy link
Contributor

Hi, @filecage, and thanks for much for this report. Hope to look into it very soon.

@alexdowad
Copy link
Contributor

Hmm. One point to start:

Legacy versions of PHP return UTF-8, not because they are smart enough to tell that the string is actually UTF-8 text, but because you put UTF-8 first in the list of candidate encodings. If you put ISO-8859-1 first, then all versions return ISO-8859-1.

The string is actually valid under both encodings; if interpreted as ISO-8859-1, it's an accented "o", followed by an accented "Y", followed by a Yen sign, followed by a superscript digit 3. A funny combination, but not invalid.

The current detection code treats all codepoints higher than U+FFFF as "rare", and penalizes candidate encodings under which the string contains such codepionts. That is what is making it prefer ISO-8859-1 over UTF-8.

@nikic Do you think I need to teach it that some emoji over U+FFFF are not "rare"?

Another point... no matter what we do, encoding detection will always be very, very unreliable for short strings. If you want to auto-detect text encoding with any level of reliability, you really need to give it more input text to work with.

Even so, let us think a bit about this issue of emoji. It is a fact that some emoji are used by a lot of people, and they do occupy ranges of Unicode codepoints over U+FFFF.

@filecage
Copy link
Author

filecage commented Jan 4, 2022

Thanks a lot @alexdowad for looking into this!

Legacy versions of PHP return UTF-8, not because they are smart enough to tell that the string is actually UTF-8 text, but because you put UTF-8 first in the list of candidate encodings. If you put ISO-8859-1 first, then all versions return ISO-8859-1.

When I saw that the output was different, I did not think that previous versions might be accidentally returning the correct results.

Even so, let us think a bit about this issue of emoji. It is a fact that some emoji are used by a lot of people, and they do occupy ranges of Unicode codepoints over U+FFFF.

I haven't yet had a chance to look deeply into it but I could imagine that if we're somehow able to reliably detect emojis in a string, it's a very strong indicator for a string to be UTF-8, even for short strings. It might not be worth the effort though.

@jurihahn
Copy link

I have the same problem, if text starts with emoji, it's not detected as UTF-8 (I am using Docker php:8.1.6-fpm)

@titanz35
Copy link

titanz35 commented Sep 6, 2022

I have a smiliar problem too, with a custom test suit in PhpUnit that throw us an error.

The test check if mb_detect_encoding find the correct encoding with an UTF-8 file content, with a BOM header.

You can see the simplify test here :
https://3v4l.org/83vWZ

The problem is present in php 8.1 and 8.2 too.

@alexdowad
Copy link
Contributor

@titanz35's problem has nothing to do with emoji; rather, it's about detecting encoding for strings which use a BOM. I have just prepared a fix for that.

@alexdowad
Copy link
Contributor

@filecage, I am just amending mb_detect_encoding so it accepts the most basic emoji like "smiling face" and "frowning face". But I have seen that your test case is actually U+1F973 FACE WITH PARTY HORN AND PARTY HAT.

Do you think that qualifies as a 'common' and thus expected character to be found in input strings for mb_detect_encoding?

@filecage
Copy link
Author

@alexdowad according to unicode, U+1F973 is categorized as basic emoji: https://unicode.org/Public/emoji/15.0/emoji-sequences.txt

When it comes to what I'd expect from mb_detect_encoding(), I would probably go with: either all or nothing. I know that that's a tough one, but I assume that people (including myself) won't expect mb_detect_encoding() to work with with some, but not all.

Just out of personal interest, what would be a good reason not to include all unicode standardized emojis? Is it that it would slow down the detection itself or that we have no reliable way of having an exhaustive list of existing emojis? Or is it something else?

@alexdowad
Copy link
Contributor

@filecage The reason would be both to keep the detection code tight and fast and also to reduce the chances of wrongly detecting some random byte sequence as 'emoji'.

@alexdowad
Copy link
Contributor

@filecage Thanks for sharing that link. The file states that there are 1386 'basic emoji'. Hmm... That's about 1% of the space from U+0000 up to U+1FFFF.

When it comes to what I'd expect from mb_detect_encoding(), I would probably go with: either all or nothing. I know that that's a tough one, but I assume that people (including myself) won't expect mb_detect_encoding() to work with with some, but not all.

Thanks for sharing those thoughts.

Here's the thing: "detecting" the encoding of some random string of bytes which one finds on the Internet is actually impossible. (mb_detect_encoding is a misnomer; it should have been called mb_guess_encoding.) The process is entirely probabilistic, and we can only ever say that this sequence of bytes looks more likely to be a UTF-8 string than a ISO-8859-1 string, or vice versa.

Given that fact, if we see that a sequence of bytes decodes to U+1F600 GRINNING FACE in one possible text encoding, but decodes to U+1F9FF NAZAR AMULET in another possible text encoding, would it really make sense to say that both choices are "equally likely"? (Both are included in the list of "basic emoji"!)

If we wanted to make mb_detect_encoding smarter, we could teach it that this codepoint is in the top 100 most commonly used, but that one is only in the top 10,000. The biggest obstacle to doing this right now is that we would need a good corpus of text (in various languages) to derive the frequencies from. For the moment, we have just divided the entire body of all Unicode codepoints into two categories: "rare" and "common" (1 bit of information per codepoint).

I think there may be another misunderstanding here, and I have probably contributed to it. You talked about mb_detect_encoding "working with some, but not all" emoji, but to describe the function as "working" or "not working" on such characters is not really accurate. Currently, mb_detect_encoding's heuristics cause it to treat strings containing emoji as "less likely" than those not containing emoji, but if all other possible interpretations of the same bytes are even more unlikely, then mb_detect_encoding will pick the emoji.

For example, please see https://3v4l.org/gVQXJ.

<?php
echo mb_detect_encoding("🥳", ['UTF-32BE', 'UTF-16BE', 'UTF-16LE', 'UTF-8', 'SJIS', 'EUC-KR', 'BIG-5']), "\n";

For your test case, the reason why mb_detect_encoding picks ISO-8859-1, is because 1) the string is actually valid in ISO-8859-1, and 2) it decodes entirely to "common" codepoints under that interpretation.

(As it turns out, encoding detection is extremely problematic when the ISO-8859-X encodings are involved. That is because the strongest signal which helps us to "detect" the correct encoding is when the input bytes do not form a valid string in most of the candidate encodings. However, every sequence of bytes is always a valid string in ISO-8859-1, so our strongest test is completely ineffective.)

Hopefully all of this helps you understand that the answer is not "all or nothing". We still do need to determine what emoji are more likely to appear in mb_detect_encoding's input, and which are less likely to appear. I strongly feel that some of the "basic" emoji, like U+1F316 WANING GIBBOUS MOON, should be deemed less likely.

@alexdowad
Copy link
Contributor

@filecage ...So we still come back to the question of U+1F973 should be on our "common" list, and I guess your answer is probably "yes"...

I should have mentioned, you can find this list in ext/mbstring/common_codepoints.txt, in the PHP source tree.

@mrbase
Copy link

mrbase commented Mar 9, 2023

I have some of the same issues. But find it hard to reason out, why/when the function returns what I want and when it does not.

I get that the old way was broken, but why/when mb_detect_encoding returns UTF-8 or not in the two cases here is beyond me.

This is mixed, but would expect UTF-8. (https://3v4l.org/96Wl1)

$text = 'abcde æøå æøå ð Ð ≗ 䎇 顃 ‰';
echo mb_detect_encoding($text, ['UTF-8', 'CP1252', 'ISO-8859-1'], true);

Do I remove the last from the text, the detection returns UTF-8 in all versions. (https://3v4l.org/Y7BQH)

$text = 'abcde æøå æøå ð Ð ≗ 䎇 顃 ‰';
echo mb_detect_encoding($text, ['UTF-8', 'CP1252', 'ISO-8859-1'], true);

What I would expect was UTF-8, when the string contains characters outside the Windows/Latin charset - but ...

@alexdowad
Copy link
Contributor

I have some of the same issues. But find it hard to reason out, why/when the function returns what I want and when it does not.

@mrbase, I will be happy to have a look at these test cases as soon as I find some time and explain exactly what mb_detect_encoding is doing and why.

What I would expect was UTF-8, when the string contains characters outside the Windows/Latin charset - but ...

Well, that's not quite the way to think about it. You need to think of the string as a sequence of bytes, not characters.

Until we know how a string is encoded, it is not meaningful to talk about "characters". It's just raw bytes. To guess the encoding, we need to look at what those bytes might mean if they were interpreted as UTF-8 text, CP1252 text, or ISO-8859-1 text.

Sure, when interpreted as UTF-8, that string might include characters not in the Latin charset. But that doesn't mean anything. When interpreted as ISO-8859-1, it will most definitely not include characters outside the Latin charset.

@alexdowad
Copy link
Contributor

@mrbase Just one more point... if you want to provide more information about what you are actually doing, I may (or may not) be able to suggest a better way to achieve your goal.

If your interest is just in mb_detect_encoding and how it works, then I am happy to restrict my comments to that subject and not offer commentary on whether your application should be using mb_detect_encoding or not (hint: it probably shouldn't).

@mrbase
Copy link

mrbase commented Mar 9, 2023

@alexdowad Yeah, I get it - it is not easy at all.

We have a system that runs on both Windows and Linux, and historically all files have been CP1252 encoded.

On Linux you can (in most cases) run file -i some-file.php to find out what the system thinks the file is encoded as - this does not work on Windows, here we read the file - and used mb_detect_encoding to see if the content needed to be converted in order to save it as UTF-8.

Also we use it to find the encoding on csv/txt/.. import files we do not know the encoding of.

And yes, it would be super to have a better understanding of how it works, but I also think you are right in, that we should not use it as we do.

@alexdowad
Copy link
Contributor

@mrbase Thanks for sharing more info.

I haven't analyzed your test cases yet, but first, if UTF-8 is the "default" encoding which most of your input files are encoded in, I would like to suggest you first try mb_check_encoding to see if the file is valid UTF-8. If it is, just treat it as UTF-8. (Likewise, if CP1252 is the "default", use mb_check_encoding first to see if it's valid as CP1252. Hmm... I don't remember right now whether all strings are valid as CP1252 or not, though...)

If it's not valid in your 'default' encoding, then using mb_detect_encoding to guess from the other possible encodings would be reasonable.

If you want to make things work as well as possible for your users, you could provide a button which they can click if their file contents look 'corrupted'. (For example, say you guessed that the file was the default CP1252, but it's actually UTF-8.) When they click that button, show them a snippet of text from their file when interpreted as UTF-8, CP1252, ISO-8859-1, or whatever other text encodings you support. Let them click on the one which looks right.

@alexdowad
Copy link
Contributor

@mrbase If you have any ideas of heuristics which could be used to distinguish between CP1252 and ISO-8859-1, I am very open to adding those.

@mrbase
Copy link

mrbase commented Mar 9, 2023

@alexdowad

In our old test case we have the following comment, don't know if it helps or not:

# The following characters are special characters that has different code points in ISO-8859-1 and Windows-1252
# - that we differentiate correctly between ISO-8859-1 and Windows-1252
Œ
‰

Using mb_check_encoding sesam to do the trick tho.
Just to be clear mb_check_encoding is more strict than mb_detect_encoding because it "knows" what char ranges to validate against, correct?

@alexdowad
Copy link
Contributor

OK, I'll add your test case to the official test cases for the PHP interpreter.

Using mb_check_encoding sesam to do the trick tho. Just to be clear mb_check_encoding is more strict than mb_detect_encoding because it "knows" what char ranges to validate against, correct?

mb_detect_encoding has 'strict' and 'non-strict' modes. In its 'strict' mode, it rejects any candidate encoding if the input string is invalid in that encoding, which is exactly when mb_check_encoding returns false.

I am suggesting that you use mb_check_encoding not because it is more strict; the point is to give your "default" encoding first chance to be picked. If the input is valid in your "default" encoding, then you can just use that instead of guessing from a list of several candidates.

@mrbase
Copy link

mrbase commented Mar 9, 2023

@alexdowad

Fantastic, will have to restructure some of our code, but with the pointers I guess we are close :)
Thanks a bunch!

@alexdowad
Copy link
Contributor

In our old test case we have the following comment, don't know if it helps or not:

Are you allowed to share more of the test case code or not? If so, and we add it to the official test cases for PHP, it will help to ensure that future development on mbstring does not break your application.

@alexdowad
Copy link
Contributor

alexdowad commented Mar 28, 2023

@mrbase Thanks very much...

Hmm. Well, on master, mb_detect_encoding guesses that both of those files are encoded as CP1252.

Take, for example, the final U+2030 PER MILLE SIGN (‰) character in the test files. In UTF-8, that's 0xE280B0. When interpreted as CP1252, the same bytes represent the three characters ‰. The current implementation of mb_detect_encoding looks at that, and says: "‰ is a rare character, but â, €, and ° are all common, so CP1252 looks like the more likely choice".

And I don't know if I can really fault it. Neither a bare "‰" nor the nonsense string "‰" really looks like natural text.

If I, a human, was given those bytes with no other context, I would probably guess that they were UTF-8, simply because they are valid as UTF-8 and most text on the Internet nowadays is UTF-8. So... does that mean that we should give UTF-8 a 'bonus' so that mb_detect_encoding is biased towards choosing UTF-8 when possible? I am very much open to that, but am not sure exactly how the details would work.

I am also open to giving a 'bonus' to the first candidate encoding in the provided array. Again, it's just a matter of working out the details.

@staabm
Copy link
Contributor

staabm commented Mar 29, 2023

giving UTF8 per se a bonus might work today, but maybe a new charset will come up in the future which will take the lead and changing a implicit charset which gets a bonus (and decide at which point in time when this decision is made) is a big BC break.

I am also open to giving a 'bonus' to the first candidate encoding in the provided array

I think this makes more sense, as the consumer of the api is in control and can take better decisions for the use-case at hand then php alone for the whole ecosystem

@mrbase
Copy link

mrbase commented Mar 29, 2023

@alexdowad
Hmm, it might be some wired test-thing we do to get different results - might be that on is "is this utf8" and the other "is this something else" have to dig into the actual tests...

But to go back to your original answer to @filecage, regarding returning utf8 and not iso.

Why would you not return the first match, when the list of test-charsets is prioritized? Is it because some (in this case uf8) characters is also valid iso characters?
Is the guess done x times, eg. mb_detect_encoding('xxx', ['a', 'b', 'c']) results in 3 full tests, and then the "best" result is returned?

I might be talking out of my ass here, the topic is a little over my head :)

@alexdowad
Copy link
Contributor

Why would you not return the first match, when the list of test-charsets is prioritized?

The documentation for mb_detect_encoding states that the function "Detects the most likely character encoding for string string from an ordered list of candidates."

If the string is only valid in one of the provided candidate encodings, then definitely, we return that one. If it's valid in more than one candidate encoding, then we have to guess which is the "most likely character encoding".

Is it because some (in this case uf8) characters is also valid iso characters?

Again, when we don't know the encoding of a string, it's just a sequence of bytes. And yes, the same sequence of bytes may encode one sequence of characters as UTF-8, and encode a different sequence of characters as ISO-8859-1. We need to look at those two sequences of characters and guess which one is "right".

This is, of course, an impossible job, so all we can do is try to return results which are usually what the user intended. In this case, we need to guess whether the user wants ‰ or whether they want ‰.

If you just want to pick the "first match" in the list, then it is better to call mb_check_encoding in a loop.

@mrbase
Copy link

mrbase commented Mar 29, 2023

@alexdowad Yeah - that's what I figured :)
Fantastic job you do btw . Thank you for taking the time to explain the issues in such detail.

@alexdowad
Copy link
Contributor

@mrbase You are very welcome. Thanks also for supporting the PHP project by providing feedback to the developers.

@mrbase
Copy link

mrbase commented Apr 19, 2023

@alexdowad Have been thinking about this a while.
The process of the guesser, is that a:

  • loop over the provided encoding's (if any)
  • give each encoding a score
  • return encoding with the highest score

Simplified I know, but conceptually ...

I guess the score could be the same, if a string can be both cp1252 and utf8.
So, if supplied, use the supplied encoding's order as weight - if the score is equal - or something like that...?

@alexdowad
Copy link
Contributor

@mrbase Thanks for the comment.

It is unlikely that two candidate encodings will have exactly the same score. In a few cases it will happen, but not much.

You used an interesting expression... "weight". If we treat the order of supplied encodings as "weight", that could mean (for example) that if 3 encodings are listed, we multiply the score for the 2nd one by 1.05 and the 3rd one by 1.1, or something along those lines.

The big question is how heavily the order should affect the results of the operation. Many users simply pass mb_list_encodings(), rather than an explicit list, and the order of mb_list_encodings means nothing at all. (Incidentally, it is not a good idea to pass mb_list_encodings() to this function!)

@mrbase
Copy link

mrbase commented Apr 19, 2023

If I read the mb_detect_order docs correctly, the order/weight is first is highest - so I would prefer the same to be true for the encoding's supplied :)

But you are correct, just supplying the full list does not make much sense.. Would it be an idea to compare the $encodings to the mb_list_encodings - and fall back to null, if they match?

I'm all for the weighted solution tho :)

@alexdowad
Copy link
Contributor

@Girgias @cmb69 I would like to modify mb_list_encodings so it returns a single, immutable array rather than initializing a new, mutable array on each call.

Just poking around in Zend core are trying to figure out the best way to initialize an immutable PHP array. Do you know?

@Girgias
Copy link
Member

Girgias commented Apr 29, 2023

@iluuu1994 or @arnaud-lb would probably know more about those things then I do :/

@iluuu1994
Copy link
Member

@alexdowad It depends on whether you'd like to share the value in the same request, across all requests, or even across processes. Either way, type should be set to IS_ARRAY | ((GC_IMMUTABLE|GC_PERSISTENT|GC_NOT_COLLECTABLE) << GC_FLAGS_SHIFT), refcount to 2 (so that the array always gets separated), and it should be allocated with malloc.

  • Share in the same request
    • Once mb_list_encodings is called, you can check in the globals (which are unique per process/thread) whether the array has been initialized, and if not do so. You can clean up the array on RSHUTDOWN.
  • Share across requests
    • Same as above, except you clean up the array in MSHUTDOWN.
  • Share across all processes/threads
    • You could initialize the array in MINIT in the same way, and then also clean it up in MSHUTDOWN.

The first option is probably enough. Is this function particularly slow? Or what's the motivation behind it?

@alexdowad
Copy link
Contributor

The first option is probably enough. Is this function particularly slow? Or what's the motivation behind it?

@iluuu1994 The actual motivation is that so that a cheap pointer equality check can be used to check whether a list of text encodings passed in to mb_detect_encoding (or even other functions) is the complete list of all supported encodings returned by mb_list_encodings.

@alexdowad
Copy link
Contributor

@iluuu1994 What's the best way to free the memory used by such an array on module shutdown?

@iluuu1994
Copy link
Member

@alexdowad You don't have to do anything special, if you set the IS_ARRAY_PERSISTENT flag zend_hash_destroy should correctly call free instead of efree.

@alexdowad
Copy link
Contributor

@iluuu1994 Thanks! I seem to be making some progress here. However, something seems to be dawning on me here...

Is it correct that the "immutable array" feature is only to be used for arrays which user code will never attempt to modify?

Because with my current test code, this causes an interpreter crash on shutdown...

$s = mb_list_encodings();
$s[0] = 'blahblah';

@iluuu1994
Copy link
Member

iluuu1994 commented Apr 29, 2023

@alexdowad It should be fine to "mutate" immutable arrays. The point is that the refcount = 2 guarantees that the array will be copied before actually mutating it. One gotcha: GC_TRY_ADDREF checks if the element is immutable, and only adds a ref if it isn't. Z_TRY_ADDREF doesn't do that. Instead, it checks if the element is Z_REFCOUNTED_P, which it is considered only if the (zval).u1.v.type_flags == 0.

This means, after using ZVAL_ARR (or whatever variation of that) you need to set Z_TYPE_INFO_P(zval) = IS_ARRAY (or Z_TYPE_FLAGS_P(zval) = 0).

I think, if you only share the value in the same process, you could also just not make the array immutable and set the RC to 2. This should also prevent it from being deallocated, while keeping refcounting.

@alexdowad
Copy link
Contributor

@iluuu1994 Thanks for that. I need to dig in more and see if I can figure out why this crash is happening.

It looks like on $a[0] = 'blahblah';, the Zend VM replaces the persistent string which is stored in my (supposedly) immutable array, decs the refcount for the persistent string, notices that it is now unreferenced, tries to free it, then an assertion fails and the interpreter crashes because the string is persistent.

The persistent string is freed from zend_vm_execute.h:42445:

if (garbage) {
  GC_DTOR_NO_REF(garbage);
}

If that analysis is true, then I need to figure out why the array is not being copied.

If you have time to have a look, this is my test code: alexdowad@c8cf129

@iluuu1994
Copy link
Member

@alexdowad Here's a simpler implementation that only uses a request-bound allocation.

master...iluuu1994:php-src:mbstring-shared-mb_list_encodings

@alexdowad
Copy link
Contributor

@iluuu1994 Wow, thanks!

Now that I think about it, this will also achieve the same purpose... we can still tell if mb_list_encodings() is being passed to other mbstring functions, even if it's a per-request shared array and not a per-process shared array...

@iluuu1994
Copy link
Member

@alexdowad Yes. You can just compare MBSTRG(all_encodings_list) == ht (or zend_is_identical/zend_compare if that's still faster than operating on the user-provided array).

@alexdowad
Copy link
Contributor

Very early draft of the proposed change to text encoding detection: be5aa65

If 0.2 is changed to 0.3 in the following code, that is enough to make @mrbase's test case return the results which he wants:

https://github.com/alexdowad/php-src/blob/be5aa651685cdf7f8b7838fb459a58343c1bb89a/ext/mbstring/mbstring.c#L3022-L3025

@iluuu1994
Copy link
Member

So I don't forget, you'll also need to remove @refcount 1 from function mb_list_encodings() in ext/mbstring/mbstring.stub.php.

@alexdowad
Copy link
Contributor

So I don't forget, you'll also need to remove @refcount 1 from function mb_list_encodings() in ext/mbstring/mbstring.stub.php.

Thank you very much!! I totally would have missed that.

@alexdowad
Copy link
Contributor

I have just opened a PR which adds @mrbase's test case to the official test suite.

@Girgias
Copy link
Member

Girgias commented May 30, 2024

@alexdowad Is this issue resolved or not?

@alexdowad
Copy link
Contributor

@Girgias It's resolved. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants