grapheme_extract should pass over invalid surrogate halves #17568

dmsnell · 2025-01-24T21:57:19Z

Many systems incorrectly encode surrogate halves from a UTF-16 stream into UTF-8 as two three-byte characters instead of the proper four-byte sequence. These are invalid charaters in UTF-8 and should be skipped when decoding with grapheme_extract but it’s not currently handling these properly.

If offset does not point to the first byte of a UTF-8 character,
the start position is moved to the next character boundary.

For example, U+1F170 (d83c dd70) should encode as F0 9F 85 B0, but when applying the UTF-8 encoder invalidly to d83c, the output would be ED A0 BD. This entire span of bytes is invalid UTF-8.

grapheme_extract( "\xED\xA0\xBDa", 1, GRAPHEME_EXTR_COUNT, 0, $next );
// returns "\xED", an invalid UTF-8 byte sequence
// $next === 1, pointing into the middle of the invalid sequence

Instead, it should return “a” and point $next to the end of the string.

Many systems incorrectly encode surrogate halves from a UTF-16 stream into UTF-8 as two three-byte characters instead of the proper four-byte sequence. These are invalid charaters in UTF-8 and should be skipped when decoding with `grapheme_extract` but it’s not currently handling these properly. > If offset does not point to the first byte of a UTF-8 character, > the start position is moved to the next character boundary. For example, U+1F170 (d83c dd70) should encode as F0 9F 85 B0, but when applying the UTF-8 encoder invalidly to d83c, the output would be ED A0 BD. This entire span of bytes is invalid UTF-8. ```php grapheme_extract( "\xED\xA0\xBDa", 1, GRAPHEME_EXTR_COUNT, 0, $next ); // returns "\xED", an invalid UTF-8 byte sequence // $next === 1, pointing into the middle of the invalid sequence ``` Instead, it should return “a” and point `$next` to the end of the string.

youkidearitai · 2025-01-25T10:38:15Z

My understand is surrogate pair that only UTF-16. UTF-8 can cover all code points.
Therefore, this result of $next is 1. Second byte (\xA0) is only invalid byte.

dmsnell · 2025-01-26T18:13:20Z

UTF-8 can cover all code points.

This is almost correct, as UTF-8 encodes Unicode Scalar Values, which prohibit the unassigned code points in the surrogate range.

Because surrogate code points are not Unicode scalar values, any UTF-8 byte sequence that would otherwise map to code points U+D800..U+DFFF is ill-formed.
UTF-8

Further, both %ED and %A0 (and %BD for this matter) are invalid UTF-8 sequences which should not be returned from this function because they are not grapheme clusters, extended grapheme clusters, or even code points. They are simply invalid bytes on their own.

Particularly since the documentation states that grapheme_extract operates on UTF-8 sequences, it should not return invalid UTF-8 as if it were a character, and should skip over all invalid bytes, of which the surrogate range cannot be represented.

cmb69 · 2025-01-27T18:10:33Z

Hmm, the current behavior does not really make sense, but I'm not sure what the proper behavior would be. The grapheme_extract() man page states:

Function to extract a sequence of default grapheme clusters from a text buffer, which must be encoded in UTF-8

Now, invalid UTF-8 is not UTF-8.

If offset does not point to the first byte of a UTF-8 character, the start position is moved to the next character boundary.

This may still make sense even if we assume valid UTF-8, because a user might just start at a wrong position.

And then we have: https://3v4l.org/iLSL9. Given that these are not (valid) code points, what am I missing?

dmsnell · 2025-01-27T20:56:52Z

what am I missing?

@cmb69 I believe it’s the ->getPartsIterator()

$bi = IntlBreakIterator::createCodePointInstance();
$bi->setText("A\xED\xA0\xBDa");
foreach ($bi->getPartsIterator() as $cp) {
    var_dump($bi->getErrorMessage(), bin2hex($cp));
}

this produces

string(12) "U_ZERO_ERROR"
string(2) "41"
string(12) "U_ZERO_ERROR"
string(2) "ed"
string(12) "U_ZERO_ERROR"
string(2) "a0"
string(12) "U_ZERO_ERROR"
string(2) "bd"
string(12) "U_ZERO_ERROR"
string(2) "61"

https://3v4l.org/73D52

dmsnell · 2025-01-27T21:02:09Z

invalid UTF-8 is not UTF-8.

This is a fair point, and if it comes to the function rejecting these inputs that would be tolerable, though less useful.

Doing so would require that the string have already been pre-scanned for having a valid UTF-8 encoding, or scan through before reading the first grapheme/code-point/bytes.

Given that this is inherently a streaming function, it’s a useful property to return valid sequences where they exist without having to read the entire string first and without rejecting a valid prefix because of later problems. Being able to identify those invalid byte sequences also helps, giving user-space code the choice of whether to replace the sequence with U+FFFD for display, or pass through the invalid bytes unaltered, or remove them.

cmb69 · 2025-01-27T22:50:10Z

I believe it’s the ->getPartsIterator()

That doesn't really matter here. Iterating over the "CodePointIterator" gives the byte offsets (int is converted to string when passed to bin2hex()), and single bytes are reported as code points, although they are clearly invalid. Is this a bug in ICU (unlikely), or in PHP, or do we deliberate do this? If the latter, I think that grapheme_extract() should do the same (i.e. behave as it is behaving now). But I still think this behavior does not make much sense.

This is a fair point, and if it comes to the function rejecting these inputs that would be tolerable, though less useful.

I can see valid arguments for both approaches, but I'm unsure what I'd prefer. I would like some feedback from others.

dmsnell · 2025-08-23T02:41:34Z

After many months I came back to this with some insight from the note about substitution of maximal subparts but am now more confused.

At the heart of this seems to be the definition of “character boundary.” If this were to follow the behavior of mb_scrub() then we might expect three advances in the example string.

var_dump( mb_scrub( "\xED\xA0\xBDa", 'UTF-8' ) );
string(4) "???a"

This is because none of these three bytes represents a valid prefix to a proper Unicode scalar value. In contrast we could have three out of four valid bytes and treat those as a single invalid byte sequence.

var_dump( mb_scrub( "\xF0\x9F\x85a", 'UTF-8' ) );
string(2) "?a"

On this basis we could expect grapheme_extract() to make three stops in the first case on invalid UTF-8 and then a single stop in the second. However, this isn’t what occurs.

$next = 0;
var_dump( bin2hex( grapheme_extract( "\xED\xA0\xBDa", 1, GRAPHEME_EXTR_COUNT, $next, $next ) ), $next );
var_dump( bin2hex( grapheme_extract( "\xED\xA0\xBDa", 1, GRAPHEME_EXTR_COUNT, $next, $next ) ), $next );
var_dump( bin2hex( grapheme_extract( "\xED\xA0\xBDa", 1, GRAPHEME_EXTR_COUNT, $next, $next ) ), $next );

string(2) "ed"
int(1)
string(2) "61"
int(4)
string(0) ""
int(4)

$next = 0;
var_dump( bin2hex( grapheme_extract( "\xF0\x9F\x85a", 1, GRAPHEME_EXTR_COUNT, $next, $next ) ), $next );
var_dump( bin2hex( grapheme_extract( "\xF0\x9F\x85a", 1, GRAPHEME_EXTR_COUNT, $next, $next ) ), $next );
var_dump( bin2hex( grapheme_extract( "\xF0\x9F\x85a", 1, GRAPHEME_EXTR_COUNT, $next, $next ) ), $next );

string(6) "f09f85"
int(3)
string(2) "61"
int(4)
string(0) ""
int(4)

It looks like the intent here it to behave the way it does for the truncated four-byte character: the return should be a character as determined by the maximal subpart. In fact, in the broken test case it does do that for the first call, but then the updated $offset value is wrong. It suggests that the a is found at offset 1 which is just wrong. Further, if it were to be following this algorithm then it should be pausing two more times on the "a0" and "bd" bytes, but it isn’t.

I’m confirming these on 3v4l using git.master_jit but maybe I’m missing work already merged in.

Surely this seems inconsistent and likely a pointer to an issue in the PHP implementation? Either way, in hindsight and with some learning, it seems reasonable to expect that this function should likely match the behavior of mb_scrub(), which properly follows maximal subpart substitution.

dmsnell requested a review from devnexen as a code owner January 24, 2025 21:57

github-actions bot added the Extension: intl label Jan 24, 2025

dmsnell force-pushed the fix/grapheme-extract-invalid-surrogate-half branch from e0a5736 to e7f9aec Compare January 24, 2025 21:57

devnexen requested a review from youkidearitai January 24, 2025 23:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

grapheme_extract should pass over invalid surrogate halves #17568

grapheme_extract should pass over invalid surrogate halves #17568

dmsnell commented Jan 24, 2025

Uh oh!

youkidearitai commented Jan 25, 2025

Uh oh!

dmsnell commented Jan 26, 2025

Uh oh!

cmb69 commented Jan 27, 2025

Uh oh!

dmsnell commented Jan 27, 2025

Uh oh!

dmsnell commented Jan 27, 2025

Uh oh!

cmb69 commented Jan 27, 2025

Uh oh!

dmsnell commented Aug 23, 2025

Uh oh!

Uh oh!

grapheme_extract should pass over invalid surrogate halves #17568

Are you sure you want to change the base?

grapheme_extract should pass over invalid surrogate halves #17568

Conversation

dmsnell commented Jan 24, 2025

Uh oh!

youkidearitai commented Jan 25, 2025

Uh oh!

dmsnell commented Jan 26, 2025

Uh oh!

cmb69 commented Jan 27, 2025

Uh oh!

dmsnell commented Jan 27, 2025

Uh oh!

dmsnell commented Jan 27, 2025

Uh oh!

cmb69 commented Jan 27, 2025

Uh oh!

dmsnell commented Aug 23, 2025

Uh oh!

Uh oh!