Skip to content

Commit 7914b8c

Browse files
committed
Use pakutoma's encoding check functions for mb_detect_encoding even in non-strict mode
In 6fc8d01, pakutoma added specialized validity checking functions for some legacy text encodings like ISO-2022-JP and UTF-7. These check functions perform a more strict validity check than the encoding conversion functions for the same text encodings. For example, the check function for ISO-2022-JP verifies that the string ends in the correct state required by the specification for ISO-2022-JP. These check functions are already being used to make detection of text encoding more accurate when 'strict' detection mode is enabled. However, since the default is 'non-strict' detection (a bad API design but we're stuck with it now), most users will not benefit from pakutoma's work. I was previously reluctant to enable this new logic for non-strict detection mode. My intention was to reduce the scope of behavior changes, since almost *any* behavior change may affect *some* user in a way we don't expect. However, we definitely have users whose (production) code was broken by the changes I made in 28b346b, and enabling pakutoma's check functions for non-strict detection mode would un-break it. (See GH-10192 as an example.) The added checks do also make sense. In non-strict detection mode, we will not immediately reject candidate encodings whose validity check function returns false; but they will be much less likely to be selected. However, failure of the validity check function is weighted less heavily than an encoding error detected by the encoding conversion function.
1 parent 3ab10da commit 7914b8c

File tree

3 files changed

+25
-13
lines changed

3 files changed

+25
-13
lines changed

ext/mbstring/mbstring.c

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1816,7 +1816,6 @@ static size_t mb_get_strlen(zend_string *string, const mbfl_encoding *encoding)
18161816
return mb_fast_strlen_utf8((unsigned char*)ZSTR_VAL(string), ZSTR_LEN(string));
18171817
}
18181818

1819-
18201819
uint32_t wchar_buf[128];
18211820
unsigned char *in = (unsigned char*)ZSTR_VAL(string);
18221821
size_t in_len = ZSTR_LEN(string);
@@ -3006,19 +3005,24 @@ static size_t init_candidate_array(struct candidate *array, size_t length, const
30063005
for (size_t i = 0; i < length; i++) {
30073006
const mbfl_encoding *enc = encodings[i];
30083007

3008+
array[j].enc = enc;
3009+
array[j].state = 0;
3010+
array[j].demerits = 0;
3011+
30093012
/* If any candidate encodings have specialized validation functions, use them
30103013
* to eliminate as many candidates as possible */
3011-
if (strict && enc->check != NULL) {
3014+
if (enc->check != NULL) {
30123015
for (size_t k = 0; k < n; k++) {
30133016
if (!enc->check((unsigned char*)in[k], in_len[k])) {
3014-
goto skip_to_next;
3017+
if (strict) {
3018+
goto skip_to_next;
3019+
} else {
3020+
array[j].demerits += 500;
3021+
}
30153022
}
30163023
}
30173024
}
30183025

3019-
array[j].enc = enc;
3020-
array[j].state = 0;
3021-
array[j].demerits = 0;
30223026
/* This multiplier can optionally be used to make candidate encodings listed
30233027
* first more likely to be chosen. It is a weight factor which multiplies
30243028
* the number of demerits counted for each candidate. */

ext/mbstring/tests/gh10192_utf7.phpt

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ foreach ($testcases as $title => $case) {
7575
--EXPECT--
7676
non-base64 character after +
7777
string(5) "UTF-8"
78-
string(5) "UTF-7"
78+
string(5) "UTF-8"
7979
bool(false)
8080
string(5) "UTF-7"
8181
bool(false)
@@ -93,7 +93,7 @@ int(0)
9393

9494
base64 character before +
9595
string(5) "UTF-8"
96-
string(5) "UTF-7"
96+
string(5) "UTF-8"
9797
bool(false)
9898
string(5) "UTF-7"
9999
bool(false)
@@ -174,7 +174,7 @@ int(2)
174174

175175
- and +
176176
string(5) "UTF-8"
177-
string(5) "UTF-7"
177+
string(5) "UTF-8"
178178
bool(false)
179179
string(5) "UTF-7"
180180
bool(false)
@@ -219,7 +219,7 @@ int(2)
219219

220220
valid direct encoding character = after +
221221
string(5) "UTF-8"
222-
string(5) "UTF-7"
222+
string(5) "UTF-8"
223223
bool(false)
224224
string(5) "UTF-7"
225225
bool(false)
@@ -228,7 +228,7 @@ int(2)
228228

229229
invalid direct encoding character ~ after +
230230
string(5) "UTF-8"
231-
string(5) "UTF-7"
231+
string(5) "UTF-8"
232232
bool(false)
233233
string(5) "UTF-7"
234234
bool(false)
@@ -237,7 +237,7 @@ int(2)
237237

238238
invalid direct encoding character \ after +
239239
string(5) "UTF-8"
240-
string(5) "UTF-7"
240+
string(5) "UTF-8"
241241
bool(false)
242242
string(5) "UTF-7"
243243
bool(false)
@@ -246,7 +246,7 @@ int(2)
246246

247247
invalid direct encoding character ESC after +
248248
string(5) "UTF-8"
249-
string(5) "UTF-7"
249+
string(5) "UTF-8"
250250
bool(false)
251251
string(5) "UTF-7"
252252
bool(false)

ext/mbstring/tests/mb_detect_encoding.phpt

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,13 @@ echo mb_detect_encoding($test, ['UTF-8', 'ISO-8859-1']), "\n"; // Should be UTF-
7878
echo mb_detect_encoding('abc', ['UUENCODE', 'UTF-8']), "\n";
7979
echo mb_detect_encoding('abc', ['UUENCODE', 'QPrint', 'HTML-ENTITIES', 'Base64', '7bit', '8bit', 'SJIS']), "\n";
8080

81+
// This test case courtesy of Adrien Foulon
82+
// It depends on the below use of '+' being recognized as invalid UTF-7
83+
$css = 'input[type="radio"]:checked + img {
84+
border: 5px solid #0083ca;
85+
}';
86+
echo mb_detect_encoding($css, mb_list_encodings(), true), "\n";
87+
8188
echo "== DETECT ORDER ==\n";
8289

8390
mb_detect_order('auto');
@@ -400,6 +407,7 @@ UTF-8
400407
UTF-8
401408
UTF-8
402409
SJIS
410+
UTF-8
403411
== DETECT ORDER ==
404412
JIS: JIS
405413
EUC-JP: EUC-JP

0 commit comments

Comments
 (0)