Skip to content

Improve mb_detect_encoding accuracy for text containing the word Māori (with accent) #12025

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions ext/mbstring/common_codepoints.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,18 +3,23 @@
0x0020 0x007E # ASCII
0x00A1 0x00AC # Pound sign, Yen sign, copyright sign...
0x00AE 0x00FF # Accented Latin characters
0x0101 0x0101 # a with macron
0x0104 0x0107 # Polish
0x010C 0x010F # Czech
0x0113 0x0113 # e with macron
0x0118 0x011B # Polish, Czech
0x011F 0x011F # Turkish
0x012B 0x012B # i with macron
0x0130 0x0131 # Turkish
0x0141 0x0144 # Polish
0x0147 0x0148 # Czech
0x014D 0x014D # o with macron
0x0150 0x0151 # Hungarian
0x0158 0x015B # Czech, Polish
0x015F 0x015F # Turkish
0x0160 0x0161 # Used in Slavic names
0x0164 0x0165 # Czech
0x016B 0x016B # u with macron
0x016E 0x016F # Czech
0x0170 0x0171 # Hungarian
0x0179 0x017E # Polish, Czech, other Slavic languages
Expand Down
2 changes: 1 addition & 1 deletion ext/mbstring/rare_cp_bitvec.h
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@

static const uint32_t rare_codepoint_bitvec[] = {
0xffffd9ff, 0x00000000, 0x00000000, 0x80000000, 0xffffffff, 0x00002001, 0x00000000, 0x00000000,
0x70ff0f0f, 0xfffcffff, 0x70fcfe61, 0x81fc3fcc, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0x70f70f0d, 0xfffcf7ff, 0x70fcde61, 0x81fc37cc, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff, 0xffffffff,
0xfffff800, 0xffffffff, 0xffffffff, 0x0300ffff, 0x0000280f, 0x00000004, 0x00000000, 0x00000000,
0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000, 0x00000000,
Expand Down
20 changes: 20 additions & 0 deletions ext/mbstring/tests/mb_detect_encoding.phpt
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,18 @@ $css = 'input[type="radio"]:checked + img {
}';
echo mb_detect_encoding($css, mb_list_encodings(), true), "\n";

// Test cases courtesy of Kirill Roskolii and Chris Burgess
echo "-- Māori text --\n";

echo mb_detect_encoding("Total Māori,31.5,33.3,31.8,33,36.4,33.2,33.2", ['UTF-8', 'ISO-8859-1', 'Windows-1251']), "\n";
// Names of native birds from Aotearoa:
echo mb_detect_encoding("Kākā", ['UTF-8', 'ISO-8859-1', 'Windows-1251']), "\n";
echo mb_detect_encoding("Whēkau", ['UTF-8', 'ISO-8859-1', 'Windows-1251']), "\n";
echo mb_detect_encoding("Tīwaiwaka", ['UTF-8', 'ISO-8859-1', 'Windows-1251']), "\n";
echo mb_detect_encoding("Kōtuku", ['UTF-8', 'ISO-8859-1', 'Windows-1251']), "\n";
echo mb_detect_encoding("Kererū", ['UTF-8', 'ISO-8859-1', 'Windows-1251']), "\n";
echo mb_detect_encoding("Tūī", ['UTF-8', 'ISO-8859-1', 'Windows-1251']), "\n";

echo "== DETECT ORDER ==\n";

mb_detect_order('auto');
Expand Down Expand Up @@ -408,6 +420,14 @@ UTF-8
UTF-8
SJIS
UTF-8
-- Māori text --
UTF-8
UTF-8
UTF-8
UTF-8
UTF-8
UTF-8
UTF-8
== DETECT ORDER ==
JIS: JIS
EUC-JP: EUC-JP
Expand Down