mb_convert_encoding() returns inconsistent output before and after 0x0e 0x0f when converting ISO-2022-JP #10651

pakutoma · 2023-02-21T14:28:39Z

Description

When mb_convert_encoding() interprets ISO-2022-JP byte sequences, the conversion result of 0x5c is different before and after interpreting 0x0e "Shift out" and 0x0f "Shift in" in JIS X 0201 mode.
Since these control characters do not have the ability to switch from JIS X 0201 character set to another, I believe that the same character must be assigned to the same byte sequence before and after 0x0e 0x0f.

3v4l: https://3v4l.org/XrFGE

The following code:

<?php

$jis_x_0201 = '1b284a';
$yen = '5c';
$shiftout_shiftin = '0e0f';
$ascii = '1b2842';

$byte_seq = $jis_x_0201 . $yen . $shiftout_shiftin . $yen . $ascii;
var_dump($byte_seq);
var_dump(mb_convert_encoding(hex2bin($byte_seq), 'UTF-8', 'JIS'));

Resulted in this output:

string(20) "1b284a5c0e0f5c1b2842"
string(3) "¥\"

But I expected this output instead:

string(20) "1b284a5c0e0f5c1b2842"
string(3) "¥¥"

PHP Version

PHP 8.1.16

Operating System

No response

The text was updated successfully, but these errors were encountered:

pakutoma · 2023-02-21T15:30:23Z

PR: #10652

youkidearitai · 2023-02-21T15:57:10Z

Just a moment, this behavior is maybe historical.
According to Wikipedia (English ver) Shift In (0x0e) means "Latin letters", Shift Out (0x0f) means "Japanese letters". Therefore, "Shift in" can think to "Change to Latin letters (ASCII)", Second \ (0x5c) result is maybe \ (0x5c).

It is not included RFC 1468, but Han-kaku Katakana put on mail to one of technique.

refs (Sorry these links are Japanese page):

alexdowad · 2023-02-21T16:31:40Z

@pakutoma Thank you for raising this issue!

I must say this is surprising to me. My understanding was that in JIS7/8, 0x0E is used to start a section of JISX 0201 kana, and that 0x0F is used to end a section of JISX 0201 kana.

Since the default mode for all ISO-2022-JP variants (the mode which we start every string in) is ASCII mode, I thought that 0x0F should return back to the default (ASCII) mode.

I think this is what mbstring has always done, for more than 10 years. But if that is not true, please let me know.

Since this would be a BC (backward compatibility) break, we would really need to think about it carefully. From past experience, I know that our Japanese users are very sensitive to any BC breaks involving Japanese text encodings. 😉

pakutoma · 2023-02-21T16:34:05Z

Hmm, it certainly seems that way when I read the web page @youkidearitai sent me.
I think that the JIS X 0201 specification switches within JIS X 0201, but I don't know how it works in ISO-2022-JP.
Reading IETF RFC 1468, there is no mention of JIS X 0201, and I don't think there are many ISO-2022-JP implementations that use SO/SI to begin with.
It might be better not to change it carelessly.
Thank you!

pakutoma added Bug Status: Needs Triage labels Feb 21, 2023

pakutoma closed this as completed Feb 21, 2023

pakutoma mentioned this issue Feb 23, 2023

mb_check_encoding() returns true for incorrect but interpretable ISO-2022-JP byte sequences #10648

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mb_convert_encoding() returns inconsistent output before and after 0x0e 0x0f when converting ISO-2022-JP #10651

mb_convert_encoding() returns inconsistent output before and after 0x0e 0x0f when converting ISO-2022-JP #10651

pakutoma commented Feb 21, 2023

pakutoma commented Feb 21, 2023

Uh oh!

youkidearitai commented Feb 21, 2023

Uh oh!

alexdowad commented Feb 21, 2023

Uh oh!

pakutoma commented Feb 21, 2023

Uh oh!

mb_convert_encoding() returns inconsistent output before and after 0x0e 0x0f when converting ISO-2022-JP #10651

mb_convert_encoding() returns inconsistent output before and after 0x0e 0x0f when converting ISO-2022-JP #10651

Comments

pakutoma commented Feb 21, 2023

Description

PHP Version

Operating System

pakutoma commented Feb 21, 2023

Uh oh!

youkidearitai commented Feb 21, 2023

Uh oh!

alexdowad commented Feb 21, 2023

Uh oh!

pakutoma commented Feb 21, 2023

Uh oh!