Skip to content

mb_convert_encoding() returns inconsistent output before and after 0x0e 0x0f when converting ISO-2022-JP #10651

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pakutoma opened this issue Feb 21, 2023 · 4 comments

Comments

@pakutoma
Copy link
Contributor

Description

When mb_convert_encoding() interprets ISO-2022-JP byte sequences, the conversion result of 0x5c is different before and after interpreting 0x0e "Shift out" and 0x0f "Shift in" in JIS X 0201 mode.
Since these control characters do not have the ability to switch from JIS X 0201 character set to another, I believe that the same character must be assigned to the same byte sequence before and after 0x0e 0x0f.

3v4l: https://3v4l.org/XrFGE

The following code:

<?php

$jis_x_0201 = '1b284a';
$yen = '5c';
$shiftout_shiftin = '0e0f';
$ascii = '1b2842';

$byte_seq = $jis_x_0201 . $yen . $shiftout_shiftin . $yen . $ascii;
var_dump($byte_seq);
var_dump(mb_convert_encoding(hex2bin($byte_seq), 'UTF-8', 'JIS'));

Resulted in this output:

string(20) "1b284a5c0e0f5c1b2842"
string(3) "¥\"

But I expected this output instead:

string(20) "1b284a5c0e0f5c1b2842"
string(3) "¥¥"

PHP Version

PHP 8.1.16

Operating System

No response

@pakutoma
Copy link
Contributor Author

PR: #10652

@youkidearitai
Copy link
Contributor

Just a moment, this behavior is maybe historical.
According to Wikipedia (English ver) Shift In (0x0e) means "Latin letters", Shift Out (0x0f) means "Japanese letters". Therefore, "Shift in" can think to "Change to Latin letters (ASCII)", Second \ (0x5c) result is maybe \ (0x5c).

It is not included RFC 1468, but Han-kaku Katakana put on mail to one of technique.

refs (Sorry these links are Japanese page):

@alexdowad
Copy link
Contributor

@pakutoma Thank you for raising this issue!

I must say this is surprising to me. My understanding was that in JIS7/8, 0x0E is used to start a section of JISX 0201 kana, and that 0x0F is used to end a section of JISX 0201 kana.

Since the default mode for all ISO-2022-JP variants (the mode which we start every string in) is ASCII mode, I thought that 0x0F should return back to the default (ASCII) mode.

I think this is what mbstring has always done, for more than 10 years. But if that is not true, please let me know.

Since this would be a BC (backward compatibility) break, we would really need to think about it carefully. From past experience, I know that our Japanese users are very sensitive to any BC breaks involving Japanese text encodings. 😉

@pakutoma
Copy link
Contributor Author

Hmm, it certainly seems that way when I read the web page @youkidearitai sent me.
I think that the JIS X 0201 specification switches within JIS X 0201, but I don't know how it works in ISO-2022-JP.
Reading IETF RFC 1468, there is no mention of JIS X 0201, and I don't think there are many ISO-2022-JP implementations that use SO/SI to begin with.
It might be better not to change it carelessly.
Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants