Skip to content

Commit 91106cd

Browse files
authored
bpo-29240: PEP 540: Add a new UTF-8 Mode (#855)
* Add -X utf8 command line option, PYTHONUTF8 environment variable and a new sys.flags.utf8_mode flag. * If the LC_CTYPE locale is "C" at startup: enable automatically the UTF-8 mode. * Add _winapi.GetACP(). encodings._alias_mbcs() now calls _winapi.GetACP() to get the ANSI code page * locale.getpreferredencoding() now returns 'UTF-8' in the UTF-8 mode. As a side effect, open() now uses the UTF-8 encoding by default in this mode. * Py_DecodeLocale() and Py_EncodeLocale() now use the UTF-8 encoding in the UTF-8 Mode. * Update subprocess._args_from_interpreter_flags() to handle -X utf8 * Skip some tests relying on the current locale if the UTF-8 mode is enabled. * Add test_utf8mode.py. * _Py_DecodeUTF8_surrogateescape() gets a new optional parameter to return also the length (number of wide characters). * pymain_get_global_config() and pymain_set_global_config() now always copy flag values, rather than only copying if the new value is greater than the old value.
1 parent c3e070f commit 91106cd

27 files changed

+593
-178
lines changed

Doc/c-api/sys.rst

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -127,6 +127,9 @@ Operating System Utilities
127127
128128
.. versionadded:: 3.5
129129
130+
.. versionchanged:: 3.7
131+
The function now uses the UTF-8 encoding in the UTF-8 mode.
132+
130133
131134
.. c:function:: char* Py_EncodeLocale(const wchar_t *text, size_t *error_pos)
132135
@@ -138,19 +141,25 @@ Operating System Utilities
138141
to free the memory. Return ``NULL`` on encoding error or memory allocation
139142
error
140143
141-
If error_pos is not ``NULL``, ``*error_pos`` is set to the index of the
142-
invalid character on encoding error, or set to ``(size_t)-1`` otherwise.
144+
If error_pos is not ``NULL``, ``*error_pos`` is set to ``(size_t)-1`` on
145+
success, or set to the index of the invalid character on encoding error.
143146
144147
Use the :c:func:`Py_DecodeLocale` function to decode the bytes string back
145148
to a wide character string.
146149
150+
.. versionchanged:: 3.7
151+
The function now uses the UTF-8 encoding in the UTF-8 mode.
152+
147153
.. seealso::
148154
149155
The :c:func:`PyUnicode_EncodeFSDefault` and
150156
:c:func:`PyUnicode_EncodeLocale` functions.
151157
152158
.. versionadded:: 3.5
153159
160+
.. versionchanged:: 3.7
161+
The function now supports the UTF-8 mode.
162+
154163
155164
.. _systemfunctions:
156165

Doc/library/locale.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -316,6 +316,13 @@ The :mod:`locale` module defines the following exception and functions:
316316
preferences, so this function is not thread-safe. If invoking setlocale is not
317317
necessary or desired, *do_setlocale* should be set to ``False``.
318318

319+
On Android or in the UTF-8 mode (:option:`-X` ``utf8`` option), always
320+
return ``'UTF-8'``, the locale and the *do_setlocale* argument are ignored.
321+
322+
.. versionchanged:: 3.7
323+
The function now always returns ``UTF-8`` on Android or if the UTF-8 mode
324+
is enabled.
325+
319326

320327
.. function:: normalize(localename)
321328

Doc/library/sys.rst

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -313,6 +313,9 @@ always available.
313313
has caught :exc:`SystemExit` (such as an error flushing buffered data
314314
in the standard streams), the exit status is changed to 120.
315315

316+
.. versionchanged:: 3.7
317+
Added ``utf8_mode`` attribute for the new :option:`-X` ``utf8`` flag.
318+
316319

317320
.. data:: flags
318321

@@ -335,6 +338,7 @@ always available.
335338
:const:`quiet` :option:`-q`
336339
:const:`hash_randomization` :option:`-R`
337340
:const:`dev_mode` :option:`-X` ``dev``
341+
:const:`utf8_mode` :option:`-X` ``utf8``
338342
============================= =============================
339343

340344
.. versionchanged:: 3.2
@@ -347,7 +351,8 @@ always available.
347351
Removed obsolete ``division_warning`` attribute.
348352

349353
.. versionchanged:: 3.7
350-
Added ``dev_mode`` attribute for the new :option:`-X` ``dev`` flag.
354+
Added ``dev_mode`` attribute for the new :option:`-X` ``dev`` flag
355+
and ``utf8_mode`` attribute for the new :option:`-X` ``utf8`` flag.
351356

352357

353358
.. data:: float_info
@@ -492,6 +497,8 @@ always available.
492497
:func:`os.fsencode` and :func:`os.fsdecode` should be used to ensure that
493498
the correct encoding and errors mode are used.
494499

500+
* In the UTF-8 mode, the encoding is ``utf-8`` on any platform.
501+
495502
* On Mac OS X, the encoding is ``'utf-8'``.
496503

497504
* On Unix, the encoding is the locale encoding.
@@ -506,6 +513,10 @@ always available.
506513
Windows is no longer guaranteed to return ``'mbcs'``. See :pep:`529`
507514
and :func:`_enablelegacywindowsfsencoding` for more information.
508515

516+
.. versionchanged:: 3.7
517+
Return 'utf-8' in the UTF-8 mode.
518+
519+
509520
.. function:: getfilesystemencodeerrors()
510521

511522
Return the name of the error mode used to convert between Unicode filenames

Doc/using/cmdline.rst

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -439,6 +439,9 @@ Miscellaneous options
439439
* Set the :attr:`~sys.flags.dev_mode` attribute of :attr:`sys.flags` to
440440
``True``
441441

442+
* ``-X utf8`` enables the UTF-8 mode, whereas ``-X utf8=0`` disables the
443+
UTF-8 mode.
444+
442445
It also allows passing arbitrary values and retrieving them through the
443446
:data:`sys._xoptions` dictionary.
444447

@@ -455,7 +458,7 @@ Miscellaneous options
455458
The ``-X showalloccount`` option.
456459

457460
.. versionadded:: 3.7
458-
The ``-X importtime`` and ``-X dev`` options.
461+
The ``-X importtime``, ``-X dev`` and ``-X utf8`` options.
459462

460463

461464
Options you shouldn't use
@@ -816,6 +819,14 @@ conflict.
816819

817820
.. versionadded:: 3.7
818821

822+
.. envvar:: PYTHONUTF8
823+
824+
If set to ``1``, enable the UTF-8 mode. If set to ``0``, disable the UTF-8
825+
mode. Any other non-empty string cause an error.
826+
827+
.. versionadded:: 3.7
828+
829+
819830
Debug-mode variables
820831
~~~~~~~~~~~~~~~~~~~~
821832

Doc/whatsnew/3.7.rst

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -185,6 +185,23 @@ resolution on Linux and Windows.
185185
PEP written and implemented by Victor Stinner
186186

187187

188+
PEP 540: Add a new UTF-8 mode
189+
-----------------------------
190+
191+
Add a new UTF-8 mode to ignore the locale, use the UTF-8 encoding, and change
192+
:data:`sys.stdin` and :data:`sys.stdout` error handlers to ``surrogateescape``.
193+
This mode is enabled by default in the POSIX locale, but otherwise disabled by
194+
default.
195+
196+
The new :option:`-X` ``utf8`` command line option and :envvar:`PYTHONUTF8`
197+
environment variable are added to control the UTF-8 mode.
198+
199+
.. seealso::
200+
201+
:pep:`540` -- Add a new UTF-8 mode
202+
PEP written and implemented by Victor Stinner
203+
204+
188205
New Development Mode: -X dev
189206
----------------------------
190207

@@ -353,6 +370,10 @@ Added another argument *monetary* in :meth:`format_string` of :mod:`locale`.
353370
If *monetary* is true, the conversion uses monetary thousands separator and
354371
grouping strings. (Contributed by Garvit in :issue:`10379`.)
355372

373+
The :func:`locale.getpreferredencoding` function now always returns ``'UTF-8'``
374+
on Android or in the UTF-8 mode (:option:`-X` ``utf8`` option), the locale and
375+
the *do_setlocale* argument are ignored.
376+
356377
math
357378
----
358379

Include/fileobject.h

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,10 @@ PyAPI_DATA(const char *) Py_FileSystemDefaultEncodeErrors;
2828
#endif
2929
PyAPI_DATA(int) Py_HasFileSystemDefaultEncoding;
3030

31+
#if !defined(Py_LIMITED_API) || Py_LIMITED_API+0 >= 0x03070000
32+
PyAPI_DATA(int) Py_UTF8Mode;
33+
#endif
34+
3135
/* Internal API
3236
3337
The std printer acts as a preliminary sys.stderr until the new io

Include/pystate.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@ typedef struct {
3838
int show_alloc_count; /* -X showalloccount */
3939
int dump_refs; /* PYTHONDUMPREFS */
4040
int malloc_stats; /* PYTHONMALLOCSTATS */
41+
int utf8_mode; /* -X utf8 or PYTHONUTF8 environment variable */
4142
} _PyCoreConfig;
4243

4344
#define _PyCoreConfig_INIT (_PyCoreConfig){.use_hash_seed = -1}

Lib/_bootlocale.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@
99

1010
if sys.platform.startswith("win"):
1111
def getpreferredencoding(do_setlocale=True):
12+
if sys.flags.utf8_mode:
13+
return 'UTF-8'
1214
return _locale._getdefaultlocale()[1]
1315
else:
1416
try:
@@ -21,13 +23,17 @@ def getpreferredencoding(do_setlocale=True):
2123
return 'UTF-8'
2224
else:
2325
def getpreferredencoding(do_setlocale=True):
26+
if sys.flags.utf8_mode:
27+
return 'UTF-8'
2428
# This path for legacy systems needs the more complex
2529
# getdefaultlocale() function, import the full locale module.
2630
import locale
2731
return locale.getpreferredencoding(do_setlocale)
2832
else:
2933
def getpreferredencoding(do_setlocale=True):
3034
assert not do_setlocale
35+
if sys.flags.utf8_mode:
36+
return 'UTF-8'
3137
result = _locale.nl_langinfo(_locale.CODESET)
3238
if not result and sys.platform == 'darwin':
3339
# nl_langinfo can return an empty string

Lib/encodings/__init__.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -158,8 +158,9 @@ def search_function(encoding):
158158
if sys.platform == 'win32':
159159
def _alias_mbcs(encoding):
160160
try:
161-
import _bootlocale
162-
if encoding == _bootlocale.getpreferredencoding(False):
161+
import _winapi
162+
ansi_code_page = "cp%s" % _winapi.GetACP()
163+
if encoding == ansi_code_page:
163164
import encodings.mbcs
164165
return encodings.mbcs.getregentry()
165166
except ImportError:

Lib/locale.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -617,6 +617,8 @@ def resetlocale(category=LC_ALL):
617617
# On Win32, this will return the ANSI code page
618618
def getpreferredencoding(do_setlocale = True):
619619
"""Return the charset that the user is likely using."""
620+
if sys.flags.utf8_mode:
621+
return 'UTF-8'
620622
import _bootlocale
621623
return _bootlocale.getpreferredencoding(False)
622624
else:
@@ -634,6 +636,8 @@ def getpreferredencoding(do_setlocale = True):
634636
def getpreferredencoding(do_setlocale = True):
635637
"""Return the charset that the user is likely using,
636638
by looking at environment variables."""
639+
if sys.flags.utf8_mode:
640+
return 'UTF-8'
637641
res = getdefaultlocale()[1]
638642
if res is None:
639643
# LANG not set, default conservatively to ASCII
@@ -643,6 +647,8 @@ def getpreferredencoding(do_setlocale = True):
643647
def getpreferredencoding(do_setlocale = True):
644648
"""Return the charset that the user is likely using,
645649
according to the system configuration."""
650+
if sys.flags.utf8_mode:
651+
return 'UTF-8'
646652
import _bootlocale
647653
if do_setlocale:
648654
oldloc = setlocale(LC_CTYPE)

Lib/subprocess.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -280,7 +280,7 @@ def _args_from_interpreter_flags():
280280
if dev_mode:
281281
args.extend(('-X', 'dev'))
282282
for opt in ('faulthandler', 'tracemalloc', 'importtime',
283-
'showalloccount', 'showrefcount'):
283+
'showalloccount', 'showrefcount', 'utf8'):
284284
if opt in xoptions:
285285
value = xoptions[opt]
286286
if value is True:

Lib/test/test_builtin.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1022,6 +1022,7 @@ def test_open(self):
10221022
self.assertRaises(ValueError, open, 'a\x00b')
10231023
self.assertRaises(ValueError, open, b'a\x00b')
10241024

1025+
@unittest.skipIf(sys.flags.utf8_mode, "utf-8 mode is enabled")
10251026
def test_open_default_encoding(self):
10261027
old_environ = dict(os.environ)
10271028
try:

Lib/test/test_c_locale_coercion.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -130,7 +130,7 @@ def get_child_details(cls, env_vars):
130130
that.
131131
"""
132132
result, py_cmd = run_python_until_end(
133-
"-c", cls.CHILD_PROCESS_SCRIPT,
133+
"-X", "utf8=0", "-c", cls.CHILD_PROCESS_SCRIPT,
134134
__isolated=True,
135135
**env_vars
136136
)

Lib/test/test_codecs.py

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
import sys
66
import unittest
77
import encodings
8+
from unittest import mock
89

910
from test import support
1011

@@ -3180,16 +3181,9 @@ def test_incremental(self):
31803181
def test_mbcs_alias(self):
31813182
# Check that looking up our 'default' codepage will return
31823183
# mbcs when we don't have a more specific one available
3183-
import _bootlocale
3184-
def _get_fake_codepage(*a):
3185-
return 'cp123'
3186-
old_getpreferredencoding = _bootlocale.getpreferredencoding
3187-
_bootlocale.getpreferredencoding = _get_fake_codepage
3188-
try:
3184+
with mock.patch('_winapi.GetACP', return_value=123):
31893185
codec = codecs.lookup('cp123')
31903186
self.assertEqual(codec.name, 'mbcs')
3191-
finally:
3192-
_bootlocale.getpreferredencoding = old_getpreferredencoding
31933187

31943188

31953189
class ASCIITest(unittest.TestCase):

Lib/test/test_io.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2580,6 +2580,7 @@ def test_reconfigure_line_buffering(self):
25802580
t.reconfigure(line_buffering=None)
25812581
self.assertEqual(t.line_buffering, True)
25822582

2583+
@unittest.skipIf(sys.flags.utf8_mode, "utf-8 mode is enabled")
25832584
def test_default_encoding(self):
25842585
old_environ = dict(os.environ)
25852586
try:
@@ -2599,6 +2600,7 @@ def test_default_encoding(self):
25992600
os.environ.update(old_environ)
26002601

26012602
@support.cpython_only
2603+
@unittest.skipIf(sys.flags.utf8_mode, "utf-8 mode is enabled")
26022604
def test_device_encoding(self):
26032605
# Issue 15989
26042606
import _testcapi

Lib/test/test_sys.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -527,14 +527,16 @@ def test_sys_flags(self):
527527
"inspect", "interactive", "optimize", "dont_write_bytecode",
528528
"no_user_site", "no_site", "ignore_environment", "verbose",
529529
"bytes_warning", "quiet", "hash_randomization", "isolated",
530-
"dev_mode")
530+
"dev_mode", "utf8_mode")
531531
for attr in attrs:
532532
self.assertTrue(hasattr(sys.flags, attr), attr)
533533
attr_type = bool if attr == "dev_mode" else int
534534
self.assertEqual(type(getattr(sys.flags, attr)), attr_type, attr)
535535
self.assertTrue(repr(sys.flags))
536536
self.assertEqual(len(sys.flags), len(attrs))
537537

538+
self.assertIn(sys.flags.utf8_mode, {0, 1, 2})
539+
538540
def assert_raise_on_new_sys_type(self, sys_attr):
539541
# Users are intentionally prevented from creating new instances of
540542
# sys.flags, sys.version_info, and sys.getwindowsversion.
@@ -710,8 +712,8 @@ def test_c_locale_surrogateescape(self):
710712
# have no any effect
711713
out = self.c_locale_get_error_handler(encoding=':')
712714
self.assertEqual(out,
713-
'stdin: surrogateescape\n'
714-
'stdout: surrogateescape\n'
715+
'stdin: strict\n'
716+
'stdout: strict\n'
715717
'stderr: backslashreplace\n')
716718
out = self.c_locale_get_error_handler(encoding='')
717719
self.assertEqual(out,

0 commit comments

Comments
 (0)