Skip to content

bpo-29240: PEP 540: Add a new UTF-8 mode #855

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Dec 13, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 11 additions & 2 deletions Doc/c-api/sys.rst
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,9 @@ Operating System Utilities

.. versionadded:: 3.5

.. versionchanged:: 3.7
The function now uses the UTF-8 encoding in the UTF-8 mode.


.. c:function:: char* Py_EncodeLocale(const wchar_t *text, size_t *error_pos)

Expand All @@ -138,19 +141,25 @@ Operating System Utilities
to free the memory. Return ``NULL`` on encoding error or memory allocation
error

If error_pos is not ``NULL``, ``*error_pos`` is set to the index of the
invalid character on encoding error, or set to ``(size_t)-1`` otherwise.
If error_pos is not ``NULL``, ``*error_pos`` is set to ``(size_t)-1`` on
success, or set to the index of the invalid character on encoding error.

Use the :c:func:`Py_DecodeLocale` function to decode the bytes string back
to a wide character string.

.. versionchanged:: 3.7
The function now uses the UTF-8 encoding in the UTF-8 mode.

.. seealso::

The :c:func:`PyUnicode_EncodeFSDefault` and
:c:func:`PyUnicode_EncodeLocale` functions.

.. versionadded:: 3.5

.. versionchanged:: 3.7
The function now supports the UTF-8 mode.


.. _systemfunctions:

Expand Down
7 changes: 7 additions & 0 deletions Doc/library/locale.rst
Original file line number Diff line number Diff line change
Expand Up @@ -316,6 +316,13 @@ The :mod:`locale` module defines the following exception and functions:
preferences, so this function is not thread-safe. If invoking setlocale is not
necessary or desired, *do_setlocale* should be set to ``False``.

On Android or in the UTF-8 mode (:option:`-X` ``utf8`` option), always
return ``'UTF-8'``, the locale and the *do_setlocale* argument are ignored.

.. versionchanged:: 3.7
The function now always returns ``UTF-8`` on Android or if the UTF-8 mode
is enabled.


.. function:: normalize(localename)

Expand Down
13 changes: 12 additions & 1 deletion Doc/library/sys.rst
Original file line number Diff line number Diff line change
Expand Up @@ -313,6 +313,9 @@ always available.
has caught :exc:`SystemExit` (such as an error flushing buffered data
in the standard streams), the exit status is changed to 120.

.. versionchanged:: 3.7
Added ``utf8_mode`` attribute for the new :option:`-X` ``utf8`` flag.


.. data:: flags

Expand All @@ -335,6 +338,7 @@ always available.
:const:`quiet` :option:`-q`
:const:`hash_randomization` :option:`-R`
:const:`dev_mode` :option:`-X` ``dev``
:const:`utf8_mode` :option:`-X` ``utf8``
============================= =============================

.. versionchanged:: 3.2
Expand All @@ -347,7 +351,8 @@ always available.
Removed obsolete ``division_warning`` attribute.

.. versionchanged:: 3.7
Added ``dev_mode`` attribute for the new :option:`-X` ``dev`` flag.
Added ``dev_mode`` attribute for the new :option:`-X` ``dev`` flag
and ``utf8_mode`` attribute for the new :option:`-X` ``utf8`` flag.


.. data:: float_info
Expand Down Expand Up @@ -492,6 +497,8 @@ always available.
:func:`os.fsencode` and :func:`os.fsdecode` should be used to ensure that
the correct encoding and errors mode are used.

* In the UTF-8 mode, the encoding is ``utf-8`` on any platform.

* On Mac OS X, the encoding is ``'utf-8'``.

* On Unix, the encoding is the locale encoding.
Expand All @@ -506,6 +513,10 @@ always available.
Windows is no longer guaranteed to return ``'mbcs'``. See :pep:`529`
and :func:`_enablelegacywindowsfsencoding` for more information.

.. versionchanged:: 3.7
Return 'utf-8' in the UTF-8 mode.


.. function:: getfilesystemencodeerrors()

Return the name of the error mode used to convert between Unicode filenames
Expand Down
13 changes: 12 additions & 1 deletion Doc/using/cmdline.rst
Original file line number Diff line number Diff line change
Expand Up @@ -439,6 +439,9 @@ Miscellaneous options
* Set the :attr:`~sys.flags.dev_mode` attribute of :attr:`sys.flags` to
``True``

* ``-X utf8`` enables the UTF-8 mode, whereas ``-X utf8=0`` disables the
UTF-8 mode.

It also allows passing arbitrary values and retrieving them through the
:data:`sys._xoptions` dictionary.

Expand All @@ -455,7 +458,7 @@ Miscellaneous options
The ``-X showalloccount`` option.

.. versionadded:: 3.7
The ``-X importtime`` and ``-X dev`` options.
The ``-X importtime``, ``-X dev`` and ``-X utf8`` options.


Options you shouldn't use
Expand Down Expand Up @@ -816,6 +819,14 @@ conflict.

.. versionadded:: 3.7

.. envvar:: PYTHONUTF8

If set to ``1``, enable the UTF-8 mode. If set to ``0``, disable the UTF-8
mode. Any other non-empty string cause an error.

.. versionadded:: 3.7


Debug-mode variables
~~~~~~~~~~~~~~~~~~~~

Expand Down
21 changes: 21 additions & 0 deletions Doc/whatsnew/3.7.rst
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,23 @@ resolution on Linux and Windows.
PEP written and implemented by Victor Stinner


PEP 540: Add a new UTF-8 mode
-----------------------------

Add a new UTF-8 mode to ignore the locale, use the UTF-8 encoding, and change
:data:`sys.stdin` and :data:`sys.stdout` error handlers to ``surrogateescape``.
This mode is enabled by default in the POSIX locale, but otherwise disabled by
default.

The new :option:`-X` ``utf8`` command line option and :envvar:`PYTHONUTF8`
environment variable are added to control the UTF-8 mode.

.. seealso::

:pep:`540` -- Add a new UTF-8 mode
PEP written and implemented by Victor Stinner


New Development Mode: -X dev
----------------------------

Expand Down Expand Up @@ -353,6 +370,10 @@ Added another argument *monetary* in :meth:`format_string` of :mod:`locale`.
If *monetary* is true, the conversion uses monetary thousands separator and
grouping strings. (Contributed by Garvit in :issue:`10379`.)

The :func:`locale.getpreferredencoding` function now always returns ``'UTF-8'``
on Android or in the UTF-8 mode (:option:`-X` ``utf8`` option), the locale and
the *do_setlocale* argument are ignored.

math
----

Expand Down
4 changes: 4 additions & 0 deletions Include/fileobject.h
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,10 @@ PyAPI_DATA(const char *) Py_FileSystemDefaultEncodeErrors;
#endif
PyAPI_DATA(int) Py_HasFileSystemDefaultEncoding;

#if !defined(Py_LIMITED_API) || Py_LIMITED_API+0 >= 0x03070000
PyAPI_DATA(int) Py_UTF8Mode;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an important piece of information? Is it so important that it has to be in the limited API? What would I use this for?

#endif

/* Internal API

The std printer acts as a preliminary sys.stderr until the new io
Expand Down
1 change: 1 addition & 0 deletions Include/pystate.h
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ typedef struct {
int show_alloc_count; /* -X showalloccount */
int dump_refs; /* PYTHONDUMPREFS */
int malloc_stats; /* PYTHONMALLOCSTATS */
int utf8_mode; /* -X utf8 or PYTHONUTF8 environment variable */
} _PyCoreConfig;

#define _PyCoreConfig_INIT (_PyCoreConfig){.use_hash_seed = -1}
Expand Down
6 changes: 6 additions & 0 deletions Lib/_bootlocale.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@

if sys.platform.startswith("win"):
def getpreferredencoding(do_setlocale=True):
if sys.flags.utf8_mode:
return 'UTF-8'
return _locale._getdefaultlocale()[1]
else:
try:
Expand All @@ -21,13 +23,17 @@ def getpreferredencoding(do_setlocale=True):
return 'UTF-8'
else:
def getpreferredencoding(do_setlocale=True):
if sys.flags.utf8_mode:
return 'UTF-8'
# This path for legacy systems needs the more complex
# getdefaultlocale() function, import the full locale module.
import locale
return locale.getpreferredencoding(do_setlocale)
else:
def getpreferredencoding(do_setlocale=True):
assert not do_setlocale
if sys.flags.utf8_mode:
return 'UTF-8'
result = _locale.nl_langinfo(_locale.CODESET)
if not result and sys.platform == 'darwin':
# nl_langinfo can return an empty string
Expand Down
5 changes: 3 additions & 2 deletions Lib/encodings/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -158,8 +158,9 @@ def search_function(encoding):
if sys.platform == 'win32':
def _alias_mbcs(encoding):
try:
import _bootlocale
if encoding == _bootlocale.getpreferredencoding(False):
import _winapi
ansi_code_page = "cp%s" % _winapi.GetACP()
if encoding == ansi_code_page:
import encodings.mbcs
return encodings.mbcs.getregentry()
except ImportError:
Expand Down
6 changes: 6 additions & 0 deletions Lib/locale.py
Original file line number Diff line number Diff line change
Expand Up @@ -617,6 +617,8 @@ def resetlocale(category=LC_ALL):
# On Win32, this will return the ANSI code page
def getpreferredencoding(do_setlocale = True):
"""Return the charset that the user is likely using."""
if sys.flags.utf8_mode:
return 'UTF-8'
import _bootlocale
return _bootlocale.getpreferredencoding(False)
else:
Expand All @@ -634,6 +636,8 @@ def getpreferredencoding(do_setlocale = True):
def getpreferredencoding(do_setlocale = True):
"""Return the charset that the user is likely using,
by looking at environment variables."""
if sys.flags.utf8_mode:
return 'UTF-8'
res = getdefaultlocale()[1]
if res is None:
# LANG not set, default conservatively to ASCII
Expand All @@ -643,6 +647,8 @@ def getpreferredencoding(do_setlocale = True):
def getpreferredencoding(do_setlocale = True):
"""Return the charset that the user is likely using,
according to the system configuration."""
if sys.flags.utf8_mode:
return 'UTF-8'
import _bootlocale
if do_setlocale:
oldloc = setlocale(LC_CTYPE)
Expand Down
2 changes: 1 addition & 1 deletion Lib/subprocess.py
Original file line number Diff line number Diff line change
Expand Up @@ -280,7 +280,7 @@ def _args_from_interpreter_flags():
if dev_mode:
args.extend(('-X', 'dev'))
for opt in ('faulthandler', 'tracemalloc', 'importtime',
'showalloccount', 'showrefcount'):
'showalloccount', 'showrefcount', 'utf8'):
if opt in xoptions:
value = xoptions[opt]
if value is True:
Expand Down
1 change: 1 addition & 0 deletions Lib/test/test_builtin.py
Original file line number Diff line number Diff line change
Expand Up @@ -1022,6 +1022,7 @@ def test_open(self):
self.assertRaises(ValueError, open, 'a\x00b')
self.assertRaises(ValueError, open, b'a\x00b')

@unittest.skipIf(sys.flags.utf8_mode, "utf-8 mode is enabled")
def test_open_default_encoding(self):
old_environ = dict(os.environ)
try:
Expand Down
2 changes: 1 addition & 1 deletion Lib/test/test_c_locale_coercion.py
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@ def get_child_details(cls, env_vars):
that.
"""
result, py_cmd = run_python_until_end(
"-c", cls.CHILD_PROCESS_SCRIPT,
"-X", "utf8=0", "-c", cls.CHILD_PROCESS_SCRIPT,
__isolated=True,
**env_vars
)
Expand Down
10 changes: 2 additions & 8 deletions Lib/test/test_codecs.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import sys
import unittest
import encodings
from unittest import mock

from test import support

Expand Down Expand Up @@ -3180,16 +3181,9 @@ def test_incremental(self):
def test_mbcs_alias(self):
# Check that looking up our 'default' codepage will return
# mbcs when we don't have a more specific one available
import _bootlocale
def _get_fake_codepage(*a):
return 'cp123'
old_getpreferredencoding = _bootlocale.getpreferredencoding
_bootlocale.getpreferredencoding = _get_fake_codepage
try:
with mock.patch('_winapi.GetACP', return_value=123):
codec = codecs.lookup('cp123')
self.assertEqual(codec.name, 'mbcs')
finally:
_bootlocale.getpreferredencoding = old_getpreferredencoding


class ASCIITest(unittest.TestCase):
Expand Down
2 changes: 2 additions & 0 deletions Lib/test/test_io.py
Original file line number Diff line number Diff line change
Expand Up @@ -2580,6 +2580,7 @@ def test_reconfigure_line_buffering(self):
t.reconfigure(line_buffering=None)
self.assertEqual(t.line_buffering, True)

@unittest.skipIf(sys.flags.utf8_mode, "utf-8 mode is enabled")
def test_default_encoding(self):
old_environ = dict(os.environ)
try:
Expand All @@ -2599,6 +2600,7 @@ def test_default_encoding(self):
os.environ.update(old_environ)

@support.cpython_only
@unittest.skipIf(sys.flags.utf8_mode, "utf-8 mode is enabled")
def test_device_encoding(self):
# Issue 15989
import _testcapi
Expand Down
8 changes: 5 additions & 3 deletions Lib/test/test_sys.py
Original file line number Diff line number Diff line change
Expand Up @@ -527,14 +527,16 @@ def test_sys_flags(self):
"inspect", "interactive", "optimize", "dont_write_bytecode",
"no_user_site", "no_site", "ignore_environment", "verbose",
"bytes_warning", "quiet", "hash_randomization", "isolated",
"dev_mode")
"dev_mode", "utf8_mode")
for attr in attrs:
self.assertTrue(hasattr(sys.flags, attr), attr)
attr_type = bool if attr == "dev_mode" else int
self.assertEqual(type(getattr(sys.flags, attr)), attr_type, attr)
self.assertTrue(repr(sys.flags))
self.assertEqual(len(sys.flags), len(attrs))

self.assertIn(sys.flags.utf8_mode, {0, 1, 2})

def assert_raise_on_new_sys_type(self, sys_attr):
# Users are intentionally prevented from creating new instances of
# sys.flags, sys.version_info, and sys.getwindowsversion.
Expand Down Expand Up @@ -710,8 +712,8 @@ def test_c_locale_surrogateescape(self):
# have no any effect
out = self.c_locale_get_error_handler(encoding=':')
self.assertEqual(out,
'stdin: surrogateescape\n'
'stdout: surrogateescape\n'
'stdin: strict\n'
'stdout: strict\n'
'stderr: backslashreplace\n')
out = self.c_locale_get_error_handler(encoding='')
self.assertEqual(out,
Expand Down
Loading