From b7db073b2318fe9340ef9f9caa81166636174b0c Mon Sep 17 00:00:00 2001 From: animalize Date: Sat, 14 Sep 2019 17:26:56 +0800 Subject: [PATCH 01/19] overhaul Error Handlers section in codecs documentation * Some handlers were wrongly described as text-encoding only, but actually they can also be used in text-decoding. * Add more description to each handler. * Add two REPL examples. * Add indexes for Error Handler's name. --- Doc/library/codecs.rst | 161 +++++++++++------- .../2019-09-12-08-28-17.bpo-38056.6ktYkc.rst | 1 + 2 files changed, 101 insertions(+), 61 deletions(-) create mode 100644 Misc/NEWS.d/next/Documentation/2019-09-12-08-28-17.bpo-38056.6ktYkc.rst diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index f071057293eece..5bdf3119eab4e9 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -290,58 +290,55 @@ codec will handle encoding and decoding errors. Error Handlers ^^^^^^^^^^^^^^ -To simplify and standardize error handling, -codecs may implement different error handling schemes by -accepting the *errors* string argument. The following string values are -defined and implemented by all standard Python codecs: +To simplify and standardize error handling, codecs may implement different +error handling schemes by accepting the *errors* string argument: -.. tabularcolumns:: |l|L| - -+-------------------------+-----------------------------------------------+ -| Value | Meaning | -+=========================+===============================================+ -| ``'strict'`` | Raise :exc:`UnicodeError` (or a subclass); | -| | this is the default. Implemented in | -| | :func:`strict_errors`. | -+-------------------------+-----------------------------------------------+ -| ``'ignore'`` | Ignore the malformed data and continue | -| | without further notice. Implemented in | -| | :func:`ignore_errors`. | -+-------------------------+-----------------------------------------------+ - -The following error handlers are only applicable to -:term:`text encodings `: + >>> 'ß ♬'.encode(encoding='ascii', errors='backslashreplace') + b'\\xdf \\u266c' + >>> 'ß ♬'.encode(encoding='ascii', errors='xmlcharrefreplace') + b'ß ♬' .. index:: + pair: strict; error handler's name + pair: ignore; error handler's name + pair: replace; error handler's name + pair: backslashreplace; error handler's name + pair: surrogateescape; error handler's name single: ? (question mark); replacement character single: \ (backslash); escape sequence single: \x; escape sequence single: \u; escape sequence single: \U; escape sequence - single: \N; escape sequence + +The following string values can be used with all Python built-in codecs: + +.. tabularcolumns:: |l|L| +-------------------------+-----------------------------------------------+ | Value | Meaning | +=========================+===============================================+ -| ``'replace'`` | Replace with a suitable replacement | -| | marker; Python will use the official | -| | ``U+FFFD`` REPLACEMENT CHARACTER for the | -| | built-in codecs on decoding, and '?' on | -| | encoding. Implemented in | -| | :func:`replace_errors`. | +| ``'strict'`` | Raise :exc:`UnicodeError` (or a subclass), | +| | this is the default. Implemented in | +| | :func:`strict_errors`. | +-------------------------+-----------------------------------------------+ -| ``'xmlcharrefreplace'`` | Replace with the appropriate XML character | -| | reference (only for encoding). Implemented | -| | in :func:`xmlcharrefreplace_errors`. | +| ``'ignore'`` | Ignore the malformed data and continue without| +| | further notice. Implemented in | +| | :func:`ignore_errors`. | ++-------------------------+-----------------------------------------------+ +| ``'replace'`` | Replace with a replacement marker. On | +| | encoding, use ``?`` (ASCII character). On | +| | decoding, use ``U+FFFD`` (the official | +| | REPLACEMENT CHARACTER). Implemented in | +| | :func:`replace_errors`. | +-------------------------+-----------------------------------------------+ | ``'backslashreplace'`` | Replace with backslashed escape sequences. | +| | On encoding, use hexadecimal form of Unicode | +| | code point with formats ``\xhh`` ``\uxxxx`` | +| | ``\Uxxxxxxxx``. On decoding, use hexadecimal | +| | form of byte value with format ``\xhh``. | | | Implemented in | | | :func:`backslashreplace_errors`. | +-------------------------+-----------------------------------------------+ -| ``'namereplace'`` | Replace with ``\N{...}`` escape sequences | -| | (only for encoding). Implemented in | -| | :func:`namereplace_errors`. | -+-------------------------+-----------------------------------------------+ | ``'surrogateescape'`` | On decoding, replace byte with individual | | | surrogate code ranging from ``U+DC80`` to | | | ``U+DCFF``. This code will then be turned | @@ -351,6 +348,31 @@ The following error handlers are only applicable to | | more.) | +-------------------------+-----------------------------------------------+ +.. index:: + pair: xmlcharrefreplace; error handler's name + pair: namereplace; error handler's name + single: \N; escape sequence + +The following error handlers are only applicable to +:term:`text encodings `: + ++-------------------------+-----------------------------------------------+ +| Value | Meaning | ++=========================+===============================================+ +| ``'xmlcharrefreplace'`` | Replace with XML/HTML numeric character | +| | reference, which is a decimal form of Unicode | +| | code point with format ``&#num;`` Implemented | +| | in :func:`xmlcharrefreplace_errors`. | ++-------------------------+-----------------------------------------------+ +| ``'namereplace'`` | Replace with ``\N{...}`` escape sequences, | +| | what appears in the brace is the Name property| +| | from Unicode Character Database. Implemented | +| | in :func:`namereplace_errors`. | ++-------------------------+-----------------------------------------------+ + +.. index:: + pair: surrogatepass; error handler's name + In addition, the following error handler is specific to the given codecs: +-------------------+------------------------+-------------------------------------------+ @@ -365,13 +387,14 @@ In addition, the following error handler is specific to the given codecs: The ``'surrogateescape'`` and ``'surrogatepass'`` error handlers. .. versionchanged:: 3.4 - The ``'surrogatepass'`` error handlers now works with utf-16\* and utf-32\* codecs. + The ``'surrogatepass'`` error handler now works with utf-16\* and utf-32\* + codecs. .. versionadded:: 3.5 The ``'namereplace'`` error handler. .. versionchanged:: 3.5 - The ``'backslashreplace'`` error handlers now works with decoding and + The ``'backslashreplace'`` error handler now works with decoding and translating. The set of allowed values can be extended by registering a new named error @@ -414,42 +437,58 @@ functions: .. function:: strict_errors(exception) - Implements the ``'strict'`` error handling: each encoding or - decoding error raises a :exc:`UnicodeError`. + Implements the ``'strict'`` error handling. + + Each encoding or decoding error raises a :exc:`UnicodeError`. + + +.. function:: ignore_errors(exception) + + Implements the ``'ignore'`` error handling. + + Malformed data is ignored and encoding or decoding is continued without + further notice. .. function:: replace_errors(exception) - Implements the ``'replace'`` error handling (for :term:`text encodings - ` only): substitutes ``'?'`` for encoding errors - (to be encoded by the codec), and ``'\ufffd'`` (the Unicode replacement - character) for decoding errors. + Implements the ``'replace'`` error handling. + Substitutes ``?`` (ASCII character) for encoding errors, or ``U+FFFD`` (the + official REPLACEMENT CHARACTER) for decoding errors. -.. function:: ignore_errors(exception) - Implements the ``'ignore'`` error handling: malformed data is ignored and - encoding or decoding is continued without further notice. +.. function:: backslashreplace_errors(exception) + + Implements the ``'backslashreplace'`` error handling. + + Malformed data is replaced by a backslashed escape sequence. + On encoding, use hexadecimal form of Unicode code point with formats + ``\xhh`` ``\uxxxx`` ``\Uxxxxxxxx``. On decoding, use hexadecimal form of + byte value with format ``\xhh``. + + .. versionchanged:: 3.5 + now works with decoding and translating. .. function:: xmlcharrefreplace_errors(exception) Implements the ``'xmlcharrefreplace'`` error handling (for encoding with - :term:`text encodings ` only): the - unencodable character is replaced by an appropriate XML character reference. + :term:`text encodings ` only). + The unencodable character is replaced by an appropriate XML/HTML numeric + character reference, which is a decimal form of Unicode code point with + format ``&#num;`` -.. function:: backslashreplace_errors(exception) - - Implements the ``'backslashreplace'`` error handling (for - :term:`text encodings ` only): malformed data is - replaced by a backslashed escape sequence. .. function:: namereplace_errors(exception) Implements the ``'namereplace'`` error handling (for encoding with - :term:`text encodings ` only): the - unencodable character is replaced by a ``\N{...}`` escape sequence. + :term:`text encodings ` only). + + The unencodable character is replaced by a ``\N{...}`` escape sequence, + what appears in the brace is the Name property from Unicode Character + Database. .. versionadded:: 3.5 @@ -463,7 +502,7 @@ The base :class:`Codec` class defines these methods which also define the function interfaces of the stateless encoder and decoder: -.. method:: Codec.encode(input[, errors]) +.. method:: Codec.encode(input[, errors='strict']) Encodes the object *input* and returns a tuple (output object, length consumed). For instance, :term:`text encoding` converts @@ -481,7 +520,7 @@ function interfaces of the stateless encoder and decoder: of the output object type in this situation. -.. method:: Codec.decode(input[, errors]) +.. method:: Codec.decode(input[, errors='strict']) Decodes the object *input* and returns a tuple (output object, length consumed). For instance, for a :term:`text encoding`, decoding converts @@ -548,7 +587,7 @@ define in order to be compatible with the Python codec registry. object. - .. method:: encode(object[, final]) + .. method:: encode(object[, final=False]) Encodes *object* (taking the current state of the encoder into account) and returns the resulting encoded object. If this is the last call to @@ -605,7 +644,7 @@ define in order to be compatible with the Python codec registry. object. - .. method:: decode(object[, final]) + .. method:: decode(object[, final=False]) Decodes *object* (taking the current state of the decoder into account) and returns the resulting decoded object. If this is the last call to @@ -738,7 +777,7 @@ compatible with the Python codec registry. :func:`register_error`. - .. method:: read([size[, chars, [firstline]]]) + .. method:: read([size=-1[, chars=-1, [firstline=False]]]) Decodes data from the stream and returns the resulting object. @@ -764,7 +803,7 @@ compatible with the Python codec registry. available on the stream, these should be read too. - .. method:: readline([size[, keepends]]) + .. method:: readline([size=None[, keepends=True]]) Read one line from the input stream and return the decoded data. @@ -775,7 +814,7 @@ compatible with the Python codec registry. returned. - .. method:: readlines([sizehint[, keepends]]) + .. method:: readlines([sizehint=None[, keepends=True]]) Read all lines available on the input stream and return them as a list of lines. diff --git a/Misc/NEWS.d/next/Documentation/2019-09-12-08-28-17.bpo-38056.6ktYkc.rst b/Misc/NEWS.d/next/Documentation/2019-09-12-08-28-17.bpo-38056.6ktYkc.rst new file mode 100644 index 00000000000000..bfbdf3242f31b6 --- /dev/null +++ b/Misc/NEWS.d/next/Documentation/2019-09-12-08-28-17.bpo-38056.6ktYkc.rst @@ -0,0 +1 @@ +Overhaul :ref:`error-handlers` section in :mod:`codecs` module documentation. From 96cc186b6402c8cdb32fd1b9a237e828c069b455 Mon Sep 17 00:00:00 2001 From: animalize Date: Sat, 14 Sep 2019 21:40:06 +0800 Subject: [PATCH 02/19] September 17th --- Doc/glossary.rst | 11 ++++++++++- Doc/library/codecs.rst | 31 ++++++++++++++++--------------- 2 files changed, 26 insertions(+), 16 deletions(-) diff --git a/Doc/glossary.rst b/Doc/glossary.rst index e601e8b3698410..0e2bb73712a177 100644 --- a/Doc/glossary.rst +++ b/Doc/glossary.rst @@ -1044,7 +1044,16 @@ Glossary as :keyword:`if`, :keyword:`while` or :keyword:`for`. text encoding - A codec which encodes Unicode strings to bytes. + Strings are stored internally as sequences of Unicode code points in + range ``0x0``--``0x10FFFF``. Once a string object is used outside of CPU + and memory, how these arrays are stored as bytes become an issue. + + Serializing a string into a sequence of bytes is known as "encoding", and + recreating the string from the sequence of bytes is known as "decoding". + + There are a variety of different text serialization + :ref:`codecs `, which are collectivity referred to as + "text encodings". text file A :term:`file object` able to read and write :class:`str` objects. diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index 5bdf3119eab4e9..c4f9efee398f27 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -23,10 +23,10 @@ This module defines base classes for standard Python codecs (encoders and decoders) and provides access to the internal Python codec registry, which manages the codec and error handling lookup process. Most standard codecs -are :term:`text encodings `, which encode text to bytes, -but there are also codecs provided that encode text to text, and bytes to -bytes. Custom codecs may encode and decode between arbitrary types, but some -module features are restricted to use specifically with +are :term:`text encodings `, which encode text to bytes (and +reverse), but there are also codecs provided that encode text to text, and +bytes to bytes. Custom codecs may encode and decode between arbitrary types, +but some module features are restricted to use specifically with :term:`text encodings `, or with codecs that encode to :class:`bytes`. @@ -293,10 +293,10 @@ Error Handlers To simplify and standardize error handling, codecs may implement different error handling schemes by accepting the *errors* string argument: - >>> 'ß ♬'.encode(encoding='ascii', errors='backslashreplace') - b'\\xdf \\u266c' - >>> 'ß ♬'.encode(encoding='ascii', errors='xmlcharrefreplace') - b'ß ♬' + >>> 'German ß, ♬'.encode(encoding='ascii', errors='backslashreplace') + b'German \\xdf, \\u266c' + >>> 'German ß, ♬'.encode(encoding='ascii', errors='xmlcharrefreplace') + b'German ß, ♬' .. index:: pair: strict; error handler's name @@ -310,7 +310,8 @@ error handling schemes by accepting the *errors* string argument: single: \u; escape sequence single: \U; escape sequence -The following string values can be used with all Python built-in codecs: +The following error handlers can be used with all :ref:`standard-encodings` +codecs: .. tabularcolumns:: |l|L| @@ -353,8 +354,8 @@ The following string values can be used with all Python built-in codecs: pair: namereplace; error handler's name single: \N; escape sequence -The following error handlers are only applicable to -:term:`text encodings `: +The following error handlers are only applicable to encoding (within +:term:`text encodings `): +-------------------------+-----------------------------------------------+ | Value | Meaning | @@ -473,8 +474,8 @@ functions: .. function:: xmlcharrefreplace_errors(exception) - Implements the ``'xmlcharrefreplace'`` error handling (for encoding with - :term:`text encodings ` only). + Implements the ``'xmlcharrefreplace'`` error handling (for encoding within + :term:`text encoding` only). The unencodable character is replaced by an appropriate XML/HTML numeric character reference, which is a decimal form of Unicode code point with @@ -483,8 +484,8 @@ functions: .. function:: namereplace_errors(exception) - Implements the ``'namereplace'`` error handling (for encoding with - :term:`text encodings ` only). + Implements the ``'namereplace'`` error handling (for encoding within + :term:`text encoding` only). The unencodable character is replaced by a ``\N{...}`` escape sequence, what appears in the brace is the Name property from Unicode Character From 96cd2f8249907d822f7d7f1d978bbbafe6c71aaf Mon Sep 17 00:00:00 2001 From: animalize Date: Wed, 18 Sep 2019 09:15:41 +0800 Subject: [PATCH 03/19] rework the first paragraph of term "text encoding" --- Doc/glossary.rst | 6 +++--- Doc/library/codecs.rst | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/Doc/glossary.rst b/Doc/glossary.rst index 0e2bb73712a177..03b1ffc511e991 100644 --- a/Doc/glossary.rst +++ b/Doc/glossary.rst @@ -1044,9 +1044,9 @@ Glossary as :keyword:`if`, :keyword:`while` or :keyword:`for`. text encoding - Strings are stored internally as sequences of Unicode code points in - range ``0x0``--``0x10FFFF``. Once a string object is used outside of CPU - and memory, how these arrays are stored as bytes become an issue. + A string in Python is a sequence of Unicode code points (in range + ``0x0``--``0x10FFFF``). To store or transfer a string, it needs to be + serialized as a sequence of bytes. Serializing a string into a sequence of bytes is known as "encoding", and recreating the string from the sequence of bytes is known as "decoding". diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index c4f9efee398f27..f4d0f8ab83dfbb 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -469,7 +469,7 @@ functions: byte value with format ``\xhh``. .. versionchanged:: 3.5 - now works with decoding and translating. + Now works with decoding and translating. .. function:: xmlcharrefreplace_errors(exception) From 8f6da1467886862f6be84c58422e970b217fa70b Mon Sep 17 00:00:00 2001 From: animalize Date: Thu, 19 Sep 2019 12:54:37 +0800 Subject: [PATCH 04/19] Apply suggestions from code review Co-Authored-By: Kyle Stanley --- Doc/glossary.rst | 2 +- Doc/library/codecs.rst | 20 +++++++++---------- .../2019-09-12-08-28-17.bpo-38056.6ktYkc.rst | 2 +- 3 files changed, 12 insertions(+), 12 deletions(-) diff --git a/Doc/glossary.rst b/Doc/glossary.rst index 03b1ffc511e991..3b1c5cafc81002 100644 --- a/Doc/glossary.rst +++ b/Doc/glossary.rst @@ -1052,7 +1052,7 @@ Glossary recreating the string from the sequence of bytes is known as "decoding". There are a variety of different text serialization - :ref:`codecs `, which are collectivity referred to as + :ref:`codecs `, which are collectively referred to as "text encodings". text file diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index f4d0f8ab83dfbb..c0bc274e397a4c 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -24,10 +24,10 @@ This module defines base classes for standard Python codecs (encoders and decoders) and provides access to the internal Python codec registry, which manages the codec and error handling lookup process. Most standard codecs are :term:`text encodings `, which encode text to bytes (and -reverse), but there are also codecs provided that encode text to text, and +the reverse), but there are also codecs provided that encode text to text, and bytes to bytes. Custom codecs may encode and decode between arbitrary types, -but some module features are restricted to use specifically with -:term:`text encodings `, or with codecs that encode to +but some module features are restricted to be used specifically with + :term:`text encodings ` or with codecs that encode to :class:`bytes`. The module defines the following functions for encoding and decoding with @@ -310,8 +310,8 @@ error handling schemes by accepting the *errors* string argument: single: \u; escape sequence single: \U; escape sequence -The following error handlers can be used with all :ref:`standard-encodings` -codecs: +The following error handlers can be used with all Python +:ref:`standard-encodings` codecs: .. tabularcolumns:: |l|L| @@ -447,7 +447,7 @@ functions: Implements the ``'ignore'`` error handling. - Malformed data is ignored and encoding or decoding is continued without + Malformed data is ignored; encoding or decoding is continued without further notice. @@ -455,7 +455,7 @@ functions: Implements the ``'replace'`` error handling. - Substitutes ``?`` (ASCII character) for encoding errors, or ``U+FFFD`` (the + Substitutes ``?`` (ASCII character) for encoding errors or ``U+FFFD`` (the official REPLACEMENT CHARACTER) for decoding errors. @@ -464,12 +464,12 @@ functions: Implements the ``'backslashreplace'`` error handling. Malformed data is replaced by a backslashed escape sequence. - On encoding, use hexadecimal form of Unicode code point with formats - ``\xhh`` ``\uxxxx`` ``\Uxxxxxxxx``. On decoding, use hexadecimal form of + On encoding, use the hexadecimal form of Unicode code point with formats + ``\xhh`` ``\uxxxx`` ``\Uxxxxxxxx``. On decoding, use the hexadecimal form of byte value with format ``\xhh``. .. versionchanged:: 3.5 - Now works with decoding and translating. + Works with decoding and translating. .. function:: xmlcharrefreplace_errors(exception) diff --git a/Misc/NEWS.d/next/Documentation/2019-09-12-08-28-17.bpo-38056.6ktYkc.rst b/Misc/NEWS.d/next/Documentation/2019-09-12-08-28-17.bpo-38056.6ktYkc.rst index bfbdf3242f31b6..2e6b70fd84b6d9 100644 --- a/Misc/NEWS.d/next/Documentation/2019-09-12-08-28-17.bpo-38056.6ktYkc.rst +++ b/Misc/NEWS.d/next/Documentation/2019-09-12-08-28-17.bpo-38056.6ktYkc.rst @@ -1 +1 @@ -Overhaul :ref:`error-handlers` section in :mod:`codecs` module documentation. +Overhaul the :ref:`error-handlers` documentation in :mod:`codecs`. From 9a6d3781803b3e076f442e41670c97322d8858ea Mon Sep 17 00:00:00 2001 From: animalize Date: Thu, 19 Sep 2019 13:58:03 +0800 Subject: [PATCH 05/19] fix unexpected indentation --- Doc/library/codecs.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index c0bc274e397a4c..7f36b18389477f 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -27,7 +27,7 @@ are :term:`text encodings `, which encode text to bytes (and the reverse), but there are also codecs provided that encode text to text, and bytes to bytes. Custom codecs may encode and decode between arbitrary types, but some module features are restricted to be used specifically with - :term:`text encodings ` or with codecs that encode to +:term:`text encodings ` or with codecs that encode to :class:`bytes`. The module defines the following functions for encoding and decoding with @@ -479,7 +479,7 @@ functions: The unencodable character is replaced by an appropriate XML/HTML numeric character reference, which is a decimal form of Unicode code point with - format ``&#num;`` + format ``&#num;`` . .. function:: namereplace_errors(exception) From f9a082a2d4c652951a72ae1a32dd71372c815d6b Mon Sep 17 00:00:00 2001 From: animalize Date: Fri, 20 Sep 2019 10:36:02 +0800 Subject: [PATCH 06/19] namereplace Co-Authored-By: Kyle Stanley --- Doc/library/codecs.rst | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index 7f36b18389477f..67bdefa5657e07 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -366,9 +366,9 @@ The following error handlers are only applicable to encoding (within | | in :func:`xmlcharrefreplace_errors`. | +-------------------------+-----------------------------------------------+ | ``'namereplace'`` | Replace with ``\N{...}`` escape sequences, | -| | what appears in the brace is the Name property| -| | from Unicode Character Database. Implemented | -| | in :func:`namereplace_errors`. | +| | what appears in the braces is the Name | +| | property from Unicode Character Database. | +| | Implemented in :func:`namereplace_errors`. | +-------------------------+-----------------------------------------------+ .. index:: @@ -487,9 +487,10 @@ functions: Implements the ``'namereplace'`` error handling (for encoding within :term:`text encoding` only). - The unencodable character is replaced by a ``\N{...}`` escape sequence, - what appears in the brace is the Name property from Unicode Character - Database. + The unencodable character is replaced by a ``\N{...}`` escape sequence. The + set of characters that appear in the braces is the Name property from + Unicode Character Database. For example, the German lowercase letter ``"ß"`` + will be converted to byte sequence ``\N{LATIN SMALL LETTER SHARP S}`` . .. versionadded:: 3.5 From e8844a835168aa266a340a8d190ab6ad1d43c4d1 Mon Sep 17 00:00:00 2001 From: animalize Date: Fri, 20 Sep 2019 12:10:27 +0800 Subject: [PATCH 07/19] remove Python "2.5" version --- Doc/library/codecs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index 67bdefa5657e07..a3620f23f70419 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -988,7 +988,7 @@ encoding was used for encoding a string. Each charmap encoding can decode any random byte sequence. However that's not possible with UTF-8, as UTF-8 byte sequences have a structure that doesn't allow arbitrary byte sequences. To increase the reliability with which a UTF-8 encoding can be -detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls +detected, Microsoft invented a variant of UTF-8 (that Python calls ``"utf-8-sig"``) for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: ``0xef``, ``0xbb``, ``0xbf``) is written. As it's rather improbable From 503400bbc5d51e13a8f96431485b5a7b1fbdc0b7 Mon Sep 17 00:00:00 2001 From: animalize Date: Sun, 29 Sep 2019 19:35:51 +0800 Subject: [PATCH 08/19] clarify the description of `surrogatepass` --- Doc/library/codecs.rst | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index a3620f23f70419..d1d99d9e08d2d1 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -379,9 +379,11 @@ In addition, the following error handler is specific to the given codecs: +-------------------+------------------------+-------------------------------------------+ | Value | Codecs | Meaning | +===================+========================+===========================================+ -|``'surrogatepass'``| utf-8, utf-16, utf-32, | Allow encoding and decoding of surrogate | -| | utf-16-be, utf-16-le, | codes. These codecs normally treat the | -| | utf-32-be, utf-32-le | presence of surrogates as an error. | +|``'surrogatepass'``| utf-8, utf-16, utf-32, | Allow encoding and decoding Surrogate code| +| | utf-16-be, utf-16-le, | point (``U+D800`` - ``U+DFFF``) as normal | +| | utf-32-be, utf-32-le | code point. Otherwise these codecs treat | +| | | the presence of Surrogate code point in | +| | | :class:`str` as an error. | +-------------------+------------------------+-------------------------------------------+ .. versionadded:: 3.1 @@ -489,7 +491,7 @@ functions: The unencodable character is replaced by a ``\N{...}`` escape sequence. The set of characters that appear in the braces is the Name property from - Unicode Character Database. For example, the German lowercase letter ``"ß"`` + Unicode Character Database. For example, the German lowercase letter ``'ß'`` will be converted to byte sequence ``\N{LATIN SMALL LETTER SHARP S}`` . .. versionadded:: 3.5 From d81e5b19f4ae1fb1f24c7f1d1c9b48889560f752 Mon Sep 17 00:00:00 2001 From: animalize Date: Wed, 2 Oct 2019 09:24:33 +0800 Subject: [PATCH 09/19] improve replace description --- Doc/library/codecs.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index d1d99d9e08d2d1..ff9be6b2f5a049 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -328,7 +328,7 @@ The following error handlers can be used with all Python +-------------------------+-----------------------------------------------+ | ``'replace'`` | Replace with a replacement marker. On | | | encoding, use ``?`` (ASCII character). On | -| | decoding, use ``U+FFFD`` (the official | +| | decoding, use ``�`` (U+FFFD, the official | | | REPLACEMENT CHARACTER). Implemented in | | | :func:`replace_errors`. | +-------------------------+-----------------------------------------------+ @@ -457,8 +457,8 @@ functions: Implements the ``'replace'`` error handling. - Substitutes ``?`` (ASCII character) for encoding errors or ``U+FFFD`` (the - official REPLACEMENT CHARACTER) for decoding errors. + Substitutes ``?`` (ASCII character) for encoding errors or ``�`` (U+FFFD, + the official REPLACEMENT CHARACTER) for decoding errors. .. function:: backslashreplace_errors(exception) From deafda35eb334b23faaf9b43ca170fc6cb2f8fc1 Mon Sep 17 00:00:00 2001 From: Ma Lin Date: Thu, 17 Oct 2019 09:10:44 +0800 Subject: [PATCH 10/19] Update Doc/glossary.rst Co-Authored-By: Victor Stinner --- Doc/glossary.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Doc/glossary.rst b/Doc/glossary.rst index 3b1c5cafc81002..c047cf15076e02 100644 --- a/Doc/glossary.rst +++ b/Doc/glossary.rst @@ -1045,7 +1045,7 @@ Glossary text encoding A string in Python is a sequence of Unicode code points (in range - ``0x0``--``0x10FFFF``). To store or transfer a string, it needs to be + ``U+0000``--``U+10FFFF``). To store or transfer a string, it needs to be serialized as a sequence of bytes. Serializing a string into a sequence of bytes is known as "encoding", and From 28e20754122a9c8f702615a6d6b0caa6cb02cfe5 Mon Sep 17 00:00:00 2001 From: animalize Date: Thu, 17 Oct 2019 09:18:56 +0800 Subject: [PATCH 11/19] (and the reverse) => (and decode bytes to text) --- Doc/library/codecs.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index ff9be6b2f5a049..828be6790cc554 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -24,9 +24,9 @@ This module defines base classes for standard Python codecs (encoders and decoders) and provides access to the internal Python codec registry, which manages the codec and error handling lookup process. Most standard codecs are :term:`text encodings `, which encode text to bytes (and -the reverse), but there are also codecs provided that encode text to text, and -bytes to bytes. Custom codecs may encode and decode between arbitrary types, -but some module features are restricted to be used specifically with +decode bytes to text), but there are also codecs provided that encode text to +text, and bytes to bytes. Custom codecs may encode and decode between arbitrary +types, but some module features are restricted to be used specifically with :term:`text encodings ` or with codecs that encode to :class:`bytes`. @@ -909,7 +909,7 @@ Encodings and Unicode --------------------- Strings are stored internally as sequences of code points in -range ``0x0``--``0x10FFFF``. (See :pep:`393` for +range ``U+0000``--``U+10FFFF``. (See :pep:`393` for more details about the implementation.) Once a string object is used outside of CPU and memory, endianness and how these arrays are stored as bytes become an issue. As with other From ee2bc200d4897bad9bb4625449ec0b82a3c186b7 Mon Sep 17 00:00:00 2001 From: Ma Lin Date: Mon, 9 May 2022 07:44:05 +0800 Subject: [PATCH 12/19] Surrogate -> surrogate Co-authored-by: Jelle Zijlstra --- Doc/library/codecs.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index 096a2f8b06662b..99db6a76ee9c51 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -389,10 +389,10 @@ In addition, the following error handler is specific to the given codecs: +-------------------+------------------------+-------------------------------------------+ | Value | Codecs | Meaning | +===================+========================+===========================================+ -|``'surrogatepass'``| utf-8, utf-16, utf-32, | Allow encoding and decoding Surrogate code| +|``'surrogatepass'``| utf-8, utf-16, utf-32, | Allow encoding and decoding surrogate code| | | utf-16-be, utf-16-le, | point (``U+D800`` - ``U+DFFF``) as normal | | | utf-32-be, utf-32-le | code point. Otherwise these codecs treat | -| | | the presence of Surrogate code point in | +| | | the presence of surrogate code point in | | | | :class:`str` as an error. | +-------------------+------------------------+-------------------------------------------+ From 31158f30ecf2936b4bbda96ff3b86a6382a9142e Mon Sep 17 00:00:00 2001 From: Ma Lin Date: Mon, 9 May 2022 07:46:50 +0800 Subject: [PATCH 13/19] Update Doc/library/codecs.rst Co-authored-by: Jelle Zijlstra --- Doc/library/codecs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index 99db6a76ee9c51..3d2ffde17ba5f1 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -516,7 +516,7 @@ The base :class:`Codec` class defines these methods which also define the function interfaces of the stateless encoder and decoder: -.. method:: Codec.encode(input[, errors='strict']) +.. method:: Codec.encode(input, errors='strict') Encodes the object *input* and returns a tuple (output object, length consumed). For instance, :term:`text encoding` converts From 5656c3d8593d0bc3eeb180baf5846a0679d916e9 Mon Sep 17 00:00:00 2001 From: Ma Lin Date: Mon, 9 May 2022 07:46:56 +0800 Subject: [PATCH 14/19] Update Doc/library/codecs.rst Co-authored-by: Jelle Zijlstra --- Doc/library/codecs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index 3d2ffde17ba5f1..920f0a29db826a 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -601,7 +601,7 @@ define in order to be compatible with the Python codec registry. object. - .. method:: encode(object[, final=False]) + .. method:: encode(object, final=False) Encodes *object* (taking the current state of the encoder into account) and returns the resulting encoded object. If this is the last call to From 078815311b5dce182cf7d52634c890200d816b70 Mon Sep 17 00:00:00 2001 From: Ma Lin Date: Mon, 9 May 2022 07:47:01 +0800 Subject: [PATCH 15/19] Update Doc/library/codecs.rst Co-authored-by: Jelle Zijlstra --- Doc/library/codecs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index 920f0a29db826a..1e30407521cd48 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -534,7 +534,7 @@ function interfaces of the stateless encoder and decoder: of the output object type in this situation. -.. method:: Codec.decode(input[, errors='strict']) +.. method:: Codec.decode(input, errors='strict') Decodes the object *input* and returns a tuple (output object, length consumed). For instance, for a :term:`text encoding`, decoding converts From 5ef2131e0c3969db9c8c75b358027d7e234d1bdc Mon Sep 17 00:00:00 2001 From: Ma Lin Date: Mon, 9 May 2022 07:47:12 +0800 Subject: [PATCH 16/19] Update Doc/library/codecs.rst Co-authored-by: Jelle Zijlstra --- Doc/library/codecs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index 1e30407521cd48..424dc23099da00 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -792,7 +792,7 @@ compatible with the Python codec registry. :func:`register_error`. - .. method:: read([size=-1[, chars=-1, [firstline=False]]]) + .. method:: read(size=-1, chars=-1, firstline=False) Decodes data from the stream and returns the resulting object. From 85a3021f9bc4244de5248a116f037a36eafc2f7e Mon Sep 17 00:00:00 2001 From: Ma Lin Date: Mon, 9 May 2022 07:47:23 +0800 Subject: [PATCH 17/19] Update Doc/library/codecs.rst Co-authored-by: Jelle Zijlstra --- Doc/library/codecs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index 424dc23099da00..22797d4fc77b0f 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -829,7 +829,7 @@ compatible with the Python codec registry. returned. - .. method:: readlines([sizehint=None[, keepends=True]]) + .. method:: readlines(sizehint=None, keepends=True) Read all lines available on the input stream and return them as a list of lines. From 434de51679a71dc5a2069883a8b770bbf911e256 Mon Sep 17 00:00:00 2001 From: Ma Lin Date: Mon, 9 May 2022 07:53:12 +0800 Subject: [PATCH 18/19] Update Doc/library/codecs.rst Co-authored-by: Jelle Zijlstra --- Doc/library/codecs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index 22797d4fc77b0f..80a026852a8955 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -658,7 +658,7 @@ define in order to be compatible with the Python codec registry. object. - .. method:: decode(object[, final=False]) + .. method:: decode(object, final=False) Decodes *object* (taking the current state of the decoder into account) and returns the resulting decoded object. If this is the last call to From 912933fd80672e7aee887e5bd18f5a7381f3882f Mon Sep 17 00:00:00 2001 From: Ma Lin Date: Mon, 9 May 2022 07:53:25 +0800 Subject: [PATCH 19/19] Update Doc/library/codecs.rst Co-authored-by: Jelle Zijlstra --- Doc/library/codecs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index 80a026852a8955..d131408175fd16 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -818,7 +818,7 @@ compatible with the Python codec registry. available on the stream, these should be read too. - .. method:: readline([size=None[, keepends=True]]) + .. method:: readline(size=None, keepends=True) Read one line from the input stream and return the decoded data.