Skip to content

bpo-38056: overhaul Error Handlers section in codecs documentation #15732

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from May 9, 2022
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion Doc/glossary.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1044,7 +1044,16 @@ Glossary
as :keyword:`if`, :keyword:`while` or :keyword:`for`.

text encoding
A codec which encodes Unicode strings to bytes.
Strings are stored internally as sequences of Unicode code points in
range ``0x0``--``0x10FFFF``. Once a string object is used outside of CPU
and memory, how these arrays are stored as bytes become an issue.

Serializing a string into a sequence of bytes is known as "encoding", and
recreating the string from the sequence of bytes is known as "decoding".

There are a variety of different text serialization
:ref:`codecs <standard-encodings>`, which are collectivity referred to as
"text encodings".

text file
A :term:`file object` able to read and write :class:`str` objects.
Expand Down
174 changes: 107 additions & 67 deletions Doc/library/codecs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,10 @@
This module defines base classes for standard Python codecs (encoders and
decoders) and provides access to the internal Python codec registry, which
manages the codec and error handling lookup process. Most standard codecs
are :term:`text encodings <text encoding>`, which encode text to bytes,
but there are also codecs provided that encode text to text, and bytes to
bytes. Custom codecs may encode and decode between arbitrary types, but some
module features are restricted to use specifically with
are :term:`text encodings <text encoding>`, which encode text to bytes (and
reverse), but there are also codecs provided that encode text to text, and
bytes to bytes. Custom codecs may encode and decode between arbitrary types,
but some module features are restricted to use specifically with
:term:`text encodings <text encoding>`, or with codecs that encode to
:class:`bytes`.

Expand Down Expand Up @@ -290,58 +290,56 @@ codec will handle encoding and decoding errors.
Error Handlers
^^^^^^^^^^^^^^

To simplify and standardize error handling,
codecs may implement different error handling schemes by
accepting the *errors* string argument. The following string values are
defined and implemented by all standard Python codecs:
To simplify and standardize error handling, codecs may implement different
error handling schemes by accepting the *errors* string argument:

.. tabularcolumns:: |l|L|

+-------------------------+-----------------------------------------------+
| Value | Meaning |
+=========================+===============================================+
| ``'strict'`` | Raise :exc:`UnicodeError` (or a subclass); |
| | this is the default. Implemented in |
| | :func:`strict_errors`. |
+-------------------------+-----------------------------------------------+
| ``'ignore'`` | Ignore the malformed data and continue |
| | without further notice. Implemented in |
| | :func:`ignore_errors`. |
+-------------------------+-----------------------------------------------+

The following error handlers are only applicable to
:term:`text encodings <text encoding>`:
>>> 'German ß, ♬'.encode(encoding='ascii', errors='backslashreplace')
b'German \\xdf, \\u266c'
>>> 'German ß, ♬'.encode(encoding='ascii', errors='xmlcharrefreplace')
b'German &#223;, &#9836;'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of examples. Would you mind to add an example for all available error handlers?

It may be interesting to add an example for surrogatepass which is an uncommon case.

Copy link
Author

@ghost ghost Oct 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently there are only two REPL examples, just demonstrate this sentence:

codecs may implement different error handling schemes by accepting the errors string argument

If add example for all available error handlers, will the page become ugly?
all

surrogatepass is so uncommon, maybe people who need it know it naturally.
IMHO we can just describe surrogatepass clearly. Most readers don't need it, so don't have to be disturbed by an example.

In addition, surrogatepass example involves encoding algorithm, I'm afraid the readers will not see the clue from it:

>>> '\uD8AA'.encode(encoding='utf-8', errors='surrogatepass')
b'\xed\xa2\xaa'

Copy link
Contributor

@aeros aeros Oct 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vstinner

I like the idea of examples. Would you mind to add an example for all available error handlers?

I'm in agreement with adding an example for anything commonly utilized, but I don't think we should necessarily add one for all of the error handlers.

It may be interesting to add an example for surrogatepass which is an uncommon case.

IMO, we should try to focus on having examples for the common cases. Code examples can be very helpful, but in excess they can become distracting to readers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of example errors as well. I wonder if the table is making this more cluttered.

Perhaps something like:

Error Name
    Definition
    Example

would be more helpful.

Alternatively, a blank line between REPL examples would increase readability.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you all like the examples, I will try to plan a new layout to suit this idea.

My idea about the current change:

If an error handleris is very easy to understand, maybe no need to give an example.
IMHO strict/ignore/replace are the cases.

Or if an error handler is uncommon, maybe the reader doesn't need to be disturbed by an example, we can just use the text to describe it clearly.
IMHO these are not very common in real code: namereplace/surrogateescape/surrogatepass.

Then only two remaining. (backslashreplace/xmlcharrefreplace)
These two examples also teach some Unicode knowledges imperceptibly: the value of code point, ß is not an ASCII character.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@animalize

If an error handleris is very easy to understand, maybe no need to give an example.

I think the simple ones could still benefit from an example, just to show the basics of how it works. Even if it's fairly simple, it may not be quite as easy to understand for someone reading over the codecs documentation for the first time.

Or if an error handler is uncommon, maybe the reader doesn't need to be disturbed by an example, we can just use the text to describe it clearly.
IMHO these are not very common in real code: namereplace/surrogateescape/surrogatepass.

Then only two remaining. (backslashreplace/xmlcharrefreplace)
These two examples also teach some Unicode knowledges imperceptibly: the value of code point, ß is not an ASCII character.

👍

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you all like the examples, I will try to plan a new layout to suit this idea.

Sorry, I've been too busy recently, very intense.
I will do this when I have time.


.. index::
pair: strict; error handler's name
pair: ignore; error handler's name
pair: replace; error handler's name
pair: backslashreplace; error handler's name
pair: surrogateescape; error handler's name
single: ? (question mark); replacement character
single: \ (backslash); escape sequence
single: \x; escape sequence
single: \u; escape sequence
single: \U; escape sequence
single: \N; escape sequence

The following error handlers can be used with all :ref:`standard-encodings`
codecs:

.. tabularcolumns:: |l|L|

+-------------------------+-----------------------------------------------+
| Value | Meaning |
+=========================+===============================================+
| ``'replace'`` | Replace with a suitable replacement |
| | marker; Python will use the official |
| | ``U+FFFD`` REPLACEMENT CHARACTER for the |
| | built-in codecs on decoding, and '?' on |
| | encoding. Implemented in |
| | :func:`replace_errors`. |
| ``'strict'`` | Raise :exc:`UnicodeError` (or a subclass), |
| | this is the default. Implemented in |
| | :func:`strict_errors`. |
+-------------------------+-----------------------------------------------+
| ``'xmlcharrefreplace'`` | Replace with the appropriate XML character |
| | reference (only for encoding). Implemented |
| | in :func:`xmlcharrefreplace_errors`. |
| ``'ignore'`` | Ignore the malformed data and continue without|
| | further notice. Implemented in |
| | :func:`ignore_errors`. |
+-------------------------+-----------------------------------------------+
| ``'replace'`` | Replace with a replacement marker. On |
| | encoding, use ``?`` (ASCII character). On |
| | decoding, use ``U+FFFD`` (the official |
| | REPLACEMENT CHARACTER). Implemented in |
| | :func:`replace_errors`. |
+-------------------------+-----------------------------------------------+
| ``'backslashreplace'`` | Replace with backslashed escape sequences. |
| | On encoding, use hexadecimal form of Unicode |
| | code point with formats ``\xhh`` ``\uxxxx`` |
| | ``\Uxxxxxxxx``. On decoding, use hexadecimal |
| | form of byte value with format ``\xhh``. |
| | Implemented in |
| | :func:`backslashreplace_errors`. |
+-------------------------+-----------------------------------------------+
| ``'namereplace'`` | Replace with ``\N{...}`` escape sequences |
| | (only for encoding). Implemented in |
| | :func:`namereplace_errors`. |
+-------------------------+-----------------------------------------------+
| ``'surrogateescape'`` | On decoding, replace byte with individual |
| | surrogate code ranging from ``U+DC80`` to |
| | ``U+DCFF``. This code will then be turned |
Expand All @@ -351,6 +349,31 @@ The following error handlers are only applicable to
| | more.) |
+-------------------------+-----------------------------------------------+

.. index::
pair: xmlcharrefreplace; error handler's name
pair: namereplace; error handler's name
single: \N; escape sequence

The following error handlers are only applicable to encoding (within
:term:`text encodings <text encoding>`):

+-------------------------+-----------------------------------------------+
| Value | Meaning |
+=========================+===============================================+
| ``'xmlcharrefreplace'`` | Replace with XML/HTML numeric character |
| | reference, which is a decimal form of Unicode |
| | code point with format ``&#num;`` Implemented |
| | in :func:`xmlcharrefreplace_errors`. |
+-------------------------+-----------------------------------------------+
| ``'namereplace'`` | Replace with ``\N{...}`` escape sequences, |
| | what appears in the brace is the Name property|
| | from Unicode Character Database. Implemented |
| | in :func:`namereplace_errors`. |
+-------------------------+-----------------------------------------------+

.. index::
pair: surrogatepass; error handler's name

In addition, the following error handler is specific to the given codecs:

+-------------------+------------------------+-------------------------------------------+
Expand All @@ -365,13 +388,14 @@ In addition, the following error handler is specific to the given codecs:
The ``'surrogateescape'`` and ``'surrogatepass'`` error handlers.

.. versionchanged:: 3.4
The ``'surrogatepass'`` error handlers now works with utf-16\* and utf-32\* codecs.
The ``'surrogatepass'`` error handler now works with utf-16\* and utf-32\*
codecs.

.. versionadded:: 3.5
The ``'namereplace'`` error handler.

.. versionchanged:: 3.5
The ``'backslashreplace'`` error handlers now works with decoding and
The ``'backslashreplace'`` error handler now works with decoding and
translating.

The set of allowed values can be extended by registering a new named error
Expand Down Expand Up @@ -414,42 +438,58 @@ functions:

.. function:: strict_errors(exception)

Implements the ``'strict'`` error handling: each encoding or
decoding error raises a :exc:`UnicodeError`.
Implements the ``'strict'`` error handling.

Each encoding or decoding error raises a :exc:`UnicodeError`.

.. function:: replace_errors(exception)

Implements the ``'replace'`` error handling (for :term:`text encodings
<text encoding>` only): substitutes ``'?'`` for encoding errors
(to be encoded by the codec), and ``'\ufffd'`` (the Unicode replacement
character) for decoding errors.
.. function:: ignore_errors(exception)

Implements the ``'ignore'`` error handling.

.. function:: ignore_errors(exception)
Malformed data is ignored and encoding or decoding is continued without
further notice.

Implements the ``'ignore'`` error handling: malformed data is ignored and
encoding or decoding is continued without further notice.

.. function:: replace_errors(exception)

.. function:: xmlcharrefreplace_errors(exception)
Implements the ``'replace'`` error handling.

Implements the ``'xmlcharrefreplace'`` error handling (for encoding with
:term:`text encodings <text encoding>` only): the
unencodable character is replaced by an appropriate XML character reference.
Substitutes ``?`` (ASCII character) for encoding errors, or ``U+FFFD`` (the
official REPLACEMENT CHARACTER) for decoding errors.


.. function:: backslashreplace_errors(exception)

Implements the ``'backslashreplace'`` error handling (for
:term:`text encodings <text encoding>` only): malformed data is
replaced by a backslashed escape sequence.
Implements the ``'backslashreplace'`` error handling.

Malformed data is replaced by a backslashed escape sequence.
On encoding, use hexadecimal form of Unicode code point with formats
``\xhh`` ``\uxxxx`` ``\Uxxxxxxxx``. On decoding, use hexadecimal form of
byte value with format ``\xhh``.

.. versionchanged:: 3.5
now works with decoding and translating.


.. function:: xmlcharrefreplace_errors(exception)

Implements the ``'xmlcharrefreplace'`` error handling (for encoding within
:term:`text encoding` only).

The unencodable character is replaced by an appropriate XML/HTML numeric
character reference, which is a decimal form of Unicode code point with
format ``&#num;``


.. function:: namereplace_errors(exception)

Implements the ``'namereplace'`` error handling (for encoding with
:term:`text encodings <text encoding>` only): the
unencodable character is replaced by a ``\N{...}`` escape sequence.
Implements the ``'namereplace'`` error handling (for encoding within
:term:`text encoding` only).

The unencodable character is replaced by a ``\N{...}`` escape sequence,
what appears in the brace is the Name property from Unicode Character
Database.

.. versionadded:: 3.5

Expand All @@ -463,7 +503,7 @@ The base :class:`Codec` class defines these methods which also define the
function interfaces of the stateless encoder and decoder:


.. method:: Codec.encode(input[, errors])
.. method:: Codec.encode(input[, errors='strict'])

Encodes the object *input* and returns a tuple (output object, length consumed).
For instance, :term:`text encoding` converts
Expand All @@ -481,7 +521,7 @@ function interfaces of the stateless encoder and decoder:
of the output object type in this situation.


.. method:: Codec.decode(input[, errors])
.. method:: Codec.decode(input[, errors='strict'])

Decodes the object *input* and returns a tuple (output object, length
consumed). For instance, for a :term:`text encoding`, decoding converts
Expand Down Expand Up @@ -548,7 +588,7 @@ define in order to be compatible with the Python codec registry.
object.


.. method:: encode(object[, final])
.. method:: encode(object[, final=False])

Encodes *object* (taking the current state of the encoder into account)
and returns the resulting encoded object. If this is the last call to
Expand Down Expand Up @@ -605,7 +645,7 @@ define in order to be compatible with the Python codec registry.
object.


.. method:: decode(object[, final])
.. method:: decode(object[, final=False])

Decodes *object* (taking the current state of the decoder into account)
and returns the resulting decoded object. If this is the last call to
Expand Down Expand Up @@ -738,7 +778,7 @@ compatible with the Python codec registry.
:func:`register_error`.


.. method:: read([size[, chars, [firstline]]])
.. method:: read([size=-1[, chars=-1, [firstline=False]]])

Decodes data from the stream and returns the resulting object.

Expand All @@ -764,7 +804,7 @@ compatible with the Python codec registry.
available on the stream, these should be read too.


.. method:: readline([size[, keepends]])
.. method:: readline([size=None[, keepends=True]])

Read one line from the input stream and return the decoded data.

Expand All @@ -775,7 +815,7 @@ compatible with the Python codec registry.
returned.


.. method:: readlines([sizehint[, keepends]])
.. method:: readlines([sizehint=None[, keepends=True]])

Read all lines available on the input stream and return them as a list of
lines.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Overhaul :ref:`error-handlers` section in :mod:`codecs` module documentation.