Skip to content

Commit 03aa752

Browse files
miss-islingtonMa Lin
and
Ma Lin
authored
bpo-38056: overhaul Error Handlers section in codecs documentation (GH-15732)
* Some handlers were wrongly described as text-encoding only, but actually they can also be used in text-decoding. * Add more description to each handler. * Add two REPL examples. * Add indexes for Error Handler's name. Co-authored-by: Kyle Stanley <[email protected]> Co-authored-by: Victor Stinner <[email protected]> Co-authored-by: Jelle Zijlstra <[email protected]> (cherry picked from commit 5bc2390) Co-authored-by: Ma Lin <[email protected]>
1 parent bf5fc2a commit 03aa752

File tree

3 files changed

+127
-74
lines changed

3 files changed

+127
-74
lines changed

Doc/glossary.rst

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1072,7 +1072,16 @@ Glossary
10721072
as :keyword:`if`, :keyword:`while` or :keyword:`for`.
10731073

10741074
text encoding
1075-
A codec which encodes Unicode strings to bytes.
1075+
A string in Python is a sequence of Unicode code points (in range
1076+
``U+0000``--``U+10FFFF``). To store or transfer a string, it needs to be
1077+
serialized as a sequence of bytes.
1078+
1079+
Serializing a string into a sequence of bytes is known as "encoding", and
1080+
recreating the string from the sequence of bytes is known as "decoding".
1081+
1082+
There are a variety of different text serialization
1083+
:ref:`codecs <standard-encodings>`, which are collectively referred to as
1084+
"text encodings".
10761085

10771086
text file
10781087
A :term:`file object` able to read and write :class:`str` objects.

Doc/library/codecs.rst

Lines changed: 116 additions & 73 deletions
Original file line numberDiff line numberDiff line change
@@ -23,11 +23,11 @@
2323
This module defines base classes for standard Python codecs (encoders and
2424
decoders) and provides access to the internal Python codec registry, which
2525
manages the codec and error handling lookup process. Most standard codecs
26-
are :term:`text encodings <text encoding>`, which encode text to bytes,
27-
but there are also codecs provided that encode text to text, and bytes to
28-
bytes. Custom codecs may encode and decode between arbitrary types, but some
29-
module features are restricted to use specifically with
30-
:term:`text encodings <text encoding>`, or with codecs that encode to
26+
are :term:`text encodings <text encoding>`, which encode text to bytes (and
27+
decode bytes to text), but there are also codecs provided that encode text to
28+
text, and bytes to bytes. Custom codecs may encode and decode between arbitrary
29+
types, but some module features are restricted to be used specifically with
30+
:term:`text encodings <text encoding>` or with codecs that encode to
3131
:class:`bytes`.
3232

3333
The module defines the following functions for encoding and decoding with
@@ -294,58 +294,56 @@ codec will handle encoding and decoding errors.
294294
Error Handlers
295295
^^^^^^^^^^^^^^
296296

297-
To simplify and standardize error handling,
298-
codecs may implement different error handling schemes by
299-
accepting the *errors* string argument. The following string values are
300-
defined and implemented by all standard Python codecs:
297+
To simplify and standardize error handling, codecs may implement different
298+
error handling schemes by accepting the *errors* string argument:
301299

302-
.. tabularcolumns:: |l|L|
303-
304-
+-------------------------+-----------------------------------------------+
305-
| Value | Meaning |
306-
+=========================+===============================================+
307-
| ``'strict'`` | Raise :exc:`UnicodeError` (or a subclass); |
308-
| | this is the default. Implemented in |
309-
| | :func:`strict_errors`. |
310-
+-------------------------+-----------------------------------------------+
311-
| ``'ignore'`` | Ignore the malformed data and continue |
312-
| | without further notice. Implemented in |
313-
| | :func:`ignore_errors`. |
314-
+-------------------------+-----------------------------------------------+
315-
316-
The following error handlers are only applicable to
317-
:term:`text encodings <text encoding>`:
300+
>>> 'German ß, ♬'.encode(encoding='ascii', errors='backslashreplace')
301+
b'German \\xdf, \\u266c'
302+
>>> 'German ß, ♬'.encode(encoding='ascii', errors='xmlcharrefreplace')
303+
b'German &#223;, &#9836;'
318304

319305
.. index::
306+
pair: strict; error handler's name
307+
pair: ignore; error handler's name
308+
pair: replace; error handler's name
309+
pair: backslashreplace; error handler's name
310+
pair: surrogateescape; error handler's name
320311
single: ? (question mark); replacement character
321312
single: \ (backslash); escape sequence
322313
single: \x; escape sequence
323314
single: \u; escape sequence
324315
single: \U; escape sequence
325-
single: \N; escape sequence
316+
317+
The following error handlers can be used with all Python
318+
:ref:`standard-encodings` codecs:
319+
320+
.. tabularcolumns:: |l|L|
326321

327322
+-------------------------+-----------------------------------------------+
328323
| Value | Meaning |
329324
+=========================+===============================================+
330-
| ``'replace'`` | Replace with a suitable replacement |
331-
| | marker; Python will use the official |
332-
| | ``U+FFFD`` REPLACEMENT CHARACTER for the |
333-
| | built-in codecs on decoding, and '?' on |
334-
| | encoding. Implemented in |
335-
| | :func:`replace_errors`. |
325+
| ``'strict'`` | Raise :exc:`UnicodeError` (or a subclass), |
326+
| | this is the default. Implemented in |
327+
| | :func:`strict_errors`. |
336328
+-------------------------+-----------------------------------------------+
337-
| ``'xmlcharrefreplace'`` | Replace with the appropriate XML character |
338-
| | reference (only for encoding). Implemented |
339-
| | in :func:`xmlcharrefreplace_errors`. |
329+
| ``'ignore'`` | Ignore the malformed data and continue without|
330+
| | further notice. Implemented in |
331+
| | :func:`ignore_errors`. |
332+
+-------------------------+-----------------------------------------------+
333+
| ``'replace'`` | Replace with a replacement marker. On |
334+
| | encoding, use ``?`` (ASCII character). On |
335+
| | decoding, use ```` (U+FFFD, the official |
336+
| | REPLACEMENT CHARACTER). Implemented in |
337+
| | :func:`replace_errors`. |
340338
+-------------------------+-----------------------------------------------+
341339
| ``'backslashreplace'`` | Replace with backslashed escape sequences. |
340+
| | On encoding, use hexadecimal form of Unicode |
341+
| | code point with formats ``\xhh`` ``\uxxxx`` |
342+
| | ``\Uxxxxxxxx``. On decoding, use hexadecimal |
343+
| | form of byte value with format ``\xhh``. |
342344
| | Implemented in |
343345
| | :func:`backslashreplace_errors`. |
344346
+-------------------------+-----------------------------------------------+
345-
| ``'namereplace'`` | Replace with ``\N{...}`` escape sequences |
346-
| | (only for encoding). Implemented in |
347-
| | :func:`namereplace_errors`. |
348-
+-------------------------+-----------------------------------------------+
349347
| ``'surrogateescape'`` | On decoding, replace byte with individual |
350348
| | surrogate code ranging from ``U+DC80`` to |
351349
| | ``U+DCFF``. This code will then be turned |
@@ -355,27 +353,55 @@ The following error handlers are only applicable to
355353
| | more.) |
356354
+-------------------------+-----------------------------------------------+
357355

356+
.. index::
357+
pair: xmlcharrefreplace; error handler's name
358+
pair: namereplace; error handler's name
359+
single: \N; escape sequence
360+
361+
The following error handlers are only applicable to encoding (within
362+
:term:`text encodings <text encoding>`):
363+
364+
+-------------------------+-----------------------------------------------+
365+
| Value | Meaning |
366+
+=========================+===============================================+
367+
| ``'xmlcharrefreplace'`` | Replace with XML/HTML numeric character |
368+
| | reference, which is a decimal form of Unicode |
369+
| | code point with format ``&#num;`` Implemented |
370+
| | in :func:`xmlcharrefreplace_errors`. |
371+
+-------------------------+-----------------------------------------------+
372+
| ``'namereplace'`` | Replace with ``\N{...}`` escape sequences, |
373+
| | what appears in the braces is the Name |
374+
| | property from Unicode Character Database. |
375+
| | Implemented in :func:`namereplace_errors`. |
376+
+-------------------------+-----------------------------------------------+
377+
378+
.. index::
379+
pair: surrogatepass; error handler's name
380+
358381
In addition, the following error handler is specific to the given codecs:
359382

360383
+-------------------+------------------------+-------------------------------------------+
361384
| Value | Codecs | Meaning |
362385
+===================+========================+===========================================+
363-
|``'surrogatepass'``| utf-8, utf-16, utf-32, | Allow encoding and decoding of surrogate |
364-
| | utf-16-be, utf-16-le, | codes. These codecs normally treat the |
365-
| | utf-32-be, utf-32-le | presence of surrogates as an error. |
386+
|``'surrogatepass'``| utf-8, utf-16, utf-32, | Allow encoding and decoding surrogate code|
387+
| | utf-16-be, utf-16-le, | point (``U+D800`` - ``U+DFFF``) as normal |
388+
| | utf-32-be, utf-32-le | code point. Otherwise these codecs treat |
389+
| | | the presence of surrogate code point in |
390+
| | | :class:`str` as an error. |
366391
+-------------------+------------------------+-------------------------------------------+
367392

368393
.. versionadded:: 3.1
369394
The ``'surrogateescape'`` and ``'surrogatepass'`` error handlers.
370395

371396
.. versionchanged:: 3.4
372-
The ``'surrogatepass'`` error handlers now works with utf-16\* and utf-32\* codecs.
397+
The ``'surrogatepass'`` error handler now works with utf-16\* and utf-32\*
398+
codecs.
373399

374400
.. versionadded:: 3.5
375401
The ``'namereplace'`` error handler.
376402

377403
.. versionchanged:: 3.5
378-
The ``'backslashreplace'`` error handlers now works with decoding and
404+
The ``'backslashreplace'`` error handler now works with decoding and
379405
translating.
380406

381407
The set of allowed values can be extended by registering a new named error
@@ -418,42 +444,59 @@ functions:
418444

419445
.. function:: strict_errors(exception)
420446

421-
Implements the ``'strict'`` error handling: each encoding or
422-
decoding error raises a :exc:`UnicodeError`.
447+
Implements the ``'strict'`` error handling.
423448

449+
Each encoding or decoding error raises a :exc:`UnicodeError`.
424450

425-
.. function:: replace_errors(exception)
426451

427-
Implements the ``'replace'`` error handling (for :term:`text encodings
428-
<text encoding>` only): substitutes ``'?'`` for encoding errors
429-
(to be encoded by the codec), and ``'\ufffd'`` (the Unicode replacement
430-
character) for decoding errors.
452+
.. function:: ignore_errors(exception)
431453

454+
Implements the ``'ignore'`` error handling.
432455

433-
.. function:: ignore_errors(exception)
456+
Malformed data is ignored; encoding or decoding is continued without
457+
further notice.
434458

435-
Implements the ``'ignore'`` error handling: malformed data is ignored and
436-
encoding or decoding is continued without further notice.
437459

460+
.. function:: replace_errors(exception)
438461

439-
.. function:: xmlcharrefreplace_errors(exception)
462+
Implements the ``'replace'`` error handling.
440463

441-
Implements the ``'xmlcharrefreplace'`` error handling (for encoding with
442-
:term:`text encodings <text encoding>` only): the
443-
unencodable character is replaced by an appropriate XML character reference.
464+
Substitutes ``?`` (ASCII character) for encoding errors or ```` (U+FFFD,
465+
the official REPLACEMENT CHARACTER) for decoding errors.
444466

445467

446468
.. function:: backslashreplace_errors(exception)
447469

448-
Implements the ``'backslashreplace'`` error handling (for
449-
:term:`text encodings <text encoding>` only): malformed data is
450-
replaced by a backslashed escape sequence.
470+
Implements the ``'backslashreplace'`` error handling.
471+
472+
Malformed data is replaced by a backslashed escape sequence.
473+
On encoding, use the hexadecimal form of Unicode code point with formats
474+
``\xhh`` ``\uxxxx`` ``\Uxxxxxxxx``. On decoding, use the hexadecimal form of
475+
byte value with format ``\xhh``.
476+
477+
.. versionchanged:: 3.5
478+
Works with decoding and translating.
479+
480+
481+
.. function:: xmlcharrefreplace_errors(exception)
482+
483+
Implements the ``'xmlcharrefreplace'`` error handling (for encoding within
484+
:term:`text encoding` only).
485+
486+
The unencodable character is replaced by an appropriate XML/HTML numeric
487+
character reference, which is a decimal form of Unicode code point with
488+
format ``&#num;`` .
489+
451490

452491
.. function:: namereplace_errors(exception)
453492

454-
Implements the ``'namereplace'`` error handling (for encoding with
455-
:term:`text encodings <text encoding>` only): the
456-
unencodable character is replaced by a ``\N{...}`` escape sequence.
493+
Implements the ``'namereplace'`` error handling (for encoding within
494+
:term:`text encoding` only).
495+
496+
The unencodable character is replaced by a ``\N{...}`` escape sequence. The
497+
set of characters that appear in the braces is the Name property from
498+
Unicode Character Database. For example, the German lowercase letter ``'ß'``
499+
will be converted to byte sequence ``\N{LATIN SMALL LETTER SHARP S}`` .
457500

458501
.. versionadded:: 3.5
459502

@@ -467,7 +510,7 @@ The base :class:`Codec` class defines these methods which also define the
467510
function interfaces of the stateless encoder and decoder:
468511

469512

470-
.. method:: Codec.encode(input[, errors])
513+
.. method:: Codec.encode(input, errors='strict')
471514

472515
Encodes the object *input* and returns a tuple (output object, length consumed).
473516
For instance, :term:`text encoding` converts
@@ -485,7 +528,7 @@ function interfaces of the stateless encoder and decoder:
485528
of the output object type in this situation.
486529

487530

488-
.. method:: Codec.decode(input[, errors])
531+
.. method:: Codec.decode(input, errors='strict')
489532

490533
Decodes the object *input* and returns a tuple (output object, length
491534
consumed). For instance, for a :term:`text encoding`, decoding converts
@@ -552,7 +595,7 @@ define in order to be compatible with the Python codec registry.
552595
object.
553596

554597

555-
.. method:: encode(object[, final])
598+
.. method:: encode(object, final=False)
556599

557600
Encodes *object* (taking the current state of the encoder into account)
558601
and returns the resulting encoded object. If this is the last call to
@@ -609,7 +652,7 @@ define in order to be compatible with the Python codec registry.
609652
object.
610653

611654

612-
.. method:: decode(object[, final])
655+
.. method:: decode(object, final=False)
613656

614657
Decodes *object* (taking the current state of the decoder into account)
615658
and returns the resulting decoded object. If this is the last call to
@@ -743,7 +786,7 @@ compatible with the Python codec registry.
743786
:func:`register_error`.
744787

745788

746-
.. method:: read([size[, chars, [firstline]]])
789+
.. method:: read(size=-1, chars=-1, firstline=False)
747790

748791
Decodes data from the stream and returns the resulting object.
749792

@@ -769,7 +812,7 @@ compatible with the Python codec registry.
769812
available on the stream, these should be read too.
770813

771814

772-
.. method:: readline([size[, keepends]])
815+
.. method:: readline(size=None, keepends=True)
773816

774817
Read one line from the input stream and return the decoded data.
775818

@@ -780,7 +823,7 @@ compatible with the Python codec registry.
780823
returned.
781824

782825

783-
.. method:: readlines([sizehint[, keepends]])
826+
.. method:: readlines(sizehint=None, keepends=True)
784827

785828
Read all lines available on the input stream and return them as a list of
786829
lines.
@@ -871,7 +914,7 @@ Encodings and Unicode
871914
---------------------
872915

873916
Strings are stored internally as sequences of code points in
874-
range ``0x0``--``0x10FFFF``. (See :pep:`393` for
917+
range ``U+0000``--``U+10FFFF``. (See :pep:`393` for
875918
more details about the implementation.)
876919
Once a string object is used outside of CPU and memory, endianness
877920
and how these arrays are stored as bytes become an issue. As with other
@@ -952,7 +995,7 @@ encoding was used for encoding a string. Each charmap encoding can
952995
decode any random byte sequence. However that's not possible with UTF-8, as
953996
UTF-8 byte sequences have a structure that doesn't allow arbitrary byte
954997
sequences. To increase the reliability with which a UTF-8 encoding can be
955-
detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls
998+
detected, Microsoft invented a variant of UTF-8 (that Python calls
956999
``"utf-8-sig"``) for its Notepad program: Before any of the Unicode characters
9571000
is written to the file, a UTF-8 encoded BOM (which looks like this as a byte
9581001
sequence: ``0xef``, ``0xbb``, ``0xbf``) is written. As it's rather improbable
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Overhaul the :ref:`error-handlers` documentation in :mod:`codecs`.

0 commit comments

Comments
 (0)