Skip to content

Commit 96b2b8b

Browse files
committed
gh-111089: Add PyUnicode_AsUTF8Safe() function
Revert PyUnicode_AsUTF8() change: it no longer rejects embedded null characters: the PyUnicode_AsUTF8Safe() function should be used instead.
1 parent cd6b2ce commit 96b2b8b

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+270
-201
lines changed

Doc/c-api/unicode.rst

+29-5
Original file line numberDiff line numberDiff line change
@@ -979,6 +979,15 @@ These are the UTF-8 codec APIs:
979979
responsible for deallocating the buffer. The buffer is deallocated and
980980
pointers to it become invalid when the Unicode object is garbage collected.
981981
982+
If *size* is NULL and the *unicode* string contains null characters, the
983+
UTF-8 encoded string contains embedded null bytes and the caller is not
984+
aware since the string size is not stored. C functions processing null
985+
terminated ``char*`` truncate the string at the first embedded null byte, and
986+
so ignore bytes after the null byte. The :c:func:`PyUnicode_AsUTF8` function
987+
can be used to raise an exception rather than truncating the string. Or
988+
:c:func:`PyUnicode_AsUTF8(unicode, &size) <PyUnicode_AsUTF8AndSize>` can be
989+
used to store the size.
990+
982991
.. versionadded:: 3.3
983992
984993
.. versionchanged:: 3.7
@@ -990,12 +999,13 @@ These are the UTF-8 codec APIs:
990999
9911000
.. c:function:: const char* PyUnicode_AsUTF8(PyObject *unicode)
9921001
993-
As :c:func:`PyUnicode_AsUTF8AndSize`, but does not store the size.
1002+
Similar to :c:func:`PyUnicode_AsUTF8AndSize`, but does not store the size.
9941003
995-
Raise an exception if the *unicode* string contains embedded null
996-
characters. To accept embedded null characters and truncate on purpose
997-
at the first null byte, ``PyUnicode_AsUTF8AndSize(unicode, NULL)`` can be
998-
used instead.
1004+
If the *unicode* string contains null characters, the UTF-8 encoded string
1005+
contains embedded null bytes. C functions processing null terminated ``char*``
1006+
truncate the string at the first embedded null byte, and so ignore bytes
1007+
after the null byte. The :c:func:`PyUnicode_AsUTF8` function can be used to
1008+
raise an exception rather than truncating the string.
9991009
10001010
.. versionadded:: 3.3
10011011
@@ -1005,6 +1015,20 @@ These are the UTF-8 codec APIs:
10051015
.. versionchanged:: 3.13
10061016
Raise an exception if the string contains embedded null characters.
10071017
1018+
.. c:function:: const char* PyUnicode_AsUTF8Safe(PyObject *unicode)
1019+
1020+
Similar to :c:func:`PyUnicode_AsUTF8`, but raise :exc:`ValueError` if the
1021+
string contains embedded null characters.
1022+
1023+
The Unicode Character Set contains characters which can cause bugs or even
1024+
security issues depending on how they are proceed. See for example `Unicode
1025+
Technical Report #36: Unicode Security Considerations
1026+
<https://unicode.org/reports/tr36/>`_. This function implements a single
1027+
check: only test if the string contains null characters. Additional checks
1028+
are needed to prevent further issues cause by Unicode characters.
1029+
1030+
.. versionadded:: 3.13
1031+
10081032
10091033
UTF-32 Codecs
10101034
"""""""""""""

Doc/whatsnew/3.13.rst

+5-6
Original file line numberDiff line numberDiff line change
@@ -1127,6 +1127,11 @@ New Features
11271127
* Add :c:func:`PyUnicode_AsUTF8` function to the limited C API.
11281128
(Contributed by Victor Stinner in :gh:`111089`.)
11291129

1130+
* Add :c:func:`PyUnicode_AsUTF8Safe` function: similar to
1131+
:c:func:`PyUnicode_AsUTF8`, but raise :exc:`ValueError` if the string
1132+
contains embedded null characters.
1133+
(Contributed by Victor Stinner in :gh:`111089`.)
1134+
11301135

11311136
Porting to Python 3.13
11321137
----------------------
@@ -1197,12 +1202,6 @@ Porting to Python 3.13
11971202
Note that ``Py_TRASHCAN_BEGIN`` has a second argument which
11981203
should be the deallocation function it is in.
11991204

1200-
* The :c:func:`PyUnicode_AsUTF8` function now raises an exception if the string
1201-
contains embedded null characters. To accept embedded null characters and
1202-
truncate on purpose at the first null byte,
1203-
``PyUnicode_AsUTF8AndSize(unicode, NULL)`` can be used instead.
1204-
(Contributed by Victor Stinner in :gh:`111089`.)
1205-
12061205
* On Windows, ``Python.h`` no longer includes the ``<stddef.h>`` standard
12071206
header file. If needed, it should now be included explicitly. For example, it
12081207
provides ``offsetof()`` function, and ``size_t`` and ``ptrdiff_t`` types.

Include/cpython/unicodeobject.h

+4
Original file line numberDiff line numberDiff line change
@@ -440,6 +440,10 @@ PyAPI_FUNC(PyObject*) PyUnicode_FromKindAndData(
440440
const void *buffer,
441441
Py_ssize_t size);
442442

443+
// Similar to PyUnicode_AsUTF8(), but raise ValueError if the string contains
444+
// embedded null characters.
445+
PyAPI_FUNC(const char *) PyUnicode_AsUTF8Safe(PyObject *unicode);
446+
443447

444448
/* === Characters Type APIs =============================================== */
445449

Lib/test/test_capi/test_unicode.py

+23-10
Original file line numberDiff line numberDiff line change
@@ -905,25 +905,38 @@ def test_fromordinal(self):
905905
self.assertRaises(ValueError, fromordinal, 0x110000)
906906
self.assertRaises(ValueError, fromordinal, -1)
907907

908-
@support.cpython_only
909-
@unittest.skipIf(_testcapi is None, 'need _testcapi module')
910-
def test_asutf8(self):
911-
"""Test PyUnicode_AsUTF8()"""
912-
from _testcapi import unicode_asutf8
913-
908+
def check_asutf8(self, unicode_asutf8):
914909
self.assertEqual(unicode_asutf8('abc', 4), b'abc\0')
915910
self.assertEqual(unicode_asutf8('абв', 7), b'\xd0\xb0\xd0\xb1\xd0\xb2\0')
916911
self.assertEqual(unicode_asutf8('\U0001f600', 5), b'\xf0\x9f\x98\x80\0')
917912

918-
# disallow embedded null characters
919-
self.assertRaises(ValueError, unicode_asutf8, 'abc\0', 0)
920-
self.assertRaises(ValueError, unicode_asutf8, 'abc\0def', 0)
921-
922913
self.assertRaises(UnicodeEncodeError, unicode_asutf8, '\ud8ff', 0)
923914
self.assertRaises(TypeError, unicode_asutf8, b'abc', 0)
924915
self.assertRaises(TypeError, unicode_asutf8, [], 0)
925916
# CRASHES unicode_asutf8(NULL, 0)
926917

918+
@support.cpython_only
919+
@unittest.skipIf(_testcapi is None, 'need _testcapi module')
920+
def test_asutf8(self):
921+
"""Test PyUnicode_AsUTF8()"""
922+
from _testcapi import unicode_asutf8
923+
self.check_asutf8(unicode_asutf8)
924+
925+
# allow embedded null characters
926+
self.assertEqual(unicode_asutf8('abc\0', 5), b'abc\0\0')
927+
self.assertEqual(unicode_asutf8('abc\0def', 8), b'abc\0def\0')
928+
929+
@support.cpython_only
930+
@unittest.skipIf(_testcapi is None, 'need _testcapi module')
931+
def test_asutf8safe(self):
932+
"""Test PyUnicode_AsUTF8Safe()"""
933+
from _testcapi import unicode_asutf8safe
934+
self.check_asutf8(unicode_asutf8safe)
935+
936+
# disallow embedded null characters
937+
self.assertRaises(ValueError, unicode_asutf8safe, 'abc\0', 0)
938+
self.assertRaises(ValueError, unicode_asutf8safe, 'abc\0def', 0)
939+
927940
@support.cpython_only
928941
@unittest.skipIf(_testcapi is None, 'need _testcapi module')
929942
def test_asutf8andsize(self):
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
Add :c:func:`PyUnicode_AsUTF8Safe` function: similar to
2+
:c:func:`PyUnicode_AsUTF8`, but raise :exc:`ValueError` if the string
3+
contains embedded null characters. Patch by Victor Stinner.

Modules/_io/clinic/_iomodule.c.h

+5-5
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Modules/_io/clinic/fileio.c.h

+2-2
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Modules/_io/clinic/textio.c.h

+4-4
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Modules/_io/clinic/winconsoleio.c.h

+2-2
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Modules/_multiprocessing/clinic/multiprocessing.c.h

+2-2
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Modules/_multiprocessing/clinic/semaphore.c.h

+2-2
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Modules/_sqlite/clinic/connection.c.h

+13-13
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)