Skip to content

gh-111089: Add PyUnicode_AsUTF8NoNUL() function #111688

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 29 additions & 5 deletions Doc/c-api/unicode.rst
Original file line number Diff line number Diff line change
Expand Up @@ -979,6 +979,15 @@ These are the UTF-8 codec APIs:
responsible for deallocating the buffer. The buffer is deallocated and
pointers to it become invalid when the Unicode object is garbage collected.

If *size* is NULL and the *unicode* string contains null characters, the
UTF-8 encoded string contains embedded null bytes and the caller is not
aware since the string size is not stored. C functions processing null
terminated ``char*`` truncate the string at the first embedded null byte, and
so ignore bytes after the null byte. The :c:func:`PyUnicode_AsUTF8` function
can be used to raise an exception rather than truncating the string. Or
:c:func:`PyUnicode_AsUTF8(unicode, &size) <PyUnicode_AsUTF8AndSize>` can be
used to store the size.

.. versionadded:: 3.3

.. versionchanged:: 3.7
Expand All @@ -990,12 +999,13 @@ These are the UTF-8 codec APIs:

.. c:function:: const char* PyUnicode_AsUTF8(PyObject *unicode)

As :c:func:`PyUnicode_AsUTF8AndSize`, but does not store the size.
Similar to :c:func:`PyUnicode_AsUTF8AndSize`, but does not store the size.

Raise an exception if the *unicode* string contains embedded null
characters. To accept embedded null characters and truncate on purpose
at the first null byte, ``PyUnicode_AsUTF8AndSize(unicode, NULL)`` can be
used instead.
If the *unicode* string contains null characters, the UTF-8 encoded string
contains embedded null bytes. C functions processing null terminated ``char*``
truncate the string at the first embedded null byte, and so ignore bytes
after the null byte. The :c:func:`PyUnicode_AsUTF8` function can be used to
raise an exception rather than truncating the string.

.. versionadded:: 3.3

Expand All @@ -1005,6 +1015,20 @@ These are the UTF-8 codec APIs:
.. versionchanged:: 3.13
Raise an exception if the string contains embedded null characters.

.. c:function:: const char* PyUnicode_AsUTF8NoNUL(PyObject *unicode)

Similar to :c:func:`PyUnicode_AsUTF8`, but raise :exc:`ValueError` if the
string contains embedded null characters.

The Unicode Character Set contains characters which can cause bugs or even
security issues depending on how they are proceed. See for example `Unicode
Technical Report #36: Unicode Security Considerations
<https://unicode.org/reports/tr36/>`_. This function implements a single
check: only test if the string contains null characters. Additional checks
are needed to prevent further issues cause by Unicode characters.

.. versionadded:: 3.13


UTF-32 Codecs
"""""""""""""
Expand Down
1 change: 1 addition & 0 deletions Doc/data/stable_abi.dat

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

11 changes: 5 additions & 6 deletions Doc/whatsnew/3.13.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1137,6 +1137,11 @@ New Features
* Add :c:func:`PyUnicode_AsUTF8` function to the limited C API.
(Contributed by Victor Stinner in :gh:`111089`.)

* Add :c:func:`PyUnicode_AsUTF8NoNUL` function: similar to
:c:func:`PyUnicode_AsUTF8`, but raise :exc:`ValueError` if the string
contains embedded null characters.
(Contributed by Victor Stinner in :gh:`111089`.)


Porting to Python 3.13
----------------------
Expand Down Expand Up @@ -1207,12 +1212,6 @@ Porting to Python 3.13
Note that ``Py_TRASHCAN_BEGIN`` has a second argument which
should be the deallocation function it is in.

* The :c:func:`PyUnicode_AsUTF8` function now raises an exception if the string
contains embedded null characters. To accept embedded null characters and
truncate on purpose at the first null byte,
``PyUnicode_AsUTF8AndSize(unicode, NULL)`` can be used instead.
(Contributed by Victor Stinner in :gh:`111089`.)

* On Windows, ``Python.h`` no longer includes the ``<stddef.h>`` standard
header file. If needed, it should now be included explicitly. For example, it
provides ``offsetof()`` function, and ``size_t`` and ``ptrdiff_t`` types.
Expand Down
4 changes: 4 additions & 0 deletions Include/unicodeobject.h
Original file line number Diff line number Diff line change
Expand Up @@ -453,6 +453,10 @@ PyAPI_FUNC(PyObject*) PyUnicode_AsUTF8String(
// when the Unicode object is deallocated.
PyAPI_FUNC(const char *) PyUnicode_AsUTF8(PyObject *unicode);

// Similar to PyUnicode_AsUTF8(), but raise ValueError if the string contains
// embedded null characters.
PyAPI_FUNC(const char *) PyUnicode_AsUTF8NoNUL(PyObject *unicode);

// Returns a pointer to the UTF-8 encoding of the
// Unicode object unicode and the size of the encoded representation
// in bytes stored in `*size` (if size is not NULL).
Expand Down
33 changes: 23 additions & 10 deletions Lib/test/test_capi/test_unicode.py
Original file line number Diff line number Diff line change
Expand Up @@ -905,25 +905,38 @@ def test_fromordinal(self):
self.assertRaises(ValueError, fromordinal, 0x110000)
self.assertRaises(ValueError, fromordinal, -1)

@support.cpython_only
@unittest.skipIf(_testcapi is None, 'need _testcapi module')
def test_asutf8(self):
"""Test PyUnicode_AsUTF8()"""
from _testcapi import unicode_asutf8

def check_asutf8(self, unicode_asutf8):
self.assertEqual(unicode_asutf8('abc', 4), b'abc\0')
self.assertEqual(unicode_asutf8('абв', 7), b'\xd0\xb0\xd0\xb1\xd0\xb2\0')
self.assertEqual(unicode_asutf8('\U0001f600', 5), b'\xf0\x9f\x98\x80\0')

# disallow embedded null characters
self.assertRaises(ValueError, unicode_asutf8, 'abc\0', 0)
self.assertRaises(ValueError, unicode_asutf8, 'abc\0def', 0)

self.assertRaises(UnicodeEncodeError, unicode_asutf8, '\ud8ff', 0)
self.assertRaises(TypeError, unicode_asutf8, b'abc', 0)
self.assertRaises(TypeError, unicode_asutf8, [], 0)
# CRASHES unicode_asutf8(NULL, 0)

@support.cpython_only
@unittest.skipIf(_testcapi is None, 'need _testcapi module')
def test_asutf8(self):
"""Test PyUnicode_AsUTF8()"""
from _testcapi import unicode_asutf8
self.check_asutf8(unicode_asutf8)

# allow embedded null characters
self.assertEqual(unicode_asutf8('abc\0', 5), b'abc\0\0')
self.assertEqual(unicode_asutf8('abc\0def', 8), b'abc\0def\0')

@support.cpython_only
@unittest.skipIf(_testcapi is None, 'need _testcapi module')
def test_asutf8nonul(self):
"""Test PyUnicode_AsUTF8NoNUL()"""
from _testcapi import unicode_asutf8nonul
self.check_asutf8(unicode_asutf8nonul)

# disallow embedded null characters
self.assertRaises(ValueError, unicode_asutf8nonul, 'abc\0', 0)
self.assertRaises(ValueError, unicode_asutf8nonul, 'abc\0def', 0)

@support.cpython_only
@unittest.skipIf(_testcapi is None, 'need _testcapi module')
def test_asutf8andsize(self):
Expand Down
1 change: 1 addition & 0 deletions Lib/test/test_stable_abi_ctypes.py

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Add :c:func:`PyUnicode_AsUTF8NoNUL` function: similar to
:c:func:`PyUnicode_AsUTF8`, but raise :exc:`ValueError` if the string
contains embedded null characters. Patch by Victor Stinner.
2 changes: 2 additions & 0 deletions Misc/stable_abi.toml
Original file line number Diff line number Diff line change
Expand Up @@ -2480,6 +2480,8 @@
added = '3.13'
[function.PyUnicode_AsUTF8]
added = '3.13'
[function.PyUnicode_AsUTF8NoNUL]
added = '3.13'
[function._Py_SetRefcnt]
added = '3.13'
abi_only = true
10 changes: 5 additions & 5 deletions Modules/_io/clinic/_iomodule.c.h

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions Modules/_io/clinic/fileio.c.h

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 4 additions & 4 deletions Modules/_io/clinic/textio.c.h

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions Modules/_io/clinic/winconsoleio.c.h

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions Modules/_multiprocessing/clinic/multiprocessing.c.h

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions Modules/_multiprocessing/clinic/semaphore.c.h

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading