Skip to content

bpo-40120: Fix unbounded struct char[] undefined behavior. #19232

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion Include/cpython/bytesobject.h
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
typedef struct {
PyObject_VAR_HEAD
Py_hash_t ob_shash;
char ob_sval[1];
char ob_sval[];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to:

  • Leave the structure as it is
  • Use a code search to check which C extension access directly PyBytesObject structure members (well, first check Cython and pybind11)
  • Prepare affected C extensions
  • Once the number of affected C extensions is low enough: make the structure opaque, enforce the usage of public C functions.


/* Invariants:
* ob_sval contains space for 'ob_size+1' elements.
Expand Down
2 changes: 1 addition & 1 deletion Lib/test/test_sys.py
Original file line number Diff line number Diff line change
Expand Up @@ -1197,7 +1197,7 @@ class C(object): pass
# EncodingMap
import codecs, encodings.iso8859_3
x = codecs.charmap_build(encodings.iso8859_3.decoding_table)
check(x, size('32B2iB'))
check(x, size('32B2i'))
# enumerate
check(enumerate([]), size('n3P'))
# reverse
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
Fixed internal structure definitions for structs such as PyBytesObject and
unicode's encoding_map to not rely on C undefined behavior for access to
their trailing unbounded character array in favor of C99 approved flexible
array member syntax.
3 changes: 3 additions & 0 deletions Objects/bytesobject.c
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,9 @@ _Py_IDENTIFIER(__bytes__);

Using PyBytesObject_SIZE instead of sizeof(PyBytesObject) saves
3 bytes per string allocation on a typical system.

The + 1 accounts for the trailing \0 byte that we include as a safety
measure for code that treats the underlying char * as a C string.
*/
#define PyBytesObject_SIZE (offsetof(PyBytesObject, ob_sval) + 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, you made PyBytesObject 1 byte smaller, and you didn't have to substract 1 somewhere? Does it mean that Python overallocates 1 byte since forever?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thankfully not. Because this code used offsetof the macro did not need to change. The offset never changes. The existing + 1 in the macro is easy to misinterpret as having been there to account for the [1] but that isn't how it was used - PyBytesObject is always allocated with that one extra byte that is filled with a NUL for C safety for PyBytes_AsString users.


Expand Down
6 changes: 3 additions & 3 deletions Objects/unicodeobject.c
Original file line number Diff line number Diff line change
Expand Up @@ -8208,14 +8208,14 @@ struct encoding_map {
PyObject_HEAD
unsigned char level1[32];
int count2, count3;
unsigned char level23[1];
unsigned char level23[];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gpshead: Can you please write a PR which only contains this change? Since it doesn't touch the public C API, C++ compatibility is not an issue (I don't think that Python can be built with a C++ compiler). We can start with this change and see how it goes? See also the discussion at python/peps#1349

};

static PyObject*
encoding_map_size(PyObject *obj, PyObject* args)
{
struct encoding_map *map = (struct encoding_map*)obj;
return PyLong_FromLong(sizeof(*map) - 1 + 16*map->count2 +
return PyLong_FromLong(sizeof(*map) + 16*map->count2 +
128*map->count3);
}

Expand Down Expand Up @@ -8347,7 +8347,7 @@ PyUnicode_BuildEncodingMap(PyObject* string)

/* Create a three-level trie */
result = PyObject_MALLOC(sizeof(struct encoding_map) +
16*count2 + 128*count3 - 1);
16*count2 + 128*count3);
if (!result)
return PyErr_NoMemory();
PyObject_Init(result, &EncodingMapType);
Expand Down