Skip to content

intern_static is not thread-safe with multiple interpreters #122291

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
colesbury opened this issue Jul 25, 2024 · 4 comments
Closed

intern_static is not thread-safe with multiple interpreters #122291

colesbury opened this issue Jul 25, 2024 · 4 comments
Assignees
Labels
type-bug An unexpected behavior, bug, or error

Comments

@colesbury
Copy link
Contributor

colesbury commented Jul 25, 2024

Bug report

Most static strings are interned during Python initialization in _PyUnicode_InitStaticStrings. However, the _Py_LATIN1_CHR characters (code points 0-255) are static, but not interned. They may be interned later while the Python is running. This can happen for various reasons, including calls to sys.intern.

This isn't thread-safe: it modifies the hashtable _PyRuntime.cached_objects.interned_strings, which is shared across threads and interpreters, without any synchronization.

It also can break the interning identity invariant. You can have a non-static, interned 1-characters string later shadowed by the global interning of the static 1-character string.

Suggestions

  • The _PyRuntime.cached_objects.interned_strings should be immutable. We should not modify it after Py_Initialize() until shutdown (i.e., _PyUnicode_ClearInterned called from finalize_interp_types())
  • The 1-character latin1 strings should be interned. This can either be by explicitly interning them during startup, or by handling 1-character strings specially in intern_common.

cc @encukou @ericsnowcurrently

Linked PRs

@colesbury colesbury added the type-bug An unexpected behavior, bug, or error label Jul 25, 2024
@encukou
Copy link
Member

encukou commented Jul 26, 2024

Interning them makes sense. I'll send a PR.

@encukou
Copy link
Member

encukou commented Jul 27, 2024

FWIW:

This isn't thread-safe: it modifies the hashtable _PyRuntime.cached_objects.interned_strings, which is shared across threads and interpreters, without any synchronization.

Yup, that's a bug, so the fix should be backported.

It also can break the interning identity invariant. You can have a non-static, interned 1-characters string later shadowed by the global interning of the static 1-character string.

Hmm, interning a short string should give you the static singleton.
But the fix will definitely make this easier to follow, and harder to hide any bugs.

@encukou
Copy link
Member

encukou commented Jul 27, 2024

interning a short string should give you the static singleton.

Before the PR for this issue we had:

cpython/Objects/unicodeobject.c

Lines 15480 to 15493 in c086962

/* if it's a short string, get the singleton -- and intern it */
if (PyUnicode_GET_LENGTH(s) == 1 &&
PyUnicode_KIND(s) == PyUnicode_1BYTE_KIND) {
PyObject *r = LATIN1(*(unsigned char*)PyUnicode_DATA(s));
if (!PyUnicode_CHECK_INTERNED(r)) {
r = intern_static(interp, r);
}
Py_DECREF(s);
return r;
}
#ifdef Py_DEBUG
assert(!unicode_is_singleton(s));
#endif

encukou added a commit that referenced this issue Jul 27, 2024
@colesbury
Copy link
Contributor Author

Looks like this has been fixed for a while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

2 participants