Skip to content

test_weakref.test_threaded_weak_key_dict_deepcopy crash: merged objects should have ob_tid == 0 #135106

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
colesbury opened this issue Jun 3, 2025 · 5 comments
Labels
3.14 bugs and security fixes 3.15 new features, bugs and security fixes release-blocker tests Tests in the Lib/test dir topic-free-threading type-bug An unexpected behavior, bug, or error

Comments

@colesbury
Copy link
Contributor

colesbury commented Jun 3, 2025

Bug report

0:08:41 load avg: 0.10 [402/491/1] test_weakref worker non-zero exit code (Exit code 2147483651) -- running (7): test_largefile (2 min 38 sec), test_zipimport (2 min 25 sec), test_compileall (2 min 14 sec), test_dbm (1 min 54 sec), test_free_threading (2 min 32 sec), test_regrtest (1 min 36 sec), test_pathlib (2 min 44 sec)
...
test.test_weakref.MappingTestCase.test_threaded_weak_key_dict_deepcopy) ... C:\Users\Administrator\buildarea\pull_request.itamaro-win64-srv-22-aws.x64.nogil\build\Python\gc_free_threading.c:1081: validate_refcounts: Assertion "op->ob_tid == 0" failed: merged objects should have ob_tid == 0
Enable tracemalloc to get the memory block allocation traceback

object address  : 0000020002444810
object refcount : 0
object type     : 00007FFC8D8A1620
object type name: weakref.ReferenceType
object repr     : <refcnt 0 at 0000020002444810>

Fatal Python error: _PyObject_AssertFailed: _PyObject_AssertFailed
Python runtime state: initialized

Stack (most recent call first):
  File "C:\Users\Administrator\buildarea\pull_request.itamaro-win64-srv-22-aws.x64.nogil\build\Lib\test\test_weakref.py", line 1999 in pop_and_collect
  File "C:\Users\Administrator\buildarea\pull_request.itamaro-win64-srv-22-aws.x64.nogil\build\Lib\threading.py", line 1016 in run
  File "C:\Users\Administrator\buildarea\pull_request.itamaro-win64-srv-22-aws.x64.nogil\build\Lib\threading.py", line 1074 in _bootstrap_inner
  File "C:\Users\Administrator\buildarea\pull_request.itamaro-win64-srv-22-aws.x64.nogil\build\Lib\threading.py", line 1036 in _bootstrap

Looks like it happens during a gc.collect() call:

def pop_and_collect(lst):
gc_ctr = 0
while lst:
i = random.randint(0, len(lst) - 1)
gc_ctr += 1
lst.pop(i)
if gc_ctr % 10000 == 0:
gc.collect() # just in case

https://buildbot.python.org/#/builders/1295/builds/591

The test passed on retry, but this seems like a bug.

Seen in main (3.15) on Windows debug build with free threading

Linked PRs

@colesbury colesbury added type-bug An unexpected behavior, bug, or error tests Tests in the Lib/test dir topic-free-threading labels Jun 3, 2025
@vstinner
Copy link
Member

vstinner commented Jun 5, 2025

Seen in main (3.15) on Windows debug build with free threading

The bug was also seen on the 3.14 branch on Windows with Free Threading: https://buildbot.python.org/#/builders/1717/builds/206

@vstinner vstinner changed the title test_weakref.test_threaded_weak_key_dict_deepcopy crash test_weakref.test_threaded_weak_key_dict_deepcopy crash: merged objects should have ob_tid == 0 Jun 5, 2025
@colesbury colesbury added release-blocker 3.14 bugs and security fixes labels Jun 16, 2025
@colesbury
Copy link
Contributor Author

colesbury commented Jun 16, 2025

The trashcan mechanism is broken in the 3.14 free threaded build. _PyTrash_thread_deposit_object overwrites the ob_tid in the free threading build, but that's not safe if the object is still tracked by the GC.

cpython/Objects/object.c

Lines 3166 to 3176 in 21bac3a

void
_Py_Dealloc(PyObject *op)
{
PyTypeObject *type = Py_TYPE(op);
destructor dealloc = type->tp_dealloc;
PyThreadState *tstate = _PyThreadState_GET();
intptr_t margin = _Py_RecursionLimit_GetMargin(tstate);
if (margin < 2) {
_PyTrash_thread_deposit_object(tstate, (PyObject *)op);
return;
}
(from #132280)

We're only seeing this currently on Windows probably because of the smaller stack size.

cc @nascheme @markshannon

@colesbury colesbury added the 3.15 new features, bugs and security fixes label Jun 16, 2025
colesbury added a commit to colesbury/cpython that referenced this issue Jun 16, 2025
Returns `1` a little less than 1% of the time in
_Py_RecursionLimit_GetMargin() to force the use of the trashcan
mechanism.
colesbury added a commit to colesbury/cpython that referenced this issue Jun 16, 2025
Returns `1` a little less than 1% of the time in
_Py_RecursionLimit_GetMargin() to force the use of the trashcan
mechanism.
@colesbury
Copy link
Contributor Author

I think this is also broken in the default build. If you change _Py_RecursionLimit_GetMargin to occasionally return 1 as in #135589 you get lots of crashes.

@markshannon
Copy link
Member

How does the FT GC use ob_tid?
ob_tid should only be overwritten when the refcount reaches zero, so the GC shouldn't be looking at the object.

@markshannon
Copy link
Member

In the default build because we disguise the pointer as a large, but mortal reference count, when storing it in the refcount field. Do we need to do some similar mangling in the FT GC to make ob_tid have the correct form?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.14 bugs and security fixes 3.15 new features, bugs and security fixes release-blocker tests Tests in the Lib/test dir topic-free-threading type-bug An unexpected behavior, bug, or error
Projects
Development

No branches or pull requests

3 participants