Skip to content

s390x RHEL7 got crash on the refleak test #102351

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
corona10 opened this issue Mar 1, 2023 · 21 comments
Closed

s390x RHEL7 got crash on the refleak test #102351

corona10 opened this issue Mar 1, 2023 · 21 comments
Assignees
Labels
type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@corona10
Copy link
Member

corona10 commented Mar 1, 2023

link: https://buildbot.python.org/all/#/builders/217/builds/1082/steps/5/logs/stdio

test_multiple_watchers (test.test_capi.test_watchers.TestFuncWatchers.test_multiple_watchers) ... ok
./Include/object.h:654: _Py_NegativeRefcount: Assertion failed: object has negative ref count
<object at 0x3ff7e7ffd80 is freed>
Fatal Python error: _PyObject_AssertFailed: _PyObject_AssertFailed
Python runtime state: initialized
Current thread 0x000003ff86cf7700 (most recent call first):
  Garbage-collecting
  File "/home/dje/cpython-buildarea/pull_request.edelsohn-rhel-z.refleak/build/Lib/test/support/__init__.py", line 2085 in _hook
  File "/home/dje/cpython-buildarea/pull_request.edelsohn-rhel-z.refleak/build/Lib/test/libregrtest/setup.py", line 84 in _test_audit_hook
  File "/home/dje/cpython-buildarea/pull_request.edelsohn-rhel-z.refleak/build/Lib/test/test_capi/test_watchers.py", line 479 in test_watcher_raises_error
  File "/home/dje/cpython-buildarea/pull_request.edelsohn-rhel-z.refleak/build/Lib/unittest/case.py", line 579 in _callTestMethod
  File "/home/dje/cpython-buildarea/pull_request.edelsohn-rhel-z.refleak/build/Lib/unittest/case.py", line 623 in run
  File "/home/dje/cpython-buildarea/pull_request.edelsohn-rhel-z.refleak/build/Lib/unittest/case.py", line 678 in __call__
  File "/home/dje/cpython-buildarea/pull_request.edelsohn-rhel-z.refleak/build/Lib/unittest/suite.py", line 122 in run
  File "/home/dje/cpython-buildarea/pull_request.edelsohn-rhel-z.refleak/build/Lib/unittest/suite.py", line 84 in __call__
  File "/home/dje/cpython-buildarea/pull_request.edelsohn-rhel-z.refleak/build/Lib/unittest/suite.py", line 122 in run
  File "/home/dje/cpython-buildarea/pull_request.edelsohn-rhel-z.refleak/build/Lib/unittest/suite.py", line 84 in __call__
  File "/home/dje/cpython-buildarea/pull_request.edelsohn-rhel-z.refleak/build/Lib/unittest/suite.py", line 122 in run
  File "/home/dje/cpython-buildarea/pull_request.edelsohn-rhel-z.refleak/build/Lib/unittest/suite.py", line 84 in __call__
  File "/home/dje/cpython-buildarea/pull_request.edelsohn-rhel-z.refleak/build/Lib/unittest/suite.py", line 122 in run
  File "/home/dje/cpython-buildarea/pull_request.edelsohn-rhel-z.refleak/build/Lib/unittest/suite.py", line 84 in __call__
  File "/home/dje/cpython-buildarea/pull_request.edelsohn-rhel-z.refleak/build/Lib/unittest/suite.py", line 122 in run
  File "/home/dje/cpython-buildarea/pull_request.edelsohn-rhel-z.refleak/build/Lib/unittest/suite.py", line 84 in __call__
  File "/home/dje/cpython-buildarea/pull_request.edelsohn-rhel-z.refleak/build/Lib/unittest/runner.py", line 208 in run
  File "/home/dje/cpython-buildarea/pull_request.edelsohn-rhel-z.refleak/build/Lib/test/support/__init__.py", line 1104 in _run_suite
  File "/home/dje/cpython-buildarea/pull_request.edelsohn-rhel-z.refleak/build/Lib/test/support/__init__.py", line 1230 in run_unittest
  File "/home/dje/cpython-buildarea/pull_request.edelsohn-rhel-z.refleak/build/Lib/test/libregrtest/runtest.py", line 281 in _test_module
  File "/home/dje/cpython-buildarea/pull_request.edelsohn-rhel-z.refleak/build/Lib/test/libregrtest/refleak.py", line 89 in dash_R
  File "/home/dje/cpython-buildarea/pull_request.edelsohn-rhel-z.refleak/build/Lib/test/libregrtest/runtest.py", line 315 in _runtest_inner2
  File "/home/dje/cpython-buildarea/pull_request.edelsohn-rhel-z.refleak/build/Lib/test/libregrtest/runtest.py", line 360 in _runtest_inner
  File "/home/dje/cpython-buildarea/pull_request.edelsohn-rhel-z.refleak/build/Lib/test/libregrtest/runtest.py", line 235 in _runtest
  File "/home/dje/cpython-buildarea/pull_request.edelsohn-rhel-z.refleak/build/Lib/test/libregrtest/runtest.py", line 265 in runtest
  File "/home/dje/cpython-buildarea/pull_request.edelsohn-rhel-z.refleak/build/Lib/test/libregrtest/main.py", line 353 in rerun_failed_tests
  File "/home/dje/cpython-buildarea/pull_request.edelsohn-rhel-z.refleak/build/Lib/test/libregrtest/main.py", line 756 in _main
  File "/home/dje/cpython-buildarea/pull_request.edelsohn-rhel-z.refleak/build/Lib/test/libregrtest/main.py", line 711 in main
  File "/home/dje/cpython-buildarea/pull_request.edelsohn-rhel-z.refleak/build/Lib/test/libregrtest/main.py", line 775 in main
  File "/home/dje/cpython-buildarea/pull_request.edelsohn-rhel-z.refleak/build/Lib/test/__main__.py", line 2 in <module>
  File "/home/dje/cpython-buildarea/pull_request.edelsohn-rhel-z.refleak/build/Lib/runpy.py", line 88 in _run_code
  File "/home/dje/cpython-buildarea/pull_request.edelsohn-rhel-z.refleak/build/Lib/runpy.py", line 198 in _run_module_as_main
Extension modules: _testcapi, _testmultiphase, _testsinglephase, _testinternalcapi (total: 4)
make: *** [buildbottest] Aborted
program finished with exit code 2

Linked PRs

@corona10 corona10 added the type-crash A hard crash of the interpreter, possibly with a core dump label Mar 1, 2023
@corona10 corona10 changed the title s390x RHEL7 got crash on the refleak test. s390x RHEL7 got crash on the refleak test Mar 1, 2023
@corona10
Copy link
Member Author

corona10 commented Mar 1, 2023

I will try to debug issue with QEMU..

@corona10
Copy link
Member Author

corona10 commented Mar 1, 2023

Okay no way to setup through QEMU since I can not get the image of s390x RHEL image.
So I will try to contract with the machine manager.

@corona10
Copy link
Member Author

corona10 commented Mar 1, 2023

@corona10 corona10 assigned corona10 and unassigned corona10 Mar 1, 2023
@iritkatriel
Copy link
Member

@erlend-aasland

I've reproduced it on a Mac and bisected to:

efc985a714b6f43c43ae629183f95618054422ae is the first bad commit
commit efc985a714b6f43c43ae629183f95618054422ae
Author: Erlend E. Aasland <[email protected]>
Date:   Thu Feb 23 16:03:13 2023 +0100

    gh-93649: Split exception tests from _testcapimodule.c (GH-102173)
    
    
    
    Automerge-Triggered-By: GH:erlend-aasland

 Lib/test/test_capi/test_exceptions.py | 145 +++++++++++++++++
 Lib/test/test_capi/test_misc.py       | 131 +--------------
 Modules/Setup.stdlib.in               |   2 +-
 Modules/_testcapi/exceptions.c        | 277 ++++++++++++++++++++++++++++++++
 Modules/_testcapi/parts.h             |   1 +
 Modules/_testcapimodule.c             | 290 +---------------------------------
 PCbuild/_testcapi.vcxproj             |   1 +
 PCbuild/_testcapi.vcxproj.filters     |   3 +
 8 files changed, 434 insertions(+), 416 deletions(-)
 create mode 100644 Lib/test/test_capi/test_exceptions.py
 create mode 100644 Modules/_testcapi/exceptions.c

@iritkatriel
Copy link
Member

iritkatriel commented Mar 1, 2023

But I think it may be a red herring (when I touch stuff the crash goes away). Maybe we just need to bump the magic number.

iritkatriel added a commit to iritkatriel/cpython that referenced this issue Mar 1, 2023
@iritkatriel
Copy link
Member

I suspect this PR needed the magic number bump: #101933

@iritkatriel
Copy link
Member

But I think it may be a red herring (when I touch stuff the crash goes away). Maybe we just need to bump the magic number.

Bumping magic number didn't solve this: #102359.

@erlend-aasland
Copy link
Contributor

@iritkatriel, thanks for the heads-up, I'll have a look!

@erlend-aasland
Copy link
Contributor

@iritkatriel: what's your repro on macOS? I can't reproduce it (yet) on my M1 running Ventura 13.1.

@iritkatriel
Copy link
Member

I was running python -m test -j2 -R3:3 test_capi

running just a submodule of test_capi didn’t fail. I think something to do with module import/setup?

@corona10
Copy link
Member Author

corona10 commented Mar 2, 2023

When I debug the recount: -2459565876494606883
It could be overflowed issue or an uninitialized issue.

@corona10
Copy link
Member Author

corona10 commented Mar 2, 2023

When I debug the recount: -2459565876494606883

Ah it's because already freed.

@iritkatriel
Copy link
Member

I think it might be the same issue as #102350.
I narrowed it down to that test, and if I update my branch with that last commit the crash is gone.

@corona10
Copy link
Member Author

corona10 commented Mar 2, 2023

I narrowed it down to that test, and if I update my branch with that last commit the crash is gone.

Ah for my case, it's still happening with the latest commit.

@iritkatriel
Copy link
Member

Clean build?

@corona10
Copy link
Member Author

corona10 commented Mar 2, 2023

Clean build?

Yes with the clean-build.
Please read my comment when I triggered the refleak test with the PR.
#102350 (comment)
This issue was created from the comment.

@iritkatriel
Copy link
Member

Yes, I just got the crash again with main too. Sorry about the noise.

@iritkatriel
Copy link
Member

iritkatriel commented Mar 2, 2023

I think the problem comes from test_watcher_raises_error in Lib/test/test_capi/test_watchers.py.

This test registers a dictionary watcher which raises an exception. That takes us to _PyErr_WriteUnraisableMsg, where Py_DECREF(hook_args); fails because the last item in hook_args (the obj, i.e., the dictionary) has refcnt 0.

This only happens when I run it with ./python.exe -m test -v -j2 -R3:3 test_capi. If I remove the -j2 it doesn't fail.

I added assert(Py_REFCNT((PyObject*)mp) > 0); at the beginning of _PyDict_NotifyEvent, and it failed (during make) which seem wrong (apparently this is fine, for PyDict_EVENT_DEALLOCATED events).

@carljm carljm self-assigned this Mar 3, 2023
@carljm
Copy link
Member

carljm commented Mar 3, 2023

Further debugging confirmed that the root cause of this issue is #102381. Fixing all code/func/dict dealloc callbacks to temporarily resurrect the object before calling the callback, and to not save references to the object in unraisablehook, makes this issue go away.

@carljm
Copy link
Member

carljm commented Mar 3, 2023

For posterity: the specific repro here requiring -j2 -R seems to be an artifact of those options causing a GC run to tend to occur while an error-raising func watcher is active, and that GC run deallocates a cycle of unrelated objects from the datetime.py pure-Python fallbacks, including some functions.

@carljm
Copy link
Member

carljm commented Mar 8, 2023

Fixed in #102382

@carljm carljm closed this as completed Mar 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-crash A hard crash of the interpreter, possibly with a core dump
Projects
None yet
Development

No branches or pull requests

4 participants