-
Notifications
You must be signed in to change notification settings - Fork 17
Fix hang on GC in Colorer._write() #275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
original bug has been reproduced, test-run with proposed commit fixes the problem
LGTM
I'll update the reduced test case a bit: diff --git a/test/xlog/test-run-hang-gh-qa-96.test.lua b/test/xlog/test-run-hang-gh-qa-96.test.lua
index ca58a1acb..9c06910c6 100644
--- a/test/xlog/test-run-hang-gh-qa-96.test.lua
+++ b/test/xlog/test-run-hang-gh-qa-96.test.lua
@@ -1,5 +1,7 @@
test_run = require('test_run').new()
+box.schema.user.grant('guest', 'replication')
test_run:cmd('create server replica with rpl_master=default, script="xlog/replica.lua"')
test_run:cmd('start server replica')
test_run:cmd('cleanup server replica')
test_run:cmd('delete server replica')
+box.schema.user.revoke('guest', 'replication') The original one leads to another 'hang worker' problem, irrelevant here. Sergey filed #276 regarding this. Don't know how I missed it. |
Ouch, the reduced test case does not actually test the problem. I'll update it. |
One more incremental update for the test case. Sorry for the mess. diff --git a/test/xlog/test-run-hang-gh-qa-96.test.lua b/test/xlog/test-run-hang-gh-qa-96.test.lua
index 9c06910c6..fc26459d9 100644
--- a/test/xlog/test-run-hang-gh-qa-96.test.lua
+++ b/test/xlog/test-run-hang-gh-qa-96.test.lua
@@ -2,6 +2,7 @@ test_run = require('test_run').new()
box.schema.user.grant('guest', 'replication')
test_run:cmd('create server replica with rpl_master=default, script="xlog/replica.lua"')
test_run:cmd('start server replica')
+test_run:cmd('stop server replica')
test_run:cmd('cleanup server replica')
test_run:cmd('delete server replica')
box.schema.user.revoke('guest', 'replication') |
How to spot the problem visually: | [Main process] No output from workers. It seems that we hang. Send | SIGKILL to workers; exiting... How to reproduce: | (Patch Python to trigger GC inside Colorer._write().) | $ diff -u /usr/lib/python3.9/multiprocessing/connection.py{.orig,} | --- /usr/lib/python3.9/multiprocessing/connection.py.orig | +++ /usr/lib/python3.9/multiprocessing/connection.py | @@ -202,6 +202,8 @@ | raise ValueError("size is negative") | elif offset + size > n: | raise ValueError("buffer length < offset + size") | + import gc | + gc.collect() | self._send_bytes(m[offset:offset + size]) | | def send(self, obj): | | (Just in case, my tarantool version.) | $ ./src/tarantool --version | head -n 1 | Tarantool 2.8.0-134-g81c663335 | | (Add the reduced test case.) | $ cat test/xlog/test-run-hang-gh-qa-96.test.lua | test_run = require('test_run').new() | box.schema.user.grant('guest', 'replication') | test_run:cmd('create server replica with rpl_master=default, script="xlog/replica.lua"') | test_run:cmd('start server replica') | test_run:cmd('stop server replica') | test_run:cmd('cleanup server replica') | test_run:cmd('delete server replica') | box.schema.user.revoke('guest', 'replication') | | (Run the reduced test case.) | $ ./test/test-run.py xlog/test-run-hang-gh-qa-96.test.lua | | (Or run existing test with instance managing.) | $ ./test/test-run.py xlog/panic_on_broken_lsn.test.lua The problem appears, when GC is triggered inside Colorer._write() (more precisely, in multiprocessing.SimpleQueue#put()), and TarantoolServer instance is collected. __del__() calls stop(), which calls color_log(), which calls SimpleQueue#put(), which blocks on a lock. The process stucks. In fact, test-run should stop instances correctly without this __del__() method. If it is not so, it is a bug in test-run, which should be fixed anyway. So, I just removed this __del__() method. The problem looks related to [1], but it is unclear, whether it is the only problem, so I'll leave the issue open for a while. [1]: tarantool/tarantool-qa#96
75361c4
to
53ce01c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've successfully checked your investigation locally, patch LGTM.
Just to remember:
- reproduced on Alexander's test hang on color_log():
diff --git a/lib/tarantool_server.py b/lib/tarantool_server.py
index f611dba..2de01d3 100644
--- a/lib/tarantool_server.py
+++ b/lib/tarantool_server.py
@@ -1020,11 +1020,17 @@ class TarantoolServer(Server):
if not silent:
raise Exception('Server is not started')
else:
+ print("CHECK 0")
+ # exit helped to avoid of hang
+ #os.exit(0)
color_log(
'Server [{0.name}] is not started '
'(status:{0.status}) ...\n'.format(self),
schema='test_var'
)
+ print("CHECK 1")
+ # exit not helped - hanged on color_log()
+ os.exit(0)
return
if not silent:
color_stdout('[Instance {}] Stopping the server...\n'.format(
- got stack trace of the blocked place:
^CException ignored in: <function TarantoolServer.__del__ at 0x7f9650b4cb80>
Traceback (most recent call last):
File "/home/avtikhon/Workspaces/tarantool/test-run/lib/tarantool_server.py", line 662, in __del__
self.stop()
File "/home/avtikhon/Workspaces/tarantool/test-run/lib/tarantool_server.py", line 1026, in stop
color_log(
File "/home/avtikhon/Workspaces/tarantool/test-run/lib/colorer.py", line 20, in color_log
color_stdout(*args, **kwargs)
File "/home/avtikhon/Workspaces/tarantool/test-run/lib/colorer.py", line 254, in __call__
self.write(*args, **kwargs)
File "/home/avtikhon/Workspaces/tarantool/test-run/lib/colorer.py", line 250, in write
self._write(data, kwargs.get('log_only', False))
File "/home/avtikhon/Workspaces/tarantool/test-run/lib/colorer.py", line 217, in _write
self.queue.put(obj)
File "/usr/lib/python3.8/multiprocessing/queues.py", line 367, in put
with self._wlock:
File "/usr/lib/python3.8/multiprocessing/synchronize.py", line 95, in __enter__
return self._semlock.__enter__()
KeyboardInterrupt:
[001] xlog/test-run-hang-gh-qa-96.test.lua [ fail ]
This update fixes a sporadic problem with hanging test-run workers. The reason is an incorrect garbage collector handler. See [1] for details. This is not the last test-run problem, which leads to a hang worker: at least there is known problem [2]. [1]: tarantool/test-run#275 [2]: tarantool/test-run#276 Part of tarantool/tarantool-qa#96
This update fixes a sporadic problem with hanging test-run workers. The reason is an incorrect garbage collector handler. See [1] for details. This is not the last test-run problem, which leads to a hang worker: at least there is known problem [2]. [1]: tarantool/test-run#275 [2]: tarantool/test-run#276 Part of tarantool/tarantool-qa#96 (cherry picked from commit 680990a)
This update fixes a sporadic problem with hanging test-run workers. The reason is an incorrect garbage collector handler. See [1] for details. This is not the last test-run problem, which leads to a hang worker: at least there is known problem [2]. [1]: tarantool/test-run#275 [2]: tarantool/test-run#276 Part of tarantool/tarantool-qa#96 (cherry picked from commit 680990a)
This update fixes a sporadic problem with hanging test-run workers. The reason is an incorrect garbage collector handler. See [1] for details. This is not the last test-run problem, which leads to a hang worker: at least there is known problem [2]. [1]: tarantool/test-run#275 [2]: tarantool/test-run#276 Part of tarantool/tarantool-qa#96 (cherry picked from commit 680990a)
Updated the test-run submodule in tarantool in 2.8.0-147-g680990a08, 2.7.1-131-gd3268bc82, 2.6.2-128-g404458f51, 1.10.9-80-g5be2b57c3. |
How to spot the problem visually:
How to reproduce:
The problem appears, when GC is triggered inside Colorer._write() (more
precisely, in multiprocessing.SimpleQueue#put()), and TarantoolServer
instance is collected. __del__() calls stop(), which calls color_log(),
which calls SimpleQueue#put(), which blocks on a lock. The process
stucks.
In fact, test-run should stop instances correctly without this __del__()
method. If it is not so, it is a bug in test-run, which should be fixed
anyway.
So, I just removed this __del__() method.
The problem looks related to 1, but it is unclear, whether it is the
only problem, so I'll leave the issue open for a while.