Skip to content

Fix hang on GC in Colorer._write() #275

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 22, 2021

Conversation

Totktonada
Copy link
Member

@Totktonada Totktonada commented Mar 18, 2021

How to spot the problem visually:

[Main process] No output from workers. It seems that we hang. Send
SIGKILL to workers; exiting...

How to reproduce:

(Patch Python to trigger GC inside Colorer._write().)
$ diff -u /usr/lib/python3.9/multiprocessing/connection.py{.orig,}
--- /usr/lib/python3.9/multiprocessing/connection.py.orig
+++ /usr/lib/python3.9/multiprocessing/connection.py
@@ -202,6 +202,8 @@
             raise ValueError("size is negative")
         elif offset + size > n:
             raise ValueError("buffer length < offset + size")
+        import gc
+        gc.collect()
         self._send_bytes(m[offset:offset + size])

     def send(self, obj):

(Just in case, my tarantool version.)
$ ./src/tarantool --version | head -n 1
Tarantool 2.8.0-134-g81c663335

(Add the reduced test case.)
$ cat test/xlog/test-run-hang-gh-qa-96.test.lua
test_run = require('test_run').new()
box.schema.user.grant('guest', 'replication')
test_run:cmd('create server replica with rpl_master=default, script="xlog/replica.lua"')
test_run:cmd('start server replica')
test_run:cmd('stop server replica')
test_run:cmd('cleanup server replica')
test_run:cmd('delete server replica')
box.schema.user.revoke('guest', 'replication')

(Run the reduced test case.)
$ ./test/test-run.py xlog/test-run-hang-gh-qa-96.test.lua

(Or run existing test with instance managing.)
$ ./test/test-run.py xlog/panic_on_broken_lsn.test.lua

The problem appears, when GC is triggered inside Colorer._write() (more
precisely, in multiprocessing.SimpleQueue#put()), and TarantoolServer
instance is collected. __del__() calls stop(), which calls color_log(),
which calls SimpleQueue#put(), which blocks on a lock. The process
stucks.

In fact, test-run should stop instances correctly without this __del__()
method. If it is not so, it is a bug in test-run, which should be fixed
anyway.

So, I just removed this __del__() method.

The problem looks related to 1, but it is unclear, whether it is the
only problem, so I'll leave the issue open for a while.

@Totktonada Totktonada added the bug Something isn't working label Mar 18, 2021
@Totktonada Totktonada requested review from avtikhon and ligurio March 18, 2021 19:29
Copy link
Member

@ligurio ligurio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

original bug has been reproduced, test-run with proposed commit fixes the problem

LGTM

@Totktonada
Copy link
Member Author

I'll update the reduced test case a bit:

diff --git a/test/xlog/test-run-hang-gh-qa-96.test.lua b/test/xlog/test-run-hang-gh-qa-96.test.lua
index ca58a1acb..9c06910c6 100644
--- a/test/xlog/test-run-hang-gh-qa-96.test.lua
+++ b/test/xlog/test-run-hang-gh-qa-96.test.lua
@@ -1,5 +1,7 @@
 test_run = require('test_run').new()
+box.schema.user.grant('guest', 'replication')
 test_run:cmd('create server replica with rpl_master=default, script="xlog/replica.lua"')
 test_run:cmd('start server replica')
 test_run:cmd('cleanup server replica')
 test_run:cmd('delete server replica')
+box.schema.user.revoke('guest', 'replication')

The original one leads to another 'hang worker' problem, irrelevant here. Sergey filed #276 regarding this.

Don't know how I missed it.

@Totktonada
Copy link
Member Author

Ouch, the reduced test case does not actually test the problem. I'll update it.

@Totktonada
Copy link
Member Author

One more incremental update for the test case. Sorry for the mess.

diff --git a/test/xlog/test-run-hang-gh-qa-96.test.lua b/test/xlog/test-run-hang-gh-qa-96.test.lua
index 9c06910c6..fc26459d9 100644
--- a/test/xlog/test-run-hang-gh-qa-96.test.lua
+++ b/test/xlog/test-run-hang-gh-qa-96.test.lua
@@ -2,6 +2,7 @@ test_run = require('test_run').new()
 box.schema.user.grant('guest', 'replication')
 test_run:cmd('create server replica with rpl_master=default, script="xlog/replica.lua"')
 test_run:cmd('start server replica')
+test_run:cmd('stop server replica')
 test_run:cmd('cleanup server replica')
 test_run:cmd('delete server replica')
 box.schema.user.revoke('guest', 'replication')

How to spot the problem visually:

 | [Main process] No output from workers. It seems that we hang. Send
 | SIGKILL to workers; exiting...

How to reproduce:

 | (Patch Python to trigger GC inside Colorer._write().)
 | $ diff -u /usr/lib/python3.9/multiprocessing/connection.py{.orig,}
 | --- /usr/lib/python3.9/multiprocessing/connection.py.orig
 | +++ /usr/lib/python3.9/multiprocessing/connection.py
 | @@ -202,6 +202,8 @@
 |              raise ValueError("size is negative")
 |          elif offset + size > n:
 |              raise ValueError("buffer length < offset + size")
 | +        import gc
 | +        gc.collect()
 |          self._send_bytes(m[offset:offset + size])
 |
 |      def send(self, obj):
 |
 | (Just in case, my tarantool version.)
 | $ ./src/tarantool --version | head -n 1
 | Tarantool 2.8.0-134-g81c663335
 |
 | (Add the reduced test case.)
 | $ cat test/xlog/test-run-hang-gh-qa-96.test.lua
 | test_run = require('test_run').new()
 | box.schema.user.grant('guest', 'replication')
 | test_run:cmd('create server replica with rpl_master=default, script="xlog/replica.lua"')
 | test_run:cmd('start server replica')
 | test_run:cmd('stop server replica')
 | test_run:cmd('cleanup server replica')
 | test_run:cmd('delete server replica')
 | box.schema.user.revoke('guest', 'replication')
 |
 | (Run the reduced test case.)
 | $ ./test/test-run.py xlog/test-run-hang-gh-qa-96.test.lua
 |
 | (Or run existing test with instance managing.)
 | $ ./test/test-run.py xlog/panic_on_broken_lsn.test.lua

The problem appears, when GC is triggered inside Colorer._write() (more
precisely, in multiprocessing.SimpleQueue#put()), and TarantoolServer
instance is collected. __del__() calls stop(), which calls color_log(),
which calls SimpleQueue#put(), which blocks on a lock. The process
stucks.

In fact, test-run should stop instances correctly without this __del__()
method. If it is not so, it is a bug in test-run, which should be fixed
anyway.

So, I just removed this __del__() method.

The problem looks related to [1], but it is unclear, whether it is the
only problem, so I'll leave the issue open for a while.

[1]: tarantool/tarantool-qa#96
@Totktonada Totktonada force-pushed the Totktonada/fix-hang-on-gc-in-colorer-write branch from 75361c4 to 53ce01c Compare March 19, 2021 16:04
Copy link
Contributor

@avtikhon avtikhon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've successfully checked your investigation locally, patch LGTM.

Just to remember:

  1. reproduced on Alexander's test hang on color_log():
diff --git a/lib/tarantool_server.py b/lib/tarantool_server.py
index f611dba..2de01d3 100644
--- a/lib/tarantool_server.py
+++ b/lib/tarantool_server.py
@@ -1020,11 +1020,17 @@ class TarantoolServer(Server):
             if not silent:
                 raise Exception('Server is not started')
             else:
+                print("CHECK 0")
+                # exit helped to avoid of hang
+                #os.exit(0)
                 color_log(
                     'Server [{0.name}] is not started '
                     '(status:{0.status}) ...\n'.format(self),
                     schema='test_var'
                 )
+                print("CHECK 1")
+                # exit not helped - hanged on color_log()
+                os.exit(0)
             return
         if not silent:
             color_stdout('[Instance {}] Stopping the server...\n'.format(
  1. got stack trace of the blocked place:
^CException ignored in: <function TarantoolServer.__del__ at 0x7f9650b4cb80>
Traceback (most recent call last):
  File "/home/avtikhon/Workspaces/tarantool/test-run/lib/tarantool_server.py", line 662, in __del__
    self.stop()
  File "/home/avtikhon/Workspaces/tarantool/test-run/lib/tarantool_server.py", line 1026, in stop
    color_log(
  File "/home/avtikhon/Workspaces/tarantool/test-run/lib/colorer.py", line 20, in color_log
    color_stdout(*args, **kwargs)
  File "/home/avtikhon/Workspaces/tarantool/test-run/lib/colorer.py", line 254, in __call__
    self.write(*args, **kwargs)
  File "/home/avtikhon/Workspaces/tarantool/test-run/lib/colorer.py", line 250, in write
    self._write(data, kwargs.get('log_only', False))
  File "/home/avtikhon/Workspaces/tarantool/test-run/lib/colorer.py", line 217, in _write
    self.queue.put(obj)
  File "/usr/lib/python3.8/multiprocessing/queues.py", line 367, in put
    with self._wlock:
  File "/usr/lib/python3.8/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt: 
[001] xlog/test-run-hang-gh-qa-96.test.lua                            [ fail ]

@Totktonada Totktonada merged commit cb5adde into master Mar 22, 2021
@Totktonada Totktonada deleted the Totktonada/fix-hang-on-gc-in-colorer-write branch March 22, 2021 19:58
Totktonada added a commit to tarantool/tarantool that referenced this pull request Mar 22, 2021
This update fixes a sporadic problem with hanging test-run workers. The
reason is an incorrect garbage collector handler. See [1] for details.

This is not the last test-run problem, which leads to a hang worker: at
least there is known problem [2].

[1]: tarantool/test-run#275
[2]: tarantool/test-run#276

Part of tarantool/tarantool-qa#96
Totktonada added a commit to tarantool/tarantool that referenced this pull request Mar 22, 2021
This update fixes a sporadic problem with hanging test-run workers. The
reason is an incorrect garbage collector handler. See [1] for details.

This is not the last test-run problem, which leads to a hang worker: at
least there is known problem [2].

[1]: tarantool/test-run#275
[2]: tarantool/test-run#276

Part of tarantool/tarantool-qa#96

(cherry picked from commit 680990a)
Totktonada added a commit to tarantool/tarantool that referenced this pull request Mar 22, 2021
This update fixes a sporadic problem with hanging test-run workers. The
reason is an incorrect garbage collector handler. See [1] for details.

This is not the last test-run problem, which leads to a hang worker: at
least there is known problem [2].

[1]: tarantool/test-run#275
[2]: tarantool/test-run#276

Part of tarantool/tarantool-qa#96

(cherry picked from commit 680990a)
Totktonada added a commit to tarantool/tarantool that referenced this pull request Mar 22, 2021
This update fixes a sporadic problem with hanging test-run workers. The
reason is an incorrect garbage collector handler. See [1] for details.

This is not the last test-run problem, which leads to a hang worker: at
least there is known problem [2].

[1]: tarantool/test-run#275
[2]: tarantool/test-run#276

Part of tarantool/tarantool-qa#96

(cherry picked from commit 680990a)
@Totktonada
Copy link
Member Author

Updated the test-run submodule in tarantool in 2.8.0-147-g680990a08, 2.7.1-131-gd3268bc82, 2.6.2-128-g404458f51, 1.10.9-80-g5be2b57c3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants