-
Notifications
You must be signed in to change notification settings - Fork 18
Add check in wait_lsn for nil value #269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi! Thank you for the patch. |
I wondered how tarantool> require('fiber').new(function() if box.info.vclock[1] == nil then print(require('json').encode(box.info)) else print('success') end end) box.cfg{} The Also I found that when an instance is freshly bootstrapped, it also has The code does not fix this particular problem. You check whether It seems, you (again, and again, and again...) trying to blindly fix a problem without any reproducer. Ask for help if you need, but don't produce meaningless patches. Reproducer: diff --git a/test/app-tap/test-run-pr-269.test.lua b/test/app-tap/test-run-pr-269.test.lua
new file mode 100755
index 000000000..2ebb108b0
--- /dev/null
+++ b/test/app-tap/test-run-pr-269.test.lua
@@ -0,0 +1,17 @@
+#!/usr/bin/env tarantool
+
+local json = require('json')
+local test_run = require('test_run').new()
+
+test_run:cmd("create server foo with script='box/box.lua'")
+test_run:cmd('start server foo')
+--test_run:eval('foo', "box.once('first dml', function() end)")
+
+local res = test_run:get_param('foo', 'vclock')
+print(json.encode(res))
+local res = test_run:get_lsn('foo', 1)
+print(json.encode(res))
+
+test_run:cmd('stop server foo')
+test_run:cmd('cleanup server foo')
+test_run:cmd('delete server foo') How to run it: $ ./test/test-run.py --verbose app-tap/test-run-pr-269.test.lua
<...>
[001] [{}]
[001] null I'll add inline comments regarding the code. |
43b2f23
to
cc60664
Compare
Found issue in testing: [006] --- replication/election_qsync.result Fri Oct 16 00:12:02 2020 [006] +++ replication/election_qsync.reject Sat Oct 17 19:08:11 2020 [006] @@ -89,6 +89,8 @@ [006] -- Wait replication to the other instance. [006] test_run:wait_lsn('default', 'replica') [006] | --- [006] + | - error: '...sitories/tarantool/test/var/006_replication/test_run.lua:68: attempt [006] + | to compare nil with number' [006] | ... Found that wait_lsn() routine from test_run.lua script failed on comparing returned result from get_lsn() routine with real integer. It happened because get_lsn() returned 'nil' on the freshly bootstrapped instance. To avoid of it in wait_lsn() routine added check that get_lsn() routine returned not 'nil' value. Complete explanation of the issue and reproducer provided in [1]. Closes #226 Co-authored-by: Alexander Turenko <[email protected]> [1]: #269 (comment)
cc60664
to
1317013
Compare
Found issue in testing: [006] --- replication/election_qsync.result Fri Oct 16 00:12:02 2020 [006] +++ replication/election_qsync.reject Sat Oct 17 19:08:11 2020 [006] @@ -89,6 +89,8 @@ [006] -- Wait replication to the other instance. [006] test_run:wait_lsn('default', 'replica') [006] | --- [006] + | - error: '...sitories/tarantool/test/var/006_replication/test_run.lua:68: attempt [006] + | to compare nil with number' [006] | ... Found that wait_lsn() routine from test_run.lua script failed on comparing returned result from get_lsn() routine with real integer. It happened because get_lsn() returned 'nil' on the freshly bootstrapped instance. To avoid of it in wait_lsn() routine added check that get_lsn() routine returned not 'nil' value. Complete explanation of the issue and reproducer provided in [1]. Closes #226 Co-authored-by: Alexander Turenko <[email protected]> [1]: #269 (comment)
1317013
to
74f013f
Compare
Found issue in testing: [006] --- replication/election_qsync.result Fri Oct 16 00:12:02 2020 [006] +++ replication/election_qsync.reject Sat Oct 17 19:08:11 2020 [006] @@ -89,6 +89,8 @@ [006] -- Wait replication to the other instance. [006] test_run:wait_lsn('default', 'replica') [006] | --- [006] + | - error: '...sitories/tarantool/test/var/006_replication/test_run.lua:68: attempt [006] + | to compare nil with number' [006] | ... Found that wait_lsn() routine from test_run.lua script failed on comparing returned result from get_lsn() routine with real integer. It happened because get_lsn() returned 'nil' on the freshly bootstrapped instance. To avoid of it in wait_lsn() routine added check that get_lsn() routine returned not 'nil' value. Complete explanation of the issue and reproducer provided in [1]. Closes #226 Co-authored-by: Alexander Turenko <[email protected]> [1]: #269 (comment)
74f013f
to
cef0855
Compare
Thanks a lot for the explanation. I've added your comment link to commit message and used you reproducer to check the fix. |
I've changed the patch as @Totktonada suggested. |
Bumped test run with the following fix. Found issue in testing: [006] --- replication/election_qsync.result Fri Oct 16 00:12:02 2020 [006] +++ replication/election_qsync.reject Sat Oct 17 19:08:11 2020 [006] @@ -89,6 +89,8 @@ [006] -- Wait replication to the other instance. [006] test_run:wait_lsn('default', 'replica') [006] | --- [006] + | - error: '...sitories/tarantool/test/var/006_replication/test_run.lua:68: attempt [006] + | to compare nil with number' [006] | ... Found that wait_lsn() routine from test_run.lua script failed on comparing returned result from get_lsn() routine with real integer. It happened because get_lsn() returned 'nil' on the freshly bootstrapped instance. To avoid of it in wait_lsn() routine added check that get_lsn() routine returned not 'nil' value. Complete explanation of the issue and reproducer provided in [1]. Closes tarantool/test-run#226 Co-authored-by: Alexander Turenko <[email protected]> [1]: tarantool/test-run#269 (comment)
Hi! About the patch: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you agree with #269 (comment) - update the commit message.
Please, don't ignore:"The terminology was broken before your patch, but, please, fix it." from #269 (comment)
Found issue in testing: [006] --- replication/election_qsync.result Fri Oct 16 00:12:02 2020 [006] +++ replication/election_qsync.reject Sat Oct 17 19:08:11 2020 [006] @@ -89,6 +89,8 @@ [006] -- Wait replication to the other instance. [006] test_run:wait_lsn('default', 'replica') [006] | --- [006] + | - error: '...sitories/tarantool/test/var/006_replication/test_run.lua:68: attempt [006] + | to compare nil with number' [006] | ... Found that wait_lsn() routine from test_run.lua script failed on comparing returned result from get_lsn() routine with real integer. It happened because get_lsn() returned 'nil' on the freshly bootstrapped instance. To avoid of it in wait_lsn() routine added check that get_lsn() routine returned not 'nil' value. Complete explanation of the issue and reproducer provided in [1]. This patch does not fix the found issue in tests, but changes the way it works with the issue. Instead of breaking the test run, for now it will wait for the lsn value different from nil in loop till available timeout. Closes #226 Co-authored-by: Alexander Turenko <[email protected]> [1]: #269 (comment)
cef0855
to
5981144
Compare
Added comment in commit message and PR.
Seems this fix need to be done in separate commit. |
Yep. The PR can contain 2 commits. |
Hi!
|
Found issue in testing: [006] --- replication/election_qsync.result Fri Oct 16 00:12:02 2020 [006] +++ replication/election_qsync.reject Sat Oct 17 19:08:11 2020 [006] @@ -89,6 +89,8 @@ [006] -- Wait replication to the other instance. [006] test_run:wait_lsn('default', 'replica') [006] | --- [006] + | - error: '...sitories/tarantool/test/var/006_replication/test_run.lua:68: attempt [006] + | to compare nil with number' [006] | ... Found that wait_lsn() routine from test_run.lua script failed on comparing returned result from get_lsn() routine with real integer. It happened because get_lsn() returned 'nil' on the freshly bootstrapped instance. To avoid of it in wait_lsn() routine added check that get_lsn() routine returned not 'nil' value. Complete explanation of the issue and reproducer provided in [1]. This patch does not fix the found issue in tests, but changes the way it works with the issue. Instead of breaking the test run, for now it will wait for the lsn value different from nil in loop till available timeout. Needed for #226 Co-authored-by: Alexander Turenko <[email protected]> [1]: #269 (comment)
5981144
to
2da38e7
Compare
Found issue in testing: [006] --- replication/election_qsync.result Fri Oct 16 00:12:02 2020 [006] +++ replication/election_qsync.reject Sat Oct 17 19:08:11 2020 [006] @@ -89,6 +89,8 @@ [006] -- Wait replication to the other instance. [006] test_run:wait_lsn('default', 'replica') [006] | --- [006] + | - error: '...sitories/tarantool/test/var/006_replication/test_run.lua:68: attempt [006] + | to compare nil with number' [006] | ... Found that wait_lsn() routine from test_run.lua script failed on comparing returned result from get_lsn() routine with real integer. It happened because get_lsn() returned 'nil' on the freshly bootstrapped instance. To avoid of it in wait_lsn() routine added check that get_lsn() routine returned not 'nil' value. Complete explanation of the issue and reproducer provided in [1]. This patch does not fix the found issue in tests, but changes the way it works with the issue. Instead of breaking the test run, for now it will wait for the lsn value different from nil in loop till available timeout. Needed for #226 Co-authored-by: Alexander Turenko <[email protected]> [1]: #269 (comment)
Found issue in testing: [006] --- replication/election_qsync.result Fri Oct 16 00:12:02 2020 [006] +++ replication/election_qsync.reject Sat Oct 17 19:08:11 2020 [006] @@ -89,6 +89,8 @@ [006] -- Wait replication to the other instance. [006] test_run:wait_lsn('default', 'replica') [006] | --- [006] + | - error: '...sitories/tarantool/test/var/006_replication/test_run.lua:68: attempt [006] + | to compare nil with number' [006] | ... Found that wait_lsn() routine from test_run.lua script failed on comparing returned result from get_lsn() routine with real integer. It happened because get_lsn() returned 'nil' on the freshly bootstrapped instance. To avoid of it in wait_lsn() routine added check that get_lsn() routine returned not 'nil' value. Complete explanation of the issue and reproducer provided in [1]. This patch does not fix the found issue in tests, but changes the way it works with the issue. Instead of breaking the test run, for now it will wait for the lsn value different from nil in loop till available timeout. Needed for #226 Co-authored-by: Alexander Turenko <[email protected]> [1]: #269 (comment)
2da38e7
to
5a8ee8d
Compare
This patch does not fix the found issue in tests, but changes the |
Yep, I wrote the same in the first paragraph. |
Since the change is described in context of the |
Found issue in testing: [006] --- replication/election_qsync.result Fri Oct 16 00:12:02 2020 [006] +++ replication/election_qsync.reject Sat Oct 17 19:08:11 2020 [006] @@ -89,6 +89,8 @@ [006] -- Wait replication to the other instance. [006] test_run:wait_lsn('default', 'replica') [006] | --- [006] + | - error: '...sitories/tarantool/test/var/006_replication/test_run.lua:68: attempt [006] + | to compare nil with number' [006] | ... Found that wait_lsn() routine from test_run.lua script failed on comparing returned result from get_lsn() routine with real integer. It happened because get_lsn() returned 'nil' on the freshly bootstrapped instance. It happened when replica vclock [1:x, 2:1], while master vclock [1:x, 2:0], but in box.info.vclock[2] is nil instead of 0. To avoid of it in wait_lsn() routine added check that get_lsn() routine returned not 'nil' value. Complete explanation of the issue and reproducer provided in [1]. This test showed the issue in test-run which could happen with any test that uses test_run:wait_lsn() routine. This patch fixes this routine to be able to work correctly with 'nil' value returned by get_lsn() routine. Needed for #226 Co-authored-by: Alexander Turenko <[email protected]> [1]: #269 (comment)
5a8ee8d
to
02ae582
Compare
Once more discussed this issue with @sergepetrenko and found that this situation could happen Discussed once more with @Totktonada this fix and decided to correct commit message with more clear |
I propose to consider #226 as the problem with inability to wait until a vclock component will going above zero and close the issue with this PR. If there are other problems in the test, let's investigate them out of scope of the issue. |
Since @sergepetrenko approved that the described situation is possible in the test, I'm okay with wording that describes zero vclock component problem and the particular test that was hit by the problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (aside that I vote for closing the issue from this PR).
Found issue in testing: [006] --- replication/election_qsync.result Fri Oct 16 00:12:02 2020 [006] +++ replication/election_qsync.reject Sat Oct 17 19:08:11 2020 [006] @@ -89,6 +89,8 @@ [006] -- Wait replication to the other instance. [006] test_run:wait_lsn('default', 'replica') [006] | --- [006] + | - error: '...sitories/tarantool/test/var/006_replication/test_run.lua:68: attempt [006] + | to compare nil with number' [006] | ... Found that wait_lsn() routine from test_run.lua script failed on comparing returned result from get_lsn() routine with real integer. It happened because get_lsn() returned 'nil' on the freshly bootstrapped instance. It happened when replica vclock [1:x, 2:1], while master vclock [1:x, 2:0], but in box.info.vclock[2] is nil instead of 0. To avoid of it in wait_lsn() routine added check that get_lsn() routine returned not 'nil' value. Complete explanation of the issue and reproducer provided in [1]. This test showed the issue in test-run which could happen with any test that uses test_run:wait_lsn() routine. This patch fixes this routine to be able to work correctly with 'nil' value returned by get_lsn() routine. Closes #226 Co-authored-by: Alexander Turenko <[email protected]> [1]: #269 (comment)
02ae582
to
1c95830
Compare
Agree to close the issue, corrected commit message. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I wrote optional fixes for the commit message in a private conversation.
Found issue in testing: [006] --- replication/election_qsync.result Fri Oct 16 00:12:02 2020 [006] +++ replication/election_qsync.reject Sat Oct 17 19:08:11 2020 [006] @@ -89,6 +89,8 @@ [006] -- Wait replication to the other instance. [006] test_run:wait_lsn('default', 'replica') [006] | --- [006] + | - error: '...sitories/tarantool/test/var/006_replication/test_run.lua:68: attempt [006] + | to compare nil with number' [006] | ... Found that wait_lsn() routine from test_run.lua script failed on comparing returned result from get_lsn() routine with real integer. It happened because get_lsn() returned 'nil' on the freshly bootstrapped instance. This situation happens after master's re-election happened, when replica vclock [1:x, 2:1], while master vclock [1:x, 2:0], but in box.info.vclock[2] is nil instead of 0. To avoid of it in wait_lsn() routine added check that get_lsn() routine returned not 'nil' value. Complete explanation of the issue provided in [1]. This test showed the issue in test-run which could happen with any test that uses test_run:wait_lsn() routine. This patch fixes this routine to be able to work correctly with 'nil' value returned by get_lsn() routine. Closes #226 Co-authored-by: Alexander Turenko <[email protected]> [1]: #269 (comment)
1c95830
to
c9ad2d6
Compare
Updated the test-run submodule in tarantool in 2.8.0-134-g81c663335, 2.7.1-123-ge3d1d9e7a, 2.6.2-120-g790b55a92, 1.10.9-78-g1a4e6d963. |
Found issue in testing:
Found that wait_lsn() routine from test_run.lua script failed on
comparing returned result from get_lsn() routine with real integer.
It happened because get_lsn() returned 'nil' on the freshly
bootstrapped instance. It happened when replica vclock [1:x, 2:1],
while master vclock [1:x, 2:0], but in box.info.vclock[2] is nil
instead of 0. To avoid of it in wait_lsn() routine added check that
get_lsn() routine returned not 'nil' value. Complete explanation of
the issue and reproducer provided in 1.
This test showed the issue in test-run which could happen with any
test that uses test_run:wait_lsn() routine. This patch fixes this
routine to be able to work correctly with 'nil' value returned by
get_lsn() routine.
Closes #226
Co-authored-by: Alexander Turenko [email protected]