Skip to content

Add check in wait_lsn for nil value #269

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 18, 2021
Merged

Conversation

avtikhon
Copy link
Contributor

@avtikhon avtikhon commented Mar 1, 2021

Found issue in testing:

      [006] --- replication/election_qsync.result   Fri Oct 16 00:12:02 2020
      [006] +++ replication/election_qsync.reject   Sat Oct 17 19:08:11 2020
      [006] @@ -89,6 +89,8 @@
      [006]  -- Wait replication to the other instance.
      [006]  test_run:wait_lsn('default', 'replica')
      [006]   | ---
      [006] + | - error: '...sitories/tarantool/test/var/006_replication/test_run.lua:68: attempt
      [006] + |     to compare nil with number'
      [006]   | ...

Found that wait_lsn() routine from test_run.lua script failed on
comparing returned result from get_lsn() routine with real integer.
It happened because get_lsn() returned 'nil' on the freshly
bootstrapped instance. It happened when replica vclock [1:x, 2:1],
while master vclock [1:x, 2:0], but in box.info.vclock[2] is nil
instead of 0. To avoid of it in wait_lsn() routine added check that
get_lsn() routine returned not 'nil' value. Complete explanation of
the issue and reproducer provided in 1.

This test showed the issue in test-run which could happen with any
test that uses test_run:wait_lsn() routine. This patch fixes this
routine to be able to work correctly with 'nil' value returned by
get_lsn() routine.

Closes #226

Co-authored-by: Alexander Turenko [email protected]

@avtikhon avtikhon requested a review from LeonidVas March 1, 2021 15:45
@avtikhon avtikhon self-assigned this Mar 1, 2021
@avtikhon avtikhon added the teamQ label Mar 1, 2021
@LeonidVas
Copy link

LeonidVas commented Mar 2, 2021

Hi! Thank you for the patch.
AFAIU in the "election_qsync.test.lua" wait_lsn() is used for the old master, and the "bootstrap" problem does not seem relevant in this case.
Also in this test, the replica was started with options wait=True, wait_load=True, and if I understand correctly, this means that replicaset_connect() was done and vclock must not be empty.
In other words, I don’t understand why the vclock became empty on the old master.

@Totktonada
Copy link
Member

I wondered how box.info.vclock may give a result (not raise an error), where the first item is nil. It seems, it is possible on early box configuration stage (for an instance with existing snap/xlog with some operations). It can be verified from a console:

tarantool> require('fiber').new(function() if box.info.vclock[1] == nil then print(require('json').encode(box.info)) else print('success') end end) box.cfg{}

The box.info.vclock value is {}.

Also I found that when an instance is freshly bootstrapped, it also has box.info.vclock value {} (easy to check).


The code does not fix this particular problem. You check whether box.info.vclock is nil (not {}). If it would nil, you would got the 'attempt to index nil value' error on the line 44, not 'attempt to compare nil with number' on line 68.

It seems, you (again, and again, and again...) trying to blindly fix a problem without any reproducer. Ask for help if you need, but don't produce meaningless patches.

Reproducer:

diff --git a/test/app-tap/test-run-pr-269.test.lua b/test/app-tap/test-run-pr-269.test.lua
new file mode 100755
index 000000000..2ebb108b0
--- /dev/null
+++ b/test/app-tap/test-run-pr-269.test.lua
@@ -0,0 +1,17 @@
+#!/usr/bin/env tarantool
+
+local json = require('json')
+local test_run = require('test_run').new()
+
+test_run:cmd("create server foo with script='box/box.lua'")
+test_run:cmd('start server foo')
+--test_run:eval('foo', "box.once('first dml', function() end)")
+
+local res = test_run:get_param('foo', 'vclock')
+print(json.encode(res))
+local res = test_run:get_lsn('foo', 1)
+print(json.encode(res))
+
+test_run:cmd('stop server foo')
+test_run:cmd('cleanup server foo')
+test_run:cmd('delete server foo')

How to run it:

$ ./test/test-run.py --verbose app-tap/test-run-pr-269.test.lua
<...>
[001] [{}]
[001] null

I'll add inline comments regarding the code.

@avtikhon avtikhon force-pushed the avtikhon/gh-226-wait_lsn branch from 43b2f23 to cc60664 Compare March 3, 2021 08:43
avtikhon added a commit that referenced this pull request Mar 3, 2021
Found issue in testing:

  [006] --- replication/election_qsync.result	Fri Oct 16 00:12:02 2020
  [006] +++ replication/election_qsync.reject	Sat Oct 17 19:08:11 2020
  [006] @@ -89,6 +89,8 @@
  [006]  -- Wait replication to the other instance.
  [006]  test_run:wait_lsn('default', 'replica')
  [006]   | ---
  [006] + | - error: '...sitories/tarantool/test/var/006_replication/test_run.lua:68: attempt
  [006] + |     to compare nil with number'
  [006]   | ...

Found that wait_lsn() routine from test_run.lua script failed on
comparing returned result from get_lsn() routine with real integer.
It happened because get_lsn() returned 'nil' on the freshly
bootstrapped instance. To avoid of it in wait_lsn() routine added
check that get_lsn() routine returned not 'nil' value.
Complete explanation of the issue and reproducer provided in [1].

Closes #226

Co-authored-by: Alexander Turenko <[email protected]>

[1]: #269 (comment)
@avtikhon avtikhon force-pushed the avtikhon/gh-226-wait_lsn branch from cc60664 to 1317013 Compare March 3, 2021 08:55
avtikhon added a commit that referenced this pull request Mar 3, 2021
Found issue in testing:

  [006] --- replication/election_qsync.result	Fri Oct 16 00:12:02 2020
  [006] +++ replication/election_qsync.reject	Sat Oct 17 19:08:11 2020
  [006] @@ -89,6 +89,8 @@
  [006]  -- Wait replication to the other instance.
  [006]  test_run:wait_lsn('default', 'replica')
  [006]   | ---
  [006] + | - error: '...sitories/tarantool/test/var/006_replication/test_run.lua:68: attempt
  [006] + |     to compare nil with number'
  [006]   | ...

Found that wait_lsn() routine from test_run.lua script failed on
comparing returned result from get_lsn() routine with real integer.
It happened because get_lsn() returned 'nil' on the freshly
bootstrapped instance. To avoid of it in wait_lsn() routine added
check that get_lsn() routine returned not 'nil' value.
Complete explanation of the issue and reproducer provided in [1].

Closes #226

Co-authored-by: Alexander Turenko <[email protected]>

[1]: #269 (comment)
@avtikhon avtikhon force-pushed the avtikhon/gh-226-wait_lsn branch from 1317013 to 74f013f Compare March 3, 2021 08:56
@avtikhon avtikhon changed the title Add check in get_lsn for wait_lsn Add check in wait_lsn for nil value Mar 3, 2021
avtikhon added a commit that referenced this pull request Mar 3, 2021
Found issue in testing:

  [006] --- replication/election_qsync.result	Fri Oct 16 00:12:02 2020
  [006] +++ replication/election_qsync.reject	Sat Oct 17 19:08:11 2020
  [006] @@ -89,6 +89,8 @@
  [006]  -- Wait replication to the other instance.
  [006]  test_run:wait_lsn('default', 'replica')
  [006]   | ---
  [006] + | - error: '...sitories/tarantool/test/var/006_replication/test_run.lua:68: attempt
  [006] + |     to compare nil with number'
  [006]   | ...

Found that wait_lsn() routine from test_run.lua script failed on
comparing returned result from get_lsn() routine with real integer.
It happened because get_lsn() returned 'nil' on the freshly
bootstrapped instance. To avoid of it in wait_lsn() routine added
check that get_lsn() routine returned not 'nil' value.
Complete explanation of the issue and reproducer provided in [1].

Closes #226

Co-authored-by: Alexander Turenko <[email protected]>

[1]: #269 (comment)
@avtikhon avtikhon force-pushed the avtikhon/gh-226-wait_lsn branch from 74f013f to cef0855 Compare March 3, 2021 09:22
@avtikhon
Copy link
Contributor Author

avtikhon commented Mar 3, 2021

I wondered how box.info.vclock may give a result (not raise an error), where the first item is nil. It seems, it is possible on early box configuration stage (for an instance with existing snap/xlog with some operations). It can be verified from a console:

tarantool> require('fiber').new(function() if box.info.vclock[1] == nil then print(require('json').encode(box.info)) else print('success') end end) box.cfg{}

The box.info.vclock value is {}.

Also I found that when an instance is freshly bootstrapped, it also has box.info.vclock value {} (easy to check).

The code does not fix this particular problem. You check whether box.info.vclock is nil (not {}). If it would nil, you would got the 'attempt to index nil value' error on the line 44, not 'attempt to compare nil with number' on line 68.

It seems, you (again, and again, and again...) trying to blindly fix a problem without any reproducer. Ask for help if you need, but don't produce meaningless patches.

Reproducer:

diff --git a/test/app-tap/test-run-pr-269.test.lua b/test/app-tap/test-run-pr-269.test.lua
new file mode 100755
index 000000000..2ebb108b0
--- /dev/null
+++ b/test/app-tap/test-run-pr-269.test.lua
@@ -0,0 +1,17 @@
+#!/usr/bin/env tarantool
+
+local json = require('json')
+local test_run = require('test_run').new()
+
+test_run:cmd("create server foo with script='box/box.lua'")
+test_run:cmd('start server foo')
+--test_run:eval('foo', "box.once('first dml', function() end)")
+
+local res = test_run:get_param('foo', 'vclock')
+print(json.encode(res))
+local res = test_run:get_lsn('foo', 1)
+print(json.encode(res))
+
+test_run:cmd('stop server foo')
+test_run:cmd('cleanup server foo')
+test_run:cmd('delete server foo')

How to run it:

$ ./test/test-run.py --verbose app-tap/test-run-pr-269.test.lua
<...>
[001] [{}]
[001] null

I'll add inline comments regarding the code.

Thanks a lot for the explanation. I've added your comment link to commit message and used you reproducer to check the fix.

@avtikhon
Copy link
Contributor Author

avtikhon commented Mar 3, 2021

Hi! Thank you for the patch.
AFAIU in the "election_qsync.test.lua" wait_lsn() is used for the old master, and the "bootstrap" problem does not seem relevant in this case.
Also in this test, the replica was started with options wait=True, wait_load=True, and if I understand correctly, this means that replicaset_connect() was done and vclock must not be empty.
In other words, I don’t understand why the vclock became empty on the old master.

I've changed the patch as @Totktonada suggested.

@avtikhon avtikhon requested a review from Totktonada March 3, 2021 09:29
avtikhon added a commit to tarantool/tarantool that referenced this pull request Mar 3, 2021
Bumped test run with the following fix.

Found issue in testing:

  [006] --- replication/election_qsync.result   Fri Oct 16 00:12:02 2020
  [006] +++ replication/election_qsync.reject   Sat Oct 17 19:08:11 2020
  [006] @@ -89,6 +89,8 @@
  [006]  -- Wait replication to the other instance.
  [006]  test_run:wait_lsn('default', 'replica')
  [006]   | ---
  [006] + | - error: '...sitories/tarantool/test/var/006_replication/test_run.lua:68: attempt
  [006] + |     to compare nil with number'
  [006]   | ...

Found that wait_lsn() routine from test_run.lua script failed on
comparing returned result from get_lsn() routine with real integer.
It happened because get_lsn() returned 'nil' on the freshly
bootstrapped instance. To avoid of it in wait_lsn() routine added
check that get_lsn() routine returned not 'nil' value.
Complete explanation of the issue and reproducer provided in [1].

Closes tarantool/test-run#226

Co-authored-by: Alexander Turenko <[email protected]>

[1]: tarantool/test-run#269 (comment)
@LeonidVas
Copy link

Hi! Thank you for the patch.
AFAIU in the "election_qsync.test.lua" wait_lsn() is used for the old master, and the "bootstrap" problem does not seem relevant in this case.
Also in this test, the replica was started with options wait=True, wait_load=True, and if I understand correctly, this means that replicaset_connect() was done and vclock must not be empty.
In other words, I don’t understand why the vclock became empty on the old master.

I've changed the patch as @Totktonada suggested.

Hi!
My question: "Why is the vclock empty on the old master?"
And Alexander's comments don't answer on this question.

About the patch:
I had a conversation with @Totktonada and I don't object against updating wait_lsn() function, but I think that this change doesn't fix the problem with the election_qsync.test.lua test.

Copy link

@LeonidVas LeonidVas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you agree with #269 (comment) - update the commit message.
Please, don't ignore:"The terminology was broken before your patch, but, please, fix it." from #269 (comment)

avtikhon added a commit that referenced this pull request Mar 4, 2021
Found issue in testing:

  [006] --- replication/election_qsync.result	Fri Oct 16 00:12:02 2020
  [006] +++ replication/election_qsync.reject	Sat Oct 17 19:08:11 2020
  [006] @@ -89,6 +89,8 @@
  [006]  -- Wait replication to the other instance.
  [006]  test_run:wait_lsn('default', 'replica')
  [006]   | ---
  [006] + | - error: '...sitories/tarantool/test/var/006_replication/test_run.lua:68: attempt
  [006] + |     to compare nil with number'
  [006]   | ...

Found that wait_lsn() routine from test_run.lua script failed on
comparing returned result from get_lsn() routine with real integer.
It happened because get_lsn() returned 'nil' on the freshly
bootstrapped instance. To avoid of it in wait_lsn() routine added
check that get_lsn() routine returned not 'nil' value.
Complete explanation of the issue and reproducer provided in [1].

This patch does not fix the found issue in tests, but changes the
way it works with the issue. Instead of breaking the test run, for
now it will wait for the lsn value different from nil in loop till
available timeout.

Closes #226

Co-authored-by: Alexander Turenko <[email protected]>

[1]: #269 (comment)
@avtikhon avtikhon force-pushed the avtikhon/gh-226-wait_lsn branch from cef0855 to 5981144 Compare March 4, 2021 14:59
@avtikhon
Copy link
Contributor Author

avtikhon commented Mar 4, 2021

If you agree with #269 (comment) - update the commit message.

Added comment in commit message and PR.

Please, don't ignore:"The terminology was broken before your patch, but, please, fix it." from #269 (comment)

Seems this fix need to be done in separate commit.

@LeonidVas
Copy link

LeonidVas commented Mar 4, 2021

Please, don't ignore:"The terminology was broken before your patch, but, please, fix it." from #269 (comment)

Seems this fix need to be done in separate commit.

Yep. The PR can contain 2 commits.
It's not a fix. It something like "stylistic correction".
Up to you.

@LeonidVas
Copy link

If you agree with #269 (comment) - update the commit message.

Added comment in commit message and PR.

Hi!
About the commit message:

  • At the moment this change doesn't seem to fix anything, it just changes the behavior, and I don't believe the new one will be better, because before you were getting an error in the place where something went wrong, but now you will have a hang test, which will be killed after 120 seconds. As for me, I think the old behavior was clearer. But this change could make it easier to write tests (potentially), so I don't mind.
  • If the first is true. This change is not related with issue test_run:wait_lsn() fails with "attempt to compare nil with number" #226 at all, and the commit message should only contain a description of the behavior and motivation change.

avtikhon added a commit that referenced this pull request Mar 5, 2021
Found issue in testing:

  [006] --- replication/election_qsync.result	Fri Oct 16 00:12:02 2020
  [006] +++ replication/election_qsync.reject	Sat Oct 17 19:08:11 2020
  [006] @@ -89,6 +89,8 @@
  [006]  -- Wait replication to the other instance.
  [006]  test_run:wait_lsn('default', 'replica')
  [006]   | ---
  [006] + | - error: '...sitories/tarantool/test/var/006_replication/test_run.lua:68: attempt
  [006] + |     to compare nil with number'
  [006]   | ...

Found that wait_lsn() routine from test_run.lua script failed on
comparing returned result from get_lsn() routine with real integer.
It happened because get_lsn() returned 'nil' on the freshly
bootstrapped instance. To avoid of it in wait_lsn() routine added
check that get_lsn() routine returned not 'nil' value.
Complete explanation of the issue and reproducer provided in [1].

This patch does not fix the found issue in tests, but changes the
way it works with the issue. Instead of breaking the test run, for
now it will wait for the lsn value different from nil in loop till
available timeout.

Needed for #226

Co-authored-by: Alexander Turenko <[email protected]>

[1]: #269 (comment)
@avtikhon avtikhon force-pushed the avtikhon/gh-226-wait_lsn branch from 5981144 to 2da38e7 Compare March 5, 2021 12:36
avtikhon added a commit that referenced this pull request Mar 8, 2021
Found issue in testing:

  [006] --- replication/election_qsync.result	Fri Oct 16 00:12:02 2020
  [006] +++ replication/election_qsync.reject	Sat Oct 17 19:08:11 2020
  [006] @@ -89,6 +89,8 @@
  [006]  -- Wait replication to the other instance.
  [006]  test_run:wait_lsn('default', 'replica')
  [006]   | ---
  [006] + | - error: '...sitories/tarantool/test/var/006_replication/test_run.lua:68: attempt
  [006] + |     to compare nil with number'
  [006]   | ...

Found that wait_lsn() routine from test_run.lua script failed on
comparing returned result from get_lsn() routine with real integer.
It happened because get_lsn() returned 'nil' on the freshly
bootstrapped instance. To avoid of it in wait_lsn() routine added
check that get_lsn() routine returned not 'nil' value.
Complete explanation of the issue and reproducer provided in [1].

This patch does not fix the found issue in tests, but changes the
way it works with the issue. Instead of breaking the test run, for
now it will wait for the lsn value different from nil in loop till
available timeout.

Needed for #226

Co-authored-by: Alexander Turenko <[email protected]>

[1]: #269 (comment)
avtikhon added a commit that referenced this pull request Mar 15, 2021
Found issue in testing:

  [006] --- replication/election_qsync.result	Fri Oct 16 00:12:02 2020
  [006] +++ replication/election_qsync.reject	Sat Oct 17 19:08:11 2020
  [006] @@ -89,6 +89,8 @@
  [006]  -- Wait replication to the other instance.
  [006]  test_run:wait_lsn('default', 'replica')
  [006]   | ---
  [006] + | - error: '...sitories/tarantool/test/var/006_replication/test_run.lua:68: attempt
  [006] + |     to compare nil with number'
  [006]   | ...

Found that wait_lsn() routine from test_run.lua script failed on
comparing returned result from get_lsn() routine with real integer.
It happened because get_lsn() returned 'nil' on the freshly
bootstrapped instance. To avoid of it in wait_lsn() routine added
check that get_lsn() routine returned not 'nil' value.
Complete explanation of the issue and reproducer provided in [1].

This patch does not fix the found issue in tests, but changes the
way it works with the issue. Instead of breaking the test run, for
now it will wait for the lsn value different from nil in loop till
available timeout.

Needed for #226

Co-authored-by: Alexander Turenko <[email protected]>

[1]: #269 (comment)
@avtikhon avtikhon force-pushed the avtikhon/gh-226-wait_lsn branch from 2da38e7 to 5a8ee8d Compare March 15, 2021 07:33
@avtikhon
Copy link
Contributor Author

If you agree with #269 (comment) - update the commit message.

Added comment in commit message and PR.

Hi!
About the commit message:

  • At the moment this change doesn't seem to fix anything, it just changes the behavior, and I don't believe the new one will be better, because before you were getting an error in the place where something went wrong, but now you will have a hang test, which will be killed after 120 seconds. As for me, I think the old behavior was clearer. But this change could make it easier to write tests (potentially), so I don't mind.
  • If the first is true. This change is not related with issue test_run:wait_lsn() fails with "attempt to compare nil with number" #226 at all, and the commit message should only contain a description of the behavior and motivation change.

This patch does not fix the found issue in tests, but changes the
way it works with the issue. Instead of breaking the test run, for
now it will wait for the lsn value different from nil in loop till
available timeout.

@LeonidVas
Copy link

If you agree with #269 (comment) - update the commit message.

Added comment in commit message and PR.

Hi!
About the commit message:

  • At the moment this change doesn't seem to fix anything, it just changes the behavior, and I don't believe the new one will be better, because before you were getting an error in the place where something went wrong, but now you will have a hang test, which will be killed after 120 seconds. As for me, I think the old behavior was clearer. But this change could make it easier to write tests (potentially), so I don't mind.
  • If the first is true. This change is not related with issue test_run:wait_lsn() fails with "attempt to compare nil with number" #226 at all, and the commit message should only contain a description of the behavior and motivation change.

This patch does not fix the found issue in tests, but changes the
way it works with the issue. Instead of breaking the test run, for
now it will wait for the lsn value different from nil in loop till
available timeout.

Yep, I wrote the same in the first paragraph.

@Totktonada
Copy link
Member

This patch does not fix the found issue in tests, but changes the
way it works with the issue.

Since the change is described in context of the replication/election_qsync.test.lua problem and it does not fix the problem, I don't see any reason to push it.

avtikhon added a commit that referenced this pull request Mar 15, 2021
Found issue in testing:

  [006] --- replication/election_qsync.result	Fri Oct 16 00:12:02 2020
  [006] +++ replication/election_qsync.reject	Sat Oct 17 19:08:11 2020
  [006] @@ -89,6 +89,8 @@
  [006]  -- Wait replication to the other instance.
  [006]  test_run:wait_lsn('default', 'replica')
  [006]   | ---
  [006] + | - error: '...sitories/tarantool/test/var/006_replication/test_run.lua:68: attempt
  [006] + |     to compare nil with number'
  [006]   | ...

Found that wait_lsn() routine from test_run.lua script failed on
comparing returned result from get_lsn() routine with real integer.
It happened because get_lsn() returned 'nil' on the freshly
bootstrapped instance. It happened when replica vclock [1:x, 2:1],
while master vclock [1:x, 2:0], but in box.info.vclock[2] is nil
instead of 0. To avoid of it in wait_lsn() routine added check that
get_lsn() routine returned not 'nil' value. Complete explanation of
the issue and reproducer provided in [1].

This test showed the issue in test-run which could happen with any
test that uses test_run:wait_lsn() routine. This patch fixes this
routine to be able to work correctly with 'nil' value returned by
get_lsn() routine.

Needed for #226

Co-authored-by: Alexander Turenko <[email protected]>

[1]: #269 (comment)
@avtikhon avtikhon force-pushed the avtikhon/gh-226-wait_lsn branch from 5a8ee8d to 02ae582 Compare March 15, 2021 12:28
@avtikhon avtikhon requested a review from Totktonada March 15, 2021 12:29
@avtikhon
Copy link
Contributor Author

This patch does not fix the found issue in tests, but changes the
way it works with the issue.

Since the change is described in context of the replication/election_qsync.test.lua problem and it does not fix the problem, I don't see any reason to push it.

Once more discussed this issue with @sergepetrenko and found that this situation could happen
when replica vclock [1:x, 2:1], while master vclock [1:x, 2:0], but in box.info.vclock[2] is nil instead of 0.

Discussed once more with @Totktonada this fix and decided to correct commit message with more clear
information what this fix is really for.

@Totktonada
Copy link
Member

I propose to consider #226 as the problem with inability to wait until a vclock component will going above zero and close the issue with this PR. If there are other problems in the test, let's investigate them out of scope of the issue.

@Totktonada
Copy link
Member

Since @sergepetrenko approved that the described situation is possible in the test, I'm okay with wording that describes zero vclock component problem and the particular test that was hit by the problem.

Copy link
Member

@Totktonada Totktonada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (aside that I vote for closing the issue from this PR).

avtikhon added a commit that referenced this pull request Mar 17, 2021
Found issue in testing:

  [006] --- replication/election_qsync.result	Fri Oct 16 00:12:02 2020
  [006] +++ replication/election_qsync.reject	Sat Oct 17 19:08:11 2020
  [006] @@ -89,6 +89,8 @@
  [006]  -- Wait replication to the other instance.
  [006]  test_run:wait_lsn('default', 'replica')
  [006]   | ---
  [006] + | - error: '...sitories/tarantool/test/var/006_replication/test_run.lua:68: attempt
  [006] + |     to compare nil with number'
  [006]   | ...

Found that wait_lsn() routine from test_run.lua script failed on
comparing returned result from get_lsn() routine with real integer.
It happened because get_lsn() returned 'nil' on the freshly
bootstrapped instance. It happened when replica vclock [1:x, 2:1],
while master vclock [1:x, 2:0], but in box.info.vclock[2] is nil
instead of 0. To avoid of it in wait_lsn() routine added check that
get_lsn() routine returned not 'nil' value. Complete explanation of
the issue and reproducer provided in [1].

This test showed the issue in test-run which could happen with any
test that uses test_run:wait_lsn() routine. This patch fixes this
routine to be able to work correctly with 'nil' value returned by
get_lsn() routine.

Closes #226

Co-authored-by: Alexander Turenko <[email protected]>

[1]: #269 (comment)
@avtikhon avtikhon force-pushed the avtikhon/gh-226-wait_lsn branch from 02ae582 to 1c95830 Compare March 17, 2021 05:06
@avtikhon
Copy link
Contributor Author

LGTM (aside that I vote for closing the issue from this PR).

Agree to close the issue, corrected commit message.

Copy link

@LeonidVas LeonidVas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
I wrote optional fixes for the commit message in a private conversation.

Found issue in testing:

  [006] --- replication/election_qsync.result	Fri Oct 16 00:12:02 2020
  [006] +++ replication/election_qsync.reject	Sat Oct 17 19:08:11 2020
  [006] @@ -89,6 +89,8 @@
  [006]  -- Wait replication to the other instance.
  [006]  test_run:wait_lsn('default', 'replica')
  [006]   | ---
  [006] + | - error: '...sitories/tarantool/test/var/006_replication/test_run.lua:68: attempt
  [006] + |     to compare nil with number'
  [006]   | ...

Found that wait_lsn() routine from test_run.lua script failed on
comparing returned result from get_lsn() routine with real integer.
It happened because get_lsn() returned 'nil' on the freshly
bootstrapped instance. This situation happens after master's
re-election happened, when replica vclock [1:x, 2:1], while master
vclock [1:x, 2:0], but in box.info.vclock[2] is nil instead of 0.
To avoid of it in wait_lsn() routine added check that get_lsn()
routine returned not 'nil' value. Complete explanation of the issue
provided in [1].

This test showed the issue in test-run which could happen with any
test that uses test_run:wait_lsn() routine. This patch fixes this
routine to be able to work correctly with 'nil' value returned by
get_lsn() routine.

Closes #226

Co-authored-by: Alexander Turenko <[email protected]>

[1]: #269 (comment)
@avtikhon avtikhon force-pushed the avtikhon/gh-226-wait_lsn branch from 1c95830 to c9ad2d6 Compare March 18, 2021 08:15
@Totktonada Totktonada merged commit 5941741 into master Mar 18, 2021
@Totktonada Totktonada deleted the avtikhon/gh-226-wait_lsn branch March 18, 2021 12:53
@Totktonada
Copy link
Member

Updated the test-run submodule in tarantool in 2.8.0-134-g81c663335, 2.7.1-123-ge3d1d9e7a, 2.6.2-120-g790b55a92, 1.10.9-78-g1a4e6d963.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_run:wait_lsn() fails with "attempt to compare nil with number"
3 participants