SPEC-1396 test plan for connection pool clearing on monitoring connec… #679

p-mongo · 2019-11-04T23:40:49Z

…tion errors

This is a follow-up PR to #665. This PR adds prose tests for clearing connection pool when monitoring connection experiences network errors.

…tion errors

prashantmital · 2019-11-06T02:00:34Z

source/server-discovery-and-monitoring/server-discovery-and-monitoring-tests.rst

+'''''''''''''''''''''''''
+
+Scenario: mock a non-timeout network error on a monitoring connection.
+Alternatively, set a `failCommand fail point`_ on ``isMaster`` command


Additionally, should we also specify that the failPoint be set in "alwaysOn" mode? Using any other mode could result in non-deterministic results as SDAM might be able to eventually reconnect to the server and add it back to the topology?

Also, if we are trying to emulate loss of connectivity to a server, should we also set failInternalCommands: true?

I don't think we can use alwaysOn to fail isMaster because then the server would always be set to unknown and the driver would never be able to disable the failpoint.

I also don't think we should use failInternalCommands because that might cause unexpected errors or state changes on other members.

Could one set a sufficiently high value for heartbeatFrequencyMS to avoid a race condition? This might be similar to the sessions spec tests using heartbeatFrequencyMS to avoid cluster time getting updated as the test is running.

I think this requires changing the interval when the client is already instantiated, because the client must be operational prior to failing the ismaster. A higher heartbeat interval I think simply will make the test take longer.

The default of 10 seconds should be enough to connect to localhost in CI I would expect, thoughts?

We've wasted time in the past by merging tests which were not possible to implement correctly (SPEC-1200). In order to avoid that, I think we should have someone POC this test (and run the test in Evergreen to be certain it works) before we merge this change. Does anyone want to try it?

I'm having a hard time reading the commit for RUBY-1894, but is mongodb/mongo-ruby-driver@e14fddd#diff-129c36528d5fb0bd3e6019c87d8c4db1R174 the bit where these errors are being mocked? Should we be concerned that these tests fail intermittently on Evergreen and are wrapped in retry loops?

Yes these are the tests. These tests are configured with low timeouts (1 second), which works reliably on my machine but not in evergreen. There are quite a few tests that fail intermittently in evergreen and are retried in the Ruby driver. Until there is a way to run tests in the evergreen or an identical environment with shell access, troubleshooting evergreen-only failures is difficult.

To address numerous outstanding questions in this thread:

should we also specify that the failPoint be set in "alwaysOn" mode

I believe Shane successfully explained why we should not use alwaysOn, nor should we use failInternalCommands.

A higher heartbeat interval I think simply will make the test take longer.

My understanding is that a heartbeat frequency takes effect after the initial discovery. So if the test constructs a MongoClient with a 10-second heartbeatFrequencyMS, no IO will yet be performed. The driver would then select a server to run configureFailPoint and setup an isMaster failure, and that server selection will initiate SDAM. Ten seconds later, we'd expect the next monitoring round to fail.

Great, then can someone also POC the failCommand alternative?

If the Ruby tests are unreliable in Evergreen, I don't think we can consider the RUBY-1894 commit a viable POC for mocking.

I'm willing to POC the configureFailPoint alternative in PHP, which uses single-threaded SDAM; however, I'd need a day or two to do so. I would plan to use heartbeatFrequencyMS for that POC.

That said, I expect we cannot use failCommand to test the "network timeout error" case below.

My understanding is that a heartbeat frequency takes effect after the initial discovery.

I think this is correct, making the fail point a viable strategy.

That said, I expect we cannot use failCommand to test the "network timeout error" case below.

Should I request this functionality in a server ticket?

Sorry for losing track of this thread.

heartbeatFrequencyMS

Regarding my comments on tweaking heartbeatFrequencyMS, I was expecting that this test would construct its own MongoClient and that a reduced heartbeat frequency could allow the alternative test approach (i.e. isMaster configured to drop its connection) to run faster. I'm not sure if @p-mongo and I were on the same page when he wrote:

I think this requires changing the interval when the client is already instantiated, because the client must be operational prior to failing the ismaster.

Having said that, my idea about tweaking heartbeatFrequencyMS is just an optimization and something I'd propose adding as a note in this section. If so, we could also advise users to construct a fresh MongoClient just for this test. Beyond that advice, I don't have any objection to the current text as-written and am happy to omit my suggestion entirely.

Fail point for network timeout error

Should I request this functionality (fail point for network timeout error) in a server ticket?

I think the PR is OK as written with asking a driver to mock that error, and would rather not hold up this PR or SPEC-1396 on a request for new fail point behavior.

That said, if you think there's value in having a general mechanism for delaying commands via a fail point (noting that the $where/sleep() trick only works for queries), perhaps you can sync up with @ShaneHarvey to formulate that request. He may need something similar when the time comes to write integration tests for client-side operation timeouts. If you go this route, I'd suggest creating a new SPEC ticket for the follow-up work that would remove the need to mock a network timeout error.

ShaneHarvey

LGTM

ShaneHarvey · 2019-11-06T18:50:12Z

source/server-discovery-and-monitoring/server-discovery-and-monitoring-tests.rst

+'''''''''''''''''''''''''
+
+Scenario: mock a non-timeout network error on a monitoring connection.
+Alternatively, set a `failCommand fail point`_ on ``isMaster`` command


I don't think we can use alwaysOn to fail isMaster because then the server would always be set to unknown and the driver would never be able to disable the failpoint.

I also don't think we should use failInternalCommands because that might cause unexpected errors or state changes on other members.

jmikola · 2019-11-12T21:22:35Z

source/server-discovery-and-monitoring/server-discovery-and-monitoring-tests.rst

+'''''''''''''''''''''''''
+
+Scenario: mock a non-timeout network error on a monitoring connection.
+Alternatively, set a `failCommand fail point`_ on ``isMaster`` command


To address numerous outstanding questions in this thread:

should we also specify that the failPoint be set in "alwaysOn" mode

I believe Shane successfully explained why we should not use alwaysOn, nor should we use failInternalCommands.

A higher heartbeat interval I think simply will make the test take longer.

My understanding is that a heartbeat frequency takes effect after the initial discovery. So if the test constructs a MongoClient with a 10-second heartbeatFrequencyMS, no IO will yet be performed. The driver would then select a server to run configureFailPoint and setup an isMaster failure, and that server selection will initiate SDAM. Ten seconds later, we'd expect the next monitoring round to fail.

Great, then can someone also POC the failCommand alternative?

If the Ruby tests are unreliable in Evergreen, I don't think we can consider the RUBY-1894 commit a viable POC for mocking.

I'm willing to POC the configureFailPoint alternative in PHP, which uses single-threaded SDAM; however, I'd need a day or two to do so. I would plan to use heartbeatFrequencyMS for that POC.

That said, I expect we cannot use failCommand to test the "network timeout error" case below.

jmikola · 2019-11-12T21:22:41Z

source/server-discovery-and-monitoring/server-discovery-and-monitoring-tests.rst


    {ok: 0, errmsg: "not master"}
+
+Network error on monitoring connection


The SDAM test plan already had syntax for simulating a "network error":

If a response is the empty object {}, simulate a network error.

Is there any reason we need a new prose test for "Non-timeout network error" and cannot instead rely on a spec test? Ideally, we'd also have some way to denote a "network timeout error" via a spec test.

@p-mongo: I'm not sure if this question was answered elsewhere in the PR, but can you follow up here? If the SDAM test plan already has syntax for simulating a non-timeout network error, can't we use that instead of introducing prose tests here?

I realize we might still need a prose test for the "network timeout error" case unless you'd rather add new spec test syntax for that.

I believe the non-timeout test is now handled via the fail points and events. The timeout test currently specifies the use of mocking, should I create a server ticket requesting a timeout feature for fail points?

I'm happy to rely on mocking, rather than make another request for server changes. Mocking also means the tests will conceivably run much quicker without having to wait for an actual timeout (and any flakiness that may entail).

jmikola

Summarizing today's quick meeting between @mbroadst, @p-mongo, and myself: @mbroadst suggested revising this test plan to leverage SDAM and CMAP events. After (or alongside) the unified test runner project, we can consider how to translate that prose into spec tests that have assertions on SDAM and CMAP events.

Meanwhile, we should consider whether SPEC-1396 should be pushed out from the 4.4 fixVersion.

p-mongo · 2020-03-16T19:15:55Z

I revised the fail point instructions to make two changes that make the tests operate correctly:

The fail point is set to trigger twice to account for monitor retrying failing ismaster calls.
The fail point is required to use isMaster spelling.

With these changes made in the Ruby driver, it is no longer needed to retry these tests (i.e. they pass on the first attempt every time).

jmikola

Some small suggestions but the underlying content LGTM.

jmikola · 2020-05-04T18:24:15Z

source/server-discovery-and-monitoring/server-discovery-and-monitoring-tests.rst

+
+Subscribe to `TopologyDescriptionChanged SDAM event`_ on the MongoClient.
+
+Subsribe to `PoolCleared CMAP event`_ on the MongoClient.


Typo: "Subscribe"

jmikola · 2020-05-04T18:26:10Z

source/server-discovery-and-monitoring/server-discovery-and-monitoring-tests.rst

+      configureFailPoint: 'failCommand',
+      mode: {times: 2},
+      data: {
+        failCommands: %w(isMaster),


Can we stick to JS and/or a shell example here? I'm not familiar with %w (literal array syntax?) but the more generic we can keep this the better.

jmikola · 2020-05-04T18:26:59Z

source/server-discovery-and-monitoring/server-discovery-and-monitoring-tests.rst

+
+- The fail point MUST be set to trigger twice because the server monitor is
+required to retry failing ``isMaster`` calls.
+- The "m" in ``isMaster`` MUST be capitalized.


Could also rephrase as "the isMaster command name is case-sensitive."

jmikola · 2020-05-04T18:27:22Z

source/server-discovery-and-monitoring/server-discovery-and-monitoring-tests.rst

+
+Perform a server scan manually or wait for the driver to scan the server.
+
+disable the fail point to avoid spurious failures in subsequent tests.


Capitalization of "Disable" if this is a new sentence.

ShaneHarvey · 2020-07-24T22:26:18Z

I think this PR can be closed. In DRIVERS-640 we added SDAM tests which assert that the pool is cleared when a check fails with a network error:

SPEC-1396 test plan for connection pool clearing on monitoring connec…

ffc7e1a

…tion errors

p-mongo requested review from mbroadst and ShaneHarvey November 4, 2019 23:41

p added 2 commits November 4, 2019 18:42

Add the link

b41e942

Capitaize musts

6317fed

prashantmital reviewed Nov 6, 2019

View reviewed changes

ShaneHarvey approved these changes Nov 6, 2019

View reviewed changes

p-mongo requested a review from jmikola November 7, 2019 16:00

jmikola reviewed Nov 12, 2019

View reviewed changes

jmikola reviewed Jan 22, 2020

View reviewed changes

p added 4 commits January 28, 2020 23:59

Rephrase the tests with sdam and cmap events

a545922

DRY links

eb21058

bold the headings

6db089d

revise fail point instructions

edc2ed7

fix rst

9ae363b

p-mongo requested review from mbroadst and removed request for mbroadst March 16, 2020 19:20

add instructions for disabling the fail point

d34bc4a

p-mongo requested review from prashantmital and ShaneHarvey March 16, 2020 19:23

jmikola reviewed May 4, 2020

View reviewed changes

p-mongo closed this Jul 25, 2020


		{ok: 0, errmsg: "not master"}

		Network error on monitoring connection


		Subscribe to `TopologyDescriptionChanged SDAM event`_ on the MongoClient.

		Subsribe to `PoolCleared CMAP event`_ on the MongoClient.


		Perform a server scan manually or wait for the driver to scan the server.

		disable the fail point to avoid spurious failures in subsequent tests.

SPEC-1396 test plan for connection pool clearing on monitoring connec… #679

SPEC-1396 test plan for connection pool clearing on monitoring connec… #679

Uh oh!

Conversation

p-mongo commented Nov 4, 2019

Uh oh!

prashantmital Nov 6, 2019 • edited by jmikola Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShaneHarvey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmikola left a comment

Choose a reason for hiding this comment

Uh oh!

p-mongo commented Mar 16, 2020

Uh oh!

jmikola left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShaneHarvey commented Jul 24, 2020

Uh oh!

Uh oh!

prashantmital Nov 6, 2019 •

edited by jmikola

Loading