Skip to content

Conversation

p-mongo
Copy link
Contributor

@p-mongo p-mongo commented Nov 4, 2019

…tion errors

This is a follow-up PR to #665. This PR adds prose tests for clearing connection pool when monitoring connection experiences network errors.

'''''''''''''''''''''''''

Scenario: mock a non-timeout network error on a monitoring connection.
Alternatively, set a `failCommand fail point`_ on ``isMaster`` command
Copy link
Contributor

@prashantmital prashantmital Nov 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, should we also specify that the failPoint be set in "alwaysOn" mode? Using any other mode could result in non-deterministic results as SDAM might be able to eventually reconnect to the server and add it back to the topology?

Also, if we are trying to emulate loss of connectivity to a server, should we also set failInternalCommands: true?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can use alwaysOn to fail isMaster because then the server would always be set to unknown and the driver would never be able to disable the failpoint.

I also don't think we should use failInternalCommands because that might cause unexpected errors or state changes on other members.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could one set a sufficiently high value for heartbeatFrequencyMS to avoid a race condition? This might be similar to the sessions spec tests using heartbeatFrequencyMS to avoid cluster time getting updated as the test is running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this requires changing the interval when the client is already instantiated, because the client must be operational prior to failing the ismaster. A higher heartbeat interval I think simply will make the test take longer.

The default of 10 seconds should be enough to connect to localhost in CI I would expect, thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've wasted time in the past by merging tests which were not possible to implement correctly (SPEC-1200). In order to avoid that, I think we should have someone POC this test (and run the test in Evergreen to be certain it works) before we merge this change. Does anyone want to try it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time reading the commit for RUBY-1894, but is mongodb/mongo-ruby-driver@e14fddd#diff-129c36528d5fb0bd3e6019c87d8c4db1R174 the bit where these errors are being mocked? Should we be concerned that these tests fail intermittently on Evergreen and are wrapped in retry loops?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes these are the tests. These tests are configured with low timeouts (1 second), which works reliably on my machine but not in evergreen. There are quite a few tests that fail intermittently in evergreen and are retried in the Ruby driver. Until there is a way to run tests in the evergreen or an identical environment with shell access, troubleshooting evergreen-only failures is difficult.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To address numerous outstanding questions in this thread:

should we also specify that the failPoint be set in "alwaysOn" mode

I believe Shane successfully explained why we should not use alwaysOn, nor should we use failInternalCommands.

A higher heartbeat interval I think simply will make the test take longer.

My understanding is that a heartbeat frequency takes effect after the initial discovery. So if the test constructs a MongoClient with a 10-second heartbeatFrequencyMS, no IO will yet be performed. The driver would then select a server to run configureFailPoint and setup an isMaster failure, and that server selection will initiate SDAM. Ten seconds later, we'd expect the next monitoring round to fail.

Great, then can someone also POC the failCommand alternative?

If the Ruby tests are unreliable in Evergreen, I don't think we can consider the RUBY-1894 commit a viable POC for mocking.

I'm willing to POC the configureFailPoint alternative in PHP, which uses single-threaded SDAM; however, I'd need a day or two to do so. I would plan to use heartbeatFrequencyMS for that POC.

That said, I expect we cannot use failCommand to test the "network timeout error" case below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that a heartbeat frequency takes effect after the initial discovery.

I think this is correct, making the fail point a viable strategy.

That said, I expect we cannot use failCommand to test the "network timeout error" case below.

Should I request this functionality in a server ticket?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for losing track of this thread.

heartbeatFrequencyMS

Regarding my comments on tweaking heartbeatFrequencyMS, I was expecting that this test would construct its own MongoClient and that a reduced heartbeat frequency could allow the alternative test approach (i.e. isMaster configured to drop its connection) to run faster. I'm not sure if @p-mongo and I were on the same page when he wrote:

I think this requires changing the interval when the client is already instantiated, because the client must be operational prior to failing the ismaster.

Having said that, my idea about tweaking heartbeatFrequencyMS is just an optimization and something I'd propose adding as a note in this section. If so, we could also advise users to construct a fresh MongoClient just for this test. Beyond that advice, I don't have any objection to the current text as-written and am happy to omit my suggestion entirely.

Fail point for network timeout error

Should I request this functionality (fail point for network timeout error) in a server ticket?

I think the PR is OK as written with asking a driver to mock that error, and would rather not hold up this PR or SPEC-1396 on a request for new fail point behavior.

That said, if you think there's value in having a general mechanism for delaying commands via a fail point (noting that the $where/sleep() trick only works for queries), perhaps you can sync up with @ShaneHarvey to formulate that request. He may need something similar when the time comes to write integration tests for client-side operation timeouts. If you go this route, I'd suggest creating a new SPEC ticket for the follow-up work that would remove the need to mock a network timeout error.

Copy link
Member

@ShaneHarvey ShaneHarvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

'''''''''''''''''''''''''

Scenario: mock a non-timeout network error on a monitoring connection.
Alternatively, set a `failCommand fail point`_ on ``isMaster`` command
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can use alwaysOn to fail isMaster because then the server would always be set to unknown and the driver would never be able to disable the failpoint.

I also don't think we should use failInternalCommands because that might cause unexpected errors or state changes on other members.

@p-mongo p-mongo requested a review from jmikola November 7, 2019 16:00
'''''''''''''''''''''''''

Scenario: mock a non-timeout network error on a monitoring connection.
Alternatively, set a `failCommand fail point`_ on ``isMaster`` command
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To address numerous outstanding questions in this thread:

should we also specify that the failPoint be set in "alwaysOn" mode

I believe Shane successfully explained why we should not use alwaysOn, nor should we use failInternalCommands.

A higher heartbeat interval I think simply will make the test take longer.

My understanding is that a heartbeat frequency takes effect after the initial discovery. So if the test constructs a MongoClient with a 10-second heartbeatFrequencyMS, no IO will yet be performed. The driver would then select a server to run configureFailPoint and setup an isMaster failure, and that server selection will initiate SDAM. Ten seconds later, we'd expect the next monitoring round to fail.

Great, then can someone also POC the failCommand alternative?

If the Ruby tests are unreliable in Evergreen, I don't think we can consider the RUBY-1894 commit a viable POC for mocking.

I'm willing to POC the configureFailPoint alternative in PHP, which uses single-threaded SDAM; however, I'd need a day or two to do so. I would plan to use heartbeatFrequencyMS for that POC.

That said, I expect we cannot use failCommand to test the "network timeout error" case below.


{ok: 0, errmsg: "not master"}

Network error on monitoring connection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SDAM test plan already had syntax for simulating a "network error":

If a response is the empty object {}, simulate a network error.

Is there any reason we need a new prose test for "Non-timeout network error" and cannot instead rely on a spec test? Ideally, we'd also have some way to denote a "network timeout error" via a spec test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@p-mongo: I'm not sure if this question was answered elsewhere in the PR, but can you follow up here? If the SDAM test plan already has syntax for simulating a non-timeout network error, can't we use that instead of introducing prose tests here?

I realize we might still need a prose test for the "network timeout error" case unless you'd rather add new spec test syntax for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the non-timeout test is now handled via the fail points and events. The timeout test currently specifies the use of mocking, should I create a server ticket requesting a timeout feature for fail points?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to rely on mocking, rather than make another request for server changes. Mocking also means the tests will conceivably run much quicker without having to wait for an actual timeout (and any flakiness that may entail).

Copy link
Member

@jmikola jmikola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summarizing today's quick meeting between @mbroadst, @p-mongo, and myself: @mbroadst suggested revising this test plan to leverage SDAM and CMAP events. After (or alongside) the unified test runner project, we can consider how to translate that prose into spec tests that have assertions on SDAM and CMAP events.

Meanwhile, we should consider whether SPEC-1396 should be pushed out from the 4.4 fixVersion.

@p-mongo
Copy link
Contributor Author

p-mongo commented Mar 16, 2020

I revised the fail point instructions to make two changes that make the tests operate correctly:

  1. The fail point is set to trigger twice to account for monitor retrying failing ismaster calls.
  2. The fail point is required to use isMaster spelling.

With these changes made in the Ruby driver, it is no longer needed to retry these tests (i.e. they pass on the first attempt every time).

@p-mongo p-mongo requested review from mbroadst and removed request for mbroadst March 16, 2020 19:20
Copy link
Member

@jmikola jmikola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small suggestions but the underlying content LGTM.


Subscribe to `TopologyDescriptionChanged SDAM event`_ on the MongoClient.

Subsribe to `PoolCleared CMAP event`_ on the MongoClient.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "Subscribe"

configureFailPoint: 'failCommand',
mode: {times: 2},
data: {
failCommands: %w(isMaster),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we stick to JS and/or a shell example here? I'm not familiar with %w (literal array syntax?) but the more generic we can keep this the better.


- The fail point MUST be set to trigger twice because the server monitor is
required to retry failing ``isMaster`` calls.
- The "m" in ``isMaster`` MUST be capitalized.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also rephrase as "the isMaster command name is case-sensitive."


Perform a server scan manually or wait for the driver to scan the server.

disable the fail point to avoid spurious failures in subsequent tests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalization of "Disable" if this is a new sentence.

@ShaneHarvey
Copy link
Member

@p-mongo p-mongo closed this Jul 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants