-
Notifications
You must be signed in to change notification settings - Fork 246
SPEC-1396 test plan for connection pool clearing on monitoring connec… #679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
''''''''''''''''''''''''' | ||
|
||
Scenario: mock a non-timeout network error on a monitoring connection. | ||
Alternatively, set a `failCommand fail point`_ on ``isMaster`` command |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, should we also specify that the failPoint
be set in "alwaysOn" mode? Using any other mode could result in non-deterministic results as SDAM might be able to eventually reconnect to the server and add it back to the topology?
Also, if we are trying to emulate loss of connectivity to a server, should we also set failInternalCommands: true
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can use alwaysOn to fail isMaster because then the server would always be set to unknown and the driver would never be able to disable the failpoint.
I also don't think we should use failInternalCommands
because that might cause unexpected errors or state changes on other members.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could one set a sufficiently high value for heartbeatFrequencyMS
to avoid a race condition? This might be similar to the sessions spec tests using heartbeatFrequencyMS
to avoid cluster time getting updated as the test is running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this requires changing the interval when the client is already instantiated, because the client must be operational prior to failing the ismaster. A higher heartbeat interval I think simply will make the test take longer.
The default of 10 seconds should be enough to connect to localhost in CI I would expect, thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've wasted time in the past by merging tests which were not possible to implement correctly (SPEC-1200). In order to avoid that, I think we should have someone POC this test (and run the test in Evergreen to be certain it works) before we merge this change. Does anyone want to try it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a hard time reading the commit for RUBY-1894, but is mongodb/mongo-ruby-driver@e14fddd#diff-129c36528d5fb0bd3e6019c87d8c4db1R174 the bit where these errors are being mocked? Should we be concerned that these tests fail intermittently on Evergreen and are wrapped in retry loops?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes these are the tests. These tests are configured with low timeouts (1 second), which works reliably on my machine but not in evergreen. There are quite a few tests that fail intermittently in evergreen and are retried in the Ruby driver. Until there is a way to run tests in the evergreen or an identical environment with shell access, troubleshooting evergreen-only failures is difficult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To address numerous outstanding questions in this thread:
should we also specify that the failPoint be set in "alwaysOn" mode
I believe Shane successfully explained why we should not use alwaysOn
, nor should we use failInternalCommands
.
A higher heartbeat interval I think simply will make the test take longer.
My understanding is that a heartbeat frequency takes effect after the initial discovery. So if the test constructs a MongoClient with a 10-second heartbeatFrequencyMS
, no IO will yet be performed. The driver would then select a server to run configureFailPoint
and setup an isMaster
failure, and that server selection will initiate SDAM. Ten seconds later, we'd expect the next monitoring round to fail.
Great, then can someone also POC the failCommand alternative?
If the Ruby tests are unreliable in Evergreen, I don't think we can consider the RUBY-1894 commit a viable POC for mocking.
I'm willing to POC the configureFailPoint
alternative in PHP, which uses single-threaded SDAM; however, I'd need a day or two to do so. I would plan to use heartbeatFrequencyMS
for that POC.
That said, I expect we cannot use failCommand
to test the "network timeout error" case below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that a heartbeat frequency takes effect after the initial discovery.
I think this is correct, making the fail point a viable strategy.
That said, I expect we cannot use failCommand to test the "network timeout error" case below.
Should I request this functionality in a server ticket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for losing track of this thread.
heartbeatFrequencyMS
Regarding my comments on tweaking heartbeatFrequencyMS
, I was expecting that this test would construct its own MongoClient and that a reduced heartbeat frequency could allow the alternative test approach (i.e. isMaster
configured to drop its connection) to run faster. I'm not sure if @p-mongo and I were on the same page when he wrote:
I think this requires changing the interval when the client is already instantiated, because the client must be operational prior to failing the ismaster.
Having said that, my idea about tweaking heartbeatFrequencyMS
is just an optimization and something I'd propose adding as a note in this section. If so, we could also advise users to construct a fresh MongoClient just for this test. Beyond that advice, I don't have any objection to the current text as-written and am happy to omit my suggestion entirely.
Fail point for network timeout error
Should I request this functionality (fail point for network timeout error) in a server ticket?
I think the PR is OK as written with asking a driver to mock that error, and would rather not hold up this PR or SPEC-1396 on a request for new fail point behavior.
That said, if you think there's value in having a general mechanism for delaying commands via a fail point (noting that the $where
/sleep()
trick only works for queries), perhaps you can sync up with @ShaneHarvey to formulate that request. He may need something similar when the time comes to write integration tests for client-side operation timeouts. If you go this route, I'd suggest creating a new SPEC ticket for the follow-up work that would remove the need to mock a network timeout error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
''''''''''''''''''''''''' | ||
|
||
Scenario: mock a non-timeout network error on a monitoring connection. | ||
Alternatively, set a `failCommand fail point`_ on ``isMaster`` command |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can use alwaysOn to fail isMaster because then the server would always be set to unknown and the driver would never be able to disable the failpoint.
I also don't think we should use failInternalCommands
because that might cause unexpected errors or state changes on other members.
''''''''''''''''''''''''' | ||
|
||
Scenario: mock a non-timeout network error on a monitoring connection. | ||
Alternatively, set a `failCommand fail point`_ on ``isMaster`` command |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To address numerous outstanding questions in this thread:
should we also specify that the failPoint be set in "alwaysOn" mode
I believe Shane successfully explained why we should not use alwaysOn
, nor should we use failInternalCommands
.
A higher heartbeat interval I think simply will make the test take longer.
My understanding is that a heartbeat frequency takes effect after the initial discovery. So if the test constructs a MongoClient with a 10-second heartbeatFrequencyMS
, no IO will yet be performed. The driver would then select a server to run configureFailPoint
and setup an isMaster
failure, and that server selection will initiate SDAM. Ten seconds later, we'd expect the next monitoring round to fail.
Great, then can someone also POC the failCommand alternative?
If the Ruby tests are unreliable in Evergreen, I don't think we can consider the RUBY-1894 commit a viable POC for mocking.
I'm willing to POC the configureFailPoint
alternative in PHP, which uses single-threaded SDAM; however, I'd need a day or two to do so. I would plan to use heartbeatFrequencyMS
for that POC.
That said, I expect we cannot use failCommand
to test the "network timeout error" case below.
|
||
{ok: 0, errmsg: "not master"} | ||
|
||
Network error on monitoring connection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SDAM test plan already had syntax for simulating a "network error":
If a response is the empty object {}, simulate a network error.
Is there any reason we need a new prose test for "Non-timeout network error" and cannot instead rely on a spec test? Ideally, we'd also have some way to denote a "network timeout error" via a spec test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@p-mongo: I'm not sure if this question was answered elsewhere in the PR, but can you follow up here? If the SDAM test plan already has syntax for simulating a non-timeout network error, can't we use that instead of introducing prose tests here?
I realize we might still need a prose test for the "network timeout error" case unless you'd rather add new spec test syntax for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the non-timeout test is now handled via the fail points and events. The timeout test currently specifies the use of mocking, should I create a server ticket requesting a timeout feature for fail points?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to rely on mocking, rather than make another request for server changes. Mocking also means the tests will conceivably run much quicker without having to wait for an actual timeout (and any flakiness that may entail).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summarizing today's quick meeting between @mbroadst, @p-mongo, and myself: @mbroadst suggested revising this test plan to leverage SDAM and CMAP events. After (or alongside) the unified test runner project, we can consider how to translate that prose into spec tests that have assertions on SDAM and CMAP events.
Meanwhile, we should consider whether SPEC-1396 should be pushed out from the 4.4 fixVersion.
I revised the fail point instructions to make two changes that make the tests operate correctly:
With these changes made in the Ruby driver, it is no longer needed to retry these tests (i.e. they pass on the first attempt every time). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small suggestions but the underlying content LGTM.
|
||
Subscribe to `TopologyDescriptionChanged SDAM event`_ on the MongoClient. | ||
|
||
Subsribe to `PoolCleared CMAP event`_ on the MongoClient. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: "Subscribe"
configureFailPoint: 'failCommand', | ||
mode: {times: 2}, | ||
data: { | ||
failCommands: %w(isMaster), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we stick to JS and/or a shell example here? I'm not familiar with %w
(literal array syntax?) but the more generic we can keep this the better.
|
||
- The fail point MUST be set to trigger twice because the server monitor is | ||
required to retry failing ``isMaster`` calls. | ||
- The "m" in ``isMaster`` MUST be capitalized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also rephrase as "the isMaster
command name is case-sensitive."
|
||
Perform a server scan manually or wait for the driver to scan the server. | ||
|
||
disable the fail point to avoid spurious failures in subsequent tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capitalization of "Disable" if this is a new sentence.
I think this PR can be closed. In DRIVERS-640 we added SDAM tests which assert that the pool is cleared when a check fails with a network error: |
…tion errors
This is a follow-up PR to #665. This PR adds prose tests for clearing connection pool when monitoring connection experiences network errors.