Skip to content

Add shard-aware reconnection policies with support for scheduling constraints #473

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

dkropachev
Copy link
Collaborator

@dkropachev dkropachev commented May 30, 2025

Introduce ShardReconnectionPolicy and its implementations:

  • NoDelayShardReconnectionPolicy: avoids reconnection delay and ensures at most one reconnection per host+shard.
  • NoConcurrentShardReconnectionPolicy: limits concurrent reconnections to 1 per scope (Cluster or Host) using a backoff policy.

This feature enables finer control over shard reconnection behavior, helping prevent reconnection storms.

Fixes: #483

Pre-review checklist

  • I have split my patch into logically separate commits.
  • All commit messages clearly explain what they change and why.
  • I added relevant tests for new features and bug fixes.
  • All commits compile, pass static checks and pass test.
  • PR description sums up the changes and reasons why they should be introduced.
  • I have provided docstrings for the public items that I want to introduce.
  • I have adjusted the documentation in ./docs/source/.
  • I added appropriate Fixes: annotations to PR description.

@dkropachev dkropachev force-pushed the dk/add-connection-pool-delay branch 4 times, most recently from 0b80886 to f62dfa3 Compare June 3, 2025 03:42
@dkropachev dkropachev changed the title 1 Add shard-aware reconnection policies with support for scheduling constraints Jun 3, 2025
@dkropachev dkropachev requested a review from Lorak-mmk June 3, 2025 03:45
@dkropachev dkropachev marked this pull request as ready for review June 3, 2025 03:45
@dkropachev dkropachev force-pushed the dk/add-connection-pool-delay branch 2 times, most recently from dbb3ad1 to cbb4719 Compare June 4, 2025 17:53
@mykaul
Copy link

mykaul commented Jun 5, 2025

Shouldn't we have some warning / info level log when backoff is taking place?

@dkropachev
Copy link
Collaborator Author

dkropachev commented Jun 5, 2025

Shouldn't we have some warning / info level log when backoff is taking place?

I would rather not do it, it is not useful and can potentially pollute the log

@Lorak-mmk
Copy link

Do you know what caused the test failure?

  =================================== FAILURES ===================================
  ___________________________ TypeTests.test_datetype ____________________________
  
  self = <tests.unit.test_types.TypeTests testMethod=test_datetype>
  
      def test_datetype(self):
          now_time_seconds = time.time()
          now_datetime = datetime.datetime.fromtimestamp(now_time_seconds, tz=datetime.timezone.utc)
      
          # Cassandra timestamps in millis
          now_timestamp = now_time_seconds * 1e3
      
          # same results serialized
  >       self.assertEqual(DateType.serialize(now_datetime, 0), DateType.serialize(now_timestamp, 0))
  E       AssertionError: b'\x00\x00\x01\x97<\x17\xda\xf9' != b'\x00\x00\x01\x97<\x17\xda\xf8'

it is a unit test that at the first glance should be fully deterministic. Failure is unexpected.
From the assertion it looks like some off-by-one error.

@dkropachev
Copy link
Collaborator Author

Do you know what caused the test failure?

  =================================== FAILURES ===================================
  ___________________________ TypeTests.test_datetype ____________________________
  
  self = <tests.unit.test_types.TypeTests testMethod=test_datetype>
  
      def test_datetype(self):
          now_time_seconds = time.time()
          now_datetime = datetime.datetime.fromtimestamp(now_time_seconds, tz=datetime.timezone.utc)
      
          # Cassandra timestamps in millis
          now_timestamp = now_time_seconds * 1e3
      
          # same results serialized
  >       self.assertEqual(DateType.serialize(now_datetime, 0), DateType.serialize(now_timestamp, 0))
  E       AssertionError: b'\x00\x00\x01\x97<\x17\xda\xf9' != b'\x00\x00\x01\x97<\x17\xda\xf8'

it is a unit test that at the first glance should be fully deterministic. Failure is unexpected. From the assertion it looks like some off-by-one error.

It is known issue, conversion goes wrong somewhere

@dkropachev dkropachev force-pushed the dk/add-connection-pool-delay branch 4 times, most recently from a43ccd1 to b0fd069 Compare June 7, 2025 04:47
@dkropachev dkropachev requested a review from Lorak-mmk June 7, 2025 04:48
Add abstract classes: `ShardReconnectionPolicy` and `ShardReconnectionScheduler`
And implementations:
`NoDelayShardReconnectionPolicy` - policy that represents old behavior
of having no delay and no concurrency restriction.
`NoConcurrentShardReconnectionPolicy` - policy that limits concurrent
reconnections to 1 per scope and introduces delay between reconnections
within the scope.
Inject shard reconnection policy into cluster, session, connection and
host pool
Drop pending connections tracking logic, since policy does that.
Fix some tests that mocks Cluster, session, connection or host pool.
@dkropachev dkropachev force-pushed the dk/add-connection-pool-delay branch from b0fd069 to f47313f Compare June 7, 2025 04:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Delay for per-shard reconnection
3 participants