Add shard connection backoff policy #473

dkropachev · 2025-05-30T08:03:57Z

Introduce ShardReconnectionPolicy and its implementations:

NoDelayShardConnectionBackoffPolicy: no delay or concurrency limit, ensures at most one pending connection per host+shard.
LimitedConcurrencyShardConnectionBackoffPolicy: limits pending concurrent connections to max_concurrent per scope (Cluster or Host) using a backoff policy.

The idea of this PR is to shift responsibility of scheduling HostConnection._open_connection_to_missing_shard from HostConnection to ShardConnectionBackoffPolicy, that gives ShardConnectionBackoffPolicy control over process of opening connections.

This feature enables finer control over process of creating per shard connections, helping to prevent connections storms.

Fixes: #483

Solutions tested and rejected

Naive delay

Description

Policy would introduce a delay instead of executing connection creation request right away.
Policy would remember last time when connection creation was scheduled to and when it tries to schedule next request it would make sure that there is time between old and new request execution is equal or more than delay it is configured with.

Results

It worked fine when cluster operates in a normal way.

However, during testing with artificial delays, it became clear that this approach breaks down when the time to establish a
connection exceeds the configured delay.
In such cases, connections begin to pile up - the greater the connection initialization time relative to the delay, the faster they accumulate.

This becomes especially problematic during connection storms.
As the cluster becomes overloaded and connection initialization slows down, the delay-based throttling loses its effectiveness. In other words, the more the cluster suffers, the less effective the policy becomes.

Solution

The solution was to give the policy direct control over the connection initialization process.
This allows the policy to track how many connections are currently pending and apply delays after connections are created, rather than before.
That change ensures the policy remains effective even under heavy load.

This behavior is exactly what has been implemented in this PR.

Pre-review checklist

I have split my patch into logically separate commits.
All commit messages clearly explain what they change and why.
I added relevant tests for new features and bug fixes.
All commits compile, pass static checks and pass test.
PR description sums up the changes and reasons why they should be introduced.
I have provided docstrings for the public items that I want to introduce.
I have adjusted the documentation in ./docs/source/.
I added appropriate Fixes: annotations to PR description.

mykaul · 2025-06-05T06:57:15Z

Shouldn't we have some warning / info level log when backoff is taking place?

dkropachev · 2025-06-05T10:26:00Z

Shouldn't we have some warning / info level log when backoff is taking place?

I would rather not do it, it is not useful and can potentially pollute the log

Lorak-mmk · 2025-06-06T10:41:09Z

Do you know what caused the test failure?

  =================================== FAILURES ===================================
  ___________________________ TypeTests.test_datetype ____________________________
  
  self = <tests.unit.test_types.TypeTests testMethod=test_datetype>
  
      def test_datetype(self):
          now_time_seconds = time.time()
          now_datetime = datetime.datetime.fromtimestamp(now_time_seconds, tz=datetime.timezone.utc)
      
          # Cassandra timestamps in millis
          now_timestamp = now_time_seconds * 1e3
      
          # same results serialized
  >       self.assertEqual(DateType.serialize(now_datetime, 0), DateType.serialize(now_timestamp, 0))
  E       AssertionError: b'\x00\x00\x01\x97<\x17\xda\xf9' != b'\x00\x00\x01\x97<\x17\xda\xf8'

it is a unit test that at the first glance should be fully deterministic. Failure is unexpected.
From the assertion it looks like some off-by-one error.

dkropachev · 2025-06-06T10:44:03Z

Do you know what caused the test failure?

  =================================== FAILURES ===================================
  ___________________________ TypeTests.test_datetype ____________________________
  
  self = <tests.unit.test_types.TypeTests testMethod=test_datetype>
  
      def test_datetype(self):
          now_time_seconds = time.time()
          now_datetime = datetime.datetime.fromtimestamp(now_time_seconds, tz=datetime.timezone.utc)
      
          # Cassandra timestamps in millis
          now_timestamp = now_time_seconds * 1e3
      
          # same results serialized
  >       self.assertEqual(DateType.serialize(now_datetime, 0), DateType.serialize(now_timestamp, 0))
  E       AssertionError: b'\x00\x00\x01\x97<\x17\xda\xf9' != b'\x00\x00\x01\x97<\x17\xda\xf8'

it is a unit test that at the first glance should be fully deterministic. Failure is unexpected. From the assertion it looks like some off-by-one error.

It is known issue, conversion goes wrong somewhere

cassandra/cluster.py

cassandra/policies.py

Lorak-mmk

General comment: integration tests for new policies are definitely needed here.

Lorak-mmk · 2025-06-13T12:06:31Z

cassandra/policies.py

+    session: Session
+    reconnection_policy: ReconnectionPolicy
+    lock = threading.Lock
+    schedule: Optional[Iterator[float]]


It should be lock: threading.Lock

Lorak-mmk · 2025-06-13T12:24:18Z

cassandra/policies.py

+        if self.shard_reconnection_scope == ShardReconnectionPolicyScope.Cluster:
+            scope_hash = "global-cluster-scope"
+        else:
+            scope_hash = host_id


When operating on enums, it is usually good to perform exhaustiveness checks.
If in the future someone adds a new variant to this enum, then your code would (incorrectly) treat it as Host scope. Instead make an else if branch for Host, and then else that throws an error.

Lorak-mmk · 2025-06-13T12:26:24Z

cassandra/policies.py

+
+            scope_info = self.scopes.get(scope_hash, 0)
+            if not scope_info:
+                scope_info = _ScopeBucket(self.session, self.reconnection_policy)
+                self.scopes[scope_hash] = scope_info
+            scope_info.add(self._execute, scheduled_key, method, *args, **kwargs)
+            return True


So scope_info here is at first either _ScopeBucket or int. I think it would be more idiomatic to use None.

Lorak-mmk · 2025-06-13T12:33:26Z

cassandra/policies.py

+
+    def schedule(
+            self,
+            host_id: str,
+            shard_id: int,
+            method: Callable[..., None],


shard_id is int here, interesting. What is going to be passed for Cassandra? 0? Or maybe None and the type hint is just wrong?

for cassandra this code is not used, it is used only when host has shard info.

Lorak-mmk · 2025-06-13T12:45:01Z

cassandra/policies.py

+class NoConcurrentShardReconnectionPolicy(ShardReconnectionPolicy):
+    """
+    A shard reconnection policy that allows only one pending connection per scope, where scope could be `Host`, `Cluster`
+    For backoff it uses `ReconnectionPolicy`, when there is no more reconnections to scheduled backoff policy is reminded
+    For all scopes does not allow schedule multiple reconnections for same host+shard, it silently ignores attempts to do that.
+
+    On `new_scheduler` instantiate a scheduler that behaves according to the policy
+    """
+    shard_reconnection_scope: ShardReconnectionPolicyScope
+    reconnection_policy: ReconnectionPolicy


Ok I really tried to get the hang of the code here, but failed.
What I thought before:

ReconnectionPolicy, according to its comments, defines the schedules when trying to reconnect to DOWN node.

For some reason (don't know if a good one, as there is no discussion about it in PR) instead of extending driver to use for populating connection pool to, you decided to introduce a new mechanism for that, totally separate from ReconnectionPolicy.

But now I see ReconnectionPolicy used inside ShardReconnectionPolicy?! So a policy that steers reconnections to failed node now is used inside policy that re-fills connection pool. I cannot make sense of it.

This PR needs thorough explanation of newly introduced interfaces.

what are the things that are passed to schedule? What is this method, when and how many times are we supposed to call it? Which APIs can block and which cannot? How about thread safety - what they can assume?

How is ReconnectionPolicy different from ShardReconnectionPolicy? Names differ only in "Shard", so initially I thought it is shard-aware version of ReconnectionPolicy, but that does not seem to be the case.

What are the pros and cons of taken approach, what other approaches did you consider?

Ok I really tried to get the hang of the code here, but failed. What I thought before:

ReconnectionPolicy, according to its comments, defines the schedules when trying to reconnect to DOWN node.

For some reason (don't know if a good one, as there is no discussion about it in PR) instead of extending driver to use for populating connection pool to, you decided to introduce a new mechanism for that, totally separate from ReconnectionPolicy.

But now I see ReconnectionPolicy used inside ShardReconnectionPolicy?! So a policy that steers reconnections to failed node now is used inside policy that re-fills connection pool. I cannot make sense of it.

This PR needs thorough explanation of newly introduced interfaces.

what are the things that are passed to schedule? What is this method, when and how many times are we supposed to call it? Which APIs can block and which cannot? How about thread safety - what they can assume?

How is ReconnectionPolicy different from ShardReconnectionPolicy? Names differ only in "Shard", so initially I thought it is shard-aware version of ReconnectionPolicy, but that does not seem to be the case.

I have changed name for ReconnectionPolicy and added another type to it and some description why it accepts both types.
I have also added documentation to the interfaces and implemntations.
Also I have renamed all the classes and abstracts involved

What are the pros and cons of taken approach, what other approaches did you consider?

I will add this information to PR description.

Lorak-mmk · 2025-06-13T12:46:08Z

cassandra/policies.py

+    """
+    items: List[Tuple[Callable[..., None], Tuple[Any, ...], dict[str, Any]]]
+    session: Session


When I see such complicated type, I immediately think that it should be simplified.
Here if I understand this code well, you could introduce Callback type that has fields callable, args, kwargs.

Lorak-mmk · 2025-06-13T12:46:52Z

tests/unit/test_policies.py

+class MockLock:
+    def __init__(self):
+        self.acquire_calls = 0
+        self.release_calls = 0
+
+    def __enter__(self):
+        self.acquire_calls += 1
+
+    def __exit__(self, exc_type, exc_value, traceback):
+        self.release_calls += 1


I don't see this used anywhere. Why is it here?

a left over from old test, removed

Lorak-mmk · 2025-06-13T12:50:19Z

cassandra/policies.py

+        scheduled_key = f'{host_id}-{shard_id}'
+        if self.already_scheduled.get(scheduled_key):
+            return
+
+        self.already_scheduled[scheduled_key] = True
+        if not self.session.is_shutdown:


For example here, in _NoDelayShardReconnectionScheduler: It performs the check in obviously non-thread-safe way. So if it can be called concurrently, then multiple schedules for the same key are possible, despite already_scheduled trying to prevent that. So now I'm thinking that maybe it can't be called concurrently?

OTOH already_scheduled uses a lock, which is extremely strong signal that concurrency is at play here. And now I have no idea what to think, because nothing is explained anywhere.

Yeah, assumption was that it is not a big deal, since _open_connection_to_missing_shard will take care of second connection.
But after looking at it I realised that it will close old one, which can lead to lost responses.
Added a lock here.

Lorak-mmk · 2025-06-13T12:53:27Z

cassandra/policies.py

+    def _get_delay(self) -> float:
+        if self.schedule is None:
+            self.schedule = self.reconnection_policy.new_schedule()
+        try:
+            return next(self.schedule)
+        except StopIteration:
+            self.schedule = self.reconnection_policy.new_schedule()
+            return next(self.schedule)


Is there a situation where self.schedule can really be None here, or is it just a precaution condition that should never really be entered? If it is a precaution, it is fine to have it but there should be a comment explaining that.

I thought that self.schedule can only be None when running is false (btw the opposite is not true: running is initialized to False, but schedule is initialized to non-None), and I only see calls to _get_delaywhenrunningshould beTrue`.

got rid of None case completely.

Lorak-mmk · 2025-06-13T12:57:48Z

cassandra/cluster.py

+    def empty(self):
+        return len(self._scheduled_tasks) == 0 and self._queue.empty()
+


Where is this used?

it used to be part of tests, now it is unused, removed.

mykaul · 2025-06-15T11:30:27Z

The patchset lacks documentation, which would have helped to understand the feature and when/how to use it. Is documentation a separate repo / commit?

mykaul · 2025-06-15T11:32:43Z

cassandra/policies.py

+    A scope for `ShardConnectionBackoffPolicy`, in particular ``LimitedConcurrencyShardConnectionBackoffPolicy``
+
+    Scope defines concurrency limitation scope, for instance :
+     ``LimitedConcurrencyShardConnectionBackoffPolicy`` - allows only one pending connection per scope, if you set it to Cluster,


Was there any ask for 1 connection per cluster? What's the usefulness? I can understand 1 per host, 1 per rack, maybe even 1 per DC. 1 per cluster is not performant, not highly available.

I will update description, it limits concurrency to 'max_concurrency' per scope

mykaul · 2025-06-15T11:33:11Z

cassandra/policies.py

+    """
+    A shard connection backoff policy that allows only ``max_concurrent`` concurrent connection per scope.
+    Scope could be ``Host``or ``Cluster``
+    For backoff calculation ir needs ``ShardConnectionBackoffSchedule`` or ``ReconnectionPolicy``, since both share same API.


Copilot

Pull Request Overview

This PR adds shard‐aware reconnection policies with support for scheduling constraints. Key changes include new policy implementations and schedulers in cassandra/policies.py, modifications to connection management in cassandra/pool.py and cassandra/cluster.py, and comprehensive tests in both unit and integration suites to validate the new behavior.

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
tests/unit/test_shard_aware.py	Adds tests for both immediate and delayed reconnection behavior using new policies.
tests/unit/test_policies.py	Introduces extensive tests for scope bucket and scheduler behavior.
tests/unit/test_host_connection_pool.py	Updates HostConnectionPool tests to integrate the new scheduler.
tests/integration/long/test_policies.py	Validates backoff policies and correct connection formation across shards.
tests/integration/init.py	Adds a marker for tests designed for Scylla-specific behavior.
cassandra/pool.py	Refactors connection replacements to use the new scheduler instead of direct submission.
cassandra/policies.py	Implements new scheduler classes and backoff policies for shard connections.
cassandra/cluster.py	Exposes a new property and uses the scheduler for initializing shard connections.

Add abstract classes: `ShardReconnectionPolicy` and `ShardReconnectionScheduler` And implementations: `NoDelayShardReconnectionPolicy` - policy that represents old behavior of having no delay and no concurrency restriction. `NoConcurrentShardReconnectionPolicy` - policy that limits concurrent reconnections to 1 per scope and introduces delay between reconnections within the scope.

Inject shard reconnection policy into cluster, session, connection and host pool. Drop pending connections tracking logic, since policy does that. Fix some tests that mocks Cluster, session, connection or host pool.

dkropachev force-pushed the dk/add-connection-pool-delay branch 4 times, most recently from 0b80886 to f62dfa3 Compare June 3, 2025 03:42

dkropachev changed the title 1 Add shard-aware reconnection policies with support for scheduling constraints Jun 3, 2025

dkropachev requested a review from Lorak-mmk June 3, 2025 03:45

dkropachev marked this pull request as ready for review June 3, 2025 03:45

dkropachev mentioned this pull request Jun 4, 2025

Delay for per-shard reconnection #483

Open

dkropachev force-pushed the dk/add-connection-pool-delay branch 2 times, most recently from dbb3ad1 to cbb4719 Compare June 4, 2025 17:53

Lorak-mmk requested changes Jun 6, 2025

View reviewed changes

dkropachev force-pushed the dk/add-connection-pool-delay branch 4 times, most recently from a43ccd1 to b0fd069 Compare June 7, 2025 04:47

dkropachev requested a review from Lorak-mmk June 7, 2025 04:48

dkropachev force-pushed the dk/add-connection-pool-delay branch 2 times, most recently from f47313f to 9dfd9ec Compare June 13, 2025 06:20

Lorak-mmk requested changes Jun 13, 2025

View reviewed changes

dkropachev force-pushed the dk/add-connection-pool-delay branch 2 times, most recently from aebc540 to 61668de Compare June 13, 2025 17:58

dkropachev requested a review from Lorak-mmk June 13, 2025 18:02

dkropachev self-assigned this Jun 13, 2025

mykaul reviewed Jun 15, 2025

View reviewed changes

mykaul requested a review from Copilot June 15, 2025 11:33

Copilot AI reviewed Jun 15, 2025

View reviewed changes

dkropachev added 2 commits June 17, 2025 00:07

feat(cluster): inject shard reconnection policy

806aba9

Inject shard reconnection policy into cluster, session, connection and host pool. Drop pending connections tracking logic, since policy does that. Fix some tests that mocks Cluster, session, connection or host pool.

dkropachev force-pushed the dk/add-connection-pool-delay branch from 61668de to 806aba9 Compare June 17, 2025 04:07

dkropachev changed the title ~~Add shard-aware reconnection policies with support for scheduling constraints~~ Add shard connection backoff policy Jun 17, 2025

		def empty(self):
		return len(self._scheduled_tasks) == 0 and self._queue.empty()

Add shard connection backoff policy #473

Are you sure you want to change the base?

Add shard connection backoff policy #473

Uh oh!

Conversation

dkropachev commented May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Solutions tested and rejected

Naive delay

Description

Results

Solution

Pre-review checklist

Uh oh!

mykaul commented Jun 5, 2025

Uh oh!

dkropachev commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lorak-mmk commented Jun 6, 2025

Uh oh!

dkropachev commented Jun 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Lorak-mmk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mykaul commented Jun 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

dkropachev commented May 30, 2025 •

edited

Loading

dkropachev commented Jun 5, 2025 •

edited

Loading