Skip to content

Prefer primary replica for the token for LWT queries #24

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kostja opened this issue Apr 20, 2020 · 3 comments
Closed

Prefer primary replica for the token for LWT queries #24

kostja opened this issue Apr 20, 2020 · 3 comments
Assignees

Comments

@kostja
Copy link

kostja commented Apr 20, 2020

The way Paxos protocol works is that if two queries attempt to update the same key from different coordinators, they will start two independent Paxos rounds. Each round will be assigned a timestamp, and the coordinator who has the highest timestamp will win.
The issue is, the loser only queues up after the winner (using a semaphore) if both rounds are coordinated by the same node. If rounds are started at different nodes, the only option for the user is to sleep increasingly random interval and retry (this is what our implementation does).

If the key is contended and the driver is neither shard nor token aware, this leads to a lot of retries to make an update.
If the key is contended and the driver is token and shard aware, it will send the query to one of the replicas for the partition, but will round-robin between replicas.

This still means there will be at least 50% of loser queries, which will retry before they can commit. Retries against a contended key in multi-DC setup lead to increasingly growing delays and time outs. If it takes 100 milliseconds to do a round due to network topology, and the key is contended, we get as few as 1 query per second for such a key. If all queries queue up on the same coordinator, we can get to up to 10 QPS per key.

This is why for LWT queries the driver should choose replicas in a pre-defined order, so that in case of contention they will queue up at the same replica, rather than compete: choose the primary replica first, then, if the primary is known to be down, the first secondary, then the second secondary, and so on.

This will reduce contention over hot keys and thus increase LWT performance.

If LOCAL_SERIAL serial consistency is used, we should prefer the primary in the local DC, because only local DCs endpoints will participate in the Paxos round. If SERIAL consistency is specified in a multi-DC setup, we should use any DCs primary, but consistently use the same primary for all queries on all clients. The key to avoiding contention is in all clients consistently choosing the same replica for the same key, if it's available/alive.

See also scylladb/gocql#40

@kostja
Copy link
Author

kostja commented May 14, 2020

@avelanarius
Copy link

Support for this optimization is already implemented in Java Driver 3.x, via those commits (no PR was made, it was directly committed by @haaawk):

47552b7
d2cfc30
f724fdc
674f33f
9aebd64
f91f3b1
cd5faa9
5be42b6
f4aeff6

Support for this feature is missing in Java Driver 4.x. It seems that the work can be split into a few smaller subtasks (parsing LwtInfo, parsing the response of prepared statements whether a query is LWT, picking a correct replica when doing a query).

@avelanarius avelanarius assigned Lorak-mmk and unassigned haaawk Apr 4, 2022
Lorak-mmk added a commit to Lorak-mmk/java-driver that referenced this issue Apr 28, 2022
To avoid contention, we want to try primary replica first, then secondary and so on.

More information: scylladb#24
Lorak-mmk added a commit to Lorak-mmk/java-driver that referenced this issue May 17, 2022
To avoid contention, we want to try primary replica first, then secondary and so on.

More information: scylladb#24
avelanarius pushed a commit that referenced this issue May 26, 2022
To avoid contention, we want to try primary replica first, then secondary and so on.

More information: #24
@Gor027
Copy link

Gor027 commented Mar 3, 2023

@avelanarius It seems that this PR #125 has resolved the issue, so can this issue be closed now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants