-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Eliminate cancellation deadlock in RateLimiter implementations #85523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
In System.Threading.RateLimiting.ConcurrencyLimiter and the other three implementations using a synchronously locked deque, there is a deadlock between an outstanding wait-for-lease operation being canceled by user and releasing an existing lease. The internal Release() method has to clean up wait-for-lease operations in some circumstances. This involves disposing the operation's CancellationTokenRegistration, which blocks if the registered callback is running. This callback can be invoked on another thread if the external cancellation token fires. It locks the rate limiter to return the operation's permits to the rate limiter. If the external cancellation token fires after Release() runs but before it disposes of the CancellationTokenRegistration, there will be a deadlock as Release() holds the lock while waiting for the callback to complete while the callback blocks on the lock to return its permits. This change eliminates this deadlock by moving the cleanup of any wait-for-lease operations in Release() outside the lock.
Tagging subscribers to this area: @mangod9 Issue DetailsIn This change eliminates this deadlock by moving the cleanup of any wait-for-lease operations in
|
@dotnet-policy-bot @dotnet-policy-service agree company="Apptane Inc." |
Nice find! I am curious how you found it. Did you hit the problem in an app? Or were you reviewing the code and noticed this pattern? |
Thank you. We observed the deadlock in a situation where a lot of long-running operations were being queued by the rate limiter and canceled by a request timeout. I then went over the code and came up with this fix. Having experienced similar problems with disposing shared objects and canceling cancellation tokens under lock, I now always avoid it. |
// is going to invoke the callback synchronously, but this does not create | ||
// a deadlock because lock are reentrant | ||
if (cancellationToken.CanBeCanceled) | ||
#if NETCOREAPP || NETSTANDARD2_1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#if NETCOREAPP || NETSTANDARD2_1 | |
#if NETCOREAPP || NETSTANDARD2_1_OR_GREATER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just keep it as is. Most libraries in this repo use NETSTANDARD2_1
, and NETSTANDARD2_1_OR_GREATER
is supported starting with the 5.0 SDK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated before reading your comment 🤣 5.0 has been EOL for more than a year too.
...raries/System.Threading.RateLimiting/src/System/Threading/RateLimiting/ConcurrencyLimiter.cs
Outdated
Show resolved
Hide resolved
...raries/System.Threading.RateLimiting/src/System/Threading/RateLimiting/ConcurrencyLimiter.cs
Outdated
Show resolved
Hide resolved
_cancellationToken = cancellationToken; | ||
if (state is RequestRegistration registration && registration.TrySetCanceled(registration._cancellationToken)) | ||
{ | ||
var limiter = (ConcurrencyLimiter)registration.Task.AsyncState!; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never seen this before, kind of interesting since you can "avoid" the extra field. Not really avoiding really since it's there whether you use it or not 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, might as well use it and save a pointer.
Interlocked.Increment(ref _successfulLeasesCount); | ||
} | ||
nextPendingRequest.CancellationTokenRegistration.Dispose(); | ||
disposer.Add(nextPendingRequest); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was just double checking things and realized the Debug.Assert(_queueCount == 0);
on line 285 can now be wrong (for other limiters too).
I'm kind of tempted to just remove it. Maybe add some asserts elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And other asserts about _queueCount
too. I don't know what dotnet/runtime policy on asserts is, i.e. how important it is to keep them. It should be possible to keep these asserts as is by moving _queueCount -= registration.Count
inside the lock in case of disposal, and setting registration.Count
to zero (it's only accessed from within the lock so it's safe to modify), but it makes the code a bit more complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this one is extremely important. It's there to verify that when the limiter is idle, there are no queued items.
We could check that the _queue
is empty instead, we do lose the verification that our math is correct, but like you noted, it would be more complex to verify that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not as complex as I thought. How about something like this?
diff --git a/src/libraries/System.Threading.RateLimiting/src/System/Threading/RateLimiting/ConcurrencyLimiter.cs b/src/libraries/System.Threading.RateLimiting/src/System/Threading/RateLimiting/ConcurrencyLimiter.cs
index c2ba9cac..4faf4210 100644
--- a/src/libraries/System.Threading.RateLimiting/src/System/Threading/RateLimiting/ConcurrencyLimiter.cs
+++ b/src/libraries/System.Threading.RateLimiting/src/System/Threading/RateLimiting/ConcurrencyLimiter.cs
@@ -178,7 +178,7 @@ protected override ValueTask<RateLimitLease> AcquireAsyncCore(int permitCount, C
var request = new RequestRegistration(permitCount, this, cancellationToken);
_queue.EnqueueTail(request);
_queueCount += permitCount;
- Debug.Assert(_queueCount <= _options.QueueLimit);
+ Debug.Assert(_queueCount - disposer.QueueCount <= _options.QueueLimit);
return new ValueTask<RateLimitLease>(request.Task);
}
@@ -271,7 +271,7 @@ private void Release(int releaseCount)
Interlocked.Increment(ref _successfulLeasesCount);
}
disposer.Add(nextPendingRequest);
- Debug.Assert(_queueCount >= 0);
+ Debug.Assert(_queueCount - disposer.QueueCount >= 0);
}
else
{
@@ -282,7 +282,7 @@ private void Release(int releaseCount)
if (_permitCount == _options.PermitLimit)
{
Debug.Assert(_idleSince is null);
- Debug.Assert(_queueCount == 0);
+ Debug.Assert(_queueCount - disposer.QueueCount == 0);
_idleSince = Stopwatch.GetTimestamp();
}
}
@@ -424,10 +424,13 @@ public struct Disposer : IDisposable
{
private RequestRegistration? _next;
+ public int QueueCount { get; set; }
+
public void Add(RequestRegistration request)
{
request._next = _next;
_next = request;
+ QueueCount += request.Count;
}
public void Dispose()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh that's not bad! Maybe a comment on public int QueueCount { get; set; }
that it's just for asserts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Process terminated. Assertion failed.
at System.Threading.RateLimiting.TokenBucketRateLimiter.ReplenishInternal(Int64 nowTicks) in /_/src/libraries/System.Threading.RateLimiting/src/System/Threading/RateLimiting/TokenBucketRateLimiter.cs:line 358
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Completely reworked _queueCount
handling and added unit tests exercising the cancellation code paths which used to produce the deadlock (only in ConcurrentLimiter, other three implementations are identical in this respect). I put this test and relevant code behind the DEBUG
conditional compilation symbol because it involves adding callbacks at two strategic points in Release()
, as there seems to be no other way to create a deterministically behaving unit test. If there is a test-specific symbol I'll use it instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is getting a bit complicated. Can we just remove Debug.Assert(_queueCount == 0);
on line 285 and undo the recent changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@atykhyy Are you able to update the PR or should I try to finish it up? We have a week or two before 8.0 starts locking down and I want to get this change in 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I was out of office. I don't quite understand why you wanted to remove this assert. It's not hit in unit tests, and I think it can be proved that it is safe. But if you're okay removing it then I won't bother :)
65011af
to
bf4c180
Compare
bf4c180
to
25b17d7
Compare
I took the commits and opened a new PR for the fix since I don't think your fork allows pushing. The new PR is at #90285 and your name should be preserved on the commit when we merge. |
In
System.Threading.RateLimiting.ConcurrencyLimiter
and the other three implementations using a synchronously locked deque, there is a deadlock between an outstanding wait-for-lease operation being canceled by user and releasing an existing lease. The internalRelease()
method has to clean up wait-for-lease operations in some circumstances. This involves disposing the operation'sCancellationTokenRegistration
, which blocks if the registered callback is already running. This callback can be invoked on another thread if the external cancellation token fires. It locks the rate limiter to return the operation's permits to the rate limiter. If the external cancellation token fires afterRelease()
runs but before it disposes of theCancellationTokenRegistration
, there will be a deadlock asRelease()
holds the lock while waiting for the callback to complete while the callback blocks on the lock to return its permits.This change eliminates this deadlock by moving the cleanup of any wait-for-lease operations in
Release()
outside the lock.It also combines the private struct
RequestRegistration
with the private classCancelQueueState
as they are used only together, increasing the size of the deque element for no discernible gain.