Skip to content

Conversation

atykhyy
Copy link

@atykhyy atykhyy commented Apr 28, 2023

In System.Threading.RateLimiting.ConcurrencyLimiter and the other three implementations using a synchronously locked deque, there is a deadlock between an outstanding wait-for-lease operation being canceled by user and releasing an existing lease. The internal Release() method has to clean up wait-for-lease operations in some circumstances. This involves disposing the operation's CancellationTokenRegistration, which blocks if the registered callback is already running. This callback can be invoked on another thread if the external cancellation token fires. It locks the rate limiter to return the operation's permits to the rate limiter. If the external cancellation token fires after Release() runs but before it disposes of the CancellationTokenRegistration, there will be a deadlock as Release() holds the lock while waiting for the callback to complete while the callback blocks on the lock to return its permits.

This change eliminates this deadlock by moving the cleanup of any wait-for-lease operations in Release() outside the lock.
It also combines the private struct RequestRegistration with the private class CancelQueueState as they are used only together, increasing the size of the deque element for no discernible gain.

In System.Threading.RateLimiting.ConcurrencyLimiter and the other three
implementations using a synchronously locked deque, there is a deadlock
between an outstanding wait-for-lease operation being canceled by user
and releasing an existing lease. The internal Release() method has to
clean up wait-for-lease operations in some circumstances. This involves
disposing the operation's CancellationTokenRegistration, which blocks
if the registered callback is running. This callback can be invoked on
another thread if the external cancellation token fires. It locks the
rate limiter to return the operation's permits to the rate limiter.
If the external cancellation token fires after Release() runs
but before it disposes of the CancellationTokenRegistration, there will
be a deadlock as Release() holds the lock while waiting for the callback
to complete while the callback blocks on the lock to return its permits.

This change eliminates this deadlock by moving the cleanup of any
wait-for-lease operations in Release() outside the lock.
@ghost ghost added community-contribution Indicates that the PR has been added by a community member area-System.Threading labels Apr 28, 2023
@ghost
Copy link

ghost commented Apr 28, 2023

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

Issue Details

In System.Threading.RateLimiting.ConcurrencyLimiter and the other three implementations using a synchronously locked deque, there is a deadlock between an outstanding wait-for-lease operation being canceled by user and releasing an existing lease. The internal Release() method has to clean up wait-for-lease operations in some circumstances. This involves disposing the operation's CancellationTokenRegistration, which blocks if the registered callback is already running. This callback can be invoked on another thread if the external cancellation token fires. It locks the rate limiter to return the operation's permits to the rate limiter. If the external cancellation token fires after Release() runs but before it disposes of the CancellationTokenRegistration, there will be a deadlock as Release() holds the lock while waiting for the callback to complete while the callback blocks on the lock to return its permits.

This change eliminates this deadlock by moving the cleanup of any wait-for-lease operations in Release() outside the lock.
It also combines the private struct RequestRegistration with the private class CancelQueueState as they are used only together, increasing the size of the deque element for no discernible gain.

Author: atykhyy
Assignees: -
Labels:

area-System.Threading, community-contribution

Milestone: -

@atykhyy
Copy link
Author

atykhyy commented Apr 29, 2023

@dotnet-policy-bot @dotnet-policy-service agree company="Apptane Inc."

@davidfowl davidfowl requested a review from BrennanConroy April 29, 2023 16:00
@BrennanConroy
Copy link
Member

Nice find! I am curious how you found it. Did you hit the problem in an app? Or were you reviewing the code and noticed this pattern?

@atykhyy
Copy link
Author

atykhyy commented May 21, 2023

Thank you. We observed the deadlock in a situation where a lot of long-running operations were being queued by the rate limiter and canceled by a request timeout. I then went over the code and came up with this fix. Having experienced similar problems with disposing shared objects and canceling cancellation tokens under lock, I now always avoid it. TaskCompletionSource mutators used to be another fertile source of these problems but TaskCreationOptions.RunContinuationsAsynchronously has provided an escape hatch when it is needed.

// is going to invoke the callback synchronously, but this does not create
// a deadlock because lock are reentrant
if (cancellationToken.CanBeCanceled)
#if NETCOREAPP || NETSTANDARD2_1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#if NETCOREAPP || NETSTANDARD2_1
#if NETCOREAPP || NETSTANDARD2_1_OR_GREATER

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just keep it as is. Most libraries in this repo use NETSTANDARD2_1, and NETSTANDARD2_1_OR_GREATER is supported starting with the 5.0 SDK.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated before reading your comment 🤣 5.0 has been EOL for more than a year too.

_cancellationToken = cancellationToken;
if (state is RequestRegistration registration && registration.TrySetCanceled(registration._cancellationToken))
{
var limiter = (ConcurrencyLimiter)registration.Task.AsyncState!;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never seen this before, kind of interesting since you can "avoid" the extra field. Not really avoiding really since it's there whether you use it or not 😄

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, might as well use it and save a pointer.

Interlocked.Increment(ref _successfulLeasesCount);
}
nextPendingRequest.CancellationTokenRegistration.Dispose();
disposer.Add(nextPendingRequest);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was just double checking things and realized the Debug.Assert(_queueCount == 0); on line 285 can now be wrong (for other limiters too).

I'm kind of tempted to just remove it. Maybe add some asserts elsewhere?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And other asserts about _queueCount too. I don't know what dotnet/runtime policy on asserts is, i.e. how important it is to keep them. It should be possible to keep these asserts as is by moving _queueCount -= registration.Count inside the lock in case of disposal, and setting registration.Count to zero (it's only accessed from within the lock so it's safe to modify), but it makes the code a bit more complicated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this one is extremely important. It's there to verify that when the limiter is idle, there are no queued items.

We could check that the _queue is empty instead, we do lose the verification that our math is correct, but like you noted, it would be more complex to verify that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not as complex as I thought. How about something like this?

diff --git a/src/libraries/System.Threading.RateLimiting/src/System/Threading/RateLimiting/ConcurrencyLimiter.cs b/src/libraries/System.Threading.RateLimiting/src/System/Threading/RateLimiting/ConcurrencyLimiter.cs
index c2ba9cac..4faf4210 100644
--- a/src/libraries/System.Threading.RateLimiting/src/System/Threading/RateLimiting/ConcurrencyLimiter.cs
+++ b/src/libraries/System.Threading.RateLimiting/src/System/Threading/RateLimiting/ConcurrencyLimiter.cs
@@ -178,7 +178,7 @@ protected override ValueTask<RateLimitLease> AcquireAsyncCore(int permitCount, C
                 var request = new RequestRegistration(permitCount, this, cancellationToken);
                 _queue.EnqueueTail(request);
                 _queueCount += permitCount;
-                Debug.Assert(_queueCount <= _options.QueueLimit);
+                Debug.Assert(_queueCount - disposer.QueueCount <= _options.QueueLimit);
 
                 return new ValueTask<RateLimitLease>(request.Task);
             }
@@ -271,7 +271,7 @@ private void Release(int releaseCount)
                             Interlocked.Increment(ref _successfulLeasesCount);
                         }
                         disposer.Add(nextPendingRequest);
-                        Debug.Assert(_queueCount >= 0);
+                        Debug.Assert(_queueCount - disposer.QueueCount >= 0);
                     }
                     else
                     {
@@ -282,7 +282,7 @@ private void Release(int releaseCount)
                 if (_permitCount == _options.PermitLimit)
                 {
                     Debug.Assert(_idleSince is null);
-                    Debug.Assert(_queueCount == 0);
+                    Debug.Assert(_queueCount - disposer.QueueCount == 0);
                     _idleSince = Stopwatch.GetTimestamp();
                 }
             }
@@ -424,10 +424,13 @@ public struct Disposer : IDisposable
             {
                 private RequestRegistration? _next;
 
+                public int QueueCount { get; set; }
+
                 public void Add(RequestRegistration request)
                 {
                     request._next = _next;
                     _next = request;
+                    QueueCount += request.Count;
                 }
 
                 public void Dispose()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that's not bad! Maybe a comment on public int QueueCount { get; set; } that it's just for asserts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Process terminated. Assertion failed.
at System.Threading.RateLimiting.TokenBucketRateLimiter.ReplenishInternal(Int64 nowTicks) in /_/src/libraries/System.Threading.RateLimiting/src/System/Threading/RateLimiting/TokenBucketRateLimiter.cs:line 358

Copy link
Author

@atykhyy atykhyy Jun 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely reworked _queueCount handling and added unit tests exercising the cancellation code paths which used to produce the deadlock (only in ConcurrentLimiter, other three implementations are identical in this respect). I put this test and relevant code behind the DEBUG conditional compilation symbol because it involves adding callbacks at two strategic points in Release(), as there seems to be no other way to create a deterministically behaving unit test. If there is a test-specific symbol I'll use it instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting a bit complicated. Can we just remove Debug.Assert(_queueCount == 0); on line 285 and undo the recent changes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atykhyy Are you able to update the PR or should I try to finish it up? We have a week or two before 8.0 starts locking down and I want to get this change in 😄

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was out of office. I don't quite understand why you wanted to remove this assert. It's not hit in unit tests, and I think it can be proved that it is safe. But if you're okay removing it then I won't bother :)

@BrennanConroy
Copy link
Member

I took the commits and opened a new PR for the fix since I don't think your fork allows pushing.

The new PR is at #90285 and your name should be preserved on the commit when we merge.

@ghost ghost locked as resolved and limited conversation to collaborators Sep 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Threading community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants