Skip to content

Server non responsive under small load (thread-pool starvation) #17090

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
arthurvaverko-kaltura opened this issue Nov 14, 2019 · 22 comments
Closed
Labels
area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions feature-kestrel question
Milestone

Comments

@arthurvaverko-kaltura
Copy link

I have a very large legacy code-base that we recently converted to netcore.2.2 from Net 4.6.1
The the entire legacy code is fully synchronous blocking calls starting from db using dataset.Fill()
up to using WebRequests with stream readers and so on.

Under not so heavy load Kestrel server becomes unresponsive (50 concurrent users)
I can clearly see that we have threadpool starvation issue described here: aspnet/KestrelHttpServer#2104

I see that during simple load with ab of 50 concurrent users the amount of threads in htop starting to climb 1-2 per second and when I increase the concurrency I see that it goes even further but the client gets timeouts (lots of them ...)

the middleware pipeline i have is fully async up until the point it calls upon the controller (they are sync and legacy)

I am not at a point where we can convert the code into a pure async all the way architecture and I am looking for a workaround.

i have set the ThreadPool minimum like this in my StartupClass

System.Threading.ThreadPool.SetMinThreads(10000, 10000); // or higher

but still the app starts with 25 threads and when hit with a load of even 50 users the threads keep growing 1-2 per second without letting requests in.

I am running from a container and tried to add the -e ComPlus_ThreadPool_ForceMinWorkerThreads=1000 but still got same behavior
also set the --sysctl net.core.somaxconn=4096 didnt help

I would love to get some pointers from you as to how I can avoid threadpool starvation without going full async on the legacy code.

@davidfowl
Copy link
Member

The the entire legacy code is fully synchronous blocking calls starting from db using dataset.Fill()
up to using WebRequests with stream readers and so on.

Ouch 😢. Just an FYI, WebRequest is implemented very poorly on .NET Core and performs much worse than on .NET Framework (it's basically a trap).

Bumping the min number of threads should help with code like this and the injection should be fast (more than 1-2 per second).. Are you sure the min threads fix is deployed? .NET Core 3.0 has counters you can use to look at this stuff, before 3.x it's hard to look at these kinds of metrics.

@arthurvaverko-kaltura
Copy link
Author

@davidfowl I am sure its deployed and its logging out the new min threads that I have set .. but for some odd reason htop I side the container will show a very slow injection of new threads ..

Is it important to set in before I create the host builder ? Maybe that I set the threads during configure services is an issue ?

@arthurvaverko-kaltura
Copy link
Author

Ho and thanks about the webrequest tip I have encountered it already :)

@davidfowl
Copy link
Member

Is it important to set in before I create the host builder ? Maybe that I set the threads during configure services is an issue ?

That shouldn't matter.

@stephentoub @kouvel Any ideas on this?

@kouvel
Copy link
Contributor

kouvel commented Nov 14, 2019

System.Threading.ThreadPool.SetMinThreads(10000, 10000);

The default maximum IO completion thread count is 1000, so the call above may fail (min can't be greater than the current max). Check the result of the call to SetMinThreads to see if it's succeeding.

To set the min worker and IO completion thread counts, this should work:

int workerThreads, ioCompletionThreads;
ThreadPool.GetMaxThreads(out workerThreads, out ioCompletionThreads);
ThreadPool.SetMaxThreads(Math.Max(10000, workerThreads), Math.Max(10000, ioCompletionThreads));
ThreadPool.SetMinThreads(10000, 10000);

To set just the min worker thread count, this should work:

int workerThreads, ioCompletionThreads;
ThreadPool.GetMinThreads(out workerThreads, out ioCompletionThreads);
ThreadPool.SetMinThreads(10000, ioCompletionThreads);

ComPlus_ThreadPool_ForceMinWorkerThreads=1000

On Unixes the env var is case-sensitive and the correct casing is COMPlus_ThreadPool_ForceMinWorkerThreads.

@halter73
Copy link
Member

While using HttpClientFactory and avoiding any blocking I/O would be ideal, as a stopgap, instead of increasing the min thread count, it might be better just to limit the number of concurrent requests. If the requests complete quickly enough under normal load, limiting the number of requests executing at once could help a lot.

You can try out our ConcurrencyLimiter package which allows you to do exactly this. Once the limit is reached, new requests are queued without blocking any threads. You can learn more about it and why it can be so helpful from this community standup discussion.

@arthurvaverko-kaltura
Copy link
Author

@kouvel Thanks you so much! I didn't notice that I have set the min before the max .. and that there is a bool return value on this method :\
I'm gonna up the min threadpool now and re-test .. i'll update soon....

@arthurvaverko-kaltura
Copy link
Author

Yesss now we have Threads!
request are going in .. no freezes .. will have to tweak the min number according to the average load but that solved it Thanks You!

@arthurvaverko-kaltura
Copy link
Author

@halter73 Thanks for the suggestion im gonna definitely look that up!
but its basically attacking the problem with a different tool ..

is there a difference between queuing request (non bollocking) and queuing requests inside the threadpool using the min thread property ? basically the end result would be the same..

am i missing something ?
i mean we do spend some cpu time on context switches but is there something else ?

@arthurvaverko-kaltura
Copy link
Author

Huh.. :( ConcurrencyLimiter is net core 3,0 only ..
and under 3.0
I got a lot of errors on

Microsoft.AspNetCore.Server.Kestrel.Core.BadHttpRequestException: Unexpected end of request content.

@halter73
Copy link
Member

is there a difference between queuing request (non bollocking) and queuing requests inside the threadpool using the min thread property ? basically the end result would be the same..
a m i missing something ?
i mean we do spend some cpu time on context switches but is there something else ?

The biggest difference is even when the request queue limit is reached, you should still have threadpool threads available as long as you set the concurrent request limit low enough. This is important in order to avoid a vicious cycle where the threads being injected into the threadpool immediately add more threadpool workitems before blocking yet another thread. In this case (which is common for any ASP.NET Core apps that do blocking I/O), injecting threads can paradoxically make the threadpool's workitem backlog even larger.

In many cases, the blocked threads are really just waiting on another thread to become available to unblock the Task/operation it's waiting on rather than the actual I/O which has already completed. Since this is an example of code limiting access to the very resource (threadpool threads) needed to complete the operation it's waiting on, this is sometimes referred to as an "async deadlock". I go over a specific instance of this in aspnet/IISIntegration#245 (comment).

Another big difference with the ConcurrencyLimiter is that you can set both a concurrent request limit and a queue size limit. When queue size limit is reached, it will start responding with 503s or do whatever you specify in a custom OnRejected RequestDelegate. This way you can proactively tell clients to back off when you're overloaded. With the threadpool, there's no upper bound that I know of on the number of workitems If the backlog becomes to large, by the time the workitem is dequeued to process the request, the client has probably already timed out the request making all the time spent executing that workitem wasted.

I got a lot of errors on

Microsoft.AspNetCore.Server.Kestrel.Core.BadHttpRequestException: Unexpected end of request content.

Is that when you upgrade to 3.0 in general or when you upgrade to 3.0 and enable the ConcurrencyLimiter? If it's the latter, it's possible that the clients are timing out the request in the middle of uploading the body. While the request is queued, Kestrel will only buffer up to about 1MB of the request body. So if the client is trying to upload more than that and gets queued for a long time, it makes sense that the client could close the connection mid-request-body.

What do the client logs say? Do you have any more server logs? For diagnosing bad request failures, setting the min log level to debug is often helpful because we try to log client errors with low severity.

@arthurvaverko-kaltura
Copy link
Author

Thank for the valuable info !
I'm defiantly gonna give it a shot. regarding server logs as always time pressure didn't let me to investigate further, i had to drop to 2.2 and search for other workarounds.

so for now i have tried to up the min threads to 10K, this helped somewhat but still the load the our servers require and the amount of blocking I/O that is causing slow responses we are still hitting starvation and timeouts.

I'll update soon on progress, thanks.

@analogrelay analogrelay added this to the Discussions milestone Nov 15, 2019
@stephenlautier
Copy link
Contributor

stephenlautier commented Nov 20, 2019

I was going to open an issue but then I found this one which is very similar.

After updating our app to netcore3 from netcore2.2, we started getting several errors randomly, and on netcore2.2 we never had this error.

This is the stack we are getting:

Unexpected end of request content.
--
   at Microsoft.AspNetCore.Server.Kestrel.Core.BadHttpRequestException.Throw(RequestRejectionReason reason)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.Http1ContentLengthMessageBody.ReadAsyncInternal(CancellationToken cancellationToken)
   at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpRequestStream.ReadAsyncInternal(Memory`1 buffer, CancellationToken cancellationToken)
   at Microsoft.AspNetCore.WebUtilities.FileBufferingReadStream.ReadAsync(Byte[] buffer, Int32 offset, Int32 count, CancellationToken cancellationToken)
   at Microsoft.AspNetCore.WebUtilities.StreamHelperExtensions.DrainAsync(Stream stream, ArrayPool`1 bytePool, Nullable`1 limit, CancellationToken cancellationToken)
   at Microsoft.AspNetCore.Mvc.Formatters.NewtonsoftJsonInputFormatter.ReadRequestBodyAsync(InputFormatterContext context, Encoding encoding)
   at Microsoft.AspNetCore.Mvc.ModelBinding.Binders.BodyModelBinder.BindModelAsync(ModelBindingContext bindingContext)
   at Microsoft.AspNetCore.Mvc.ModelBinding.ParameterBinder.BindModelAsync(ActionContext actionContext, IModelBinder modelBinder, IValueProvider valueProvider, ParameterDescriptor parameter, ModelMetadata metadata, Object value)
   at Microsoft.AspNetCore.Mvc.Controllers.ControllerBinderDelegateProvider.<>c__DisplayClass0_0.<<CreateBinderDelegate>g__Bind|0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.<InvokeInnerFilterAsync>g__Awaited|13_0(ControllerActionInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeFilterPipelineAsync>g__Awaited|19_0(ResourceInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeAsync>g__Awaited|17_0(ResourceInvoker invoker, Task task, IDisposable scope)
   at Microsoft.AspNetCore.Routing.EndpointMiddleware.<Invoke>g__AwaitRequestTask|6_0(Endpoint endpoint, Task requestTask, ILogger logger)
   at Midgard.AspNet.Shared.User.UserContextMiddleware.Invoke(HttpContext httpContext, IUserContext userContext)
   at Microsoft.AspNetCore.Authorization.AuthorizationMiddleware.Invoke(HttpContext context)
   at Microsoft.AspNetCore.Authentication.AuthenticationMiddleware.Invoke(HttpContext context)
   at Odin.AspNet.Shared.Brand.BrandContextMiddleware.Invoke(HttpContext httpContext, RequestContext requestContext, IBrandContext brandContext)
   at Odin.AspNet.Shared.Request.RequestContextMiddleware.Invoke(HttpContext httpContext, RequestContext requestContext, ICountryService countryService)
   at Odin.AspNet.Shared.Tenancy.MultiTenantMiddleware`1.Invoke(HttpContext httpContext, IExportLocatorScope locatorScope)
   at Contentful.AspNetCore.MiddleWare.WebhookMiddleware.Invoke(HttpContext context)
   at Microsoft.AspNetCore.Builder.Extensions.MapMiddleware.Invoke(HttpContext context)
   at Microsoft.AspNetCore.Builder.Extensions.MapMiddleware.Invoke(HttpContext context)
   at Odin.AspNet.Shared.Error.ErrorHandlingMiddleware.Invoke(HttpContext httpContext, RequestContext requestContext)


Unable to read data from the transport connection: Connection reset by peer.

   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.GetResult(Int16 token)
   at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, CancellationToken cancellationToken)

We are using

  • Docker (linux) based on mcr.microsoft.com/dotnet/core/aspnet:3.0-buster-slim
  • netcore3

@arthurvaverko-kaltura
Copy link
Author

arthurvaverko-kaltura commented Nov 20, 2019

Okay so I can confirm that while increasing the min Threadpool threas help, it was not enough, we were still hitting starvation and getting timeouts.

What helped was adding the concurrency limiter and setting it to the default of Environment.ProcessorCount * 2;`
in addition we had to increase the min threads to compensate for the *2 concurrency.

anyway playing with these 2 did the trick and the legacy code now works as expected and holds the load it should have

keep in mind that the old NetFramework had IIS queue and concurrency limit to protect from that..

@arthurvaverko-kaltura
Copy link
Author

arthurvaverko-kaltura commented Nov 25, 2019

Edit
Looks like issue is resolved, see latest comment (you can skip this message)

@halter73 Hello again..
In one of our applications there seems to be a problem ..
we configured the concurrency limiter according to Environment.ProcessorCount
we use version 3.0.0
in our test machine when I try to load it with 10 Concurrent users I get good behavior ..
I used dotnet-counters monitor -p 6 Microsoft.AspNetCore.ConcurrencyLimiter System.Runtime

using ab -p api-data.txt -T application/json -c 10 -n 1000 http://10.10.11.123:8080/api
we got this:

Press p to pause, r to resume, q to quit.
    Status: Running

[System.Runtime]
    # of Assemblies Loaded                           171
    % Time in GC (since last GC)                       2
    Allocation Rate (Bytes / sec)            217,305,968
    CPU Usage (%)                                     22
    Exceptions / sec                                 112
    GC Heap Size (MB)                                125
    Gen 0 GC / sec                                     1
    Gen 0 Size (B)                            35,018,320
    Gen 1 GC / sec                                     1
    Gen 1 Size (B)                             1,613,168
    Gen 2 GC / sec                                     1
    Gen 2 Size (B)                            11,982,944
    LOH Size (B)                               5,833,416
    Monitor Lock Contention Count / sec              130
    Number of Active Timers                           39
    ThreadPool Completed Work Items / sec            811
    ThreadPool Queue Length                            0
    ThreadPool Threads Count                          31
    Working Set (MB)                                 408
[Microsoft.AspNetCore.ConcurrencyLimiter]
    Average Time in Queue                             62.235
    Queue Length                                       8
    Rejected Requests                                  0

and the api is alive and all okay ..
but when I up the concurrency in ab to 12 everything gets locked up and the server is non responsive and will not recover until it is restarted.
ab -p api-data.txt -T application/json -c 12 -n 1000 http://10.10.11.123:8080/api
we got this:

[System.Runtime]
    # of Assemblies Loaded                           171
    % Time in GC (since last GC)                       2
    Allocation Rate (Bytes / sec)              1,000,168
    CPU Usage (%)                                      0
    Exceptions / sec                                   0
    GC Heap Size (MB)                                246
    Gen 0 GC / sec                                     0
    Gen 0 Size (B)                            19,326,952
    Gen 1 GC / sec                                     0
    Gen 1 Size (B)                             3,634,272
    Gen 2 GC / sec                                     0
    Gen 2 Size (B)                             9,619,736
    LOH Size (B)                               7,751,656
    Monitor Lock Contention Count / sec                0
    Number of Active Timers                           39
    ThreadPool Completed Work Items / sec             13
    ThreadPool Queue Length                            0
    ThreadPool Threads Count                          30
    Working Set (MB)                                 436
[Microsoft.AspNetCore.ConcurrencyLimiter]
    Average Time in Queue                              0
    Queue Length                                      12
    Rejected Requests                                  0
                                                          

looks like all 12 request gets locked and nothing is getting in. every new request will increas the counter, and nothing else is happening.
The threads stay the same (min thread is set to 1k) and doesn't seem like there are any lock contention

I tried to "Disable" the concurrency limiter by setting the concurrency to 1K and the api started working fine again (with min threads set to 1k). I fear there might be a deadlock there somewhere.
from my look i could not figure out where it could happen but I suspect the double await there might cause some trouble

private async ValueTask<bool> SemaphoreAwaited(Task task)
{
     await task;
     return true;
}

this method awaits the semaphore while the middleware itself awaits the wrapping task
but im not entirely sure, would love to get your help on this.

@halter73
Copy link
Member

I'll admit SemaphoreAwaited is a bit weird. It probably makes more sense just to call the ValueTask ctor that takes a Task so the conversion doesn't require SemaphoreAwaited to run on the threadpool. That said, the ConcurrencyLimiterMiddleware also needs to continue on the threadpool as soon as the awaited task completes, and it should continue on the same thread as SemaphoreAwaited, so this shouldn't make much of a difference in performance.

Additionally, if you aren't threadpool starved because the number of concurrent requests is limited enough, it shouldn't be a problem that SemaphoreAwaited needs a thread. I would recommend collecting a dump and running clrstack -all to see what all your threads are blocked on. It could be there's a Task.Result or something similar that can be turned into an await.

@arthurvaverko-kaltura
Copy link
Author

arthurvaverko-kaltura commented Nov 26, 2019

Edit
Looks like issue is resolved, see latest comment (you can skip this message)

okay I used dotnet-dump to collect and analyze the clrstack
i got this

OS Thread Id: 0x6 (0)
        Child SP               IP Call Site
00007FFCDABC9140 00007f879781c00c [GCFrame: 00007ffcdabc9140]
00007FFCDABC9220 00007f879781c00c [HelperMethodFrame_1OBJ: 00007ffcdabc9220] System.Threading.Monitor.ObjWait(Boolean, Int32, System.Object)
00007FFCDABC9350 00007F871D060874 System.Threading.ManualResetEventSlim.Wait(Int32, System.Threading.CancellationToken)
00007FFCDABC93E0 00007F871D075120 System.Threading.Tasks.Task.SpinThenBlockingWait(Int32, System.Threading.CancellationToken) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/Task.cs @ 2922]
00007FFCDABC9440 00007F871D074FC4 System.Threading.Tasks.Task.InternalWaitCore(Int32, System.Threading.CancellationToken) [/_/src/System.Private.CoreLib/shared/System/Threading/Tasks/Task.cs @ 2861]
00007FFCDABC9490 00007F871D0CB246 System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task) [/_/src/System.Private.CoreLib/shared/System/Runtime/CompilerServices/TaskAwaiter.cs @ 143]
00007FFCDABC94B0 00007F871D0CB1ED System.Runtime.CompilerServices.TaskAwaiter.GetResult() [/_/src/System.Private.CoreLib/shared/System/Runtime/CompilerServices/TaskAwaiter.cs @ 107]
00007FFCDABC94C0 00007F871D7D5238 Microsoft.AspNetCore.Hosting.WebHostExtensions.Run(Microsoft.AspNetCore.Hosting.IWebHost) [/_/src/Hosting/Hosting/src/WebHostExtensions.cs @ 69]
00007FFCDABC94E0 00007F871D540665 TVPApi.Web.Program.Main(System.String[]) [/build/tvpapi/TVPApi.Api/Program.cs @ 24]
00007FFCDABC9818 00007f879695ccaf [GCFrame: 00007ffcdabc9818]
00007FFCDABC9D00 00007f879695ccaf [GCFrame: 00007ffcdabc9d00]

also have the trace file
trace.nettrace

clrstack -all
result is here: https://gist.github.com/arthurvaverko/29aea49782d96d31dc0e44795e293637

@arthurvaverko-kaltura
Copy link
Author

Okay .. Facepalm
Seems like it was all on me ... I have added the ConcurrencyLimiter middlewere twice :
Now I'm not entirely sure why that caused it to lock at a certain load but the fact that the outer concurrency limiter did not hit the OnExit and did not release the SemaphoreSlim
while the inner ConcurrencyLimiter Semaphore was released.

I am running a full load test to verify indeed this fixes the issue, ill update as we go..

in addition I think it is a good idea to "throw the devs into the pit of success" and throw some kind of new StupidDeveloperException() if someone tries to register the ConcurrencyLimiter twice :)

@halter73
Copy link
Member

halter73 commented Nov 26, 2019

Interesting. The app probably broke when the ConcurrencyLimiter middleware was added twice because both instances of the middleware were using the same singleton IQueuePolicy instance. Once the max number of concurrent requests were let through the outer middleware, the inner middleware must have checked the now already-throttling IQueuePolicy and seen that it shouldn't let any more requests through until more ongoing requests finished. Unfortunately, at this point, no requests could finish because the inner middleware was queuing them all leading to a deadlock that happened not to block any threads.

I agree that registering the ConcurrencyLimiter twice is a pit of failure and we should throw an exception early to make this clear to developers. We would welcome a PR that does this I'm not sure what the best mechanism for identifying this issues is. I'm not sure if any other middleware currently throws if added twice. Maybe the UseConcurrencyLimiter extension method could add an item to IApplicationBuilder.Properties and throw if said item is already set. This would not save you if you added the middleware manually instead of with the extension method, but it might be good enough.

We could also think about moving the IQueuePolicy out of DI and just making it a regular settable property on ConcurrencyLimiterOptions. This would make accidentally sharing IQueuePolicy between two instances of ConcurrencyLimiterMiddleware almost impossible. Though I still don't think it's worth the effort and possibly breaking people to do this.

Edit: I now agree with @Tratcher that making the IQueuePolicy transient is the better fix. You might still take a slight performance hit from registering the ConcurrencyLimiter middleware multiple times unintentionally, but at least everything should still work as intended.

@halter73
Copy link
Member

@Tratcher pointed out that making the IQueuePolicy a transient instead of a singleton would have stopped the deadlock issue. Since the middleware itself has a singleton lifetime and resolves the IQueuePolicy once, this could work. This still wouldn't make it easy to configure multiple instances of the middleware differently, but that's true for an IOptions-based approach too.

Out of curiosity, @arthurvaverko is there anything that made it easy to accidentally add the ConcurrencyLimiter middleware multiple times other than the fact that it didn't throw when you did?

@arthurvaverko-kaltura
Copy link
Author

Nothing made it easy ...
We have multiple application that share a common middleware pipeline so I created a wrapping common midlleware that included the cuncrrency limiter.
So we have app.UseCommon() and ontop of that I added a call to UseCuncurrencyLimiter().

I agree that making the IQueuePolicy transient is a better approach.

Ill try to find some time to maybe make my first conlntrib to the AspNetCore repo :)

@ghost
Copy link

ghost commented Jan 26, 2020

Thank you for contacting us. Due to no activity on this issue we're closing it in an effort to keep our backlog clean. If you believe there is a concern related to the ASP.NET Core framework, which hasn't been addressed yet, please file a new issue.

@ghost ghost closed this as completed Jan 26, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Jan 26, 2020
@amcasey amcasey added area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions and removed area-runtime labels Jun 2, 2023
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions feature-kestrel question
Projects
None yet
Development

No branches or pull requests

8 participants