Skip to content

IIS Application Pool Recycle returns unexpected 503 #10117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
FlorianRainer opened this issue May 9, 2019 · 32 comments
Closed

IIS Application Pool Recycle returns unexpected 503 #10117

FlorianRainer opened this issue May 9, 2019 · 32 comments
Labels
affected-few This issue impacts only small number of customers area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions enhancement This issue represents an ask for new feature or an enhancement to an existing one severity-minor This label is used by an internal tool
Milestone

Comments

@FlorianRainer
Copy link

Describe the bug / Steps To Reproduce

I have a Asp.Net Full Framework MVC Application (Visual Studio default MVC Template) hosted on my local IIS10 (Win10) (not IIS Express) with "Disable Overlapped Recyle = false" (default value).
If i recycle the application pool while i use a tool to continuously send http requests the webserver will respond correctly for all requests, no Errors (500 / 503) and no timeouts.

If i try to do the same with Asp.Net Core 2.2 (InProcess) i will get 503 exceptions.

I have the same Problem if i change the physical path for my website.
Full Framework no 503, AspNetCoreModuleV2 a lot of 503.
The old "AspNetCoreModule" (Asp.Net Core 2.1) had a similar Problem but had returned less 503 responses.

Expected behavior

The Expected behavior would be:

  • no 503 errors on recycle (with "Disable Overlapped Recyle = false")
  • no 503 responses if the physical path changes.

This is important for high availability and depoyment in my case.

Screenshots

Full Framework 4.7.2 with overlapped recycling not hosted via AspNetCoreModule
grafik
(left Baseline, right recycle test)

AspNetCore 2.2 with overlapped recycling via AspNetCoreModuleV2
grafik
(left Baseline, right recycle test)

@rockerinthelocker
Copy link

rockerinthelocker commented May 9, 2019

Maybe the application pool's 'Queue Length' is too small? Default is 1,000 and if exhausted, 503ers are returned.

@FlorianRainer
Copy link
Author

@rockerinthelocker the 'Queue Length' for my tests was 1000 (default) but in both cases, .net full and core. I've made one more test with Queue Length = 65000 but nothing has changed, neary the same count of request errors.
grafik

One more thing i've noticed is, the difference in Max Request time, this could explain the problem, or maybe give us a hint...

  • .net full framework
    if i hit recycle the already started requests will be processed.
    the new worker process will be started (overlapped)
    the new incomming requests will be queued
    if the new process is ready they will be processed there without errors

    result: some requests with long request time (long wait time in queue)

  • .net core
    if i hit recycle the already started requests will be processed.
    the new worker process will be started (overlapped)
    the new incomming requests will be queued OR still processed by the old worker but return 503 while in shutdown phase
    if the new process is ready they the queued requests will be processed by the new worker but this happens to early, process is not ready to process requests and return 503

    result: in both cases we get some 503 but we get less long response time (less time in queue)

Of course this is only a possible explaination, no concrete result of debugging!

@analogrelay
Copy link
Contributor

We'll do some looking at if we can improve this. However, in general we recommend using a pattern like blue-green deployments for zero-downtime deployments. Features like Overlapped Recycle help, but don't guarantee that you can do a zero-downtime deployment.

@analogrelay analogrelay added this to the 3.0.0-preview7 milestone May 14, 2019
@FlorianRainer
Copy link
Author

@anurse Thanks for taking a look at this.

It's ok if it doesn't guarantee zero downtime for deployment, but it would be nice if it would work like the old .NET, but you're right there will be no guarantee.

But i think it should be guaranteed zero downtime for application pool recycle (by time interval, by memory limitation, manual recycle via IIS UI) right?

@analogrelay
Copy link
Contributor

But i think it should be guaranteed zero downtime for application pool recycle (by time interval, by memory limitation, manual recycle via IIS UI) right?

Nothing is guaranteed when it comes to downtime, that's what makes high-availability applications so tricky :). But my understanding is that yes we can try to make this better. In the situation where the app pool is recycling, IIS should be emitting the appropriate events to ensure we keep downtime low.

We'll look in to this if we can in 3.0, but we have a lot of high-priority work, so I can't guarantee how far we'll get.

@analogrelay
Copy link
Contributor

Moving to the Backlog. We've done some initial investigation and this is going to be very challenging. For high-availability scenarios we need this and some kind of shadow copying mechanic which is a very costly feature. We might be able to revisit later. Using separate app pools and a load balancer is our recommended approach for high-availability as it allows you full flexibility over deployment process and the ability to easily revert versions.

@analogrelay analogrelay modified the milestones: 3.0.0-preview7, Backlog Jun 4, 2019
@FlorianRainer
Copy link
Author

@anurse thanks for the investigation. I agree with your conclusions about high-availability scenarios and deployment.

But for me the more important case is the normal periodical application pool recycle.
Correct me if i'm wrong, but for this case no shadow copying would be neccessary because nothing changes in the web folder.

During the recycle process the IIS returns IIS 10.0 detailed error - 503.0 - Service Unavailabe, this could be caused by https://github.com/aspnet/AspNetCore/blob/3477daf3c4f530dff80f197e0642cb39a26fb07b/src/Servers/IIS/AspNetCoreModuleV2/AspNetCore/proxymodule.cpp#L121 ERROR_SERVER_SHUTDOWN_IN_PROGRESS but i'm not sure about this.

it seems like there are requests still processed by the worker process in shutdown phase and not queued in IIS request queue for the new worker process. This issue is probably different from the shadow copy problem during a deployment, and it happen much more often (every 27 hours with default config)

@ferrarimartin
Copy link

So, given this issue and #8775, is there a way to have zero-downtime application pools recycles and deployments with IIS-only tools, or we should use other tools? We're facing the exact same issue.

@FlorianRainer
Copy link
Author

FlorianRainer commented Jul 30, 2020

@ferrarimartin FYI: for our company we have decided to go on with .net 4.x full framework to provide this zero-downtime services.
we where not able to work around this issue for application pool recycle. for Publishing we used some kind of "blue green deployment strategy" but for periodical recycle we where not able to find a workaround

@jkotalik jkotalik added affected-few This issue impacts only small number of customers enhancement This issue represents an ask for new feature or an enhancement to an existing one severity-minor This label is used by an internal tool labels Nov 12, 2020 — with ASP.NET Core Issue Ranking
@EricMKaufman
Copy link

The recycle problem seems like a considerable blocker for a lot of apps considering moving from full framework to .net core. Any updates?

@FlorianRainer
Copy link
Author

IMHO at least the application pool recycle behavior should be considered to be fixed.
The Problem in this case are the 503 responses from Kestrel (i guess) while shutdown (not in startup phase)
if there would be at least a configuration or OptIn (code) to let the incomming requests in IIS queue or let the requests finish with success.

The Deployment case is far more complicated but can be resolved by workarounds.

The only Workaround for this issue is to disable periodical app pool recycle. (not possible for all apps out there) @EricMKaufman

@EricMKaufman
Copy link

EricMKaufman commented Dec 18, 2020

@FlorianRainer, I agree it would be great to have Overlapped Recycle fixed or implemented. This was surprising for me to learn. Thank you for submitting this issue.

I think it would prevent not fun surprises for folks if this was added the to the IIS documentation.
dotnet/AspNetCore.Docs#20992

@Santas
Copy link

Santas commented Feb 19, 2021

@anurse Any update on planning of this?

I agree that it is possible to work around the deployment scenario, but not all apps can work without regular recycling. Also it is easy to miss in documentation, that this long supported feature is no longer available.

@Santas
Copy link

Santas commented Apr 19, 2021

You could use IIS ARR locally to load balance between two websites and switch between them without downtime. Take a look at https://kevinareed.com/2015/11/07/how-to-deploy-anything-in-iis-with-zero-downtime-on-a-single-server/ for inspiration.

@siberianbot
Copy link

There is no problem with deployment. There is problem with app recycling.

We just can't go on PROD and say "hey, could you use ARR to balance load between two instances on one machine?"
We are interested on recycling because it allows us to finish all requests on old instance and use new instance for handling new requests. With ARR we can lose some requests. Also there could be a problem when ARR still doesn't know about new instance.

@siberianbot
Copy link

Any updates?
This problem with recycling is critical for us.

@ArcanoxDragon
Copy link

ArcanoxDragon commented May 5, 2021

This issue is impacting my company as well. We have two separate instances of the same ASP.NET Core site running on two different servers, and load balanced using a dedicated load balancer. Whenever the App Pool recycles, any requests to the site get a 503 error for about 3-4 seconds as the site restarts. The load balancer sees this as "downtime" and sends a downtime report e-mail every time the pool recycles, and if both sites happen to recycle at the same time (for whatever reason), the site appears offline briefly for outside users.

Something else we're observing which may be related to this:

If there are "long-running" connections open to the ASP.NET Core API (such as an image upload from a client with a slow connection) when the App Pool recycles, those connections are killed and an exception is thrown in the API saying "An existing connection was forcibly closed by the remote host" immediately before the pool restarts.

@ferrarimartin
Copy link

@ArcanoxDragon yes, your second problem is probably related, as there's no overlapping of processes to flush the old connections. So, the old process will be killed and new one will be fired. Your old connections will be killed, and some connections will be lost while the new process is started. By default the timeout in IIS for shutting down the pool is 30 seconds (not sure how that works with .net core though).
In your case, maybe you can have the load balancer remove traffic just before the recycles (say 30 seconds or so if there are long-running connections), but you should set the recycling to occur at a specified time (different for each server) so you know when to remove traffic.

@alex-jitbit
Copy link
Contributor

alex-jitbit commented Oct 6, 2021

More recycle examples that have nothing to do with deployments:

Adding/modifying hostname bindings causes an IIS recycle.

Modifying web.config - causes a recycle.

Adding SSL-certificates - causes a recycle.

Adding a new IP address to listen to - causes a recycle.

Adding removing other websites on IIS server - causes a recycle

Modifying any IIS setting, like "Dynamic IP restrictions" or URL Rewrite rules - causes a recycle.

etc.

In other words IIS can't be touched at all, otherwise users are getting 503 errors in Core apps.

Sure, there are ways to work those around. Like terminate SSL connections on a load balancer etc. etc... The point is - things that "just worked" in .NET Framework now require 3 extra servers in .NET Core.

It looks like .NET Core is pushing us towards Linux. Which is... unexpected! After all we're paying you guys for Windows Server licenses ;)

Is this something that's being looked into in .NET 6?

@alex-jitbit
Copy link
Contributor

alex-jitbit commented Nov 8, 2021

I'm gonna try .NET 6 later today and see if it's been addressed.

P.S. It's unfortunate that this issue has "affected-few" label applied. It affects literally everyone who hosts on IIS.

@alex-jitbit
Copy link
Contributor

alex-jitbit commented Nov 10, 2021

The issue is still present with .NET 6 and the new hosting module 16.0.21299.

I upgraded the project to .NET 6 in VS 2022, installed the new hosting bundle, deployed to IIS and ran a load test using Netling tool, just like the op. Same results.

😥

@ghost
Copy link

ghost commented Dec 22, 2021

Thanks for contacting us.

We're moving this issue to the .NET 7 Planning milestone for future evaluation / consideration. We would like to keep this around to collect more feedback, which can help us with prioritizing this work. We will re-evaluate this issue, during our next planning meeting(s).
If we later determine, that the issue has no community involvement, or it's very rare and low-impact issue, we will close it - so that the team can focus on more important and high impact issues.
To learn more about what to expect next and how this issue will be handled you can read more about our triage process here.

@ferrarimartin
Copy link

@Santas , have you tried this? I don't think it would work, because of this issue: #8775

You could use IIS ARR locally to load balance between two websites and switch between them without downtime. Take a look at https://kevinareed.com/2015/11/07/how-to-deploy-anything-in-iis-with-zero-downtime-on-a-single-server/ for inspiration.

@c0shea
Copy link

c0shea commented Jan 5, 2022

This is an issue for us, as well. We have two servers that are load balanced through a separate load balancer appliance. Even with the load balancer, it still is going to allow at least one request on each server to return a 503 to the client. I would expect the old behavior of queuing up the requests during a recycle rather than returning 503 to clients. Short of clients having to know to retry requests that return a 503, how else can we avoid their requests failing?

@triynko
Copy link

triynko commented Jan 11, 2022

It's not just asp.net core. Overlapped recycling in IIS appears to be completely broken. No kidding. Windows Server 2019, AWS, docker, .NET Framework 4.7.2 Web Application. "Application Initialization" Windows feature confirmed to be installed: Get-WindowsFeature 'Web-AppInit' shows Install State = Installed.

With Powershell, I make absolutely sure my runtime settings within the docker container are configured for overlapped recycling. I made sure: 1. idle timeout is set to zero, 2. the startup time limit is high enough (even the shutdown time limit is more than enough), 3. periodic restart is disabled (set to zero) and instead 4. a scheduled restart time is configured (in the near future, for testing behavior), 5. ensured 'preloadEnabled' is set to true, and 6. the 'old' methods of using auto start provider are disabled by setting serviceAutoStartEnabled = False and serviceAutoStartProvider = "" (btw, having them enabled does not fix anything), and made sure they're set explicitly at the web application level, in addition to the site-level defaults.

Import-Module WebAdministration;
Set-ItemProperty IIS:\AppPools\DefaultAppPool -Name 'processModel.idleTimeout' -Value "00:00:00";
Set-ItemProperty IIS:\AppPools\DefaultAppPool -Name 'processModel.startupTimeLimit' -Value "00:05:00";
Set-ItemProperty IIS:\AppPools\DefaultAppPool -Name 'autoStart' -Value True;
Set-ItemProperty IIS:\AppPools\DefaultAppPool -Name 'startMode' -Value "AlwaysRunning";
Set-ItemProperty -Path 'IIS:\AppPools\DefaultAppPool' -name 'recycling.periodicRestart.time' -value '0.00:00:00';
Set-ItemProperty -Path 'IIS:\AppPools\DefaultAppPool' -name 'recycling.periodicRestart.schedule' -value @{value="01:35"};
Set-ItemProperty IIS:\Sites\DefaultWebSite -Name applicationDefaults -Value @{"serviceAutoStartEnabled"="False";"serviceAutoStartProvider"=""};
Set-ItemProperty IIS:\Sites\DefaultWebSite\MyWebApp -Name serviceAutoStartEnabled -Value False;
Set-ItemProperty IIS:\Sites\DefaultWebSite\MyWebApp -Name serviceAutoStartProvider -Value "";`

Not only am I seeing double-runs of Application_Start (in global.asax) where only one of the calls finishes (the other shows no indication of completing before being terminated), I see basically immediate blocking of all requests. It's limited to a single server node and processModel.maxProcesses is set to 1.

In a scenario with 4 threads constantly calling the app node when a recycle event occurs, all 4 requests block right away. 2 of the 4 experience a 504, gateway timeout after 60s, retry, and complete in 30s (so 90s total). The other 2 threads' calls (which started at the same time), complete after about 90s (about 10s more than the 80s runtime measured by Application_Start. So... the overall wait time of all 4 threads is 90s, but 2/4 experience a 504 after 60 seconds, and the other two experience no such disruption, completely cleanly after a 90s blocking wait. Not even sure where the 60s timeout leading to the 504s is coming from, since process model shutdown/startup time limits are 1:30 and 5:00 respectively. No timeout settings that I'm aware of are at 60s.

Somehow it's like there are 2 processes started (confirmed by logging output by Application_Start), half the requests (2 or 4) end up at one that ultimately gets terminated (confirmed by absence of logs written at end of Application_Start), and the other two are sent to the 'good' one that eventually persists (and writes logs at end of Application_Start). It makes no sense. On top of that, the 'initializationPage' I have set is never called. Path is correctly set to '/MyWebApp/health/warmup', and I confirmed I can reach it from within docker via calling a similarly-placed echo endpoint like: $body = @{"status_code"=200;"payload"="abc"}; Invoke-WebRequest 'http://localhost:80/MyWebApp/health/echo' -Method 'POST' -Body ($body|ConvertTo-Json) -ContentType 'application/json' -UseBasicParsing;

It's like overlapped recycling just DOES NOT WORK AT ALL. The moment the new process starts, requests are immediately routed to it (or a bad copy of a duplicate new process) and block until it's completely warmed up. That's not overlapping.

I've tested this behavior on settings change (process model changes are most disruptive, get 500s, 502, 503s, and 504s; app pool settings changes show duplicate initialization, as do scheduled recycling events; meanwhile, app-level changes don't seem to exhibit the duplicate initialization behavior, but the blocking behavior is still there; no overlapped recycling is apparent. There's just nothing about this setup that works as documented, and it's very frustrating that this is heavily impacting production environments.

@sbakharev
Copy link

sbakharev commented Jan 13, 2022

@triynko although I completely understand what you're trying to achieve and share your pain there is only one potential real bug of IIS I see described in your post:

2 of the 4 experience a 504, gateway timeout after 60s

To me this looks like some sort of miscommunication between overlapped recycling and preloading (or autoStart, or AlwaysRunning, idk).

Speaking about overlapping. IIS does provide it to some extent by letting the requests which have already started finish in the old process and directing all new requests to a fresh process. This is what IIS means by "overlapping" - instead of blocking/503ing all new incoming requests, then waiting for all the existing requests to finish, and only then creating a new process and delegating all new requests there - it instantly creates a new process which overlaps with another existing process. This feature (or, more accurately, the ability to disable it) does make sense since there are applications which do not work properly with multiple process instances - yes, in many cases this is caused by smelly application design - but not always.

And IIS has no way to understand whether your application has or hasn't completely "warmed up". In modern DI/LazyLoading-driven applications the warm-up stage may even never complete since some resources would only get initialized during specific requests which may simply never happen for the whole process lifetime. Attempts to build a universal warmup-finished-detection technique would most likely either just fail, or would be ugly/buggy, or would require customizations within applications themselves (providing externally usable warmup methods). It is too individual.

P.S. All I've written about overlapped recycling is about .NET Framework apps, not ASP.NET Core. The latter does have a problem and we even had to completely disable scheduled recycling of ASP.NET Core applications to prevent regular 500 errors on frontends due to backends responding 503 during recycles.

@FlorianRainer
Copy link
Author

I can confirm the double-runs of Application_Start (in global.asax) Problem, described by @triynko

@HaoK HaoK modified the milestones: .NET 7 Planning, Discussions Feb 16, 2022
@HaoK HaoK removed their assignment Feb 16, 2022
@triynko
Copy link

triynko commented Feb 18, 2022

Update since my last post. This is way more complex than I originally thought it would be. The overlapped recycling does work like it's supposed to, but there is a very specific issue that causes double or even triple initialization runs.

First, on the topic of detecting when an app has warmed up:

And IIS has no way to understand whether your application has or hasn't completely "warmed up". In modern DI/LazyLoading-driven applications the warm-up stage may even never complete since some resources would only get initialized during specific requests which may simply never happen for the whole process lifetime. Attempts to build a universal warmup-finished-detection technique would most likely either just fail, or would be ugly/buggy, or would require customizations within applications themselves (providing externally usable warmup methods). It is too individual.

It actually can and does detect when it's warmed up correctly. For example, preloadEnabled sends an initial request, which returns only after Application_Start has completed. Asynchronous warmups aside, as long as it waits for Application_Start to complete, it's fine. That means the app is able to serve requests. We do all our initialization in Application_Start and increase the startup/ping timeouts to allow it enough time to complete. In addition, IIS supports 'warmup' pages, and I've confirmed this works correctly too. During overlapped recycling, it not only finishes Application_Start, it also finishes executing the warmup pages before it starts handing requests to the new/overlapped process. The result is a completely undisrupted call chain with a sudden transition to the new process serving requests. The returned responses also confirm the value of my 'WarmedUp' flag set in the response header (sourced from a static variable assigned in my warmup endpoint), with no delayed requests at all.

Now, here's where things get interesting.

Microsoft's docker entrypoint executable ServiceMonitor, is what's coming in and disrupting everything. Because our apps are set to use autoStart and preloadEnabled, there's this docker-container-system-initiated run of the w3svc service that starts up the app pool and application right away, before ServiceMonitor even gets involved. That run is DOOMED and pointless. It's missing ENV variables that ServiceMonitor will eventually inject and so on. So the first thing ServiceMonitor does is try to stop the service, then inject environment variables in the DefaultAppPool's config section, then restart the service. So, to work around that doomed run, we simply set the startupMode of the w3svc service to 'Manual' (via Powershell script) when we build the docker container. That way, it's not initially running, there is no doomed run, and ServiceMonitor starts up, configures what it needs to, starts w3svc (for the first time) on it's own, then we see just a single run of application start! 😄

Except in ECS. We were seeing triple, not double, app startup runs there. So while we were able to reduce from 2 extra startups to 1 extra startup, we have no idea why this extra startup run is being triggered. It's something unique to the ECS environment that doesn't happen when running the image locally in docker desktop. The other strange thing about it is that it's an in-process recycle -- the kind you'd see if you modified a web.config file, and unlike the new-process, overlapped recycle that would happen if you modified some application pool setting. This in-process recycle is particularly bad for us, because it can crash the process when we're using SignalFx tracing libraries that hook into the CLR; we get sharing violations when it tries to load the domain-neutral assemblies inside the same process under a different AppDomain. Therefore, the only valid recycling we can support is new-process / overlapped recycling, such as scheduled recycling, and that's all we really want to support anyway.

Another strange thing is if dump all environment variables in Application_Start to disk (i.e. via Environment.GetEnvironmentVariables), the first Application_Start run is missing 2 variables that the 2nd one has. So... same process, but new run of Application_Start, and it lists out APP_POOL_CONFIG and APP_POOL_ID variables. And then, if I manually trigger a recycle with Powershell's Restart-WebAppPool DefaultAppPool, the new process is missing those two variables just like the original one was. So there's this very inconsistent behavior in docker and across recycles, particularly in ECS-deployed apps that's really, really bothering me. This double/triple recycling and triple cache-hydration is even leading to OutOfMemory exceptions on these already-memory-constrained systems (but that's another subject to be dealt with separately).

@ghost
Copy link

ghost commented Apr 19, 2022

Thank you for contacting us. Due to a lack of activity on this discussion issue we're closing it in an effort to keep our backlog clean. If you believe there is a concern related to the ASP.NET Core framework, which hasn't been addressed yet, please file a new issue.

This issue will be locked after 30 more days of inactivity. If you still wish to discuss this subject after then, please create a new issue!

@ghost ghost closed this as completed Apr 19, 2022
@alex-jitbit
Copy link
Contributor

Please reopen this issue, it still hasn't been fixed after 2.5 years.

@alex-jitbit
Copy link
Contributor

@triynko your comments about .NET Framework are completely off the topic by the way, this repo is about .NET Core.

@benjamin-stern
Copy link

Agreed Experiencing this issue currently and there doesn't even seem to be a known workaround. Maybe @HaoK can comment on what's been decided to be done for this issue, or if there is a viable workaround for this issue?

@ghost ghost locked as resolved and limited conversation to collaborators May 26, 2022
@amcasey amcasey added area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions and removed area-runtime labels Aug 24, 2023
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
affected-few This issue impacts only small number of customers area-networking Includes servers, yarp, json patch, bedrock, websockets, http client factory, and http abstractions enhancement This issue represents an ask for new feature or an enhancement to an existing one severity-minor This label is used by an internal tool
Projects
None yet
Development

No branches or pull requests