-
Notifications
You must be signed in to change notification settings - Fork 10.3k
reopen issue #10117 - IIS app pool recycle throws 503 errors #41340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Trying to achieve high-availability with a single instance is not recommended. |
@Tratcher like I indicated above, this is not about deployments. There are a lot of scenarios when IIS recycles the pool (see above) |
Deployments are just one example that disrupt availability. A single instance is not advised for high-availability for many reasons. |
We've moved this issue to the Backlog milestone. This means that it is not going to be worked on for the coming release. We will reassess the backlog following the current release and consider this item at that time. To learn more about our issue management process and to have better expectation regarding different types of issues you can read our Triage Process. |
@Tratcher Even having more than a single instance this would still strongly affect a service, as all the requests going to the server that's recycling would be returned the 503 error. |
We have two instances running behind a load balancer and have still experienced this issue intermittently. When the app pool inevitably recycles (due to deployment, config change, etc), it starts returning 503 instead of queuing up the requests. The load balancer doesn't immediately treat the 503s as the server being down and take it out of the rotation. Instead, it uses a polling mechanism that calls an endpoint (i.e. /status) on each instance and checks for a successful response. While that status endpoint is monitored frequently enough, there is obviously plenty of time where a bunch of requests will fail with 503 while the recycle is happening. We can't have the load balancer take the instance out of the rotation if it sees 503s being returned because (1) it's not an available option in NetScaler and (2) if both happen to recycle at the same time, both servers would be taken out of the rotation and the service would be completely down until manual intervention told the load balancer that the requests aren't failing anymore. |
In this scenario (using two or more instances behind a load balancer), since you're using a "/status" to check availability, i suggest that you make some routines during the deployment or maitenence to, before start doing anything, force the "/status" to throw an "unhealthy" status, so the load balance could remove the node from the balancer and then make the changes. About the issue itself, the IIS default behavior on a recycle is to first start a new application pool, route the new requests to the new application pool, wait the default set time to the current requests ends, an than close/finish the current application pool, keeping only the new as the application pool. Recycling an application pool, could be an ASPNET Core "expected behavior", but looking at IIS, throwing 503 during one recycle isn't a "normal behavior". |
The problem is that ASP.NET Core doesn't use the overlapping recycle behavior that .NET Framework did. While the old worker process is being shutdown (especially if there were a lot of inflight requests being handled by it), the new process isn't yet started and those requests in the middle get the 503. |
Is there any way to minimize this problem, or ate least to detect what is causing it? |
High availability and throwing 503s from otherwise "perfectly fine servers" are separate concerns. Most load balancers do not hide HTTP errors! If the IIS process responds to a HTTP request with 503, then that's what the user will see. In particular, none of the Azure load balancer offerings hide errors from the users. They pass them on faithfully. If a previously working server throws 503 errors then it will take significant time for the load balancers to detect this. Minutes even, or 10+ minutes if using CDN-type solutions such as Azure Front Door. This behavior is triggered by many actions, not all of which are resolved via hosting on multiple server instances. Scheduled recycling, for example, has been mentioned by many people as a common trigger. Similarly, many people have pointed out in the previous thread that it's not just 503 errors that are seen, but slow uploads are also unceremoniously terminated. |
My problem seems to be on IIS recycle when takes too long to recycle, it's not related to cpu neither to memory, I asked the IT infrastructure to add more performance counters to grafana but they didn't add yet, I'm inclined to think that the problem is with Network connections, what do you think that I should monitor in grafana? |
In my experience, this happens also when simply setting the physical path of a website. The response is somewhat difference though - a recycle renders the text "The service is unavailable" and setting the physical path just gives an empty 503 result. |
In my case it's not for seconds, after the first 503 only iisreset solves the problem |
We are also experiencing this issue. Is there any expectation of a fix? This is going to keep us from moving forward with .NET migration / any new work in .NET 6+. It's not the deployments, blue/green can handle that, it's the "unexpected" recycles in the run of a day that are the issue. |
We are also seeing this with our nightly apppool recycles in the production environment, and with all our aspnetcore microservices. Example of IIS Logs (anonymized): HTTP 503 with win32 substatus 1255. According to https://learn.microsoft.com/en-us/windows/win32/debug/system-error-codes--1000-1299- code 1255 matches: 1255 (0x4E7) ERROR_SERVER_SHUTDOWN_IN_PROGRESS: The server machine is shutting down. This is a critical issue for us, since we are running a 24/7 plattform. |
I did a few tests with an almost empty aspnetcore 6 application and with some of our production applications on IIS10. Of all my tests the problem seemed to be exacerbated the most if the application is doing things during the Application Stopped event:
Using this snippet in my almost empty aspnetcore 6 application leads to 20-50 times the amount of HTTP503 responses during apppool recycle compared to using no Application Stopped event handler. |
@cun-dp that makes sense, the application has stopped serving traffic and can't re-start until the current process exits. |
@Tratcher It makes sense only in the way that my test further confirms the bug with aspnetcore app pool recycling: This supports the fact that IIS (or the aspnet core v2 module?) does not handle app pool recycles correctly by overlapping both application instances and routing the new requests to the new application instance the moment the application pool is asked to recycle, like IIS does with aspdotnet framework 4.x (and previous) applications. And to make this abundantly clear, because it has been misinterpreted in #10117 a lot: The behaviour can be reproduced by just running |
but the correct approach wasn't to start a new process, redirect the new connections to that process and let the old process die in peace as long as it takes? Is there a way for me to see what's blocking the old process to die? Is there a way to force all the threads in the old process do exit? Is there any workaround to help with this problem? |
We see the exact same thing and easy to reproduce as explained by @cun-dp , just recycle while under load and requests will fail with 503's. This behaviour is not seen in any of our framework api's, only .net core. I've tried all different combinations of IIS/App pool settings and nothing has worked. Ran my .net core api continuously using K6 and always get hit with 503's when recycling under load (this api is a port from .net framework which never has this issue running same load tests). While this doesn't fix the problem, it has helped, in the app pools advanced settings, setting Disable Overlapped Recycle to |
Nice, I'll try that solution, is there a way to find what is locked in the old process? if is it a file lock, or another resource? |
In our work, the exact same thing happens to us when we install the .NET 6 application on a server with IIS, after the first 503 error it does not come out without an iisreset. but if we raise this site with kestrel we have no problem. |
Looking at the ANCM commit history I see it had 2 commits in 1 year I'm thinking MS priorities are elsewhere (or the C++ guru who wrote it has left the company and now everyone's just afraid to touch it) |
I saw that there is a @BrennanConroy that is commiting code, It would be great if we could get a better error message that is preventing the shutdown of the aspnet core app. |
@BrennanConroy Perfect, got it. After testing the dll for a few hours in production, no 503 error appeared. It seems to be working perfectly. Thank you. |
The experimental dll works for us only if we extend the shutdownDelay. Before that our Umbraco CMS(ver 9) site would still return 503s on recycle so the dll seems to be working. Thanks. Now it only goes down when we deploy the site and I don't have a fix for that yet because it's not happening every time. I tried installing Application Initialization and turning on preload but that seems to make it worse in that I have to recycle or shutdown/restart the app pool more times to bring it back online. We are making use of app_offline.htm on deploy and I can see a recycle event occur in the event log when it detects app_offline.htm. It'll show the app shutdown and restart successfully most of the time but sometimes I get an error that it failed to shutdown gracefully. The only recourse is to shut down the app pool and then restart it and normally when this error occurs I have to shutdown/restart multiple times. I'll keep working on it but just wanted to put my experience out there. |
@BrennanConroy It's great to hear this really old issue is finally seeing a resolution. Since work is being done on this is there anyway to improve the error:
Like adding something to indicate what request or ApplicationStopped callback is holding up the shutdown? It would really go a long way to help address problematic code. Request Route, stack trace, anything to know what code is blocking the shutdown. Thanks! Edit: Credit to @luizfbicalho for making the same request earlier in this thread |
@randyshoopman what a brilliant idea. We've always suffered with that error, and despite wasting hours trying to figure out what's happening, have not been successful in finding the source. We just grudgingly accept it. |
An update to my last post. So I thought the DLL fix was working for us but it turns out that manual recycles no longer cause crashes but automatic recycles do. I left automatic recycling on with the default values for the last 3 days. Checking the IIS logs, the site went down 3 days ago until this morning when the normal daily 4 a.m. recycle brought it back online. We have another vanilla .net core site that is not crashing at all so this leads me to believe it's either something that we have implemented ourselves on this particular site or Umbraco itself. |
Still waiting and still affecting production. What's the status of this? What do we have to do to get some priority and resources on it? Is it on the team's radar at all? Is there a plan to gain confidence in the approach in the modified v2 module and then to get it published officially in a hosting bundle update? |
Yes, as mentioned above, it's not only on our radar, but we intend to release this fix in .NET 9. We're also trying to figure out a way to include the fix in a .NET 8 servicing release. |
Thanks, that sounds promising. We would really appreciate a fix for .NET8, as we are only able to run LTS releases. |
We're in the same situation, only LTS releases are permitted. What would happen if it is only released for .NET 9 and later, but a .NET 8 application was running behind the new version? If it isn't available for .NET 8, this is actually pretty likely to happen for us once .NET 10 is approved for internal use. I suppose the alternate question is, can .NET 9 and later run under the old (now current) ANCM until everything is updated to a version supported by the one with recycle improvements? (Hopefully it'll make the cut and my questions will be completely irrelevant!) |
Good question @MV10. The particular component that the fix is in is actually shared across installs, so our approach to servicing it is to ensure that it is backwards compatible with older supported versions. This comes with ups and downs obviously... on the bright side, fixes that we make can apply down-level easily. On the other hand, we have to be super careful to ensure that we don't break existing apps. We don't want anyone's existing app to break without them doing anything just because they installed a newer hosting bundle on their server. This is why we're being careful and ensuring we have enough bake time and testing (and considering things like opt-in switches for down-level) before getting this out there. We have heard loud and clear that people want this fix and we can't wait to solve this problem for everyone. |
Thanks for starting the PR for this issue @BrennanConroy. Keen to know which versions of DotNet will get this fix once merged. Seeing updates to dotnet6+ would be awesome. |
This will be opt-out in 9.0 and we're backporting as opt-in for 8.0. Since the change is in the IIS module which is installed globally on the machine and the module is compatible with all supported versions of ASP.NET Core you can use the 8.0 or 9.0 module with 6.0 and still get the fix. When installing the hosting bundle (which is what installs the IIS module) you can choose to ignore the runtime that comes with the installer if you want to continue using 6.0 but get the newer module. See https://learn.microsoft.com/aspnet/core/host-and-deploy/iis/hosting-bundle?view=aspnetcore-8.0#options for details. |
Great to see this problem being solved! As the Hosting Bundle packages, some time ago i tried that "choice to ignore the runtime" but for some reason "Install Hosting Bundle with runtimes and after that uninstall both runtimes" gave me a different setup than "Install Hosting Bundle only without runtime". Even the size of the Hosting Bundle was different. Seems to persist this "bug": On our servers all the DotNet core apps only works with the 145MB Server Hosting (so i have to do a full install and after that remove the runtimes). Considering that you are working with ANCM, some time ago i posted about this problem involving IIS, DotNet Core and Windows Update (#41377). The problem persists as for some reason i need to put IIS in manual mode, reboot server, than do the Windows Update full install (Windows security patches, Dotnet Framework security patches and DotNet Core patches). |
A little bit off-topic, but until we have this fix deployed, our monitoring is being configured to ignore Failed to gracefully shut down application XYZ error messages (which correlate to the recycle 503s, of which there are many since corporate policy still requires early-AM staggered-schedule recycles of all pools). Is there a reference somewhere of the various warnings and errors ANCM can emit? |
Did this fix get into .NET 8.0.5 hosting bundle? |
The way I read it, it'll be released with .NET 9 in November, and backported to .NET 8.x.x at that time... |
According to milestone tags the fix ist slated for the next release 8.0.6: https://github.com/dotnet/aspnetcore/issues?q=milestone%3A8.0.6%20is%3Aclosed%20label%3Aservicing-approved 🤞 |
@cun-dp is right, it'll ship in the 8.0.6 release in June. |
This fix is out in 9.0-preview4 now. And will be available in 8.0.7 in July-2024. I'm going to close and lock this issue so that this comment, which will include detailed instructions below, is the last one. For those of you who have been posting semi-unrelated issues, please file separate issues for them. InstructionsWhen installing the hosting bundle (which is what installs the global IIS module) you can choose to ignore the runtime that comes with the installer if you want to continue using an older runtime but get the newer modules improvements. See https://learn.microsoft.com/aspnet/core/host-and-deploy/iis/hosting-bundle?view=aspnetcore-8.0#options for details. Otherwise, install the hosting bundle as normal to get the IIS module installed. There are 2 ways to modify the behavior of this change.
<aspNetCore processPath="dotnet" arguments="myapp.dll" stdoutLogEnabled="false" stdoutLogFile=".logsstdout">
<handlerSettings>
<!--
Milliseconds to delay shutdown of the old app app instance while the new instance starts.
Note: This doesn't delay the handling of incoming requests.
-->
<handlerSetting name="shutdownDelay" value="5000" />
</handlerSettings>
</aspNetCore>
If you set the value to 0, then it goes back to the old behavior. For For The default is 1000 milliseconds (1 second). For busy/slow machines you may want to increase this value. |
Uh oh!
There was an error while loading. Please reload this page.
Is there an existing issue for this?
Describe the bug
IIS app pool throws 503 errors during recycles. This is a known issue with ANCM module that has been reported previously in #10117 - which has 33 likes and a 3-year discussion, it was never fixed, but it was automatically closed by a bot "as a clean-up due to lack of discussion".
P.S. This is not a deployment problem. There are many scenarios when IIS app pool is being recycled outside of our control (adding/removing SSL certificates, changing IP addresses to listen to, etc... basically, touching any IIS setting causes a recycle - and 503 errors are unacceptable for high-availability scenarios).
.NET Framework was free of this bug.
Expected Behavior
No errors during recycles.
Steps To Reproduce
see the issue linked #10117
Exceptions (if any)
No response
.NET Version
5.0, 6.0 7.0 8.0
Anything else?
No response
The text was updated successfully, but these errors were encountered: