Data Protection Refresh causing site hang #30752

michaelpaulus · 2021-03-08T20:28:05Z

Describe the bug

Our site is hanging, sometimes unrecoverable during a refresh of the data protection provider. I was able to get a memory dump while it was occurring one time. I have been trying to load test locally and duplicate the issue without any luck. It seems that most of the times this happens our response time goes way up

our 500 count

and thread count

Here is a picture of the stack trace from the memory dump:

and one of the threads waiting for that one:

To Reproduce

Sorry, but I've not been able to repo this in a testing environment

Further technical details

.NET SDK (reflecting any global.json):
Version: 5.0.103
Commit: 72dec52dbd

Runtime Environment:
OS Name: Windows
OS Version: 10.0.14393
OS Platform: Windows
RID: win10-x86
Base Path: D:\Program Files (x86)\dotnet\sdk\5.0.103\

Host (useful for support):
Version: 5.0.3
Commit: c636bbdc8a

.NET SDKs installed:
1.1.14 [D:\Program Files (x86)\dotnet\sdk]
2.1.518 [D:\Program Files (x86)\dotnet\sdk]
2.1.521 [D:\Program Files (x86)\dotnet\sdk]
2.2.109 [D:\Program Files (x86)\dotnet\sdk]
3.1.111 [D:\Program Files (x86)\dotnet\sdk]
3.1.112 [D:\Program Files (x86)\dotnet\sdk]
5.0.103 [D:\Program Files (x86)\dotnet\sdk]

.NET runtimes installed:
Microsoft.AspNetCore.All 2.1.25 [D:\Program Files (x86)\dotnet\shared\Microsoft.AspNetCore.All]
Microsoft.AspNetCore.All 2.2.14 [D:\Program Files (x86)\dotnet\shared\Microsoft.AspNetCore.All]
Microsoft.AspNetCore.App 2.1.25 [D:\Program Files (x86)\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 2.2.14 [D:\Program Files (x86)\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 3.0.3 [D:\Program Files (x86)\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 3.1.8 [D:\Program Files (x86)\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 3.1.12 [D:\Program Files (x86)\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 5.0.3 [D:\Program Files (x86)\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.NETCore.App 1.0.16 [D:\Program Files (x86)\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 1.1.13 [D:\Program Files (x86)\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 2.0.9 [D:\Program Files (x86)\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 2.1.25 [D:\Program Files (x86)\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 2.2.14 [D:\Program Files (x86)\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 3.0.3 [D:\Program Files (x86)\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 3.1.12 [D:\Program Files (x86)\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 5.0.3 [D:\Program Files (x86)\dotnet\shared\Microsoft.NETCore.App]

blowdart · 2021-03-08T22:51:49Z

You're using the AzureExtensions package? You'll need to post in their repo I'm afraid, we can't transfer issues over

ghost · 2021-03-10T00:01:39Z

This issue has been resolved and has not had any activity for 1 day. It will be closed for housekeeping purposes.

See our Issue Management Policies for more information.

pakrym · 2021-03-12T16:15:13Z

This looks like a thread pool starvation caused by data protection taking a lock on the request thread to wait for the refresh to complete. Blocked request threads starve threadpool and continuations for blob requests can not run.

cc @davidfowl

davidfowl · 2021-03-12T16:20:12Z

the gift that keeps on giving 😢

pakrym · 2021-03-12T16:28:57Z

DataProtection needs asyncification.

davidfowl · 2021-03-12T16:38:04Z

Maybe we can do something like configuration and fire a change token when new data is ready. I need to investigate this. I don't think it makes sense changing Protect and Unprotect to be async.

pakrym · 2021-03-12T16:40:39Z

What if we add a TryRollKeyRingAsync() method that we unconditionally run as part of every request. This would make sure that during request all protect and unprotect calls won't make any async calls.

michaelpaulus · 2021-03-12T17:42:25Z

Some more background into our particular issue because it doesn't make any sense to me:

We have this issue pop up here and there where 500 errors just spike on an instance for a while now, but usually auto-heal goes ahead and recycles the instance and we are fine. Not that I love that it happens at all, but we couldn't figure out the cause and the impact to customers was limited.

Last week, we had an issue with 2 of our apps at different times just spike, and it didn't matter what we did, restart the app, stop and start it, ... we always had an instance that would just return 500 errors. Auto heal would just fire constantly so we had to turn it off. This reminded me of the early days of aspnet core where we had an issue with data protection starting under heavy load causing thread pool starvation. We had a work around to encrypt something before leaving the startup class so that it got initialized before requests actually started hitting the service, that worked but I noticed in one of the releases it was fixed and so we removed that work around. After opening a support case with azure support, they asked us to change auto-heal to get a memory dump when it heals instead of just restarting, we did and that is how I noticed that it was to do with the refresh code, the confusing part is that it would happen immediately after restart and hitting load. That made no sense and I was unable to repo it locally using load testing. I even downloaded parts of your code and replaced your services locally (this was never deployed) to change the refresh interval to every 1 minute instead of the hard coded 24 hours and put a delay in the blob request of 10 seconds but no luck.

What we ended up doing, and this makes no sense at all, is putting our old work around back into the startup to encrypt a string. We deployed it, swapped, and the sites started running perfectly. This was last Friday, we have not had a single issue all week.

I can grantee we didn't see a drop of load at this time, but you can clearly see the drop in server errors at 3:07PM when we deployed the change.

As a side note, we did find that swapping the slots one day did fix the issue temporarily, but on this day we swapped 3 or 4 times on this site before making the code change with no resolution. It was only after we deployed the encryption on startup trick did it solve the issue.

I know it makes no sense, especially since the refresh shouldn't happen at startup at all, and the stack trace is clearly showing it is coming from the refresh and not the initial startup.

pakrym · 2021-03-12T17:53:58Z

What we ended up doing, and this makes no sense at all, is putting our old work around back into the startup to encrypt a string.

It actually makes all kinds of sense. When do run encrypt in a startup the key management operations that usually take request lock happen before the service receives any requests avoiding the thread pool starvation.

Otherwise, as soon as the service starts it tries to load/rotate keys under load ending up with a bunch of requests taking up threads and not allowing the rotation/loading process to finish.

michaelpaulus · 2021-03-12T21:03:26Z

@pakrym Why I say this makes no sense, is that at startup the hosted service does the initialization

https://github.com/dotnet/aspnetcore/blob/main/src/DataProtection/DataProtection/src/Internal/DataProtectionHostedService.cs

and from my tests, it appears to be working correctly.

pakrym · 2021-03-12T21:17:16Z

You are right. This should've mitigated the issue on startup.

ghost · 2021-03-13T22:01:43Z

This issue has been resolved and has not had any activity for 1 day. It will be closed for housekeeping purposes.

See our Issue Management Policies for more information.

michaelpaulus · 2021-03-16T12:08:03Z

This issue did pop up again yesterday for us, we had 4 auto heal events that all gave us memory dumps, all of the dumps indicate the same problem, thread pool exhaustion while waiting on the refresh of the keys.

ghost · 2021-03-18T16:47:20Z

We've moved this issue to the Backlog milestone. This means that it is not going to be worked on for the coming release. We will reassess the backlog following the current release and consider this item at that time. To learn more about our issue management process and to have better expectation regarding different types of issues you can read our Triage Process.

blowdart · 2021-03-18T16:51:20Z

What if we add a TryRollKeyRingAsync() method that we unconditionally run as part of every request. This would make sure that during request all protect and unprotect calls won't make any async calls.

This seems like an easy DoS, you want to minimize calls to roll keys as much as possible, we throttle them by design.

Blocking on startup is also by design, there's no point in letting other requests run until the keyring is filled. The block is, of course, fine with the other providers, because they're all local resource hits. @pakrym maybe you fudge it by doing a protect inside your own initialization once it's all set?

pakrym · 2021-03-18T17:14:14Z

This seems like an easy DoS, you want to minimize calls to roll keys as much as possible, we throttle them by design.

We would only roll/refresh when the key is close to expiry and the subsequent Protect/Uprotect call would've rolled it anyway.

@pakrym maybe you fudge it by doing a protect inside your own initialization once it's all set?

Nope, we don't use protect inside the key provider.

Pikeman · 2022-02-15T00:36:55Z

Our site is also hanging from this sometimes. We see the same call stack.

We run ProtectKeysWithAzureKeyVault and we tend to see some HTTP 401 with Azure Key Vault around the time that our site hangs.

Even with the DataProtectionHostedService in play, we find that restarting our process does not reliably restore service. Maybe because the hosted service makes a single attempt at loading the key ring, and on error lets the process start handling requests anyway, that initial attempt fails (for whatever reason) and the process then runs into the same kind of deadlock as seen above?

putting our old work around back into the startup

We will try a work-around as well, like this:

using Microsoft.AspNetCore.DataProtection.KeyManagement.Internal;
On startup, inject a IKeyRingProvider keyRingProvider and do _ = keyRingProvider.GetCurrentKeyRing();
If it crashes, we'll let the hosting environment restart the process and repeat until it comes up.

Maybe we can do something like configuration and fire a change token when new data is ready. I need to investigate this. I don't think it makes sense changing Protect and Unprotect to be async.

That sounds sensible. In times of key rotation, it seems like the new key will be available a good while before the previous one expires, anyway. So it sounds nice to get the new key ring ready in the background, instead of letting requests grab that lock😅

Cheers

amcasey · 2024-10-04T17:40:08Z

I'm going to close this issue. Please post additional feedback in #36157 so we can track it centrally and don't drop any.

blowdart added the area-dataprotection Includes: DataProtection label Mar 8, 2021

blowdart added External This is an issue in a component not contained in this repository. It is open for tracking purposes. ✔️ Resolution: Answered Resolved because the question asked by the original author has been answered. labels Mar 8, 2021

ghost added the Status: Resolved label Mar 8, 2021

ghost closed this as completed Mar 10, 2021

michaelpaulus mentioned this issue Mar 12, 2021

[BUG] Data Protection Extensions: Potential application hang Azure/azure-sdk-for-net#19482

Closed

pakrym reopened this Mar 12, 2021

ghost closed this as completed Mar 13, 2021

davidfowl reopened this Mar 13, 2021

davidfowl removed ✔️ Resolution: Answered Resolved because the question asked by the original author has been answered. External This is an issue in a component not contained in this repository. It is open for tracking purposes. Status: Resolved labels Mar 13, 2021

HaoK added affected-medium This issue impacts approximately half of our customers enhancement This issue represents an ask for new feature or an enhancement to an existing one severity-major This label is used by an internal tool labels Mar 15, 2021 — with ASP.NET Core Issue Ranking

mkArtakMSFT added this to the Backlog milestone Mar 18, 2021

blowdart mentioned this issue Sep 3, 2021

Meta Issue - Data Protection keyring sync/creation race investigation #36157

Open

amcasey closed this as completed Oct 4, 2024

github-actions bot locked and limited conversation to collaborators Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Protection Refresh causing site hang #30752

Data Protection Refresh causing site hang #30752

michaelpaulus commented Mar 8, 2021

blowdart commented Mar 8, 2021 •

edited

Loading

ghost commented Mar 10, 2021

pakrym commented Mar 12, 2021

davidfowl commented Mar 12, 2021

pakrym commented Mar 12, 2021

davidfowl commented Mar 12, 2021

pakrym commented Mar 12, 2021

michaelpaulus commented Mar 12, 2021

pakrym commented Mar 12, 2021

michaelpaulus commented Mar 12, 2021

pakrym commented Mar 12, 2021 •

edited

Loading

ghost commented Mar 13, 2021

michaelpaulus commented Mar 16, 2021

ghost commented Mar 18, 2021

blowdart commented Mar 18, 2021

pakrym commented Mar 18, 2021

Pikeman commented Feb 15, 2022

amcasey commented Oct 4, 2024

Data Protection Refresh causing site hang #30752

Data Protection Refresh causing site hang #30752

Comments

michaelpaulus commented Mar 8, 2021

Describe the bug

To Reproduce

Further technical details

blowdart commented Mar 8, 2021 • edited Loading

ghost commented Mar 10, 2021

pakrym commented Mar 12, 2021

davidfowl commented Mar 12, 2021

pakrym commented Mar 12, 2021

davidfowl commented Mar 12, 2021

pakrym commented Mar 12, 2021

michaelpaulus commented Mar 12, 2021

pakrym commented Mar 12, 2021

michaelpaulus commented Mar 12, 2021

pakrym commented Mar 12, 2021 • edited Loading

ghost commented Mar 13, 2021

michaelpaulus commented Mar 16, 2021

ghost commented Mar 18, 2021

blowdart commented Mar 18, 2021

pakrym commented Mar 18, 2021

Pikeman commented Feb 15, 2022

amcasey commented Oct 4, 2024

blowdart commented Mar 8, 2021 •

edited

Loading

pakrym commented Mar 12, 2021 •

edited

Loading