Skip to content

Data Protection Refresh causing site hang #30752

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
michaelpaulus opened this issue Mar 8, 2021 · 18 comments
Closed

Data Protection Refresh causing site hang #30752

michaelpaulus opened this issue Mar 8, 2021 · 18 comments
Labels
affected-medium This issue impacts approximately half of our customers area-dataprotection Includes: DataProtection enhancement This issue represents an ask for new feature or an enhancement to an existing one severity-major This label is used by an internal tool
Milestone

Comments

@michaelpaulus
Copy link

Describe the bug

Our site is hanging, sometimes unrecoverable during a refresh of the data protection provider. I was able to get a memory dump while it was occurring one time. I have been trying to load test locally and duplicate the issue without any luck. It seems that most of the times this happens our response time goes way up

response_time

our 500 count

server_errors

and thread count

thread_count

Here is a picture of the stack trace from the memory dump:

refresh_thread

and one of the threads waiting for that one:

wait_thread

To Reproduce

Sorry, but I've not been able to repo this in a testing environment

Further technical details

.NET SDK (reflecting any global.json):
Version: 5.0.103
Commit: 72dec52dbd

Runtime Environment:
OS Name: Windows
OS Version: 10.0.14393
OS Platform: Windows
RID: win10-x86
Base Path: D:\Program Files (x86)\dotnet\sdk\5.0.103\

Host (useful for support):
Version: 5.0.3
Commit: c636bbdc8a

.NET SDKs installed:
1.1.14 [D:\Program Files (x86)\dotnet\sdk]
2.1.518 [D:\Program Files (x86)\dotnet\sdk]
2.1.521 [D:\Program Files (x86)\dotnet\sdk]
2.2.109 [D:\Program Files (x86)\dotnet\sdk]
3.1.111 [D:\Program Files (x86)\dotnet\sdk]
3.1.112 [D:\Program Files (x86)\dotnet\sdk]
5.0.103 [D:\Program Files (x86)\dotnet\sdk]

.NET runtimes installed:
Microsoft.AspNetCore.All 2.1.25 [D:\Program Files (x86)\dotnet\shared\Microsoft.AspNetCore.All]
Microsoft.AspNetCore.All 2.2.14 [D:\Program Files (x86)\dotnet\shared\Microsoft.AspNetCore.All]
Microsoft.AspNetCore.App 2.1.25 [D:\Program Files (x86)\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 2.2.14 [D:\Program Files (x86)\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 3.0.3 [D:\Program Files (x86)\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 3.1.8 [D:\Program Files (x86)\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 3.1.12 [D:\Program Files (x86)\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.AspNetCore.App 5.0.3 [D:\Program Files (x86)\dotnet\shared\Microsoft.AspNetCore.App]
Microsoft.NETCore.App 1.0.16 [D:\Program Files (x86)\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 1.1.13 [D:\Program Files (x86)\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 2.0.9 [D:\Program Files (x86)\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 2.1.25 [D:\Program Files (x86)\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 2.2.14 [D:\Program Files (x86)\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 3.0.3 [D:\Program Files (x86)\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 3.1.12 [D:\Program Files (x86)\dotnet\shared\Microsoft.NETCore.App]
Microsoft.NETCore.App 5.0.3 [D:\Program Files (x86)\dotnet\shared\Microsoft.NETCore.App]

@blowdart blowdart added the area-dataprotection Includes: DataProtection label Mar 8, 2021
@blowdart
Copy link
Contributor

blowdart commented Mar 8, 2021

You're using the AzureExtensions package? You'll need to post in their repo I'm afraid, we can't transfer issues over

@blowdart blowdart added External This is an issue in a component not contained in this repository. It is open for tracking purposes. ✔️ Resolution: Answered Resolved because the question asked by the original author has been answered. labels Mar 8, 2021
@ghost ghost added the Status: Resolved label Mar 8, 2021
@ghost
Copy link

ghost commented Mar 10, 2021

This issue has been resolved and has not had any activity for 1 day. It will be closed for housekeeping purposes.

See our Issue Management Policies for more information.

@pakrym
Copy link
Contributor

pakrym commented Mar 12, 2021

This looks like a thread pool starvation caused by data protection taking a lock on the request thread to wait for the refresh to complete. Blocked request threads starve threadpool and continuations for blob requests can not run.

cc @davidfowl

@pakrym pakrym reopened this Mar 12, 2021
@davidfowl
Copy link
Member

the gift that keeps on giving 😢

@pakrym
Copy link
Contributor

pakrym commented Mar 12, 2021

DataProtection needs asyncification.

@davidfowl
Copy link
Member

Maybe we can do something like configuration and fire a change token when new data is ready. I need to investigate this. I don't think it makes sense changing Protect and Unprotect to be async.

@pakrym
Copy link
Contributor

pakrym commented Mar 12, 2021

What if we add a TryRollKeyRingAsync() method that we unconditionally run as part of every request. This would make sure that during request all protect and unprotect calls won't make any async calls.

@michaelpaulus
Copy link
Author

Some more background into our particular issue because it doesn't make any sense to me:

We have this issue pop up here and there where 500 errors just spike on an instance for a while now, but usually auto-heal goes ahead and recycles the instance and we are fine. Not that I love that it happens at all, but we couldn't figure out the cause and the impact to customers was limited.

Last week, we had an issue with 2 of our apps at different times just spike, and it didn't matter what we did, restart the app, stop and start it, ... we always had an instance that would just return 500 errors. Auto heal would just fire constantly so we had to turn it off. This reminded me of the early days of aspnet core where we had an issue with data protection starting under heavy load causing thread pool starvation. We had a work around to encrypt something before leaving the startup class so that it got initialized before requests actually started hitting the service, that worked but I noticed in one of the releases it was fixed and so we removed that work around. After opening a support case with azure support, they asked us to change auto-heal to get a memory dump when it heals instead of just restarting, we did and that is how I noticed that it was to do with the refresh code, the confusing part is that it would happen immediately after restart and hitting load. That made no sense and I was unable to repo it locally using load testing. I even downloaded parts of your code and replaced your services locally (this was never deployed) to change the refresh interval to every 1 minute instead of the hard coded 24 hours and put a delay in the blob request of 10 seconds but no luck.

What we ended up doing, and this makes no sense at all, is putting our old work around back into the startup to encrypt a string. We deployed it, swapped, and the sites started running perfectly. This was last Friday, we have not had a single issue all week.

image

I can grantee we didn't see a drop of load at this time, but you can clearly see the drop in server errors at 3:07PM when we deployed the change.

As a side note, we did find that swapping the slots one day did fix the issue temporarily, but on this day we swapped 3 or 4 times on this site before making the code change with no resolution. It was only after we deployed the encryption on startup trick did it solve the issue.

I know it makes no sense, especially since the refresh shouldn't happen at startup at all, and the stack trace is clearly showing it is coming from the refresh and not the initial startup.

@pakrym
Copy link
Contributor

pakrym commented Mar 12, 2021

What we ended up doing, and this makes no sense at all, is putting our old work around back into the startup to encrypt a string.

It actually makes all kinds of sense. When do run encrypt in a startup the key management operations that usually take request lock happen before the service receives any requests avoiding the thread pool starvation.

Otherwise, as soon as the service starts it tries to load/rotate keys under load ending up with a bunch of requests taking up threads and not allowing the rotation/loading process to finish.

@michaelpaulus
Copy link
Author

@pakrym Why I say this makes no sense, is that at startup the hosted service does the initialization

https://github.com/dotnet/aspnetcore/blob/main/src/DataProtection/DataProtection/src/Internal/DataProtectionHostedService.cs

and from my tests, it appears to be working correctly.

@pakrym
Copy link
Contributor

pakrym commented Mar 12, 2021

You are right. This should've mitigated the issue on startup.

@ghost
Copy link

ghost commented Mar 13, 2021

This issue has been resolved and has not had any activity for 1 day. It will be closed for housekeeping purposes.

See our Issue Management Policies for more information.

@ghost ghost closed this as completed Mar 13, 2021
@davidfowl davidfowl reopened this Mar 13, 2021
@davidfowl davidfowl removed ✔️ Resolution: Answered Resolved because the question asked by the original author has been answered. External This is an issue in a component not contained in this repository. It is open for tracking purposes. Status: Resolved labels Mar 13, 2021
@HaoK HaoK added affected-medium This issue impacts approximately half of our customers enhancement This issue represents an ask for new feature or an enhancement to an existing one severity-major This label is used by an internal tool labels Mar 15, 2021 — with ASP.NET Core Issue Ranking
@michaelpaulus
Copy link
Author

This issue did pop up again yesterday for us, we had 4 auto heal events that all gave us memory dumps, all of the dumps indicate the same problem, thread pool exhaustion while waiting on the refresh of the keys.

image

image

@mkArtakMSFT mkArtakMSFT added this to the Backlog milestone Mar 18, 2021
@ghost
Copy link

ghost commented Mar 18, 2021

We've moved this issue to the Backlog milestone. This means that it is not going to be worked on for the coming release. We will reassess the backlog following the current release and consider this item at that time. To learn more about our issue management process and to have better expectation regarding different types of issues you can read our Triage Process.

@blowdart
Copy link
Contributor

What if we add a TryRollKeyRingAsync() method that we unconditionally run as part of every request. This would make sure that during request all protect and unprotect calls won't make any async calls.

This seems like an easy DoS, you want to minimize calls to roll keys as much as possible, we throttle them by design.

Blocking on startup is also by design, there's no point in letting other requests run until the keyring is filled. The block is, of course, fine with the other providers, because they're all local resource hits. @pakrym maybe you fudge it by doing a protect inside your own initialization once it's all set?

@pakrym
Copy link
Contributor

pakrym commented Mar 18, 2021

This seems like an easy DoS, you want to minimize calls to roll keys as much as possible, we throttle them by design.

We would only roll/refresh when the key is close to expiry and the subsequent Protect/Uprotect call would've rolled it anyway.

@pakrym maybe you fudge it by doing a protect inside your own initialization once it's all set?

Nope, we don't use protect inside the key provider.

@Pikeman
Copy link

Pikeman commented Feb 15, 2022

Our site is also hanging from this sometimes. We see the same call stack.

We run ProtectKeysWithAzureKeyVault and we tend to see some HTTP 401 with Azure Key Vault around the time that our site hangs.

Even with the DataProtectionHostedService in play, we find that restarting our process does not reliably restore service. Maybe because the hosted service makes a single attempt at loading the key ring, and on error lets the process start handling requests anyway, that initial attempt fails (for whatever reason) and the process then runs into the same kind of deadlock as seen above?

putting our old work around back into the startup

We will try a work-around as well, like this:

using Microsoft.AspNetCore.DataProtection.KeyManagement.Internal;
On startup, inject a IKeyRingProvider keyRingProvider and do _ = keyRingProvider.GetCurrentKeyRing();
If it crashes, we'll let the hosting environment restart the process and repeat until it comes up.

Maybe we can do something like configuration and fire a change token when new data is ready. I need to investigate this. I don't think it makes sense changing Protect and Unprotect to be async.

That sounds sensible. In times of key rotation, it seems like the new key will be available a good while before the previous one expires, anyway. So it sounds nice to get the new key ring ready in the background, instead of letting requests grab that lock😅

Cheers

@amcasey
Copy link
Member

amcasey commented Oct 4, 2024

I'm going to close this issue. Please post additional feedback in #36157 so we can track it centrally and don't drop any.

@amcasey amcasey closed this as completed Oct 4, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Nov 4, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
affected-medium This issue impacts approximately half of our customers area-dataprotection Includes: DataProtection enhancement This issue represents an ask for new feature or an enhancement to an existing one severity-major This label is used by an internal tool
Projects
None yet
Development

No branches or pull requests

8 participants