Skip to content

Data Protection Key Generation Race #52678

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amcasey opened this issue Dec 8, 2023 · 8 comments
Closed

Data Protection Key Generation Race #52678

amcasey opened this issue Dec 8, 2023 · 8 comments
Labels
area-dataprotection Includes: DataProtection

Comments

@amcasey
Copy link
Member

amcasey commented Dec 8, 2023

Data Protection normally generates a new key 48 hours before the current default key expires, so that all instances will refresh their keyrings before it is adopted. However, there's a corner case where the app isn't running at that time and an activated key is required immediately, in which case a key is generated with activation time equal(ish) to creation time. If multiple instances do this at the same time, it's possible for whichever publishes first to fail to observe the keys generated by other instances (even in the absence of clock skew), resulting in one or more instances being unable to decrypt data from other instances.

Idea: When an immediately-active key is generated, arrange to resync the keyring a few minutes later to heal.
Idea: Allow users to increase the 48 hour window to account for services that are (largely) inactive on weekends.

Extracted from #52561

@ghost
Copy link

ghost commented Jan 26, 2024

Thanks for contacting us.

We're moving this issue to the .NET 9 Planning milestone for future evaluation / consideration. We would like to keep this around to collect more feedback, which can help us with prioritizing this work. We will re-evaluate this issue, during our next planning meeting(s).
If we later determine, that the issue has no community involvement, or it's very rare and low-impact issue, we will close it - so that the team can focus on more important and high impact issues.
To learn more about what to expect next and how this issue will be handled you can read more about our triage process here.

@dotnet-policy-service dotnet-policy-service bot added the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Feb 6, 2024
@wtgodbe wtgodbe removed the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Feb 6, 2024
@dotnet-policy-service dotnet-policy-service bot added the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Feb 6, 2024
@wtgodbe wtgodbe removed the pending-ci-rerun When assigned to a PR indicates that the CI checks should be rerun label Feb 13, 2024
@dotnet dotnet deleted a comment from dotnet-policy-service bot Feb 13, 2024
@dotnet dotnet deleted a comment from dotnet-policy-service bot Feb 13, 2024
amcasey added a commit to amcasey/aspnetcore that referenced this issue Mar 12, 2024
… key is generated

If a key is required and none is available, one will be generated and immediately activated.  Immediately-activated keys are problematic because they may be used before they have propagated to other app instances.  What's more, in a scenario where one instance needs to generate an immediately-activated key, it's likely that other instances will need to do the same.  A given instance has no control over when other instances pick up its immediately-activated keys, but it can do its best to pick up theirs.  Therefore, every time it generates an immediately-activated key, it assumes other instances may have as well and pre-emptively schedules a refresh of its own key ring cache.

This problem is already partially (and complementarily) addressed by `KeyRingProvider.InAutoRefreshWindow`, which allows missing keys to be re-fetched from the backing repository for two minutes after app-startup.  This will cover the case where the short expiration period is too short to catch all updates but doesn't help with cases where an immediately-activated key is required after initial startup.

Sadly, this still does not completely address the issue: instances could still use keys unknown to their peers during the 15 second refresh window.  To fix the issue properly, we'd need to know the order in which the keys had been persisted to the repository.

Part of dotnet#52678, which is part of dotnet#36157
amcasey added a commit to amcasey/aspnetcore that referenced this issue Mar 23, 2024
This code is trying to ensure that the selected key can be decrypted (i.e. is usable).  It may fail if, for example, Azure KeyVault is unreachable due to connectivity issues.  If it fails, there's a log message and then an immediately-activated key will be generated.  An immediately-activated key can cause problems for sessions making requests to multiple app instances and those problems won't obviously be connected to the (almost silent) failure in CanCreateAuthenticatedEncryptor.  Rather than effectively swallowing such errors, we should allow some retries.

Part of dotnet#52678, which is part of dotnet#36157
amcasey added a commit to amcasey/aspnetcore that referenced this issue Apr 12, 2024
This code is trying to ensure that the selected key can be decrypted (i.e. is usable).  It may fail if, for example, Azure KeyVault is unreachable due to connectivity issues.  If it fails, there's a log message and then an immediately-activated key will be generated.  An immediately-activated key can cause problems for sessions making requests to multiple app instances and those problems won't obviously be connected to the (almost silent) failure in CanCreateAuthenticatedEncryptor.  Rather than effectively swallowing such errors, we should allow some retries.

Part of dotnet#52678, which is part of dotnet#36157
@amcasey
Copy link
Member Author

amcasey commented Aug 15, 2024

To actually achieve consensus among app instances, instances would have to know how many peers they have, which they don't.

Another approach might be to associate an insertion order with each key ring entry for consistent tie-breaking across instances. Unfortunately, that would require a format change, which has its own drawbacks.

Currently, the main mitigation is the 2 minutes window after app startup during which cache misses trigger a cache refresh, which is reasonably effective, in practice.

@MichaelRomer
Copy link

MichaelRomer commented Sep 10, 2024

We also noticed this issue within our business application which is running in a multi-replica Kubernetes context with encryption keys persisted in a database. The requests are distributed in a round robin manner and therefore it is quite common that multiple instances require a new key at ~ the same time after a few days of inactivity (weekend).

In our case the issue disrupts the OIDC login flow because the OIDC message state can only be decrypted by the instance that created the message originally.

@amcasey
Copy link
Member Author

amcasey commented Sep 17, 2024

One possible mitigation, if your app is inactive weekends, would be to increase the key lifetime from 90 days to 91 days (divisible by 7) and then kick off key generation on (say) Thursday, so that refreshes consistently happen on Monday or Tuesday (depending on your dotnet version and some fudge factors).

@MichaelRomer
Copy link

We also considered that mitigation. However, as we are running multiple environments this has to be considered also when deploying to new environments. Besides that we try to reduce the number manual ops processes as much as possible. Therefore, we decided to try out another mitigation: using a healthcheck.

we already had a healthcheck in place which tries on startup if encryption works using the DataProtectionAPI (could fail due to misconfiguration of the certificate files). Now we are running this healtcheck regularly (each 24 hours). This should trigger the key refresh on weekend as well. However, we are still testing this approach.

@amcasey
Copy link
Member Author

amcasey commented Sep 17, 2024

Besides that we try to reduce the number manual ops processes as much as possible

Can you elaborate? I was thinking of this (which I'm not, incidentally, claiming is a particularly good mitigation) as a one-time code change. Is there an ongoing cost I hadn't considered?

Now we are running this healtcheck regularly (each 24 hours)

Calling Protect at least once every 24 hours should suffice to trigger key re-generation at an appropriate time so that you don't end up with instances racing to make immediately-activated keys.

@MichaelRomer
Copy link

Can you elaborate? I was thinking of this (which I'm not, incidentally, claiming is a particularly good mitigation) as a one-time code change. Is there an ongoing cost I hadn't considered?

I tought about the "kick off key generation on a certain day" part. Not really ongoing costs but a "manual" step/consideration that could be easily missed on setup of new environments.

Calling Protect at least once every 24 hours should suffice to trigger key re-generation at an appropriate time so that you don't end up with instances racing to make immediately-activated keys.

This was our assumption why the healthcheck should be a suitable workaround. Thanks for the confirmation!

@amcasey
Copy link
Member Author

amcasey commented Oct 4, 2024

I'm going to close this issue. Please post additional feedback in #36157 so we can track it centrally and don't drop any.

@amcasey amcasey closed this as completed Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-dataprotection Includes: DataProtection
Projects
None yet
Development

No branches or pull requests

3 participants