-
Notifications
You must be signed in to change notification settings - Fork 10.3k
Data Protection Key Generation Race #52678
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for contacting us. We're moving this issue to the |
… key is generated If a key is required and none is available, one will be generated and immediately activated. Immediately-activated keys are problematic because they may be used before they have propagated to other app instances. What's more, in a scenario where one instance needs to generate an immediately-activated key, it's likely that other instances will need to do the same. A given instance has no control over when other instances pick up its immediately-activated keys, but it can do its best to pick up theirs. Therefore, every time it generates an immediately-activated key, it assumes other instances may have as well and pre-emptively schedules a refresh of its own key ring cache. This problem is already partially (and complementarily) addressed by `KeyRingProvider.InAutoRefreshWindow`, which allows missing keys to be re-fetched from the backing repository for two minutes after app-startup. This will cover the case where the short expiration period is too short to catch all updates but doesn't help with cases where an immediately-activated key is required after initial startup. Sadly, this still does not completely address the issue: instances could still use keys unknown to their peers during the 15 second refresh window. To fix the issue properly, we'd need to know the order in which the keys had been persisted to the repository. Part of dotnet#52678, which is part of dotnet#36157
This code is trying to ensure that the selected key can be decrypted (i.e. is usable). It may fail if, for example, Azure KeyVault is unreachable due to connectivity issues. If it fails, there's a log message and then an immediately-activated key will be generated. An immediately-activated key can cause problems for sessions making requests to multiple app instances and those problems won't obviously be connected to the (almost silent) failure in CanCreateAuthenticatedEncryptor. Rather than effectively swallowing such errors, we should allow some retries. Part of dotnet#52678, which is part of dotnet#36157
This code is trying to ensure that the selected key can be decrypted (i.e. is usable). It may fail if, for example, Azure KeyVault is unreachable due to connectivity issues. If it fails, there's a log message and then an immediately-activated key will be generated. An immediately-activated key can cause problems for sessions making requests to multiple app instances and those problems won't obviously be connected to the (almost silent) failure in CanCreateAuthenticatedEncryptor. Rather than effectively swallowing such errors, we should allow some retries. Part of dotnet#52678, which is part of dotnet#36157
To actually achieve consensus among app instances, instances would have to know how many peers they have, which they don't. Another approach might be to associate an insertion order with each key ring entry for consistent tie-breaking across instances. Unfortunately, that would require a format change, which has its own drawbacks. Currently, the main mitigation is the 2 minutes window after app startup during which cache misses trigger a cache refresh, which is reasonably effective, in practice. |
We also noticed this issue within our business application which is running in a multi-replica Kubernetes context with encryption keys persisted in a database. The requests are distributed in a round robin manner and therefore it is quite common that multiple instances require a new key at ~ the same time after a few days of inactivity (weekend). In our case the issue disrupts the OIDC login flow because the OIDC message state can only be decrypted by the instance that created the message originally. |
One possible mitigation, if your app is inactive weekends, would be to increase the key lifetime from 90 days to 91 days (divisible by 7) and then kick off key generation on (say) Thursday, so that refreshes consistently happen on Monday or Tuesday (depending on your dotnet version and some fudge factors). |
We also considered that mitigation. However, as we are running multiple environments this has to be considered also when deploying to new environments. Besides that we try to reduce the number manual ops processes as much as possible. Therefore, we decided to try out another mitigation: using a healthcheck. we already had a healthcheck in place which tries on startup if encryption works using the DataProtectionAPI (could fail due to misconfiguration of the certificate files). Now we are running this healtcheck regularly (each 24 hours). This should trigger the key refresh on weekend as well. However, we are still testing this approach. |
Can you elaborate? I was thinking of this (which I'm not, incidentally, claiming is a particularly good mitigation) as a one-time code change. Is there an ongoing cost I hadn't considered?
Calling |
I tought about the "kick off key generation on a certain day" part. Not really ongoing costs but a "manual" step/consideration that could be easily missed on setup of new environments.
This was our assumption why the healthcheck should be a suitable workaround. Thanks for the confirmation! |
I'm going to close this issue. Please post additional feedback in #36157 so we can track it centrally and don't drop any. |
Data Protection normally generates a new key 48 hours before the current default key expires, so that all instances will refresh their keyrings before it is adopted. However, there's a corner case where the app isn't running at that time and an activated key is required immediately, in which case a key is generated with activation time equal(ish) to creation time. If multiple instances do this at the same time, it's possible for whichever publishes first to fail to observe the keys generated by other instances (even in the absence of clock skew), resulting in one or more instances being unable to decrypt data from other instances.
Idea: When an immediately-active key is generated, arrange to resync the keyring a few minutes later to heal.
Idea: Allow users to increase the 48 hour window to account for services that are (largely) inactive on weekends.
Extracted from #52561
The text was updated successfully, but these errors were encountered: