-
Notifications
You must be signed in to change notification settings - Fork 10.3k
Data protection: Near-simultaneous keys created in key ring #52561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hey @mikegoatly, thanks for the detailed report! I guess my immediate reaction is that it's not obvious to me how two instances creating keys at nearly the same time could break unprotect. Given that they're writing to the same blob, I take it they're reading from the same blob as well. That means that both should have access to both keys and it shouldn't matter which instance the protected auth token is sent to. When you say that the keys are duplicates, do you have something more specific in mind than that they were created at nearly the same time? From your screenshot, it looks like they have different IDs, so I wouldn't expect them to collide. It sounds like you may have deleted one of the keys before restarting your app. Did you, by any chance, save it? I'm wondering if it was corrupted somehow. |
This appears to be the only place that error message is reported. Unfortunately, that doesn't imply anything about which of several failure modes was observed. |
Hi @amcasey - apologies, I could have been clearer about this. By duplicates I was trying to convey that they overlapped in their timeframes. Both were valid for exactly the same date range, give or take 100ms. Each of the keys themselves were unique. My intuition is that it was a classic race condition, and the following happened;
As I say, I'm not sure how frequently the key information is re-read from the backing store, or what happens when the keys timeframes are overlapping, so there's a chance that the system would eventually "self-heal", but even after updating the key ring, I still needed to restart the services to get them working again. It's possible I might be able to get the dependency telemetry for when the keys were written to blob storage - I'll have a look later if I get some time. |
So, I don't know offhand how new keys are synced to other instances (obviously, I'll find out), but what's bothering me is that I wouldn't expect this scenario to be rare (e.g. once in five years). It must be quite normal for two instances to receive their first requests at the same time and each make a key. It seems hard to believe that the design wouldn't accommodate that, but I'll have to dig in further to understand how it's supposed to work. |
Thanks @amcasey. For what it's worth, I had a look at the logs for the affected service running in another environment which is under higher load. I can see that the key was recreated on the 30th Nov, and only one of the two instances logged the event: The key rotation happened as part of a request occurring at almost exactly the same time as another request on the other instance: My only takeaway from this is that it definitely doesn't happen all the time. Given the keys are rotated once every 3 months, these are the only examples I can offer because I only have a 3 month retention on the logs. If there's anything else you need from me let me know, and I'll help if I can. |
This is the document describing key rolling/regeneration. By my reading, what I'd expect to happen is:
It's certainly possible there are bugs in the implementation, but I'm still under the impression that the near-simultaneous key generation you've described is supposed to work. |
I suppose I should mention that I'm just assuming you haven't tweaked any of the relevant defaults (e.g. disabled key generation or adjusted the keyring refresh frequency). |
Ah, just noticed this caveat:
Any chance you had no instances for a long enough period that both instances would have concluded an immediate-activation key was required? That certainly seems like it could lead to the behavior you were seeing. Edit: just looked at your original description and those creation and activation dates match - that would do it. Any idea how the app got into a state where it believed that was necessary? |
Were you seeing |
One possible mitigation (on our end) might be to schedule a new fetch a few minutes in the future when an immediately-activated key is created. Instances could still get out of sync, but they wouldn't stay out of sync very long. Assuming we understand the problem correctly, of course. |
Thanks for your thoughts on this and the links to the documentation.
No - it's all the default settings.
I was, but I don't think it necessarily rules it out - the cookie could have been encrypted on either one of the instances, and the other one wouldn't have been able to read it.
I think this might be it. This is our test environment so it's not usually used over the weekend and the issue occurred first thing on Monday. Looking back over the logs, there was some very limited activity, but none of it would have required interaction with data protection. This explains why I haven't seen this before. Assuming no activity on the system between 5pm on Friday and 9:00am Monday, the problem would only occur if the key expires in the time between 5pm Sunday to 9:00am Monday. In other words, a 16 hour window after the 48 hour window was last checked. Your suggestion would help with this edge case, allowing for a more rapid "self-heal" and may be sensible as a back-up, but I wonder if it's worth revisiting the 2 day activation design. Extending beyond the 48-hour timeframe could help in detecting expiring keys during weekends when systems are not actively used. The primary downside is more frequent key generation. Alternatively, could the activation window be configurable in the same way that the lifetime is, e.g. services.AddDataProtection()
// Ensure that a new key is generated 7 days before the previous one expires
.SetKeyActivationWindow(TimeSpan.FromDays(7)); There is obviously the edge case of someone configuring it badly (i.e. a lifetime of 7 days and an activation window of 7 days) but that could be detected as it is configured. |
That setting exists, but it's internal and readonly. The docs plausibly claim that this is because it's too easy to get yourself into a bad state if you change it. Two mitigations that come to mind, especially if you're not ready to update to a version that includes a fix, would be to perform some trivial activity over the weekend - I would expect basically any key usage to trigger generation if expiration is close - or to change I would definitely be interested to know if a lot of users see idle periods over weekends - that's not something I'd considered. |
Thanks for the suggestions. Now I properly understand what's going on I'm confident it's only our test system that's ever likely to be affected, and even then the probability of it re-occurring is low and a simple pod restart would fix it. |
Sounds good. Introducing a healing mechanism and increasing configurability are on our radar but may not happen in the near future. |
Extracted #52678. Thanks again for the feedback! |
Is there an existing issue for this?
Describe the bug
In one of our services today, we started running into this error:
Which ultimately led to investigating at the stored key ring. We use
PersistKeysToAzureBlobStorage
, and after opening the key ring blob, we noticed it contained two overlapping keys, created ~100ms of each another:and
Digging into the application logs, we can that two requests were handled on separate instances of the service, each creating and persisted separate keys:
The upshot of this is that the two instances were operating with different keys.
I'm not sure what the expected behavior of the data protection system is when there are overlapping keys like this, and I guess there's a chance that it would have "just worked" had I restarted all the instances, but my solution was to ensure that there was only one valid key before restarting.
Expected Behavior
Even when a service is scaled out, it should not be possible for duplicate keys to be created.
Steps To Reproduce
Given this is the first time in 5 years of running our services with this code, reproducing will be hard. That said, when it does go wrong the errors are critical.
Exceptions (if any)
For us the error occurs when decryption an auth cookie. The only useful bit of information from the logs is:
.NET Version
6
Anything else?
Application is running in a Linux container in AKS.
The service was running on .NET 6:
With the library versions:
Azure.Extensions.AspNetCore.DataProtection.Blobs
1.3.2The text was updated successfully, but these errors were encountered: