Skip to content

Data protection: Near-simultaneous keys created in key ring #52561

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
mikegoatly opened this issue Dec 4, 2023 · 16 comments
Closed
1 task done

Data protection: Near-simultaneous keys created in key ring #52561

mikegoatly opened this issue Dec 4, 2023 · 16 comments
Labels
area-dataprotection Includes: DataProtection

Comments

@mikegoatly
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Describe the bug

In one of our services today, we started running into this error:

Failure message: Unprotect ticket failed

Which ultimately led to investigating at the stored key ring. We use PersistKeysToAzureBlobStorage, and after opening the key ring blob, we noticed it contained two overlapping keys, created ~100ms of each another:

        <creationDate>2023-12-04T08:39:23.4797282Z</creationDate>
        <activationDate>2023-12-04T08:39:23.4189143Z</activationDate>
        <expirationDate>2024-03-03T08:39:23.4189143Z</expirationDate>

and

        <creationDate>2023-12-04T08:39:23.5924921Z</creationDate>
        <activationDate>2023-12-04T08:39:23.4697987Z</activationDate>
        <expirationDate>2024-03-03T08:39:23.4697987Z</expirationDate>

Digging into the application logs, we can that two requests were handled on separate instances of the service, each creating and persisted separate keys:

image

The upshot of this is that the two instances were operating with different keys.

I'm not sure what the expected behavior of the data protection system is when there are overlapping keys like this, and I guess there's a chance that it would have "just worked" had I restarted all the instances, but my solution was to ensure that there was only one valid key before restarting.

Expected Behavior

Even when a service is scaled out, it should not be possible for duplicate keys to be created.

Steps To Reproduce

Given this is the first time in 5 years of running our services with this code, reproducing will be hard. That said, when it does go wrong the errors are critical.

Exceptions (if any)

For us the error occurs when decryption an auth cookie. The only useful bit of information from the logs is:

idsrv.external was not authenticated. Failure message: Unprotect ticket failed

.NET Version

6

Anything else?

Application is running in a Linux container in AKS.

The service was running on .NET 6:

Host:
  Version:      6.0.23
  Architecture: x64
  Commit:       e0f0de876a

.NET SDKs installed:
  No SDKs were found.

.NET runtimes installed:
  Microsoft.AspNetCore.App 6.0.23 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 6.0.23 [/usr/share/dotnet/shared/Microsoft.NETCore.App]

With the library versions:

Azure.Extensions.AspNetCore.DataProtection.Blobs 1.3.2

@ghost ghost added the area-dataprotection Includes: DataProtection label Dec 4, 2023
@amcasey
Copy link
Member

amcasey commented Dec 4, 2023

Hey @mikegoatly, thanks for the detailed report!

I guess my immediate reaction is that it's not obvious to me how two instances creating keys at nearly the same time could break unprotect. Given that they're writing to the same blob, I take it they're reading from the same blob as well. That means that both should have access to both keys and it shouldn't matter which instance the protected auth token is sent to.

When you say that the keys are duplicates, do you have something more specific in mind than that they were created at nearly the same time? From your screenshot, it looks like they have different IDs, so I wouldn't expect them to collide.

It sounds like you may have deleted one of the keys before restarting your app. Did you, by any chance, save it? I'm wondering if it was corrupted somehow.

@amcasey
Copy link
Member

amcasey commented Dec 4, 2023

This appears to be the only place that error message is reported. Unfortunately, that doesn't imply anything about which of several failure modes was observed.

@mikegoatly
Copy link
Author

Hi @amcasey - apologies, I could have been clearer about this.

By duplicates I was trying to convey that they overlapped in their timeframes. Both were valid for exactly the same date range, give or take 100ms. Each of the keys themselves were unique.

My intuition is that it was a classic race condition, and the following happened;

  • Both instances were hit simultaneously, each detecting that the key needed regenerating, which they did
  • Both saved their keys to the key ring (within ~100ms)
  • Each instance continued to run with its own generated key, meaning that cookies encrypted on one couldn't be decrypted on the other

As I say, I'm not sure how frequently the key information is re-read from the backing store, or what happens when the keys timeframes are overlapping, so there's a chance that the system would eventually "self-heal", but even after updating the key ring, I still needed to restart the services to get them working again.

It's possible I might be able to get the dependency telemetry for when the keys were written to blob storage - I'll have a look later if I get some time.

@mikegoatly
Copy link
Author

mikegoatly commented Dec 5, 2023

Yeah, I can see the sequence of events by drilling into the dependencies:

image

And slightly more helpfully as a table:

Instance A Instance B
GET current config
GET current config
PUT updated config
GET config
PUT updated config 412 CONFLICT
GET config again
PUT updated config < now includes both keys
GET config

In this case it would have been better if the conflict wasn't handled by the blob storage provider and the exception was bubbled up to the core data protection logic, at which point it could have re-read the keys and used the updated value.

Edit: I realise I've made some gross assumptions in that statement; I don't actually know for sure which layer handled the 412...

@amcasey
Copy link
Member

amcasey commented Dec 6, 2023

So, I don't know offhand how new keys are synced to other instances (obviously, I'll find out), but what's bothering me is that I wouldn't expect this scenario to be rare (e.g. once in five years). It must be quite normal for two instances to receive their first requests at the same time and each make a key. It seems hard to believe that the design wouldn't accommodate that, but I'll have to dig in further to understand how it's supposed to work.

@mikegoatly
Copy link
Author

Thanks @amcasey.

For what it's worth, I had a look at the logs for the affected service running in another environment which is under higher load. I can see that the key was recreated on the 30th Nov, and only one of the two instances logged the event:

image

The key rotation happened as part of a request occurring at almost exactly the same time as another request on the other instance:

image

My only takeaway from this is that it definitely doesn't happen all the time. Given the keys are rotated once every 3 months, these are the only examples I can offer because I only have a 3 month retention on the logs.

If there's anything else you need from me let me know, and I'll help if I can.

@amcasey
Copy link
Member

amcasey commented Dec 6, 2023

This is the document describing key rolling/regeneration. By my reading, what I'd expect to happen is:

  1. Both instances see that the default key is going to expire within 48 hours
  2. Both instances generate a new key (sub-optimal, but not problematic, AFAICT)
  3. Both instances write their new keys to the IXmlRepository (blob storage, in this case)
  4. Sometime within the next 24 hours, each instance fetches the keyring from blob storage and sees both new keys
  5. Neither instance consumes the new keys since they are not yet Active
  6. 48 hours after generation, both new keys transition from Created to Active
  7. Both instances pick the newer of the two (almost but not actually) simultaneously created keys as the new default - the older is effectively ignored

It's certainly possible there are bugs in the implementation, but I'm still under the impression that the near-simultaneous key generation you've described is supposed to work.

@amcasey
Copy link
Member

amcasey commented Dec 6, 2023

I suppose I should mention that I'm just assuming you haven't tweaked any of the relevant defaults (e.g. disabled key generation or adjusted the keyring refresh frequency).

@amcasey
Copy link
Member

amcasey commented Dec 6, 2023

Ah, just noticed this caveat:

There might be circumstances where a key will be created with immediate activation. One example would be when the application hasn't run for a time and all keys in the key ring are expired. When this happens, the key is given an activation date of { now } without the normal 2-day activation delay.

Any chance you had no instances for a long enough period that both instances would have concluded an immediate-activation key was required? That certainly seems like it could lead to the behavior you were seeing.

Edit: just looked at your original description and those creation and activation dates match - that would do it. Any idea how the app got into a state where it believed that was necessary?

@amcasey amcasey changed the title Data protection: Duplicate keys created in key ring Data protection: Near-simultaneous keys created in key ring Dec 6, 2023
@amcasey
Copy link
Member

amcasey commented Dec 7, 2023

Were you seeing Unprotect failures on both instances or just one? Both would punch a hole in this theory.

@amcasey
Copy link
Member

amcasey commented Dec 7, 2023

One possible mitigation (on our end) might be to schedule a new fetch a few minutes in the future when an immediately-activated key is created. Instances could still get out of sync, but they wouldn't stay out of sync very long. Assuming we understand the problem correctly, of course.

@mikegoatly
Copy link
Author

mikegoatly commented Dec 7, 2023

Thanks for your thoughts on this and the links to the documentation.

I suppose I should mention that I'm just assuming you haven't tweaked any of the relevant defaults (e.g. disabled key generation or adjusted the keyring refresh frequency).

No - it's all the default settings.

Were you seeing Unprotect failures on both instances or just one? Both would punch a hole in this theory.

I was, but I don't think it necessarily rules it out - the cookie could have been encrypted on either one of the instances, and the other one wouldn't have been able to read it.

Ah, just noticed this caveat:

There might be circumstances where a key will be created with immediate activation. One example would be when the application hasn't run for a time and all keys in the key ring are expired. When this happens, the key is given an activation date of { now } without the normal 2-day activation delay.

Any chance you had no instances for a long enough period that both instances would have concluded an immediate-activation key was required? That certainly seems like it could lead to the behavior you were seeing.

I think this might be it. This is our test environment so it's not usually used over the weekend and the issue occurred first thing on Monday. Looking back over the logs, there was some very limited activity, but none of it would have required interaction with data protection.

This explains why I haven't seen this before. Assuming no activity on the system between 5pm on Friday and 9:00am Monday, the problem would only occur if the key expires in the time between 5pm Sunday to 9:00am Monday. In other words, a 16 hour window after the 48 hour window was last checked.

Your suggestion would help with this edge case, allowing for a more rapid "self-heal" and may be sensible as a back-up, but I wonder if it's worth revisiting the 2 day activation design. Extending beyond the 48-hour timeframe could help in detecting expiring keys during weekends when systems are not actively used. The primary downside is more frequent key generation.

Alternatively, could the activation window be configurable in the same way that the lifetime is, e.g.

services.AddDataProtection()
       // Ensure that a new key is generated 7 days before the previous one expires
       .SetKeyActivationWindow(TimeSpan.FromDays(7));

There is obviously the edge case of someone configuring it badly (i.e. a lifetime of 7 days and an activation window of 7 days) but that could be detected as it is configured.

@amcasey
Copy link
Member

amcasey commented Dec 7, 2023

That setting exists, but it's internal and readonly. The docs plausibly claim that this is because it's too easy to get yourself into a bad state if you change it.

Two mitigations that come to mind, especially if you're not ready to update to a version that includes a fix, would be to perform some trivial activity over the weekend - I would expect basically any key usage to trigger generation if expiration is close - or to change NewKeyLifetime from 90 days to something divisible by 7 so that expiration always fall on (e.g.) Thursday and re-generation always falls on (e.g.) Tuesday.

I would definitely be interested to know if a lot of users see idle periods over weekends - that's not something I'd considered.

@mikegoatly
Copy link
Author

Thanks for the suggestions. Now I properly understand what's going on I'm confident it's only our test system that's ever likely to be affected, and even then the probability of it re-occurring is low and a simple pod restart would fix it.

@amcasey
Copy link
Member

amcasey commented Dec 8, 2023

Sounds good. Introducing a healing mechanism and increasing configurability are on our radar but may not happen in the near future.

@amcasey
Copy link
Member

amcasey commented Dec 8, 2023

Extracted #52678. Thanks again for the feedback!

@amcasey amcasey closed this as completed Dec 8, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Feb 7, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-dataprotection Includes: DataProtection
Projects
None yet
Development

No branches or pull requests

3 participants
@mikegoatly @amcasey and others