Data protection: Near-simultaneous keys created in key ring #52561

mikegoatly · 2023-12-04T12:50:46Z

Is there an existing issue for this?

I have searched the existing issues

Describe the bug

In one of our services today, we started running into this error:

Failure message: Unprotect ticket failed

Which ultimately led to investigating at the stored key ring. We use PersistKeysToAzureBlobStorage, and after opening the key ring blob, we noticed it contained two overlapping keys, created ~100ms of each another:

        <creationDate>2023-12-04T08:39:23.4797282Z</creationDate>
        <activationDate>2023-12-04T08:39:23.4189143Z</activationDate>
        <expirationDate>2024-03-03T08:39:23.4189143Z</expirationDate>

and

        <creationDate>2023-12-04T08:39:23.5924921Z</creationDate>
        <activationDate>2023-12-04T08:39:23.4697987Z</activationDate>
        <expirationDate>2024-03-03T08:39:23.4697987Z</expirationDate>

Digging into the application logs, we can that two requests were handled on separate instances of the service, each creating and persisted separate keys:

The upshot of this is that the two instances were operating with different keys.

I'm not sure what the expected behavior of the data protection system is when there are overlapping keys like this, and I guess there's a chance that it would have "just worked" had I restarted all the instances, but my solution was to ensure that there was only one valid key before restarting.

Expected Behavior

Even when a service is scaled out, it should not be possible for duplicate keys to be created.

Steps To Reproduce

Given this is the first time in 5 years of running our services with this code, reproducing will be hard. That said, when it does go wrong the errors are critical.

Exceptions (if any)

For us the error occurs when decryption an auth cookie. The only useful bit of information from the logs is:

idsrv.external was not authenticated. Failure message: Unprotect ticket failed

.NET Version

6

Anything else?

Application is running in a Linux container in AKS.

The service was running on .NET 6:

Host:
  Version:      6.0.23
  Architecture: x64
  Commit:       e0f0de876a

.NET SDKs installed:
  No SDKs were found.

.NET runtimes installed:
  Microsoft.AspNetCore.App 6.0.23 [/usr/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 6.0.23 [/usr/share/dotnet/shared/Microsoft.NETCore.App]

With the library versions:

Azure.Extensions.AspNetCore.DataProtection.Blobs 1.3.2

The text was updated successfully, but these errors were encountered:

amcasey · 2023-12-04T22:16:29Z

Hey @mikegoatly, thanks for the detailed report!

I guess my immediate reaction is that it's not obvious to me how two instances creating keys at nearly the same time could break unprotect. Given that they're writing to the same blob, I take it they're reading from the same blob as well. That means that both should have access to both keys and it shouldn't matter which instance the protected auth token is sent to.

When you say that the keys are duplicates, do you have something more specific in mind than that they were created at nearly the same time? From your screenshot, it looks like they have different IDs, so I wouldn't expect them to collide.

It sounds like you may have deleted one of the keys before restarting your app. Did you, by any chance, save it? I'm wondering if it was corrupted somehow.

amcasey · 2023-12-04T22:18:56Z

This appears to be the only place that error message is reported. Unfortunately, that doesn't imply anything about which of several failure modes was observed.

mikegoatly · 2023-12-05T09:12:32Z

Hi @amcasey - apologies, I could have been clearer about this.

By duplicates I was trying to convey that they overlapped in their timeframes. Both were valid for exactly the same date range, give or take 100ms. Each of the keys themselves were unique.

My intuition is that it was a classic race condition, and the following happened;

Both instances were hit simultaneously, each detecting that the key needed regenerating, which they did
Both saved their keys to the key ring (within ~100ms)
Each instance continued to run with its own generated key, meaning that cookies encrypted on one couldn't be decrypted on the other

As I say, I'm not sure how frequently the key information is re-read from the backing store, or what happens when the keys timeframes are overlapping, so there's a chance that the system would eventually "self-heal", but even after updating the key ring, I still needed to restart the services to get them working again.

It's possible I might be able to get the dependency telemetry for when the keys were written to blob storage - I'll have a look later if I get some time.

mikegoatly · 2023-12-05T09:42:30Z

Yeah, I can see the sequence of events by drilling into the dependencies:

And slightly more helpfully as a table:

Instance A	Instance B
GET current config
	GET current config
PUT updated config
GET config
	PUT updated config 412 CONFLICT
	GET config again
	PUT updated config < now includes both keys
	GET config

In this case it would have been better if the conflict wasn't handled by the blob storage provider and the exception was bubbled up to the core data protection logic, at which point it could have re-read the keys and used the updated value.

Edit: I realise I've made some gross assumptions in that statement; I don't actually know for sure which layer handled the 412...

amcasey · 2023-12-06T01:43:32Z

So, I don't know offhand how new keys are synced to other instances (obviously, I'll find out), but what's bothering me is that I wouldn't expect this scenario to be rare (e.g. once in five years). It must be quite normal for two instances to receive their first requests at the same time and each make a key. It seems hard to believe that the design wouldn't accommodate that, but I'll have to dig in further to understand how it's supposed to work.

mikegoatly · 2023-12-06T13:36:35Z

Thanks @amcasey.

For what it's worth, I had a look at the logs for the affected service running in another environment which is under higher load. I can see that the key was recreated on the 30th Nov, and only one of the two instances logged the event:

The key rotation happened as part of a request occurring at almost exactly the same time as another request on the other instance:

My only takeaway from this is that it definitely doesn't happen all the time. Given the keys are rotated once every 3 months, these are the only examples I can offer because I only have a 3 month retention on the logs.

If there's anything else you need from me let me know, and I'll help if I can.

amcasey · 2023-12-06T21:04:35Z

This is the document describing key rolling/regeneration. By my reading, what I'd expect to happen is:

Both instances see that the default key is going to expire within 48 hours
Both instances generate a new key (sub-optimal, but not problematic, AFAICT)
Both instances write their new keys to the IXmlRepository (blob storage, in this case)
Sometime within the next 24 hours, each instance fetches the keyring from blob storage and sees both new keys
Neither instance consumes the new keys since they are not yet Active
48 hours after generation, both new keys transition from Created to Active
Both instances pick the newer of the two (almost but not actually) simultaneously created keys as the new default - the older is effectively ignored

It's certainly possible there are bugs in the implementation, but I'm still under the impression that the near-simultaneous key generation you've described is supposed to work.

amcasey · 2023-12-06T21:05:29Z

I suppose I should mention that I'm just assuming you haven't tweaked any of the relevant defaults (e.g. disabled key generation or adjusted the keyring refresh frequency).

amcasey · 2023-12-06T21:09:45Z

Ah, just noticed this caveat:

There might be circumstances where a key will be created with immediate activation. One example would be when the application hasn't run for a time and all keys in the key ring are expired. When this happens, the key is given an activation date of { now } without the normal 2-day activation delay.

Any chance you had no instances for a long enough period that both instances would have concluded an immediate-activation key was required? That certainly seems like it could lead to the behavior you were seeing.

Edit: just looked at your original description and those creation and activation dates match - that would do it. Any idea how the app got into a state where it believed that was necessary?

amcasey · 2023-12-07T00:59:02Z

Were you seeing Unprotect failures on both instances or just one? Both would punch a hole in this theory.

amcasey · 2023-12-07T01:13:35Z

One possible mitigation (on our end) might be to schedule a new fetch a few minutes in the future when an immediately-activated key is created. Instances could still get out of sync, but they wouldn't stay out of sync very long. Assuming we understand the problem correctly, of course.

mikegoatly · 2023-12-07T08:59:09Z

Thanks for your thoughts on this and the links to the documentation.

I suppose I should mention that I'm just assuming you haven't tweaked any of the relevant defaults (e.g. disabled key generation or adjusted the keyring refresh frequency).

No - it's all the default settings.

Were you seeing Unprotect failures on both instances or just one? Both would punch a hole in this theory.

I was, but I don't think it necessarily rules it out - the cookie could have been encrypted on either one of the instances, and the other one wouldn't have been able to read it.

Ah, just noticed this caveat:

There might be circumstances where a key will be created with immediate activation. One example would be when the application hasn't run for a time and all keys in the key ring are expired. When this happens, the key is given an activation date of { now } without the normal 2-day activation delay.

Any chance you had no instances for a long enough period that both instances would have concluded an immediate-activation key was required? That certainly seems like it could lead to the behavior you were seeing.

I think this might be it. This is our test environment so it's not usually used over the weekend and the issue occurred first thing on Monday. Looking back over the logs, there was some very limited activity, but none of it would have required interaction with data protection.

This explains why I haven't seen this before. Assuming no activity on the system between 5pm on Friday and 9:00am Monday, the problem would only occur if the key expires in the time between 5pm Sunday to 9:00am Monday. In other words, a 16 hour window after the 48 hour window was last checked.

Your suggestion would help with this edge case, allowing for a more rapid "self-heal" and may be sensible as a back-up, but I wonder if it's worth revisiting the 2 day activation design. Extending beyond the 48-hour timeframe could help in detecting expiring keys during weekends when systems are not actively used. The primary downside is more frequent key generation.

Alternatively, could the activation window be configurable in the same way that the lifetime is, e.g.

services.AddDataProtection()
       // Ensure that a new key is generated 7 days before the previous one expires
       .SetKeyActivationWindow(TimeSpan.FromDays(7));

There is obviously the edge case of someone configuring it badly (i.e. a lifetime of 7 days and an activation window of 7 days) but that could be detected as it is configured.

amcasey · 2023-12-07T22:02:22Z

That setting exists, but it's internal and readonly. The docs plausibly claim that this is because it's too easy to get yourself into a bad state if you change it.

Two mitigations that come to mind, especially if you're not ready to update to a version that includes a fix, would be to perform some trivial activity over the weekend - I would expect basically any key usage to trigger generation if expiration is close - or to change NewKeyLifetime from 90 days to something divisible by 7 so that expiration always fall on (e.g.) Thursday and re-generation always falls on (e.g.) Tuesday.

I would definitely be interested to know if a lot of users see idle periods over weekends - that's not something I'd considered.

mikegoatly · 2023-12-08T09:09:32Z

Thanks for the suggestions. Now I properly understand what's going on I'm confident it's only our test system that's ever likely to be affected, and even then the probability of it re-occurring is low and a simple pod restart would fix it.

amcasey · 2023-12-08T21:37:43Z

Sounds good. Introducing a healing mechanism and increasing configurability are on our radar but may not happen in the near future.

amcasey · 2023-12-08T22:32:32Z

Extracted #52678. Thanks again for the feedback!

ghost added the area-dataprotection Includes: DataProtection label Dec 4, 2023

amcasey changed the title ~~Data protection: Duplicate keys created in key ring~~ Data protection: Near-simultaneous keys created in key ring Dec 6, 2023

amcasey mentioned this issue Dec 8, 2023

Data Protection Key Generation Race #52678

Closed

amcasey closed this as completed Dec 8, 2023

ghost locked as resolved and limited conversation to collaborators Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data protection: Near-simultaneous keys created in key ring #52561

Data protection: Near-simultaneous keys created in key ring #52561

mikegoatly commented Dec 4, 2023

amcasey commented Dec 4, 2023

amcasey commented Dec 4, 2023 •

edited

Loading

mikegoatly commented Dec 5, 2023

mikegoatly commented Dec 5, 2023 •

edited

Loading

amcasey commented Dec 6, 2023

mikegoatly commented Dec 6, 2023

amcasey commented Dec 6, 2023

amcasey commented Dec 6, 2023

amcasey commented Dec 6, 2023 •

edited

Loading

amcasey commented Dec 7, 2023

amcasey commented Dec 7, 2023

mikegoatly commented Dec 7, 2023 •

edited

Loading

amcasey commented Dec 7, 2023

mikegoatly commented Dec 8, 2023

amcasey commented Dec 8, 2023

amcasey commented Dec 8, 2023 •

edited

Loading

Data protection: Near-simultaneous keys created in key ring #52561

Data protection: Near-simultaneous keys created in key ring #52561

Comments

mikegoatly commented Dec 4, 2023

Is there an existing issue for this?

Describe the bug

Expected Behavior

Steps To Reproduce

Exceptions (if any)

.NET Version

Anything else?

amcasey commented Dec 4, 2023

amcasey commented Dec 4, 2023 • edited Loading

mikegoatly commented Dec 5, 2023

mikegoatly commented Dec 5, 2023 • edited Loading

amcasey commented Dec 6, 2023

mikegoatly commented Dec 6, 2023

amcasey commented Dec 6, 2023

amcasey commented Dec 6, 2023

amcasey commented Dec 6, 2023 • edited Loading

amcasey commented Dec 7, 2023

amcasey commented Dec 7, 2023

mikegoatly commented Dec 7, 2023 • edited Loading

amcasey commented Dec 7, 2023

mikegoatly commented Dec 8, 2023

amcasey commented Dec 8, 2023

amcasey commented Dec 8, 2023 • edited Loading

amcasey commented Dec 4, 2023 •

edited

Loading

mikegoatly commented Dec 5, 2023 •

edited

Loading

amcasey commented Dec 6, 2023 •

edited

Loading

mikegoatly commented Dec 7, 2023 •

edited

Loading

amcasey commented Dec 8, 2023 •

edited

Loading