-
Notifications
You must be signed in to change notification settings - Fork 18k
crypto/x509: intermittent failures on darwin/amd64 with ‘x509: “…” certificate is expired’ since 2021-11-06 #51209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Is it possible the time/date on those machines is occasionally very wrong? I use proxy.golang.org on my Mac a bunch (as do many many users) |
@rolandshoemaker, can you have a look? I wonder if there's something horrible happening at the macOS API call level, since that error is coming from the platform, I believe. |
Thanks. The new ones seem to be
I guess the proxy must be issuing redirects for those packages rather than serving bytes itself? The final error still has the Unicode quotes indicative of a macOS error. |
If for some reason Go knows the correct time (which it does because the time in the log lines is correct), while macOS somehow doesn't, this being noticed for the first time would make sense because we switched to using the macOS verifier in Go 1.18. Alternatively, there might be an intermittent bug that corrupts the time provided to the verification API. (For example if we're getting the ABI wrong it might depend on uninitialized memory. There's an assembly trampoline involved.) |
I don't see any evidence of what Go thinks the time is. I don't see a single timestamp in that entire log. Maybe I am missing it? |
(The file names are assigned later, by the coordinator.) |
|
Looks like we only pass an explicit time in during testing. If we can reproduce it, we could pass a time in unconditionally. |
I can reproduce this on my M1 Mac Mini. I will try to get more information about the failure. |
I am working on establishing an accurate failure rate, but anecdotally I've never seen this program run for longer than 30 minutes without provoking a failure. I added debugging to crypto/x509 to print the failed certs and as expected they are the right ones, with valid NotBefore/NotAfter ranges nowhere near the current time. I am going to let it run for the rest of the day to establish a reliable baseline failure rate, and then I will try setting the time unconditionally when invoking the macOS API. |
This program calls x509.Verify directly, from a bunch of goroutines, over and over, on the cert chain from proxy.golang.org. It fails pretty often:
The call is
If I remove the CurrentTime line, then the problem goes away. This strongly suggests that the time conversion is the problem. And looking more carefully at crypto/tls, it does set CurrentTime unconditionally in handshake_client.go. The time conversion is done in crypto/x509/internal/macos:
The only problem is that CFDateCreate takes a CFAbsoluteTime and a CFAbsoluteTime is seconds since 2001-01-01, yes, but as a CFTimeInterval, which in turn is a double (float64). We are passing an integer, in an integer register, and leaving the floating-point register where CFDateCreate expects to find its argument completely uninitialized. Honestly it is amazing that it works as often as it does (most of the time!). I wonder if SecTrustSetVerifyDate treats super weird dates as if the call never happened and falls back to the current date. Or maybe X0 is very often zero, and zero is special, or maybe something else... In any event, this hack makes the program run without failure for many minutes, when before it failed every few seconds:
That seems to confirm the diagnosis. I will think more about the right fix. |
It works so well because of the register ABI or the generated code, which caused t.Sub(...).Seconds() inside x509 to leave the number we wanted in the floating-point register. As long as it doesn't get overwritten due to goroutine rescheduling before the libc call, all is well. I have confirmed this by putting a spurious math.Sin call in the middle. |
Will send a fix after the weekend. |
(I'm on vacation, but this is amazing.) |
Change https://go.dev/cl/387255 mentions this issue: |
Reopening to track cherry-pick to Go 1.18 release branch. |
Does this fail open or closed? In other words, could this allow Verify to return nil for an expired cert? |
Merged to Go 1.18 release branch via https://go.dev/cl/389554. |
hey @FiloSottile friendly ping about
|
This could result in Verify returning nil for an expired cert, yes, but it was never shipped in any release, so we are not handling it as a security issue. |
Thanks! |
greplogs --dashboard -md -l -e 'x509: “misc-sni.google.com” certificate is expired'
2022-02-14T20:07:02-8634188-b2cb1bd/darwin-amd64-11_0
2022-02-11T19:36:36-f4118a5-0bde2cf/darwin-amd64-11_0
2022-01-10T21:48:09-03fcf44-1f411e9/darwin-amd64-11_0
2021-12-09T19:01:08-03fcf44-8ff254e/darwin-amd64-11_0
2021-12-08T18:06:06-5770296-7b7efd7/darwin-amd64-11_0
2021-12-08T15:15:40-03fcf44-08025a9/darwin-amd64-11_0
2021-11-12T18:57:22-c33205f-b1b6d92/darwin-amd64-10_15
2021-11-06T19:41:15-ec00351-61d789d/darwin-amd64-11_0
These failures seem to always occur when fetching from
proxy.golang.org
, but not for any specific package. (They occur intermittently across many different repos and tests.) However, I have only observed them ondarwin/amd64
builders. I'm not sure where in the stack these failures are introduced — whether they're due to a bug in the darwin kernel, a bug (perhaps a race?) incrypto/x509
, some misconfiguration on the builders, or something else entirely.These failures also seem to have started recently (during or shortly before the Go 1.18 code freeze).
@golang/security: since
darwin/amd64
is a first class port, could someone assess whether this is a Go 1.18 regression, and whether and how we should mitigate these errors?The text was updated successfully, but these errors were encountered: