-
Notifications
You must be signed in to change notification settings - Fork 219
Description
See
Lines 33 to 42 in 30308ae
if rdrand_step(&mut el) == 1 { | |
// AMD CPUs from families 14h to 16h (pre Ryzen) sometimes fail to | |
// set CF on bogus random data, so we check these values explicitly. | |
// See https://github.com/systemd/systemd/issues/11810#issuecomment-489727505 | |
// We perform this check regardless of target to guard against | |
// any implementation that incorrectly fails to set CF. | |
if el != 0 && el != !0 { | |
return Ok(el.to_ne_bytes()); | |
} | |
// Keep looping in case this was a false positive. |
if rdrand_step(&mut el) == 1 {
// AMD CPUs from families 14h to 16h (pre Ryzen) sometimes fail to
// set CF on bogus random data, so we check these values explicitly.
// See https://github.com/systemd/systemd/issues/11810#issuecomment-489727505
// We perform this check regardless of target to guard against
// any implementation that incorrectly fails to set CF.
if el != 0 && el != !0 {
return Ok(el.to_ne_bytes());
}
// Keep looping in case this was a false positive.
}
The condition if el != 0 && el != !0
is testing that the value returned by RDRAND, after it reports success, is not zero or all-one bits, i.e. never usize::MIN
or usize::MAX
. Consequently, getrandom::getrandom
will never return a result where there a single word is zero or all-one bits, where as word is a 4-byte chunk on 32-bit x86 or a 8-byte chunk on 64-bit x86, when the RDRAND implementation is being used. Such values are expected to occur every 2/2^N words on average. As a result, any use of getrandom::getrandom
returns results that are wrongly biased; 2/2^N values are rejected on an N-bit platform.
32-bit x86 support for the RDRAND feature was added in PR #134, after PR #48. The analysis on the bias when the code was implemented was based on the probability 2/2^64 which is correct when this code runs on a 64-bit CPU, but not for when it runs on a 32-bit CPU. I suspect that when PR #134 was under consideration, the difference in the bias was perhaps overlooked. The bias is notably worse on 32-bit platforms since 2/2^32 is much more likely than 2/2^64.
Outside of cryptography, I find the present solution particularly unfortunate because if getrandom's output is used for randomized testing of a 64-bit/32-bit function on x86_64/x86 on a target for which the RDRAND implementation is used, the important boundary conditions of the values usize::MIN
and usize::MAX
will never be reached, ever!
In the context of implementing cryptographic functions, and especially keygen and nonce generation, I suspect any user of this crate is likely to get pestered about this bias and will have to address it in some way.
In any case, I think the present solution is weird enough that it is worth having more eyes on it and more discussion, and also I hope we could find a better solution.
Incidentally, BoringSSL's code has this to say:
// Also disable for family 0x17, models 0x70–0x7f, due to possible RDRAND
// failures there too.
It seems like Google has reason to believe that some family 0x17 models may also be bad. So the comment indicating the problem is only with families 14h-16h may be worth updating, at least.
Activity
briansmith commentedon Oct 1, 2021
In the
rdrand
crate, I saw very similar logic:Interestingly, the rdrand crate source code says that the bit pattern is all-ones when an AMD CPU fails, but it also checks all-zeroes, but doesn't explain why. And getrandom does the same, without any explanation.
I dug into the systemd patch at poettering/systemd@c6372be and it also checks both values. It has a clearer explanation of the reasoning:
So the value 0 is being screened out "just in case." Doesn't sound good to me.
I think that what might be not too terrible for systemd in that particular circumstance may be not quite good enough for a general purpose
getrandom
library.dhardy commentedon Oct 1, 2021
I try not to comment on this repo too much (since I passed off maintainership), but I do remember a little about this issue.
Biased, yes, and you are right that this was initially only considered in the case of
u64
values. But is 1-in-2^31 bias too much? Depends on the application I guess; certainly this case is more arguable than for 1-in-2^63. We could perhaps set the minimum output size at 64-bits?I wish we could just go to the AMD website and find a listing of all possible failure cases. Alas, we lack comprehensive sources of information, and also the possibility that this might later be discovered in future CPUs, so is there much more we can do?
dhardy commentedon Oct 1, 2021
Apparently this also affects some of the latest AMD CPUs, but in a harder-to-detect way: collisions on Ryzen 5900X.
(Perhaps the better answer is never to trust RDRAND on an AMD CPU? Not to say there aren't potential issues with other vendors.)
briansmith commentedon Oct 4, 2021
Thanks @dhardy. I remember that second systemd issue now. Here's an important comment from that issue:
I don't know if that claim is accurate, but if it is accurate then it explains why systemd is comfortable with using a non-uniform RNG.
The
rdrand
crate was also doing the same thing as this; over the weekend they changed their implementation to avoid filtering out usize::MIN and usize::MAX; see nagisa/rust_rdrand#16 (comment).newpavlov commentedon Oct 15, 2021
I think we should introduce separate code paths for affected families and everyone else. I have proposed it previously, but we chose the current approach for simplicity sake. It would be nice to have more info on the 0x17 family.
This hardware bug is really scary, so I think "just in case" is perfectly fine here. I think it's better to have small bias, than potentially zeroed nonces/keys. Personally, I think that on affected families it may be worth to go even further and check that we do not encounter collisions by calling the step function twice and checking that it returns different results each time.
briansmith commentedon Oct 20, 2021
Why only all-zeroes and all-ones are scary, but no other values? My understanding is that AMD documented and/or somebody observed that all-ones happen in the case of the specific AMD failure; did AMD document or did somebody observe that the all-zero value occurs for these failures?
False positives would occur. (This type of test is documented in NIST/FIPS standards and implementation guidance, FYI.)
As I mentioned in #230, I don't think it is a good idea to default to using RDRAND for any target. What I would prefer is that there is no purely-RDRAND-based implementation in
getrandom
; instead, when targeting the OS-less targets, users can register their own custom RNG using the existing mechanism, and if they want their custom RNG to use (just) RDRAND then they could use the rdrand crate for that. That would remove the use of RDRAND completely fromgetrandom
. Then people could build "real" CSPRNGs that use RDRAND only as an entropy source instead of as the entire CSPRNG.josephlr commentedon Oct 20, 2021
Some context on why that check is the way it is. The check is basically "did the implementation forget to set the CF flag". Both AMD and Intel document what value is put in the destination register on failure. For AMD it is all 1s (as discussed above), for Intel it is all 0s (see section 5.2 of Intel's DRNG whitepaper). So we just check for both.
This is obviously not an ideal state, as ideally the
RDRAND
instruction would never have bugs on any architecture, but it is what it is.I agree on many of the things in #230, but removing RDRAND completely is unlikely. Many users of this crate (myself included) rely on this functionality and its simple ergonomics, and I wouldn't want to break them.
josephlr commentedon Jan 28, 2023
I was looking into this issue today, and I think we should just mimic what the Linux Kernel does here:
cpuid
claims to support RDRANDusize::MAX
. This is the bug the current code tries to avoid.init_amd_bd
andinit_amd_jg
is_amd && family < 0x17
.x86_init_rdrand
getrandom_inner
. This should avoid any issues around bias (either on 32-bit or 64-bit platforms). It also happens to make the generated code smaller and easier to inline.This would also make our code almost match that in BoringSSL as well, except for them disabling RDRAND on
family 0x17, models 0x70–0x7f
, as @briansmith noted in #228 (comment). This change was added in BoringSSL change 37604, and I did some internal and external digging to figure out why ths change was added.It turns out it was added in response to the Zen 2 RDRAND issues (Phoronix, Reddit post). As far as I can tell, a bad BIOS caused RDRAND to always fail, but still signal failure correctly via
CF
. On BoringSSL, this would be an issue, so the feature is disabled. However, for this library, we will just (always) return an error, so no blacklisting for that family/model is needed.no_std
-only platform briansmith/ring#744