-
-
Notifications
You must be signed in to change notification settings - Fork 473
Description
I know this is a contentious issue, sorry for bringing it again... Like others, I feel a bit uneasy with the choice to promote a Standard
distribution that samples within the open (0, 1) interval. As a matter of fact, I am a bit warry of promoting any "one true way" to produce FP values.
Now I do agree that a fully open interval is adequate in most applications. OTOH, I have never seen a practical situation where a half-open interval would not be adequate too: it is often important to avoid singularity at one of the bound, typically with log(u)
-type transforms, but I suspect that the need to avoid both 0 and 1 is very rare.
The main reason I dislike the (0, 1) open interval as the default choice, though, is that it implicitly bakes into the API a defficiency of the current transmute
-based FP generation algorithm, namely the truncation to 23 bits (resp. 52 bits for f64) precision, instead of the 24/53 bits that one may expect based on the FP significand precision.
The problem is that, unlike with half-open [0, 1) or (0, 1] intervals, the generation of FP values in the (0, 1) open interval becomes AFAIK prohibitively expensive with 24/53 bits due to the need for a rejection step. So in a way, the current Standard
distribution would become a commitment to an implementation detail of the current FP generation method.
For this reason, if it is really deemed important to promote a default FP generation distribution rather than just define e.g. Open01
, OpenClosed01
and ClosedOpen01
, I would then favor the widely-adopted convention of sampling within [0, 1) because (i) it is also adequate for most situations and (ii) it leaves open the possibility (today, or in the future) to efficiently generate FP values with a 24/53-bit precision.
Regarding point (ii) above, I made some preliminary investigations to assess the computational advantage of the current 52-bit transmute
-based method over two 53-bit FP generation methods. The following u64->f64 conversion methods were benchmarked:
// The 52-bit transmute-based method we now use.
fn transmute52(r: u64) -> f64 {
const FRACTION_WIDTH: i32 = 52;
const ONE: u64 = 1023 << FRACTION_WIDTH; // = transmute::<f64, u64>(1.0)
let fraction: u64 = r >> (64 - FRACTION_WIDTH);
let u = unsafe { mem::transmute::<u64, f64>(fraction | ONE) };
u - 1.0
}
// The conventional 53-bit conversion method.
fn direct53(r: u64) -> f64 {
const PRECISION: i32 = 53;
const SCALE: f64 = 1.0/((1u64 << PRECISION) as f64);
SCALE * (r >> (64 - PRECISION)) as f64
}
// Home-grown, branch-less 53-bit transmute-based method -- faster methods may exist.
fn transmute53(r: u64) -> f64 {
const FRACTION_WIDTH: i32 = 52;
const ONE_HALF: u64 = 1022 << FRACTION_WIDTH; // = transmute::<f64, u64>(0.5)
const RIGHT_MASK: u64 = (1u64 << 63) - 1u64; // all bits set except MSB
// Create a 000...00 or a 111...11 mask depending on the MSB.
// The commented alternative is slightly slower but functionally equivalent to
// direct53(), whereas the uncommented version inverts the role of the MSB.
let compl_mask: u64 = (r as i64 >> 63) as u64;
//let compl_mask: u64 = (!r as i64 >> 63) as u64;
let fraction: u64 = (r & RIGHT_MASK) >> (63 - FRACTION_WIDTH);
let u = unsafe { mem::transmute::<u64, f64>(fraction | ONE_HALF) };
// `c` is 0.0 or 0.5 depending on the MSB.
let c = unsafe { mem::transmute::<u64, f64>(compl_mask & ONE_HALF) };
u - c
}
The benchmark performs a simple sum of a large number of FP values produced by one of the above conversion function fed by the 64-bit output of XorShiftRng
.
As always, the benchmark needs to be taken with a big grain of salt, especially that the methods are normally inlined so a lot depends on the actual code context. In order to assess robustness towards inlining, the benchmark was ran a second time with the above functions marked with #[inline(never)]
. Also, I did not try any other CPU than mine (i5-7200). With this caveat, here are the computation times:
Method | #[inline(always)] |
#[inline(never)] |
---|---|---|
XorShiftRng + transmute52 |
t0 | t1 [1.77 × t0] |
XorShiftRng + direct53 |
1.00 × t0 | 1.19 × t1 |
XorShiftRng + transmute53 |
1.14 × t0 | 0.99 × t1 |
Two surprises:
- the inlined direct 53 bit method was exactly as fast as the transmute method,
- the non-inlined version of
transmute53
was very slightly but consistently faster, which I surmise is becausec
has a 50% chance to be 0.0 sou - c
can evaluate fast.
In any case, these results seem to strongly question the purported advantage of the 52-bit transmute
-based method, at least for modern CPUs. And even for old CPUs, I would expect the transmute53
version to be reasonably close to transmute52
.