Skip to content

Future-proofing 24/53bit precision for f32/f64 generation #416

@sbarral

Description

@sbarral

I know this is a contentious issue, sorry for bringing it again... Like others, I feel a bit uneasy with the choice to promote a Standard distribution that samples within the open (0, 1) interval. As a matter of fact, I am a bit warry of promoting any "one true way" to produce FP values.

Now I do agree that a fully open interval is adequate in most applications. OTOH, I have never seen a practical situation where a half-open interval would not be adequate too: it is often important to avoid singularity at one of the bound, typically with log(u)-type transforms, but I suspect that the need to avoid both 0 and 1 is very rare.

The main reason I dislike the (0, 1) open interval as the default choice, though, is that it implicitly bakes into the API a defficiency of the current transmute-based FP generation algorithm, namely the truncation to 23 bits (resp. 52 bits for f64) precision, instead of the 24/53 bits that one may expect based on the FP significand precision.
The problem is that, unlike with half-open [0, 1) or (0, 1] intervals, the generation of FP values in the (0, 1) open interval becomes AFAIK prohibitively expensive with 24/53 bits due to the need for a rejection step. So in a way, the current Standard distribution would become a commitment to an implementation detail of the current FP generation method.

For this reason, if it is really deemed important to promote a default FP generation distribution rather than just define e.g. Open01, OpenClosed01 and ClosedOpen01, I would then favor the widely-adopted convention of sampling within [0, 1) because (i) it is also adequate for most situations and (ii) it leaves open the possibility (today, or in the future) to efficiently generate FP values with a 24/53-bit precision.

Regarding point (ii) above, I made some preliminary investigations to assess the computational advantage of the current 52-bit transmute-based method over two 53-bit FP generation methods. The following u64->f64 conversion methods were benchmarked:

// The 52-bit transmute-based method we now use.
fn transmute52(r: u64) -> f64 {
    const FRACTION_WIDTH: i32 = 52;
    const ONE: u64 = 1023 << FRACTION_WIDTH; // = transmute::<f64, u64>(1.0)
    
    let fraction: u64 = r >> (64 - FRACTION_WIDTH);
    let u = unsafe { mem::transmute::<u64, f64>(fraction | ONE) };
    
    u - 1.0
}
// The conventional 53-bit conversion method.
fn direct53(r: u64) -> f64 {
    const PRECISION: i32 = 53;
    const SCALE: f64 = 1.0/((1u64 << PRECISION) as f64);
    
    SCALE * (r >> (64 - PRECISION)) as f64
}
// Home-grown, branch-less 53-bit transmute-based method -- faster methods may exist.
fn transmute53(r: u64) -> f64 {
    const FRACTION_WIDTH: i32 = 52;
    const ONE_HALF: u64 = 1022 << FRACTION_WIDTH; // = transmute::<f64, u64>(0.5)
    const RIGHT_MASK: u64 = (1u64 << 63) - 1u64; // all bits set except MSB
    
    // Create a 000...00 or a 111...11 mask depending on the MSB.
    // The commented alternative is slightly slower but functionally equivalent to
    // direct53(), whereas the uncommented version inverts the role of the MSB.  
    let compl_mask: u64 = (r as i64 >> 63) as u64;
    //let compl_mask: u64 = (!r as i64 >> 63) as u64; 
    let fraction: u64 = (r & RIGHT_MASK) >> (63 - FRACTION_WIDTH);
    let u = unsafe { mem::transmute::<u64, f64>(fraction | ONE_HALF) };
    // `c` is 0.0 or 0.5 depending on the MSB.
    let c = unsafe { mem::transmute::<u64, f64>(compl_mask & ONE_HALF) };
    
    u - c
}

The benchmark performs a simple sum of a large number of FP values produced by one of the above conversion function fed by the 64-bit output of XorShiftRng.
As always, the benchmark needs to be taken with a big grain of salt, especially that the methods are normally inlined so a lot depends on the actual code context. In order to assess robustness towards inlining, the benchmark was ran a second time with the above functions marked with #[inline(never)]. Also, I did not try any other CPU than mine (i5-7200). With this caveat, here are the computation times:

Method #[inline(always)] #[inline(never)]
XorShiftRng + transmute52 t0 t1 [1.77 × t0]
XorShiftRng + direct53 1.00 × t0 1.19 × t1
XorShiftRng + transmute53 1.14 × t0 0.99 × t1

Two surprises:

  • the inlined direct 53 bit method was exactly as fast as the transmute method,
  • the non-inlined version of transmute53 was very slightly but consistently faster, which I surmise is because c has a 50% chance to be 0.0 so u - c can evaluate fast.

In any case, these results seem to strongly question the purported advantage of the 52-bit transmute-based method, at least for modern CPUs. And even for old CPUs, I would expect the transmute53 version to be reasonably close to transmute52.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions