Future-proofing 24/53bit precision for f32/f64 generation

I know this is a contentious issue, sorry for bringing it again... Like others, I feel a bit uneasy with the choice to promote a `Standard` distribution that samples within the open (0, 1) interval. As a matter of fact, I am a bit warry of promoting any "one true way" to produce FP values.

Now I do agree that a fully open interval is adequate in most applications. OTOH, I have never seen a practical situation where a half-open interval would not be adequate too: it is often important to avoid singularity at one of the bound, typically with `log(u)`-type transforms, but I suspect that the need to avoid both 0 and 1 is very rare.

The main reason I dislike the (0, 1) open interval as the default choice, though, is that it implicitly bakes into the API a defficiency of the current `transmute`-based FP generation algorithm, namely the truncation to 23 bits (resp. 52 bits for f64) precision, instead of the 24/53 bits that one may expect based on the FP significand precision.
The problem is that, unlike with half-open [0, 1) or (0, 1] intervals, the generation of FP values in the (0, 1) open interval becomes AFAIK prohibitively expensive with 24/53 bits due to the need for a rejection step. So in a way, the current `Standard` distribution would become a commitment to an implementation detail of the current FP generation method.

For this reason, if it is really deemed important to promote a default FP generation distribution rather than just define e.g. `Open01`, `OpenClosed01` and `ClosedOpen01`, I would then favor the widely-adopted convention of sampling within [0, 1) because (i) it is also adequate for most situations and (ii) it leaves open the possibility (today, or in the future) to efficiently generate FP values with a 24/53-bit precision.

Regarding point (ii) above, I made some preliminary investigations to assess the computational advantage of the current 52-bit `transmute`-based method over two 53-bit FP generation methods. The following u64->f64 conversion methods were benchmarked:

```rust
// The 52-bit transmute-based method we now use.
fn transmute52(r: u64) -> f64 {
    const FRACTION_WIDTH: i32 = 52;
    const ONE: u64 = 1023 << FRACTION_WIDTH; // = transmute::<f64, u64>(1.0)
    
    let fraction: u64 = r >> (64 - FRACTION_WIDTH);
    let u = unsafe { mem::transmute::<u64, f64>(fraction | ONE) };
    
    u - 1.0
}
```

```rust
// The conventional 53-bit conversion method.
fn direct53(r: u64) -> f64 {
    const PRECISION: i32 = 53;
    const SCALE: f64 = 1.0/((1u64 << PRECISION) as f64);
    
    SCALE * (r >> (64 - PRECISION)) as f64
}
```

```rust
// Home-grown, branch-less 53-bit transmute-based method -- faster methods may exist.
fn transmute53(r: u64) -> f64 {
    const FRACTION_WIDTH: i32 = 52;
    const ONE_HALF: u64 = 1022 << FRACTION_WIDTH; // = transmute::<f64, u64>(0.5)
    const RIGHT_MASK: u64 = (1u64 << 63) - 1u64; // all bits set except MSB
    
    // Create a 000...00 or a 111...11 mask depending on the MSB.
    // The commented alternative is slightly slower but functionally equivalent to
    // direct53(), whereas the uncommented version inverts the role of the MSB.  
    let compl_mask: u64 = (r as i64 >> 63) as u64;
    //let compl_mask: u64 = (!r as i64 >> 63) as u64; 
    let fraction: u64 = (r & RIGHT_MASK) >> (63 - FRACTION_WIDTH);
    let u = unsafe { mem::transmute::<u64, f64>(fraction | ONE_HALF) };
    // `c` is 0.0 or 0.5 depending on the MSB.
    let c = unsafe { mem::transmute::<u64, f64>(compl_mask & ONE_HALF) };
    
    u - c
}
```

The benchmark performs a simple sum of a large number of FP values produced by one of the above conversion function fed by the 64-bit output of `XorShiftRng`. 
As always, the benchmark needs to be taken with a big grain of salt, especially that the methods are normally inlined so a lot depends on the actual code context. In order to assess robustness towards inlining, the benchmark was ran a second time with the above functions marked with `#[inline(never)]`. Also, I did not try any other CPU than mine (i5-7200). With this caveat, here are the computation times:

Method | `#[inline(always)]` | `#[inline(never)]`
----------|-----------------------|---------------------
`XorShiftRng` + `transmute52` | t0 | t1 [1.77 × t0]
`XorShiftRng` + `direct53` | 1.00 × t0 | 1.19 × t1
`XorShiftRng` + `transmute53` | 1.14 × t0 |  0.99 × t1

Two surprises:
* the inlined direct 53 bit method was exactly as fast as the transmute method,
* the non-inlined version of `transmute53` was very slightly but consistently faster, which I surmise is because `c` has a 50% chance to be 0.0 so `u - c` can evaluate fast.

In any case, these results seem to strongly question the [purported advantage](http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/ARTICLES/dSFMT.pdf) of the 52-bit `transmute`-based method, at least for modern CPUs. And even for old CPUs, I would expect the `transmute53` version to be reasonably close to `transmute52`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Future-proofing 24/53bit precision for f32/f64 generation #416

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Method	`#[inline(always)]`	`#[inline(never)]`
`XorShiftRng` + `transmute52`	t0	t1 [1.77 × t0]
`XorShiftRng` + `direct53`	1.00 × t0	1.19 × t1
`XorShiftRng` + `transmute53`	1.14 × t0	0.99 × t1

Uh oh!

Future-proofing 24/53bit precision for f32/f64 generation #416

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions