Improve shootout-reverse-complement #18143

mahkoh · 2014-10-18T21:37:20Z

Lots of unsafe code and lots of branches removed. Also multithreaded.

Rust old: 1.208 seconds
Rust new: 0.761 seconds
C: 0.632 seconds

huonw · 2014-10-19T00:31:33Z

src/test/bench/shootout-reverse-complement.rs

-        if seq.len() % 2 == 1 {
-            let middle = &mut seq[seq.len() / 2];
-            *middle = complements[*middle as uint];
+    /// Returns a mutable slice into the contained vector.


This is extremely unsafe.

mahkoh · 2014-10-19T00:58:55Z

For some reason LLVM produces very bad code when doing pointer arithmetic. The commented-out variant in the following snippet is much faster than the normal variant:

    unsafe {
        let mut left = seq.as_mut_ptr();
        let mut right = left.offset(len as int);
        while left < right {
            //asm!("dec $0" : "+r"(right));
            right = right.offset(-1);
            let tmp = COMPLEMENTS[*left as uint];
            *left = COMPLEMENTS[*right as uint];
            *right = tmp;
            //asm!("inc $0" : "+r"(left));
            left = left.offset(1);
        }
    }

cc @thestinger @eddyb

mahkoh · 2014-10-19T05:36:44Z

Lo and behold:

~/rust/reverse-complement$ time ./rust < /tmp/input.txt > /dev/null
./rust < /tmp/input.txt > /dev/null 0.34s user 0.19s system 120% cpu 0.437 total

~/rust/reverse-complement$ time ./c < /tmp/input.txt > /dev/null
./c < /tmp/input.txt > /dev/null 0.40s user 0.18s system 119% cpu 0.485 total

~/rust/reverse-complement$ time ./cxx < /tmp/input.txt > /dev/null
./cxx < /tmp/input.txt > /dev/null 0.43s user 0.17s system 120% cpu 0.498 total

huonw · 2014-10-19T05:46:30Z

src/test/bench/shootout-reverse-complement.rs

+use std::sync::{Arc, Future};
+use std::mem::{transmute};
+use std::raw::{Slice};
+use std::num::{div_rem};


The convention is to import the module and call like num::div_rem, especially if they're only called once like these are.

(The old code wasn't so idiomatic.)

huonw · 2014-10-19T06:00:10Z

What is the granularity with which tasks are spawned? (It's either per-line, or per >, I can't quite tell.)

killercup · 2014-10-19T11:37:03Z

This should probably be added to the list in #18085.

kud1ing · 2014-10-19T15:15:52Z

@killercup Done

alexcrichton · 2014-10-19T19:50:53Z

src/test/bench/shootout-reverse-complement.rs

+
+/// Lookup tables.
+static mut CPL16: [u16, ..1 << 16] = [0, ..1 << 16];
+static mut CPL8:  [u8,  ..1 << 8]  = [0, ..1 << 8];


It would be nice for the benchmarks to show off idiomatic Rust code rather than copying C into rust, and usage of static mut globals isn't necessarily idiomatic. Does this take a significant perf hit compared to calculating the tables and putting them into an Arc?

There is no performance hit but only because the tables are so small.

fn gen_tables() -> Arc<Tables> { let mut t: Tables = unsafe { std::mem::uninitialized() }; // ... Arc::new(t) }

This first puts the table on the stack, then copies it to another place on the stack, and then copies it to the heap. I don't really want this kind of wasteful code in the benchmark.

alexcrichton · 2014-10-19T19:55:46Z

The vast quantities of unsafe code are a little unsettling in this benchmark. In general I would expect idiomatic Rust code to be shown off in the benchmarks game, and it's more idiomatic to create a safe abstraction which is implemented unsafely if necessary. For example the old TwoSideIterator in this example and the parallel function of the spectral-norm benchmark are both implemented unsafely but expose a safe interface.

Would it be possible to elide the same amount of bounds checks and such, but by writing this benchmark more safely? One could almost always just translate straight C to Rust, unsafety and all, but that seems somewhat counter to the spirit of the benchmarking game between languages.

mahkoh · 2014-10-19T19:55:56Z

Unfortunately I found a case (which does not appear in the benchmark) in which the faster algorithm won't work. You can still see that algorithm in the previous commit.

One big problem with the current code is that the assembly generated by Rust and LLVM is not very good, e.g., replacing

while *cur != b'>' {
    cur = cur.offset(-1);
}

by

asm!("
    .align 16, 0x90
XOAEUSTNH:
    dec $0
    cmpb $$62, ($0)
    jne XOAEUSTNH" : "+r"(cur));

gives you a massive speedup. Clang does generate this kind of code but rustc does not.

mahkoh · 2014-10-19T20:02:59Z

I'm pretty annoyed by the compiler actively deoptimizing my code ~~and won't try to improve this particular benchmark anymore~~ (turns out that I did). Even if you copy C code and put everything in a big unsafe block the generated code is still significantly slower. The current version is not as fast as the version announced in #18143 (comment) and I don't think you can write a version as fast as the C version in Rust right now, unsafe code or not.

mahkoh · 2014-10-19T20:11:57Z

Here is another case where rustc goes crazy:

    let mut end = data.as_ptr().offset(data.len() as int - 1);
    let mut cur = end;
    loop {
        if *cur != b'>' {
            // asm!("");
            cur = std::intrinsics::offset(cur, -1);
            continue;
        }

Uncommenting the empty asm! will give you a nice speedup in the generated code.

eddyb · 2014-10-19T22:33:29Z

cc @zwarich @dotdash - any ideas what is going on here?
One guess I do have is that indexing (v[i++]) is some sort of canonical form and LLVM can convert to it from pointer bumping (*p++), but for some reason it can't get to the more efficient (pointer bumping) form, at least not from rustc IR.
We could test this hypothesis by isolating the loop in both Rust and C, and getting the IR from rustc and clang, respectively.

TeXitoi · 2014-10-19T22:43:40Z

src/test/bench/shootout-reverse-complement.rs

+
+/// Generates the tables.
+fn gen_tables() {
+    unsafe {


That's a lot of unsafe code for a function that will be used only once for a very small amount of time. I personally find that make_complements() in the old version is much more simple, but I may be biaised as I wrote it ;-).

TeXitoi · 2014-10-19T23:22:58Z

What is the impact of using a 64k lookup table? Is it your own idea or did any other shootout benchmark use it?

And thanks for working on this benchmark!

mahkoh · 2014-10-19T23:25:59Z

The C++ code does the same thing. Our reverse code is significantly faster than the C code.

mahkoh · 2014-10-19T23:34:12Z

~/rust/reverse-complement$ perf stat ./rust < ./input.txt > /dev/null

 Performance counter stats for './rust':

        758.568211      task-clock (msec)         #    1.140 CPUs utilized          
                80      context-switches          #    0.105 K/sec                  
                32      cpu-migrations            #    0.042 K/sec                  
           123,117      page-faults               #    0.162 M/sec                  
     1,580,569,070      cycles                    #    2.084 GHz                    
       849,347,730      stalled-cycles-frontend   #   53.74% frontend cycles idle   
       587,892,723      stalled-cycles-backend    #   37.20% backend  cycles idle   
     2,107,021,931      instructions              #    1.33  insns per cycle        
                                                  #    0.40  stalled cycles per insn
       330,187,928      branches                  #  435.278 M/sec                  
           264,699      branch-misses             #    0.08% of all branches        

       0.665397382 seconds time elapsed

~/rust/reverse-complement$ perf stat ./c < ./input.txt > /dev/null   

 Performance counter stats for './c':

        659.239763      task-clock (msec)         #    1.265 CPUs utilized          
               103      context-switches          #    0.156 K/sec                  
                25      cpu-migrations            #    0.038 K/sec                  
            60,101      page-faults               #    0.091 M/sec                  
     1,367,562,395      cycles                    #    2.074 GHz                    
       491,611,151      stalled-cycles-frontend   #   35.95% frontend cycles idle   
       288,809,513      stalled-cycles-backend    #   21.12% backend  cycles idle   
     2,555,179,319      instructions              #    1.87  insns per cycle        
                                                  #    0.19  stalled cycles per insn
       480,482,167      branches                  #  728.843 M/sec                  
           134,052      branch-misses             #    0.03% of all branches        

       0.521284864 seconds time elapsed

We use fewer instructions than the C code and have fewer branches but more branch misses and more pagefaults.

TeXitoi · 2014-10-20T00:35:48Z

Maybe that's because of checked_add in range_step?

mahkoh · 2014-10-20T01:21:11Z

This code is actually faster than the C program when compiled with Clang. But the C++ program is still faster.

mahkoh · 2014-10-20T03:28:59Z

I've made a small change in the last version where I preallocate 300MB for the input instead of gradually increasing the capacity. The performance impact is massive. Not only in this version much faster than both the c and the c++ versions, even reallocating once will make it slower than the c version again. Is this a problem with jemalloc? cc @thestinger

mahkoh · 2014-10-20T03:32:37Z

~/rust/reverse-complement$ time ./rust < ./input.txt > /dev/null
./rust < ./input.txt > /dev/null 0.28s user 0.22s system 124% cpu 0.397 total
~/rust/reverse-complement$ time ./c (gcc) < ./input.txt > /dev/null
./c < ./input.txt > /dev/null 0.40s user 0.24s system 128% cpu 0.497 total

dotdash · 2014-10-20T09:11:29Z

One big problem with the current code is that the assembly generated by Rust and LLVM is not very good, e.g., replacing
while *cur != b'>' {
    cur = cur.offset(-1);
}
by
asm!("
    .align 16, 0x90
XOAEUSTNH:
    dec $0
    cmpb $$62, ($0)
    jne XOAEUSTNH" : "+r"(cur));
gives you a massive speedup. Clang does generate this kind of code but rustc does not.

I guess that is with clang 3.4? ~~LLVM dropped support for cmpb in 3.5.~~ LLVM 3.5 only uses cmpb when the function has the minsize attribute. In a small test case, I get similar asm from rustc and clang-3.5:

Rust code:

#![crate_type="lib"]

extern {
    fn black(_: *const u8);
}

pub unsafe fn test() {
    let data = std::io::stdin().read_to_end().unwrap();
    let mut end = data.as_ptr().offset(data.len() as int - 1);
    let mut cur = end;
    loop {
        while *cur != b'>' {
            cur = cur.offset(-1);
        }

        black(cur);
    }
}

Resulting loop asm:

.LBB0_64:
    movq    %rbx, %rdi
    callq   black@PLT
.LBB0_65:
    movzbl  (%rbx), %eax
    cmpl    $62, %eax
    je  .LBB0_64
    decq    %rbx
    jmp .LBB0_65

C code:

void black(char *);

long len;
char *data;

void test(void) {
    char *end = data + len;
    char *cur = end;
    for (;;) {
        while (*cur != '>')
            cur = cur - 1;

        black(cur);
    }

}

With clang-3.5:

.LBB0_2:                                # %._crit_edge
                                        #   in Loop: Header=BB0_1 Depth=1
    decq    %rbx
.LBB0_1:                                # %.outer
                                        # =>This Inner Loop Header: Depth=1
    movzbl  (%rbx), %eax
    cmpl    $62, %eax
    jne .LBB0_2
# BB#3:                                 # %.lr.ph
                                        #   in Loop: Header=BB0_1 Depth=1
    movq    %rbx, %rdi
    callq   black
    jmp .LBB0_1

With clang-3.4:

.LBB0_2:                                # %._crit_edge
                                        #   in Loop: Header=BB0_1 Depth=1
    decq    %rbx
.LBB0_1:                                # %.outer
                                        # =>This Inner Loop Header: Depth=1
    cmpb    $62, (%rbx)
    jne .LBB0_2
# BB#3:                                 # %.lr.ph
                                        #   in Loop: Header=BB0_1 Depth=1
    movq    %rbx, %rdi
    callq   black
    jmp .LBB0_1

arthurprs · 2014-10-22T15:27:34Z

I think the last change is a little too hackish. But it's a matter of taste.

mahkoh · 2014-10-22T15:38:04Z

Which part?

arthurprs · 2014-10-22T15:47:50Z

The shared memory code.
Don't get me wrong, it's some awesome code. But I fell it kinda defeats the purpose of using rust.

mahkoh · 2014-10-22T15:52:22Z

This structure provides a safe way to share disjoint parts of a Vector mutably across tasks. I would actually expect something like this in the stdlib.

thestinger · 2014-10-22T15:52:33Z

I don't think it makes sense to call the libc allocator directly. It's not going to accomplish anything once the prefix issue is fixed anyway... You're really just falling through to mremap with a large amount of allocator overhead.

mahkoh · 2014-10-22T16:17:12Z

@thestinger: Allocating the memory with a 2x growth factor with jemalloc already takes 0.45 seconds on my machine. With glibc's allocator the whole program only takes 0.39 seconds (0.25 seconds for the allocation.)

Is there a way to replace the default allocator by the libc allocator so that I can use Vec?

thestinger · 2014-10-22T16:27:11Z

@mahkoh: As I've explained, the issue is specific to Linux where there is an mremap function available to allocators. You will be making it slower on all other platforms by doing it this way. There are good reasons for jemalloc avoiding mremap as it exists today. It would be faster to bypass glibc and use mremap directly and that is going to be the only option once the jemalloc prefix issue is worked around and calling malloc uses the same underlying allocator as heap::allocate.

thestinger · 2014-10-22T16:31:06Z

Any code calling malloc / realloc / free directly for a reason other than compatibility for a C doesn't make sense. I don't see a reason to work around a performance issue that's fixable for all Rust code via a workaround in jemalloc's chunk allocator until the Linux kernel API for mremap is improved.

mahkoh · 2014-10-22T16:34:29Z

@thestinger: The benchmarks game runs on linux and I think it makes sense to optimize for one platform in this case. This is the reason we are faster than C and C++ right now. Once the performance issue has been fixed on the distros on which the benchmarks run, I'll gladly replace the manual allocations by a Vec.

thestinger · 2014-10-22T16:38:27Z

You don't seem to be reading my comments. To spell it out clearly:

using the C standard library allocation API is not required to work around the performance issue today
using the C standard library allocation API is going to use jemalloc in the future

As long as it contains this workaround, it's going to get an r- from me.

mahkoh · 2014-10-22T17:03:53Z

calling the C standard library allocation API is not required to work around the performance issue

If you can fix this for all Rust programs, then there is no reason for me to use a workaround. I already said this above.

This is just a benchmark and I think there is no reason not to use a workaround for the moment. I already asked for a way to replace the default allocator so that we can quickly switch this program back to jemalloc when it's ready.

Anyway, a factor of 16 makes jemalloc faster than glibc and doesn't use more memory for the largest benchmark. I'll use this for now.

thestinger · 2014-10-22T17:16:08Z

This is just a benchmark and I think there is no reason not to use a workaround for the moment. I already asked for a way to replace the default allocator so that we can quickly switch this program back to jemalloc when it's ready.

By default, jemalloc replaces the libc allocator but Rust adds a je_ prefix which disables the hooks. Using two general purpose allocators in the same process causes significant external fragmentation and performance issues so fixing this is a priority. It only involves changing a few lines in the jemalloc configure script / code. I expect it to be fixed when jemalloc is updated again within a few weeks.

mahkoh · 2014-10-22T17:41:19Z

Updated with LibcBox replaced by Vec. The performance might be a little bit worse than before but we're still good.

mahkoh · 2014-10-22T19:41:10Z

I don't think this will get much faster at this point. There are two low hanging fruits lefts:

Use the libc's memchr: This will give us another 10 - 20 ms while the program runs at 370ms
Waste more memory.

$ grep unsafe rust.rs| wc -l                                                    
~16

Three of them are in the tables module. I don't think recreating lookup tables every time you need them is good code. Putting them on the heap makes even less sense. Also note that putting them in an Arc will make many unnecessary copies. The tables module provides a safe way to access the tables.
Two are in the shared_memory module and both are necessary. You can't share mutable memory across tasks without unsafe right now. The shared_memory module provides a safe way to do this by guaranteeing that there are never two overlapping slices.
One is in read_to_end and necessary. The Reader method with the same name does the same thing but much slower.
One is in memchr and necessary. You cannot get a memchr this fast without either using assembly or reading multiple bytes at once (or using libc's memchr.) One could avoid the pointer arithmetic by transmuting a &[uint] but I'm not convinced that this would improve the code.
Two are in reverse_complement. The first one is necessary because LLVM won't replace a simple loop by a call to memmove. The second one is necessary because otherwise you get at least one additional branch per 4 bytes which will be slower. Also something LLVM could but does not optimize.
The one in main is there because Arc doesn't have a method necessary to make it safe.

Suggestions how to fix these issues without giving up speed are welcome (except for the tables module.)

TeXitoi · 2014-10-22T20:58:50Z

I propose an alternative to your shared_memory module

Using the parallel function adapted from the parallel one of shootout-spectralnorm by @alexcrichton:

write a DnaSeqIterMut iterator that iterates on the sequence chunks (implemented using your memchr)
change reverse_complement to take a mutable slice
combine that with the parallel function below

What do you think?

#![feature(unboxed_closures, overloaded_calls)]

use std::mem;
use std::raw::Repr;

// Executes a closure in parallel over the given iterator over mutable slice.
// The closure `f` is run in parallel with an element of `iter`.
fn parallel<'a, I, T, F>(mut iter: I, f: F)
        where T: Send + Sync,
              I: Iterator<&'a mut [T]>,
              F: Fn(&'a mut [T]) + Sync {
    let (tx, rx) = channel();
    for chunk in iter {
        let tx = tx.clone();

        // Need to convert `f` and `chunk` to something that can cross the task
        // boundary.
        let f = &f as *const _ as *const uint;
        let raw = chunk.repr();
        spawn(proc() {
            let f = f as *const F;
            unsafe { (*f)(mem::transmute(raw)) }
            drop(tx)
        });
    }
    drop(tx);
    for () in rx.iter() {}
}

fn main() {
    let mut v = Vec::from_fn(50, |i| i);
    parallel(v.as_mut_slice().chunks_mut(10), |&: s: &mut [uint]| {
        let i = s[0];
        for j in s.iter_mut() { *j = i; }
    });
    println!("{}", v);
}

TeXitoi · 2014-10-22T21:04:51Z

And using libc memchr seems simpler and faster, thus, I'd prefer using it personally.

arthurprs · 2014-10-22T21:35:09Z

Can't we [add, fix, optimize] some of those back into rust stdlib? Similar to what was done for [].reverse() ?

mahkoh · 2014-10-23T17:42:42Z

I've replaced the memchr.

I'm a bit worried about the rules of the benchmark.

read line-by-line a redirected FASTA format file from stdin

We're definitely not reading line by line. (Although we could do that by hiding everything behind a large BufferedReader, but this would involve another copy.) Here is the currently fastest C code:

   size_t buflen = 1024, len, end = 0;
   char *buf = malloc(1024);

   int in = fileno(stdin);
   while ((len = read(in, buf + end, buflen - 256 - end))) {
      end += len;
      if (end < buflen - 256) break;
      buf = realloc(buf, buflen *= 2);
   }

That's not line by line either.

TeXitoi · 2014-10-24T09:20:31Z

@mahkoh no problem according to the rules, as almost everyone, current rust version included, read stdin in a buffer until EOF.

mahkoh · 2014-10-24T09:24:27Z

@TeXitoi: Do you know what's the problem with http://benchmarksgame.alioth.debian.org/u64q/program.php?test=revcomp&lang=java&id=1 ?

TeXitoi · 2014-10-24T09:30:45Z

I think that's because they was stricter about this rule some time ago (and this version was proposed in 2010 as you can see in the comment).

alexcrichton · 2014-10-24T15:11:21Z

@mahkoh could you squash the commits down into one? I think that the failure on android may be spurious otherwise.

mahkoh · 2014-10-24T15:19:55Z

The old file contained a line about ignoring the android tests. Rebased and restored.

Lots of unsafe code and lots of branches removed. Also multithreaded. Rust old: 1.208 seconds Rust new: 0.761 seconds C: 0.632 seconds

huonw reviewed Oct 19, 2014
View reviewed changes

kud1ing mentioned this pull request Oct 19, 2014

Implement / optimize the shootout benchmarks #18085

Closed

15 tasks

alexcrichton reviewed Oct 19, 2014
View reviewed changes

TeXitoi reviewed Oct 19, 2014
View reviewed changes

Julian Orth added 2 commits October 24, 2014 17:18

Improve shootout-reverse-complement

da2152c

ignore-android

b3ed617

bors added a commit that referenced this pull request Oct 25, 2014

auto merge of #18143 : mahkoh/rust/reverse_complement, r=alexcrichton

6ea47d8

Lots of unsafe code and lots of branches removed. Also multithreaded. Rust old: 1.208 seconds Rust new: 0.761 seconds C: 0.632 seconds

bors added a commit that referenced this pull request Oct 26, 2014

auto merge of #18143 : mahkoh/rust/reverse_complement, r=alexcrichton

f037452

Lots of unsafe code and lots of branches removed. Also multithreaded. Rust old: 1.208 seconds Rust new: 0.761 seconds C: 0.632 seconds

bors closed this Oct 26, 2014

Improve shootout-reverse-complement #18143

Improve shootout-reverse-complement #18143

Uh oh!

Conversation

mahkoh commented Oct 18, 2014

Uh oh!

huonw Oct 19, 2014

Choose a reason for hiding this comment

Uh oh!

mahkoh commented Oct 19, 2014

Uh oh!

mahkoh commented Oct 19, 2014

Uh oh!

huonw Oct 19, 2014

Choose a reason for hiding this comment

Uh oh!

huonw commented Oct 19, 2014

Uh oh!

killercup commented Oct 19, 2014

Uh oh!

kud1ing commented Oct 19, 2014

Uh oh!

alexcrichton Oct 19, 2014

Choose a reason for hiding this comment

Uh oh!

mahkoh Oct 20, 2014

Choose a reason for hiding this comment

Uh oh!

alexcrichton commented Oct 19, 2014

Uh oh!

mahkoh commented Oct 19, 2014

Uh oh!

mahkoh commented Oct 19, 2014

Uh oh!

mahkoh commented Oct 19, 2014

Uh oh!

eddyb commented Oct 19, 2014

Uh oh!

TeXitoi Oct 19, 2014

Choose a reason for hiding this comment

Uh oh!

TeXitoi commented Oct 19, 2014

Uh oh!

mahkoh commented Oct 19, 2014

Uh oh!

mahkoh commented Oct 19, 2014

Uh oh!

TeXitoi commented Oct 20, 2014

Uh oh!

mahkoh commented Oct 20, 2014

Uh oh!

mahkoh commented Oct 20, 2014

Uh oh!

mahkoh commented Oct 20, 2014

Uh oh!

dotdash commented Oct 20, 2014

Uh oh!

arthurprs commented Oct 22, 2014

Uh oh!

mahkoh commented Oct 22, 2014

Uh oh!

arthurprs commented Oct 22, 2014

Uh oh!

mahkoh commented Oct 22, 2014

Uh oh!

thestinger commented Oct 22, 2014

Uh oh!

mahkoh commented Oct 22, 2014

Uh oh!

thestinger commented Oct 22, 2014

Uh oh!

thestinger commented Oct 22, 2014

Uh oh!

mahkoh commented Oct 22, 2014

Uh oh!

thestinger commented Oct 22, 2014

Uh oh!

mahkoh commented Oct 22, 2014

Uh oh!

thestinger commented Oct 22, 2014