Skip to content

Improve shootout-reverse-complement #18143

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Improve shootout-reverse-complement #18143

wants to merge 2 commits into from

Conversation

mahkoh
Copy link
Contributor

@mahkoh mahkoh commented Oct 18, 2014

Lots of unsafe code and lots of branches removed. Also multithreaded.

Rust old: 1.208 seconds
Rust new: 0.761 seconds
C: 0.632 seconds

if seq.len() % 2 == 1 {
let middle = &mut seq[seq.len() / 2];
*middle = complements[*middle as uint];
/// Returns a mutable slice into the contained vector.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is extremely unsafe.

@mahkoh
Copy link
Contributor Author

mahkoh commented Oct 19, 2014

For some reason LLVM produces very bad code when doing pointer arithmetic. The commented-out variant in the following snippet is much faster than the normal variant:

    unsafe {
        let mut left = seq.as_mut_ptr();
        let mut right = left.offset(len as int);
        while left < right {
            //asm!("dec $0" : "+r"(right));
            right = right.offset(-1);
            let tmp = COMPLEMENTS[*left as uint];
            *left = COMPLEMENTS[*right as uint];
            *right = tmp;
            //asm!("inc $0" : "+r"(left));
            left = left.offset(1);
        }
    }

cc @thestinger @eddyb

@mahkoh
Copy link
Contributor Author

mahkoh commented Oct 19, 2014

Lo and behold:

~/rust/reverse-complement$ time ./rust < /tmp/input.txt > /dev/null
./rust < /tmp/input.txt > /dev/null 0.34s user 0.19s system 120% cpu 0.437 total

~/rust/reverse-complement$ time ./c < /tmp/input.txt > /dev/null
./c < /tmp/input.txt > /dev/null 0.40s user 0.18s system 119% cpu 0.485 total

~/rust/reverse-complement$ time ./cxx < /tmp/input.txt > /dev/null
./cxx < /tmp/input.txt > /dev/null 0.43s user 0.17s system 120% cpu 0.498 total

use std::sync::{Arc, Future};
use std::mem::{transmute};
use std::raw::{Slice};
use std::num::{div_rem};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convention is to import the module and call like num::div_rem, especially if they're only called once like these are.

(The old code wasn't so idiomatic.)

@huonw
Copy link
Member

huonw commented Oct 19, 2014

What is the granularity with which tasks are spawned? (It's either per-line, or per >, I can't quite tell.)

@killercup
Copy link
Member

This should probably be added to the list in #18085.

@kud1ing
Copy link

kud1ing commented Oct 19, 2014

@killercup Done


/// Lookup tables.
static mut CPL16: [u16, ..1 << 16] = [0, ..1 << 16];
static mut CPL8: [u8, ..1 << 8] = [0, ..1 << 8];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice for the benchmarks to show off idiomatic Rust code rather than copying C into rust, and usage of static mut globals isn't necessarily idiomatic. Does this take a significant perf hit compared to calculating the tables and putting them into an Arc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no performance hit but only because the tables are so small.

fn gen_tables() -> Arc<Tables> {
    let mut t: Tables = unsafe { std::mem::uninitialized() };

    // ...

    Arc::new(t)
}

This first puts the table on the stack, then copies it to another place on the stack, and then copies it to the heap. I don't really want this kind of wasteful code in the benchmark.

@alexcrichton
Copy link
Member

The vast quantities of unsafe code are a little unsettling in this benchmark. In general I would expect idiomatic Rust code to be shown off in the benchmarks game, and it's more idiomatic to create a safe abstraction which is implemented unsafely if necessary. For example the old TwoSideIterator in this example and the parallel function of the spectral-norm benchmark are both implemented unsafely but expose a safe interface.

Would it be possible to elide the same amount of bounds checks and such, but by writing this benchmark more safely? One could almost always just translate straight C to Rust, unsafety and all, but that seems somewhat counter to the spirit of the benchmarking game between languages.

@mahkoh
Copy link
Contributor Author

mahkoh commented Oct 19, 2014

Unfortunately I found a case (which does not appear in the benchmark) in which the faster algorithm won't work. You can still see that algorithm in the previous commit.

One big problem with the current code is that the assembly generated by Rust and LLVM is not very good, e.g., replacing

while *cur != b'>' {
    cur = cur.offset(-1);
}

by

asm!("
    .align 16, 0x90
XOAEUSTNH:
    dec $0
    cmpb $$62, ($0)
    jne XOAEUSTNH" : "+r"(cur));

gives you a massive speedup. Clang does generate this kind of code but rustc does not.

@mahkoh
Copy link
Contributor Author

mahkoh commented Oct 19, 2014

I'm pretty annoyed by the compiler actively deoptimizing my code and won't try to improve this particular benchmark anymore (turns out that I did). Even if you copy C code and put everything in a big unsafe block the generated code is still significantly slower. The current version is not as fast as the version announced in #18143 (comment) and I don't think you can write a version as fast as the C version in Rust right now, unsafe code or not.

@mahkoh
Copy link
Contributor Author

mahkoh commented Oct 19, 2014

Here is another case where rustc goes crazy:

    let mut end = data.as_ptr().offset(data.len() as int - 1);
    let mut cur = end;
    loop {
        if *cur != b'>' {
            // asm!("");
            cur = std::intrinsics::offset(cur, -1);
            continue;
        }

Uncommenting the empty asm! will give you a nice speedup in the generated code.

@eddyb
Copy link
Member

eddyb commented Oct 19, 2014

cc @zwarich @dotdash - any ideas what is going on here?
One guess I do have is that indexing (v[i++]) is some sort of canonical form and LLVM can convert to it from pointer bumping (*p++), but for some reason it can't get to the more efficient (pointer bumping) form, at least not from rustc IR.
We could test this hypothesis by isolating the loop in both Rust and C, and getting the IR from rustc and clang, respectively.


/// Generates the tables.
fn gen_tables() {
unsafe {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a lot of unsafe code for a function that will be used only once for a very small amount of time. I personally find that make_complements() in the old version is much more simple, but I may be biaised as I wrote it ;-).

@TeXitoi
Copy link
Contributor

TeXitoi commented Oct 19, 2014

What is the impact of using a 64k lookup table? Is it your own idea or did any other shootout benchmark use it?

And thanks for working on this benchmark!

@mahkoh
Copy link
Contributor Author

mahkoh commented Oct 19, 2014

The C++ code does the same thing. Our reverse code is significantly faster than the C code.

@mahkoh
Copy link
Contributor Author

mahkoh commented Oct 19, 2014

~/rust/reverse-complement$ perf stat ./rust < ./input.txt > /dev/null

 Performance counter stats for './rust':

        758.568211      task-clock (msec)         #    1.140 CPUs utilized          
                80      context-switches          #    0.105 K/sec                  
                32      cpu-migrations            #    0.042 K/sec                  
           123,117      page-faults               #    0.162 M/sec                  
     1,580,569,070      cycles                    #    2.084 GHz                    
       849,347,730      stalled-cycles-frontend   #   53.74% frontend cycles idle   
       587,892,723      stalled-cycles-backend    #   37.20% backend  cycles idle   
     2,107,021,931      instructions              #    1.33  insns per cycle        
                                                  #    0.40  stalled cycles per insn
       330,187,928      branches                  #  435.278 M/sec                  
           264,699      branch-misses             #    0.08% of all branches        

       0.665397382 seconds time elapsed

~/rust/reverse-complement$ perf stat ./c < ./input.txt > /dev/null   

 Performance counter stats for './c':

        659.239763      task-clock (msec)         #    1.265 CPUs utilized          
               103      context-switches          #    0.156 K/sec                  
                25      cpu-migrations            #    0.038 K/sec                  
            60,101      page-faults               #    0.091 M/sec                  
     1,367,562,395      cycles                    #    2.074 GHz                    
       491,611,151      stalled-cycles-frontend   #   35.95% frontend cycles idle   
       288,809,513      stalled-cycles-backend    #   21.12% backend  cycles idle   
     2,555,179,319      instructions              #    1.87  insns per cycle        
                                                  #    0.19  stalled cycles per insn
       480,482,167      branches                  #  728.843 M/sec                  
           134,052      branch-misses             #    0.03% of all branches        

       0.521284864 seconds time elapsed

We use fewer instructions than the C code and have fewer branches but more branch misses and more pagefaults.

@TeXitoi
Copy link
Contributor

TeXitoi commented Oct 20, 2014

Maybe that's because of checked_add in range_step?

@mahkoh
Copy link
Contributor Author

mahkoh commented Oct 20, 2014

This code is actually faster than the C program when compiled with Clang. But the C++ program is still faster.

@mahkoh
Copy link
Contributor Author

mahkoh commented Oct 20, 2014

I've made a small change in the last version where I preallocate 300MB for the input instead of gradually increasing the capacity. The performance impact is massive. Not only in this version much faster than both the c and the c++ versions, even reallocating once will make it slower than the c version again. Is this a problem with jemalloc? cc @thestinger

@mahkoh
Copy link
Contributor Author

mahkoh commented Oct 20, 2014

~/rust/reverse-complement$ time ./rust < ./input.txt > /dev/null
./rust < ./input.txt > /dev/null 0.28s user 0.22s system 124% cpu 0.397 total
~/rust/reverse-complement$ time ./c (gcc) < ./input.txt > /dev/null
./c < ./input.txt > /dev/null 0.40s user 0.24s system 128% cpu 0.497 total

@dotdash
Copy link
Contributor

dotdash commented Oct 20, 2014

One big problem with the current code is that the assembly generated by Rust and LLVM is not very good, e.g., replacing

while *cur != b'>' {
    cur = cur.offset(-1);
}

by

asm!("
    .align 16, 0x90
XOAEUSTNH:
    dec $0
    cmpb $$62, ($0)
    jne XOAEUSTNH" : "+r"(cur));

gives you a massive speedup. Clang does generate this kind of code but rustc does not.

I guess that is with clang 3.4? LLVM dropped support for cmpb in 3.5. LLVM 3.5 only uses cmpb when the function has the minsize attribute. In a small test case, I get similar asm from rustc and clang-3.5:

Rust code:

#![crate_type="lib"]

extern {
    fn black(_: *const u8);
}

pub unsafe fn test() {
    let data = std::io::stdin().read_to_end().unwrap();
    let mut end = data.as_ptr().offset(data.len() as int - 1);
    let mut cur = end;
    loop {
        while *cur != b'>' {
            cur = cur.offset(-1);
        }

        black(cur);
    }
}

Resulting loop asm:

.LBB0_64:
    movq    %rbx, %rdi
    callq   black@PLT
.LBB0_65:
    movzbl  (%rbx), %eax
    cmpl    $62, %eax
    je  .LBB0_64
    decq    %rbx
    jmp .LBB0_65

C code:

void black(char *);

long len;
char *data;

void test(void) {
    char *end = data + len;
    char *cur = end;
    for (;;) {
        while (*cur != '>')
            cur = cur - 1;

        black(cur);
    }

}

With clang-3.5:

.LBB0_2:                                # %._crit_edge
                                        #   in Loop: Header=BB0_1 Depth=1
    decq    %rbx
.LBB0_1:                                # %.outer
                                        # =>This Inner Loop Header: Depth=1
    movzbl  (%rbx), %eax
    cmpl    $62, %eax
    jne .LBB0_2
# BB#3:                                 # %.lr.ph
                                        #   in Loop: Header=BB0_1 Depth=1
    movq    %rbx, %rdi
    callq   black
    jmp .LBB0_1

With clang-3.4:

.LBB0_2:                                # %._crit_edge
                                        #   in Loop: Header=BB0_1 Depth=1
    decq    %rbx
.LBB0_1:                                # %.outer
                                        # =>This Inner Loop Header: Depth=1
    cmpb    $62, (%rbx)
    jne .LBB0_2
# BB#3:                                 # %.lr.ph
                                        #   in Loop: Header=BB0_1 Depth=1
    movq    %rbx, %rdi
    callq   black
    jmp .LBB0_1

@arthurprs
Copy link
Contributor

I think the last change is a little too hackish. But it's a matter of taste.

@mahkoh
Copy link
Contributor Author

mahkoh commented Oct 22, 2014

Which part?

@arthurprs
Copy link
Contributor

The shared memory code.
Don't get me wrong, it's some awesome code. But I fell it kinda defeats the purpose of using rust.

@mahkoh
Copy link
Contributor Author

mahkoh commented Oct 22, 2014

This structure provides a safe way to share disjoint parts of a Vector mutably across tasks. I would actually expect something like this in the stdlib.

@thestinger
Copy link
Contributor

I don't think it makes sense to call the libc allocator directly. It's not going to accomplish anything once the prefix issue is fixed anyway... You're really just falling through to mremap with a large amount of allocator overhead.

@mahkoh
Copy link
Contributor Author

mahkoh commented Oct 22, 2014

@thestinger: Allocating the memory with a 2x growth factor with jemalloc already takes 0.45 seconds on my machine. With glibc's allocator the whole program only takes 0.39 seconds (0.25 seconds for the allocation.)

Is there a way to replace the default allocator by the libc allocator so that I can use Vec?

@thestinger
Copy link
Contributor

@mahkoh: As I've explained, the issue is specific to Linux where there is an mremap function available to allocators. You will be making it slower on all other platforms by doing it this way. There are good reasons for jemalloc avoiding mremap as it exists today. It would be faster to bypass glibc and use mremap directly and that is going to be the only option once the jemalloc prefix issue is worked around and calling malloc uses the same underlying allocator as heap::allocate.

@thestinger
Copy link
Contributor

Any code calling malloc / realloc / free directly for a reason other than compatibility for a C doesn't make sense. I don't see a reason to work around a performance issue that's fixable for all Rust code via a workaround in jemalloc's chunk allocator until the Linux kernel API for mremap is improved.

@mahkoh
Copy link
Contributor Author

mahkoh commented Oct 22, 2014

@thestinger: The benchmarks game runs on linux and I think it makes sense to optimize for one platform in this case. This is the reason we are faster than C and C++ right now. Once the performance issue has been fixed on the distros on which the benchmarks run, I'll gladly replace the manual allocations by a Vec.

@thestinger
Copy link
Contributor

You don't seem to be reading my comments. To spell it out clearly:

  • using the C standard library allocation API is not required to work around the performance issue today
  • using the C standard library allocation API is going to use jemalloc in the future

As long as it contains this workaround, it's going to get an r- from me.

@mahkoh
Copy link
Contributor Author

mahkoh commented Oct 22, 2014

calling the C standard library allocation API is not required to work around the performance issue

If you can fix this for all Rust programs, then there is no reason for me to use a workaround. I already said this above.

This is just a benchmark and I think there is no reason not to use a workaround for the moment. I already asked for a way to replace the default allocator so that we can quickly switch this program back to jemalloc when it's ready.

Anyway, a factor of 16 makes jemalloc faster than glibc and doesn't use more memory for the largest benchmark. I'll use this for now.

@thestinger
Copy link
Contributor

This is just a benchmark and I think there is no reason not to use a workaround for the moment. I already asked for a way to replace the default allocator so that we can quickly switch this program back to jemalloc when it's ready.

By default, jemalloc replaces the libc allocator but Rust adds a je_ prefix which disables the hooks. Using two general purpose allocators in the same process causes significant external fragmentation and performance issues so fixing this is a priority. It only involves changing a few lines in the jemalloc configure script / code. I expect it to be fixed when jemalloc is updated again within a few weeks.

@mahkoh
Copy link
Contributor Author

mahkoh commented Oct 22, 2014

Updated with LibcBox replaced by Vec. The performance might be a little bit worse than before but we're still good.

@mahkoh
Copy link
Contributor Author

mahkoh commented Oct 22, 2014

I don't think this will get much faster at this point. There are two low hanging fruits lefts:

  • Use the libc's memchr: This will give us another 10 - 20 ms while the program runs at 370ms
  • Waste more memory.

$ grep unsafe rust.rs| wc -l                                                    
~16
  • Three of them are in the tables module. I don't think recreating lookup tables every time you need them is good code. Putting them on the heap makes even less sense. Also note that putting them in an Arc will make many unnecessary copies. The tables module provides a safe way to access the tables.
  • Two are in the shared_memory module and both are necessary. You can't share mutable memory across tasks without unsafe right now. The shared_memory module provides a safe way to do this by guaranteeing that there are never two overlapping slices.
  • One is in read_to_end and necessary. The Reader method with the same name does the same thing but much slower.
  • One is in memchr and necessary. You cannot get a memchr this fast without either using assembly or reading multiple bytes at once (or using libc's memchr.) One could avoid the pointer arithmetic by transmuting a &[uint] but I'm not convinced that this would improve the code.
  • Two are in reverse_complement. The first one is necessary because LLVM won't replace a simple loop by a call to memmove. The second one is necessary because otherwise you get at least one additional branch per 4 bytes which will be slower. Also something LLVM could but does not optimize.
  • The one in main is there because Arc doesn't have a method necessary to make it safe.

Suggestions how to fix these issues without giving up speed are welcome (except for the tables module.)

@TeXitoi
Copy link
Contributor

TeXitoi commented Oct 22, 2014

I propose an alternative to your shared_memory module

Using the parallel function adapted from the parallel one of shootout-spectralnorm by @alexcrichton:

  • write a DnaSeqIterMut iterator that iterates on the sequence chunks (implemented using your memchr)
  • change reverse_complement to take a mutable slice
  • combine that with the parallel function below

What do you think?

#![feature(unboxed_closures, overloaded_calls)]

use std::mem;
use std::raw::Repr;

// Executes a closure in parallel over the given iterator over mutable slice.
// The closure `f` is run in parallel with an element of `iter`.
fn parallel<'a, I, T, F>(mut iter: I, f: F)
        where T: Send + Sync,
              I: Iterator<&'a mut [T]>,
              F: Fn(&'a mut [T]) + Sync {
    let (tx, rx) = channel();
    for chunk in iter {
        let tx = tx.clone();

        // Need to convert `f` and `chunk` to something that can cross the task
        // boundary.
        let f = &f as *const _ as *const uint;
        let raw = chunk.repr();
        spawn(proc() {
            let f = f as *const F;
            unsafe { (*f)(mem::transmute(raw)) }
            drop(tx)
        });
    }
    drop(tx);
    for () in rx.iter() {}
}

fn main() {
    let mut v = Vec::from_fn(50, |i| i);
    parallel(v.as_mut_slice().chunks_mut(10), |&: s: &mut [uint]| {
        let i = s[0];
        for j in s.iter_mut() { *j = i; }
    });
    println!("{}", v);
}

@TeXitoi
Copy link
Contributor

TeXitoi commented Oct 22, 2014

And using libc memchr seems simpler and faster, thus, I'd prefer using it personally.

@arthurprs
Copy link
Contributor

Can't we [add, fix, optimize] some of those back into rust stdlib? Similar to what was done for [].reverse() ?

@mahkoh
Copy link
Contributor Author

mahkoh commented Oct 23, 2014

I've replaced the memchr.

I'm a bit worried about the rules of the benchmark.

read line-by-line a redirected FASTA format file from stdin

We're definitely not reading line by line. (Although we could do that by hiding everything behind a large BufferedReader, but this would involve another copy.) Here is the currently fastest C code:

   size_t buflen = 1024, len, end = 0;
   char *buf = malloc(1024);

   int in = fileno(stdin);
   while ((len = read(in, buf + end, buflen - 256 - end))) {
      end += len;
      if (end < buflen - 256) break;
      buf = realloc(buf, buflen *= 2);
   }

That's not line by line either.

@TeXitoi
Copy link
Contributor

TeXitoi commented Oct 24, 2014

@mahkoh no problem according to the rules, as almost everyone, current rust version included, read stdin in a buffer until EOF.

@mahkoh
Copy link
Contributor Author

mahkoh commented Oct 24, 2014

@TeXitoi
Copy link
Contributor

TeXitoi commented Oct 24, 2014

I think that's because they was stricter about this rule some time ago (and this version was proposed in 2010 as you can see in the comment).

@alexcrichton
Copy link
Member

@mahkoh could you squash the commits down into one? I think that the failure on android may be spurious otherwise.

@mahkoh
Copy link
Contributor Author

mahkoh commented Oct 24, 2014

The old file contained a line about ignoring the android tests. Rebased and restored.

bors added a commit that referenced this pull request Oct 25, 2014
Lots of unsafe code and lots of branches removed. Also multithreaded.

Rust old: 1.208 seconds
Rust new: 0.761 seconds
C: 0.632 seconds
bors added a commit that referenced this pull request Oct 26, 2014
Lots of unsafe code and lots of branches removed. Also multithreaded.

Rust old: 1.208 seconds
Rust new: 0.761 seconds
C: 0.632 seconds
@bors bors closed this Oct 26, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.