-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Improve shootout-reverse-complement #18143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
if seq.len() % 2 == 1 { | ||
let middle = &mut seq[seq.len() / 2]; | ||
*middle = complements[*middle as uint]; | ||
/// Returns a mutable slice into the contained vector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is extremely unsafe
.
For some reason LLVM produces very bad code when doing pointer arithmetic. The commented-out variant in the following snippet is much faster than the normal variant: unsafe {
let mut left = seq.as_mut_ptr();
let mut right = left.offset(len as int);
while left < right {
//asm!("dec $0" : "+r"(right));
right = right.offset(-1);
let tmp = COMPLEMENTS[*left as uint];
*left = COMPLEMENTS[*right as uint];
*right = tmp;
//asm!("inc $0" : "+r"(left));
left = left.offset(1);
}
} |
Lo and behold: ~/rust/reverse-complement$ time ./rust < /tmp/input.txt > /dev/null ~/rust/reverse-complement$ time ./c < /tmp/input.txt > /dev/null ~/rust/reverse-complement$ time ./cxx < /tmp/input.txt > /dev/null |
use std::sync::{Arc, Future}; | ||
use std::mem::{transmute}; | ||
use std::raw::{Slice}; | ||
use std::num::{div_rem}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convention is to import the module and call like num::div_rem
, especially if they're only called once like these are.
(The old code wasn't so idiomatic.)
What is the granularity with which tasks are spawned? (It's either per-line, or per |
This should probably be added to the list in #18085. |
@killercup Done |
|
||
/// Lookup tables. | ||
static mut CPL16: [u16, ..1 << 16] = [0, ..1 << 16]; | ||
static mut CPL8: [u8, ..1 << 8] = [0, ..1 << 8]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice for the benchmarks to show off idiomatic Rust code rather than copying C into rust, and usage of static mut
globals isn't necessarily idiomatic. Does this take a significant perf hit compared to calculating the tables and putting them into an Arc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no performance hit but only because the tables are so small.
fn gen_tables() -> Arc<Tables> {
let mut t: Tables = unsafe { std::mem::uninitialized() };
// ...
Arc::new(t)
}
This first puts the table on the stack, then copies it to another place on the stack, and then copies it to the heap. I don't really want this kind of wasteful code in the benchmark.
The vast quantities of Would it be possible to elide the same amount of bounds checks and such, but by writing this benchmark more safely? One could almost always just translate straight C to Rust, unsafety and all, but that seems somewhat counter to the spirit of the benchmarking game between languages. |
Unfortunately I found a case (which does not appear in the benchmark) in which the faster algorithm won't work. You can still see that algorithm in the previous commit. One big problem with the current code is that the assembly generated by Rust and LLVM is not very good, e.g., replacing while *cur != b'>' {
cur = cur.offset(-1);
} by asm!("
.align 16, 0x90
XOAEUSTNH:
dec $0
cmpb $$62, ($0)
jne XOAEUSTNH" : "+r"(cur)); gives you a massive speedup. Clang does generate this kind of code but rustc does not. |
I'm pretty annoyed by the compiler actively deoptimizing my code |
Here is another case where rustc goes crazy: let mut end = data.as_ptr().offset(data.len() as int - 1);
let mut cur = end;
loop {
if *cur != b'>' {
// asm!("");
cur = std::intrinsics::offset(cur, -1);
continue;
} Uncommenting the empty asm! will give you a nice speedup in the generated code. |
cc @zwarich @dotdash - any ideas what is going on here? |
|
||
/// Generates the tables. | ||
fn gen_tables() { | ||
unsafe { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a lot of unsafe code for a function that will be used only once for a very small amount of time. I personally find that make_complements()
in the old version is much more simple, but I may be biaised as I wrote it ;-).
What is the impact of using a 64k lookup table? Is it your own idea or did any other shootout benchmark use it? And thanks for working on this benchmark! |
The C++ code does the same thing. Our reverse code is significantly faster than the C code. |
We use fewer instructions than the C code and have fewer branches but more branch misses and more pagefaults. |
Maybe that's because of |
This code is actually faster than the C program when compiled with Clang. But the C++ program is still faster. |
I've made a small change in the last version where I preallocate 300MB for the input instead of gradually increasing the capacity. The performance impact is massive. Not only in this version much faster than both the c and the c++ versions, even reallocating once will make it slower than the c version again. Is this a problem with jemalloc? cc @thestinger |
~/rust/reverse-complement$ time ./rust < ./input.txt > /dev/null |
I guess that is with clang 3.4? Rust code: #![crate_type="lib"]
extern {
fn black(_: *const u8);
}
pub unsafe fn test() {
let data = std::io::stdin().read_to_end().unwrap();
let mut end = data.as_ptr().offset(data.len() as int - 1);
let mut cur = end;
loop {
while *cur != b'>' {
cur = cur.offset(-1);
}
black(cur);
}
} Resulting loop asm: .LBB0_64:
movq %rbx, %rdi
callq black@PLT
.LBB0_65:
movzbl (%rbx), %eax
cmpl $62, %eax
je .LBB0_64
decq %rbx
jmp .LBB0_65 C code: void black(char *);
long len;
char *data;
void test(void) {
char *end = data + len;
char *cur = end;
for (;;) {
while (*cur != '>')
cur = cur - 1;
black(cur);
}
} With clang-3.5: .LBB0_2: # %._crit_edge
# in Loop: Header=BB0_1 Depth=1
decq %rbx
.LBB0_1: # %.outer
# =>This Inner Loop Header: Depth=1
movzbl (%rbx), %eax
cmpl $62, %eax
jne .LBB0_2
# BB#3: # %.lr.ph
# in Loop: Header=BB0_1 Depth=1
movq %rbx, %rdi
callq black
jmp .LBB0_1 With clang-3.4: .LBB0_2: # %._crit_edge
# in Loop: Header=BB0_1 Depth=1
decq %rbx
.LBB0_1: # %.outer
# =>This Inner Loop Header: Depth=1
cmpb $62, (%rbx)
jne .LBB0_2
# BB#3: # %.lr.ph
# in Loop: Header=BB0_1 Depth=1
movq %rbx, %rdi
callq black
jmp .LBB0_1 |
I think the last change is a little too hackish. But it's a matter of taste. |
Which part? |
The shared memory code. |
This structure provides a safe way to share disjoint parts of a Vector mutably across tasks. I would actually expect something like this in the stdlib. |
I don't think it makes sense to call the libc allocator directly. It's not going to accomplish anything once the prefix issue is fixed anyway... You're really just falling through to mremap with a large amount of allocator overhead. |
@thestinger: Allocating the memory with a 2x growth factor with jemalloc already takes 0.45 seconds on my machine. With glibc's allocator the whole program only takes 0.39 seconds (0.25 seconds for the allocation.) Is there a way to replace the default allocator by the libc allocator so that I can use Vec? |
@mahkoh: As I've explained, the issue is specific to Linux where there is an |
Any code calling |
@thestinger: The benchmarks game runs on linux and I think it makes sense to optimize for one platform in this case. This is the reason we are faster than C and C++ right now. Once the performance issue has been fixed on the distros on which the benchmarks run, I'll gladly replace the manual allocations by a Vec. |
You don't seem to be reading my comments. To spell it out clearly:
As long as it contains this workaround, it's going to get an r- from me. |
If you can fix this for all Rust programs, then there is no reason for me to use a workaround. I already said this above. This is just a benchmark and I think there is no reason not to use a workaround for the moment. I already asked for a way to replace the default allocator so that we can quickly switch this program back to jemalloc when it's ready. Anyway, a factor of 16 makes jemalloc faster than glibc and doesn't use more memory for the largest benchmark. I'll use this for now. |
By default, jemalloc replaces the libc allocator but Rust adds a |
Updated with LibcBox replaced by Vec. The performance might be a little bit worse than before but we're still good. |
I don't think this will get much faster at this point. There are two low hanging fruits lefts:
Suggestions how to fix these issues without giving up speed are welcome (except for the tables module.) |
I propose an alternative to your Using the parallel function adapted from the parallel one of
What do you think? #![feature(unboxed_closures, overloaded_calls)]
use std::mem;
use std::raw::Repr;
// Executes a closure in parallel over the given iterator over mutable slice.
// The closure `f` is run in parallel with an element of `iter`.
fn parallel<'a, I, T, F>(mut iter: I, f: F)
where T: Send + Sync,
I: Iterator<&'a mut [T]>,
F: Fn(&'a mut [T]) + Sync {
let (tx, rx) = channel();
for chunk in iter {
let tx = tx.clone();
// Need to convert `f` and `chunk` to something that can cross the task
// boundary.
let f = &f as *const _ as *const uint;
let raw = chunk.repr();
spawn(proc() {
let f = f as *const F;
unsafe { (*f)(mem::transmute(raw)) }
drop(tx)
});
}
drop(tx);
for () in rx.iter() {}
}
fn main() {
let mut v = Vec::from_fn(50, |i| i);
parallel(v.as_mut_slice().chunks_mut(10), |&: s: &mut [uint]| {
let i = s[0];
for j in s.iter_mut() { *j = i; }
});
println!("{}", v);
} |
And using libc memchr seems simpler and faster, thus, I'd prefer using it personally. |
Can't we [add, fix, optimize] some of those back into rust stdlib? Similar to what was done for [].reverse() ? |
I've replaced the memchr. I'm a bit worried about the rules of the benchmark.
We're definitely not reading line by line. (Although we could do that by hiding everything behind a large BufferedReader, but this would involve another copy.) Here is the currently fastest C code: size_t buflen = 1024, len, end = 0;
char *buf = malloc(1024);
int in = fileno(stdin);
while ((len = read(in, buf + end, buflen - 256 - end))) {
end += len;
if (end < buflen - 256) break;
buf = realloc(buf, buflen *= 2);
} That's not line by line either. |
@mahkoh no problem according to the rules, as almost everyone, current rust version included, read stdin in a buffer until EOF. |
@TeXitoi: Do you know what's the problem with http://benchmarksgame.alioth.debian.org/u64q/program.php?test=revcomp&lang=java&id=1 ? |
I think that's because they was stricter about this rule some time ago (and this version was proposed in 2010 as you can see in the comment). |
@mahkoh could you squash the commits down into one? I think that the failure on android may be spurious otherwise. |
The old file contained a line about ignoring the android tests. Rebased and restored. |
Lots of unsafe code and lots of branches removed. Also multithreaded. Rust old: 1.208 seconds Rust new: 0.761 seconds C: 0.632 seconds
Lots of unsafe code and lots of branches removed. Also multithreaded. Rust old: 1.208 seconds Rust new: 0.761 seconds C: 0.632 seconds
Lots of unsafe code and lots of branches removed. Also multithreaded.
Rust old: 1.208 seconds
Rust new: 0.761 seconds
C: 0.632 seconds