Skip to content

Dramatic slowdown in rust performance from the serialization benchmarks #19281

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
erickt opened this issue Nov 24, 2014 · 2 comments · Fixed by #19574
Closed

Dramatic slowdown in rust performance from the serialization benchmarks #19281

erickt opened this issue Nov 24, 2014 · 2 comments · Fixed by #19574
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code. regression-from-stable-to-nightly Performance or correctness regression from stable to nightly.

Comments

@erickt
Copy link
Contributor

erickt commented Nov 24, 2014

I've been doing a lot of benchmarking recently (1, 2, 3), and I've seen a pretty dramatic drop in performance over the past couple weeks. While some of it might be explained from upgrading from OSX Mavericks to Yosemite, I still saw a 40% drop in performance between 2014-11-13 and 2014-11-23. I haven't been able to dig into what's going on yet, but I did see that our current implementation of Writer for &mut [u8]:

impl Writer for Vec<u8> {
    #[inline]
    fn write(&mut self, buf: &[u8]) -> IoResult<()> {
        self.push_all(buf);
        Ok(())
    }
}

impl Writer for Vec<u8> {
    #[inline]
    fn write(&mut self, buf: &[u8]) -> IoResult<()> {
        self.push_all(buf);
        Ok(())
    }
}

#[bench]
fn bench_std_vec_writer(b: &mut test::Bencher) {
    let mut dst = Vec::with_capacity(BATCHES * SRC_LEN);
    let src = &[1, .. SRC_LEN];

    b.iter(|| {
        dst.clear();

        do_std_writes(&mut dst, src, BATCHES);
    })
}

Does not appear to be inlining well for some reason:

test writer::bench_std_vec_writer                           ... bench: 1000 | [----*****#*****--------]             | 2000:      1248 ns/iter (+/- 588)
test writer::bench_std_vec_writer_inline_always             ... bench: 900 |   [----*#***--]                        | 2000:      1125 ns/iter (+/- 282)
test writer::bench_std_vec_writer_inline_never              ... bench: 1000 |  [----***#*****--------]              | 2000:      1227 ns/iter (+/- 516)

Rewriting to this makes it 10 times faster (and yes, I realize I'm not updating the length of the Vec<u8>. Could that be a problem?):

struct VecWriter1<'a> {
    dst: &'a mut Vec<u8>,
}

impl<'a> MyWriter for VecWriter1<'a> {
    #[inline]
    fn my_write(&mut self, src: &[u8]) -> IoResult<()> {
        let src_len = src.len();

        self.dst.reserve(src_len);

        let dst = self.dst.as_mut_slice();

        unsafe {
            // we reserved enough room in `dst` to store `src`.
            ptr::copy_nonoverlapping_memory(
                dst.as_mut_ptr(),
                src.as_ptr(),
                src_len);
        }

        Ok(())
    }
}

with this performance:

test writer::bench_vec_writer_1                             ... bench: 100 |         [------*********#*****--------] | 200:       160 ns/iter (+/- 68)
test writer::bench_vec_writer_1_inline_always               ... bench: 100 |     [--------****#**--]                 | 300:       182 ns/iter (+/- 79)
test writer::bench_vec_writer_1_inline_never                ... bench: 600 |   [---****#**--]                       | 2000:       952 ns/iter (+/- 399)

Furthermore, commenting out the self.dst.reserve(src_len) made it just as fast as BufWriter and directly using the unsafe ptr::copy_nonoverlapping_memory.

@emberian emberian added I-slow Issue: Problems and improvements with respect to performance of generated code. regression-from-stable-to-nightly Performance or correctness regression from stable to nightly. labels Nov 24, 2014
@erickt
Copy link
Contributor Author

erickt commented Dec 2, 2014

It looks like this is coming from cf3b2e4, where Vec::reserve* was rewritten. It appears that the old Vec::reserve_additional behavior was about 2-10 times faster according to this micro-benchmark.

@erickt
Copy link
Contributor Author

erickt commented Dec 2, 2014

This is definitely getting out of my area of expertise, but apparently switching Vec::reserve from this:

    pub fn reserve(&mut self, additional: uint) {
        if self.cap - self.len < additional {
            match self.len.checked_add(additional) {
                None => panic!("Vec::reserve: `uint` overflow"),
                // if the checked_add
                Some(new_cap) => {
                    let amort_cap = new_cap.next_power_of_two();
                    // next_power_of_two will overflow to exactly 0 for really big capacities
                    if amort_cap == 0 {
                        self.grow_capacity(new_cap);
                    } else {
                        self.grow_capacity(amort_cap);
                    }
                }
            }
        }
    }

To this:

    pub fn reserve(&mut self, additional: uint) {
        if self.cap - self.len < additional {
            match self.len.checked_add(additional) {
                None => panic!("Vec::reserve: `uint` overflow"),
                Some(new_cap) => {
                    let amort_cap = new_cap.next_power_of_two();
                    // next_power_of_two will overflow to exactly 0 for really big capacities
                    let capacity = if amort_cap == 0 { new_cap } else { amort_cap };
                    self.grow_capacity(capacity)
                }
            }
        }
    }

Is sufficient to get my micro-benchmark's Vec::write to perform like it did a week ago. I don't know enough about low-level optimization to know why this seemingly inconsequential change would have such an impact on performance. @pcwalton or @thestinger, you're pretty familiar with low-level stuff, do you know what could be going on here?

erickt added a commit to erickt/rust that referenced this issue Dec 5, 2014
Somehow llvm is able to optimize this version of Vec::reserve
into dramatically faster than the old version. In micro-benchmarks
this was 2-10 times faster. It also shaved 14 minutes off of
rust's compile times.

Closes rust-lang#19281.
bors added a commit that referenced this issue Dec 8, 2014
(I don't understand why this works, and so I don't quite trust this yet. I'm pushing it up to see if anyone else can replicate this performance increase)

Somehow llvm is able to optimize this version of Vec::reserve into dramatically faster than the old version. In micro-benchmarks this was 2-10 times faster. It also reduce my Rust compile time from 41 minutes to 27 minutes.

Closes #19281.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code. regression-from-stable-to-nightly Performance or correctness regression from stable to nightly.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants