-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Optimize Vec when cloning from a slice #15471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
llvm is currently not able to conver `Vec::extend` into a memcpy for `Copy` types, which results in methods like `Vec::push_all` to run twice as slow as it should be running. This patch takes the unsafe `Vec::clone` optimization to speed up all the operations that are cloning a slice into a `Vec`. before: test vec::tests::bench_clone_from_0000_0000 ... bench: 12 ns/iter (+/- 2) test vec::tests::bench_clone_from_0000_0010 ... bench: 125 ns/iter (+/- 4) = 80 MB/s test vec::tests::bench_clone_from_0000_0100 ... bench: 360 ns/iter (+/- 33) = 277 MB/s test vec::tests::bench_clone_from_0000_1000 ... bench: 2601 ns/iter (+/- 175) = 384 MB/s test vec::tests::bench_clone_from_0010_0000 ... bench: 12 ns/iter (+/- 2) test vec::tests::bench_clone_from_0010_0010 ... bench: 125 ns/iter (+/- 10) = 80 MB/s test vec::tests::bench_clone_from_0010_0100 ... bench: 361 ns/iter (+/- 28) = 277 MB/s test vec::tests::bench_clone_from_0100_0010 ... bench: 131 ns/iter (+/- 13) = 76 MB/s test vec::tests::bench_clone_from_0100_0100 ... bench: 360 ns/iter (+/- 9) = 277 MB/s test vec::tests::bench_clone_from_0100_1000 ... bench: 2575 ns/iter (+/- 168) = 388 MB/s test vec::tests::bench_clone_from_1000_0100 ... bench: 356 ns/iter (+/- 20) = 280 MB/s test vec::tests::bench_clone_from_1000_1000 ... bench: 2605 ns/iter (+/- 167) = 383 MB/s test vec::tests::bench_from_slice_0000 ... bench: 11 ns/iter (+/- 0) test vec::tests::bench_from_slice_0010 ... bench: 115 ns/iter (+/- 5) = 86 MB/s test vec::tests::bench_from_slice_0100 ... bench: 309 ns/iter (+/- 170) = 323 MB/s test vec::tests::bench_from_slice_1000 ... bench: 2065 ns/iter (+/- 198) = 484 MB/s test vec::tests::bench_push_all_0000_0000 ... bench: 7 ns/iter (+/- 0) test vec::tests::bench_push_all_0000_0010 ... bench: 79 ns/iter (+/- 7) = 126 MB/s test vec::tests::bench_push_all_0000_0100 ... bench: 342 ns/iter (+/- 18) = 292 MB/s test vec::tests::bench_push_all_0000_1000 ... bench: 2873 ns/iter (+/- 75) = 348 MB/s test vec::tests::bench_push_all_0010_0010 ... bench: 154 ns/iter (+/- 8) = 64 MB/s test vec::tests::bench_push_all_0100_0100 ... bench: 518 ns/iter (+/- 18) = 193 MB/s test vec::tests::bench_push_all_1000_1000 ... bench: 4490 ns/iter (+/- 223) = 222 MB/s after: test vec::tests::bench_clone_from_0000_0000 ... bench: 12 ns/iter (+/- 1) test vec::tests::bench_clone_from_0000_0010 ... bench: 123 ns/iter (+/- 5) = 81 MB/s test vec::tests::bench_clone_from_0000_0100 ... bench: 367 ns/iter (+/- 23) = 272 MB/s test vec::tests::bench_clone_from_0000_1000 ... bench: 2618 ns/iter (+/- 252) = 381 MB/s test vec::tests::bench_clone_from_0010_0000 ... bench: 12 ns/iter (+/- 1) test vec::tests::bench_clone_from_0010_0010 ... bench: 124 ns/iter (+/- 7) = 80 MB/s test vec::tests::bench_clone_from_0010_0100 ... bench: 369 ns/iter (+/- 34) = 271 MB/s test vec::tests::bench_clone_from_0100_0010 ... bench: 123 ns/iter (+/- 6) = 81 MB/s test vec::tests::bench_clone_from_0100_0100 ... bench: 371 ns/iter (+/- 25) = 269 MB/s test vec::tests::bench_clone_from_0100_1000 ... bench: 2713 ns/iter (+/- 532) = 368 MB/s test vec::tests::bench_clone_from_1000_0100 ... bench: 369 ns/iter (+/- 14) = 271 MB/s test vec::tests::bench_clone_from_1000_1000 ... bench: 2611 ns/iter (+/- 194) = 382 MB/s test vec::tests::bench_from_slice_0000 ... bench: 7 ns/iter (+/- 0) test vec::tests::bench_from_slice_0010 ... bench: 108 ns/iter (+/- 4) = 92 MB/s test vec::tests::bench_from_slice_0100 ... bench: 235 ns/iter (+/- 24) = 425 MB/s test vec::tests::bench_from_slice_1000 ... bench: 1318 ns/iter (+/- 96) = 758 MB/s test vec::tests::bench_push_all_0000_0000 ... bench: 7 ns/iter (+/- 0) test vec::tests::bench_push_all_0000_0010 ... bench: 70 ns/iter (+/- 4) = 142 MB/s test vec::tests::bench_push_all_0000_0100 ... bench: 176 ns/iter (+/- 16) = 568 MB/s test vec::tests::bench_push_all_0000_1000 ... bench: 1125 ns/iter (+/- 94) = 888 MB/s test vec::tests::bench_push_all_0010_0010 ... bench: 159 ns/iter (+/- 15) = 62 MB/s test vec::tests::bench_push_all_0100_0100 ... bench: 363 ns/iter (+/- 12) = 275 MB/s test vec::tests::bench_push_all_1000_1000 ... bench: 2860 ns/iter (+/- 415) = 349 MB/s
dst_len += 1; | ||
dst.set_len(dst_len); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of taking src: &[T]
, does this optimize as well with src: Iterator<T>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just prototyped this out, and while for most functions it performed the same, for some reason this version of Vec::push_all
is half way between the original implementation, and my initial &[T]
implementation. Sometimes I just don't understand microbenchmarks. I'll investigate it more tomorrow.
test vec::tests::bench_clone_from_0000_0000 ... bench: 9 ns/iter (+/- 0)
test vec::tests::bench_clone_from_0000_0010 ... bench: 128 ns/iter (+/- 12) = 78 MB/s
test vec::tests::bench_clone_from_0000_0100 ... bench: 379 ns/iter (+/- 14) = 263 MB/s
test vec::tests::bench_clone_from_0000_1000 ... bench: 2624 ns/iter (+/- 187) = 381 MB/s
test vec::tests::bench_clone_from_0010_0000 ... bench: 9 ns/iter (+/- 1)
test vec::tests::bench_clone_from_0010_0010 ... bench: 127 ns/iter (+/- 9) = 78 MB/s
test vec::tests::bench_clone_from_0010_0100 ... bench: 386 ns/iter (+/- 26) = 259 MB/s
test vec::tests::bench_clone_from_0100_0010 ... bench: 129 ns/iter (+/- 6) = 77 MB/s
test vec::tests::bench_clone_from_0100_0100 ... bench: 374 ns/iter (+/- 19) = 267 MB/s
test vec::tests::bench_clone_from_0100_1000 ... bench: 2696 ns/iter (+/- 164) = 370 MB/s
test vec::tests::bench_clone_from_1000_0100 ... bench: 384 ns/iter (+/- 27) = 260 MB/s
test vec::tests::bench_clone_from_1000_1000 ... bench: 2638 ns/iter (+/- 75) = 379 MB/s
test vec::tests::bench_extend_0000_0000 ... bench: 9 ns/iter (+/- 0)
test vec::tests::bench_extend_0000_0010 ... bench: 129 ns/iter (+/- 6) = 77 MB/s
test vec::tests::bench_extend_0000_0100 ... bench: 418 ns/iter (+/- 23) = 239 MB/s
test vec::tests::bench_extend_0000_1000 ... bench: 3108 ns/iter (+/- 296) = 321 MB/s
test vec::tests::bench_extend_0010_0010 ... bench: 209 ns/iter (+/- 9) = 47 MB/s
test vec::tests::bench_extend_0100_0100 ... bench: 616 ns/iter (+/- 69) = 162 MB/s
test vec::tests::bench_extend_1000_1000 ... bench: 4891 ns/iter (+/- 243) = 204 MB/s
test vec::tests::bench_from_slice_0000 ... bench: 4 ns/iter (+/- 0)
test vec::tests::bench_from_slice_0010 ... bench: 109 ns/iter (+/- 9) = 91 MB/s
test vec::tests::bench_from_slice_0100 ... bench: 233 ns/iter (+/- 11) = 429 MB/s
test vec::tests::bench_from_slice_1000 ... bench: 1331 ns/iter (+/- 53) = 751 MB/s
test vec::tests::bench_push_all_0000_0000 ... bench: 4 ns/iter (+/- 1)
test vec::tests::bench_push_all_0000_0010 ... bench: 64 ns/iter (+/- 4) = 156 MB/s
test vec::tests::bench_push_all_0000_0100 ... bench: 230 ns/iter (+/- 21) = 434 MB/s
test vec::tests::bench_push_all_0000_1000 ... bench: 1777 ns/iter (+/- 248) = 562 MB/s
test vec::tests::bench_push_all_0010_0010 ... bench: 140 ns/iter (+/- 3) = 71 MB/s
test vec::tests::bench_push_all_0100_0100 ... bench: 407 ns/iter (+/- 18) = 245 MB/s
test vec::tests::bench_push_all_1000_1000 ... bench: 3773 ns/iter (+/- 661) = 265 MB/s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I figured out an internal loop that lets llvm optimize an interator down into my fast version, but unfortunately it's not drop safe:
fn do_clone_hack_external_iter2<'a, Iter: Iterator<&'a u8>>(dst: &mut Vec<u8>, mut src: Iter) {
let mut dst_len = dst.len();
for x in src {
let x = x.clone();
unsafe {
ptr::write(
dst.as_mut_slice().unsafe_mut_ref(dst_len),
x);
dst_len += 1;
}
}
unsafe {
dst.set_len(dst_len);
}
}
Setting the vec length from inside that hot loop prevents this optimization from happening. I've put my experiments in this gist if anyone else wants to play with it: https://gist.github.com/erickt/7a3fd8689159e69ed74c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is unfortunate! It's a shame that this optimization can't apply to patterns such as extend
or collect
, but I suppose we'll just have to wait to teach LLVM a little more about aliasing!
This changes Vec::from_slice to call unsafe_push_all_clone directly to avoid doing an unnecessary reserve_additional call
llvm is currently not able to conver `Vec::extend` into a memcpy for `Copy` types, which results in methods like `Vec::push_all` to run twice as slow as it should be running. This patch takes the unsafe `Vec::clone` optimization to speed up all the operations that are cloning a slice into a `Vec`. before: ``` test vec::tests::bench_clone_from_0000_0000 ... bench: 12 ns/iter (+/- 2) test vec::tests::bench_clone_from_0000_0010 ... bench: 125 ns/iter (+/- 4) = 80 MB/s test vec::tests::bench_clone_from_0000_0100 ... bench: 360 ns/iter (+/- 33) = 277 MB/s test vec::tests::bench_clone_from_0000_1000 ... bench: 2601 ns/iter (+/- 175) = 384 MB/s test vec::tests::bench_clone_from_0010_0000 ... bench: 12 ns/iter (+/- 2) test vec::tests::bench_clone_from_0010_0010 ... bench: 125 ns/iter (+/- 10) = 80 MB/s test vec::tests::bench_clone_from_0010_0100 ... bench: 361 ns/iter (+/- 28) = 277 MB/s test vec::tests::bench_clone_from_0100_0010 ... bench: 131 ns/iter (+/- 13) = 76 MB/s test vec::tests::bench_clone_from_0100_0100 ... bench: 360 ns/iter (+/- 9) = 277 MB/s test vec::tests::bench_clone_from_0100_1000 ... bench: 2575 ns/iter (+/- 168) = 388 MB/s test vec::tests::bench_clone_from_1000_0100 ... bench: 356 ns/iter (+/- 20) = 280 MB/s test vec::tests::bench_clone_from_1000_1000 ... bench: 2605 ns/iter (+/- 167) = 383 MB/s test vec::tests::bench_from_slice_0000 ... bench: 11 ns/iter (+/- 0) test vec::tests::bench_from_slice_0010 ... bench: 115 ns/iter (+/- 5) = 86 MB/s test vec::tests::bench_from_slice_0100 ... bench: 309 ns/iter (+/- 170) = 323 MB/s test vec::tests::bench_from_slice_1000 ... bench: 2065 ns/iter (+/- 198) = 484 MB/s test vec::tests::bench_push_all_0000_0000 ... bench: 7 ns/iter (+/- 0) test vec::tests::bench_push_all_0000_0010 ... bench: 79 ns/iter (+/- 7) = 126 MB/s test vec::tests::bench_push_all_0000_0100 ... bench: 342 ns/iter (+/- 18) = 292 MB/s test vec::tests::bench_push_all_0000_1000 ... bench: 2873 ns/iter (+/- 75) = 348 MB/s test vec::tests::bench_push_all_0010_0010 ... bench: 154 ns/iter (+/- 8) = 64 MB/s test vec::tests::bench_push_all_0100_0100 ... bench: 518 ns/iter (+/- 18) = 193 MB/s test vec::tests::bench_push_all_1000_1000 ... bench: 4490 ns/iter (+/- 223) = 222 MB/s ``` after: ``` test vec::tests::bench_clone_from_0000_0000 ... bench: 12 ns/iter (+/- 1) test vec::tests::bench_clone_from_0000_0010 ... bench: 123 ns/iter (+/- 5) = 81 MB/s test vec::tests::bench_clone_from_0000_0100 ... bench: 367 ns/iter (+/- 23) = 272 MB/s test vec::tests::bench_clone_from_0000_1000 ... bench: 2618 ns/iter (+/- 252) = 381 MB/s test vec::tests::bench_clone_from_0010_0000 ... bench: 12 ns/iter (+/- 1) test vec::tests::bench_clone_from_0010_0010 ... bench: 124 ns/iter (+/- 7) = 80 MB/s test vec::tests::bench_clone_from_0010_0100 ... bench: 369 ns/iter (+/- 34) = 271 MB/s test vec::tests::bench_clone_from_0100_0010 ... bench: 123 ns/iter (+/- 6) = 81 MB/s test vec::tests::bench_clone_from_0100_0100 ... bench: 371 ns/iter (+/- 25) = 269 MB/s test vec::tests::bench_clone_from_0100_1000 ... bench: 2713 ns/iter (+/- 532) = 368 MB/s test vec::tests::bench_clone_from_1000_0100 ... bench: 369 ns/iter (+/- 14) = 271 MB/s test vec::tests::bench_clone_from_1000_1000 ... bench: 2611 ns/iter (+/- 194) = 382 MB/s test vec::tests::bench_from_slice_0000 ... bench: 7 ns/iter (+/- 0) test vec::tests::bench_from_slice_0010 ... bench: 108 ns/iter (+/- 4) = 92 MB/s test vec::tests::bench_from_slice_0100 ... bench: 235 ns/iter (+/- 24) = 425 MB/s test vec::tests::bench_from_slice_1000 ... bench: 1318 ns/iter (+/- 96) = 758 MB/s test vec::tests::bench_push_all_0000_0000 ... bench: 7 ns/iter (+/- 0) test vec::tests::bench_push_all_0000_0010 ... bench: 70 ns/iter (+/- 4) = 142 MB/s test vec::tests::bench_push_all_0000_0100 ... bench: 176 ns/iter (+/- 16) = 568 MB/s test vec::tests::bench_push_all_0000_1000 ... bench: 1125 ns/iter (+/- 94) = 888 MB/s test vec::tests::bench_push_all_0010_0010 ... bench: 159 ns/iter (+/- 15) = 62 MB/s test vec::tests::bench_push_all_0100_0100 ... bench: 363 ns/iter (+/- 12) = 275 MB/s test vec::tests::bench_push_all_1000_1000 ... bench: 2860 ns/iter (+/- 415) = 349 MB/s ``` This also includes extra benchmarks for `Vec` and `MemWriter`.
llvm is currently not able to conver
Vec::extend
into a memcpy forCopy
types, which results in methods likeVec::push_all
to run twice as slow as it should be running. This patch takes the unsafeVec::clone
optimization to speed up all the operations that are cloning a slice into aVec
.before:
after:
This also includes extra benchmarks for
Vec
andMemWriter
.