-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
===
Optimization: Defer recursion into pointees
#43658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
jl_egal
Optimization: Defer recursion into pointees===
Optimization: Defer recursion into pointees
@nanosoldier |
1 similar comment
@nanosoldier |
Good idea, I like it! This sort of feels like the right way to do it to me, even if it's a tad slower in some cases. I wonder if it would help to ignore pointer fields completely in the first pass, instead of doing the |
Something went wrong when running your job:
Unfortunately, the logs could not be uploaded. |
Yay, thanks! 🎉
Yeah, i had that idea too, and I could go either way. I chose to do it this way, because it's possible the NULL/NULL check could also allow us to return early: If a struct has 2 pointer fields and the second one is NULL in So it's another instance of front-loading checking the bits in the struct before recursing to pointees. So i think probably better to do it this way, since it's making the same tradeoff as the rest of the PR. Does that make sense to you, too? |
Anyone know why Nanosolider failed to upload the benchmark results? Is it because the PR is a draft or something? I'll mark it ready for review now. |
The size has just gotten too big and no one has fixed it. |
NULL fields are rare, so I think not worth special-casing. Actual timing is what matters though. |
I get these results for your last benchmark (where the PR is slower): before PR: PR: PR + ignore pointer fields on the first pass: So there is a slight improvement (and looks like I need a new laptop again 😂 ) |
Okay cool, makes sense! I think the "NULL fields are rare, so I think not worth special-casing." argument is strong here, so that makes sense. Thanks! 👍 👍 I can make that change. But first: So is it possible to see the results of the nanosoldier run? Or are they lost to the mysteries of the universe? It would be nice to see if this proves to be good or bad, and I'd like to be able to have these numbers before making any changes so we can compare? Thanks! |
@vtjnash has to manually retrieve them. |
Oops, sorry, I am behind on email. Here you go: |
Thanks @vtjnash! What do the percentages mean? It looks like some got quite a bit better and some quite a bit worse, but in ways that don't exactly make sense to me... So I feel like probably this is mostly just noisy? Can any of you with more experience reading these weigh in? I'll give your suggestion a shot now, @JeffBezanson. |
@nanosoldier runbenchmarks(ALL, vs="@5449d1bfabdaeeb321c179a8344dc2852a989764") |
Nanosoldier (especially recently) is pretty noisy. |
@NHDaly, you need to code quote the part after the nanosoldier invocation. |
@nanosoldier |
Something went wrong when running your job:
Unfortunately, the logs could not be uploaded. |
Oops, thanks @KristofferC. I did that right the first time (but also it didn't work then either.. 🤔 maybe i don't have permissions or something). Thanks @oscardssmith and @KristofferC |
it did work, it's just that there's a nanosoldier bug that means @vtjnash needs to post the log manually. |
I don't know if this is the comparison you wanted, since there are a lot of unrelated commits in that command, but here you go: https://github.com/JuliaCI/NanosoldierReports/blob/master/benchmark/by_hash/18eef47_vs_5449d1b/report.md |
@nanosoldier |
For more impact, we may want to update codegen.cpp to do the same ordering change as here also for |
Something went wrong when running your job:
Unfortunately, the logs could not be uploaded. |
Still not what I mean to run though: @nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. |
Oh, huh, maybe I don't exactly undrestand how Nanosoldier works. I think i was trying to compare the latest commit on this branch against the previous commit. Did that not do that? |
Yeah, makes sense. Good idea! I don't really feel like I have the chops to do that.. should it be done in a separate PR? Or could someone help with that here? Also, i really don't know what to make of these Nanosoldier results.. Some things appear to get quite a bit better, and some quite a bit worse, but it's hard to tell what's just noise? :( |
@NHDaly Can you rebase on the latest master? That should fix the |
Immutable struct comparisons with `===` can be arbitrarily expensive for deeply recursive but (almost) equal objects. Whenever possible, it's valuable to defer the potentially expensive recursion by first comparing the struct fields for bitwise equality. Before this commit, two structs are compare elementwise, in the order of the struct definition, recursing when pointer fields are encountered. This commit defers the recursion into pointed-to fields until after all other non-pointer fields of the struct are compared. This has two advantages: 1. It defers the expensive part of === comparison as long as possible, in the hopes that we can exit early from dissimilarities discovered elsewhere in the struct instances. 2. It improves cache-locality by scanning the whole struct before jumping into any new places in memory (and reducing comparisons needed on the current cache line after returning from the recursive call). The drawback is that you'll have to scan the pointer fields again, which means potentially more cache misses if the struct is very large. The best way to tell if this is helpful or harmful is benchmarking. Here is the motivating benchmark, which indeed improves by 10x with this commit, compared to master: ```julia julia> using BenchmarkTools julia> struct VN val::Float32 next::Union{VN, Array} # put a mutable thing at the end, to prevent runtime from sharing instances end julia> struct NV next::Union{NV, Array} # put a mutable thing at the end, to prevent runtime from sharing instances val::Float32 end julia> function make_chain_vn(n, sentinal) head = VN(1, sentinal) for i in 2:n head = VN(rand(Int), head) end return head end make_chain_vn (generic function with 1 method) julia> function make_chain_nv(n, sentinal) head = NV(sentinal, 1) for i in 2:n head = NV(head, rand(Int)) end return head end make_chain_nv (generic function with 1 method) julia> vn1, vn2 = make_chain_vn(10000, []), make_chain_vn(10000, []); julia> nv1, nv2 = make_chain_nv(10000, []), make_chain_nv(10000, []); ``` Master: ``` julia> @Btime $vn1 === $vn2 7.562 ns (0 allocations: 0 bytes) false julia> @Btime $nv1 === $nv2 # slower, since it recurses pointers unnecessarily 76.952 μs (0 allocations: 0 bytes) false ``` After this commit: ``` julia> @Btime $vn1 === $vn2 8.597 ns (0 allocations: 0 bytes) false julia> @Btime $nv1 === $nv2 # We get to skip the recursion and exit early. :) 10.280 ns (0 allocations: 0 bytes) false ``` However, I think that there are probably other benchmarks where it harms performance, so we'll have to see... For example, here's one: In the exact opposite case as above, if the two objects _are_ (almost) equal, necessitating checking every object, the `NV` comparisons could have exited after all the recursive pointer checks, and never compare the fields, whereas now the fields are checked first, so this gets slower. I'm not exactly sure why the `VN` comparisons get somewhat slower too, but it's maybe because of the second scan mentioned above. ```julia julia> function make_chain_nv(n, sentinal) head = NV(sentinal, 1) for i in 2:n head = NV(head, i) end return head end make_chain_nv (generic function with 1 method) julia> function make_chain_vn(n, sentinal) head = VN(1, sentinal) for i in 2:n head = VN(i, head) end return head end make_chain_vn (generic function with 1 method) julia> vn1, vn2 = make_chain_vn(10000, []), make_chain_vn(10000, []); julia> nv1, nv2 = make_chain_nv(10000, []), make_chain_nv(10000, []); ``` Master: ``` julia> @Btime $vn1 === $vn2 95.996 μs (0 allocations: 0 bytes) false julia> @Btime $nv1 === $nv2 82.192 μs (0 allocations: 0 bytes) false ``` This commit: ``` julia> @Btime $vn1 === $vn2 127.512 μs (0 allocations: 0 bytes) false julia> @Btime $nv1 === $nv2 126.837 μs (0 allocations: 0 bytes) ```
Delay the nullptr checks until all non-ptr fields have been compared, since we have to go back to those anyways to follow the pointers. Co-authored-by:Jeff Bezanson <[email protected]>
7d8c933
to
ba718e8
Compare
Done, thanks |
* jl_egal Optimization: Defer recursion into pointees Immutable struct comparisons with `===` can be arbitrarily expensive for deeply recursive but (almost) equal objects. Whenever possible, it's valuable to defer the potentially expensive recursion by first comparing the struct fields for bitwise equality. Before this commit, two structs are compare elementwise, in the order of the struct definition, recursing when pointer fields are encountered. This commit defers the recursion into pointed-to fields until after all other non-pointer fields of the struct are compared. This has two advantages: 1. It defers the expensive part of === comparison as long as possible, in the hopes that we can exit early from dissimilarities discovered elsewhere in the struct instances. 2. It improves cache-locality by scanning the whole struct before jumping into any new places in memory (and reducing comparisons needed on the current cache line after returning from the recursive call). The drawback is that you'll have to scan the pointer fields again, which means potentially more cache misses if the struct is very large. The best way to tell if this is helpful or harmful is benchmarking. Here is the motivating benchmark, which indeed improves by 10x with this commit, compared to master: ```julia julia> using BenchmarkTools julia> struct VN val::Float32 next::Union{VN, Array} # put a mutable thing at the end, to prevent runtime from sharing instances end julia> struct NV next::Union{NV, Array} # put a mutable thing at the end, to prevent runtime from sharing instances val::Float32 end julia> function make_chain_vn(n, sentinal) head = VN(1, sentinal) for i in 2:n head = VN(rand(Int), head) end return head end make_chain_vn (generic function with 1 method) julia> function make_chain_nv(n, sentinal) head = NV(sentinal, 1) for i in 2:n head = NV(head, rand(Int)) end return head end make_chain_nv (generic function with 1 method) julia> vn1, vn2 = make_chain_vn(10000, []), make_chain_vn(10000, []); julia> nv1, nv2 = make_chain_nv(10000, []), make_chain_nv(10000, []); ``` Master: ``` julia> @Btime $vn1 === $vn2 7.562 ns (0 allocations: 0 bytes) false julia> @Btime $nv1 === $nv2 # slower, since it recurses pointers unnecessarily 76.952 μs (0 allocations: 0 bytes) false ``` After this commit: ``` julia> @Btime $vn1 === $vn2 8.597 ns (0 allocations: 0 bytes) false julia> @Btime $nv1 === $nv2 # We get to skip the recursion and exit early. :) 10.280 ns (0 allocations: 0 bytes) false ``` However, I think that there are probably other benchmarks where it harms performance, so we'll have to see... For example, here's one: In the exact opposite case as above, if the two objects _are_ (almost) equal, necessitating checking every object, the `NV` comparisons could have exited after all the recursive pointer checks, and never compare the fields, whereas now the fields are checked first, so this gets slower. I'm not exactly sure why the `VN` comparisons get somewhat slower too, but it's maybe because of the second scan mentioned above. ```julia julia> function make_chain_nv(n, sentinal) head = NV(sentinal, 1) for i in 2:n head = NV(head, i) end return head end make_chain_nv (generic function with 1 method) julia> function make_chain_vn(n, sentinal) head = VN(1, sentinal) for i in 2:n head = VN(i, head) end return head end make_chain_vn (generic function with 1 method) julia> vn1, vn2 = make_chain_vn(10000, []), make_chain_vn(10000, []); julia> nv1, nv2 = make_chain_nv(10000, []), make_chain_nv(10000, []); ``` Master: ``` julia> @Btime $vn1 === $vn2 95.996 μs (0 allocations: 0 bytes) false julia> @Btime $nv1 === $nv2 82.192 μs (0 allocations: 0 bytes) false ``` This commit: ``` julia> @Btime $vn1 === $vn2 127.512 μs (0 allocations: 0 bytes) false julia> @Btime $nv1 === $nv2 126.837 μs (0 allocations: 0 bytes) ``` * Ignore pointer fields completely in the first pass of === Delay the nullptr checks until all non-ptr fields have been compared, since we have to go back to those anyways to follow the pointers. Co-authored-by:Jeff Bezanson <[email protected]>
* jl_egal Optimization: Defer recursion into pointees Immutable struct comparisons with `===` can be arbitrarily expensive for deeply recursive but (almost) equal objects. Whenever possible, it's valuable to defer the potentially expensive recursion by first comparing the struct fields for bitwise equality. Before this commit, two structs are compare elementwise, in the order of the struct definition, recursing when pointer fields are encountered. This commit defers the recursion into pointed-to fields until after all other non-pointer fields of the struct are compared. This has two advantages: 1. It defers the expensive part of === comparison as long as possible, in the hopes that we can exit early from dissimilarities discovered elsewhere in the struct instances. 2. It improves cache-locality by scanning the whole struct before jumping into any new places in memory (and reducing comparisons needed on the current cache line after returning from the recursive call). The drawback is that you'll have to scan the pointer fields again, which means potentially more cache misses if the struct is very large. The best way to tell if this is helpful or harmful is benchmarking. Here is the motivating benchmark, which indeed improves by 10x with this commit, compared to master: ```julia julia> using BenchmarkTools julia> struct VN val::Float32 next::Union{VN, Array} # put a mutable thing at the end, to prevent runtime from sharing instances end julia> struct NV next::Union{NV, Array} # put a mutable thing at the end, to prevent runtime from sharing instances val::Float32 end julia> function make_chain_vn(n, sentinal) head = VN(1, sentinal) for i in 2:n head = VN(rand(Int), head) end return head end make_chain_vn (generic function with 1 method) julia> function make_chain_nv(n, sentinal) head = NV(sentinal, 1) for i in 2:n head = NV(head, rand(Int)) end return head end make_chain_nv (generic function with 1 method) julia> vn1, vn2 = make_chain_vn(10000, []), make_chain_vn(10000, []); julia> nv1, nv2 = make_chain_nv(10000, []), make_chain_nv(10000, []); ``` Master: ``` julia> @Btime $vn1 === $vn2 7.562 ns (0 allocations: 0 bytes) false julia> @Btime $nv1 === $nv2 # slower, since it recurses pointers unnecessarily 76.952 μs (0 allocations: 0 bytes) false ``` After this commit: ``` julia> @Btime $vn1 === $vn2 8.597 ns (0 allocations: 0 bytes) false julia> @Btime $nv1 === $nv2 # We get to skip the recursion and exit early. :) 10.280 ns (0 allocations: 0 bytes) false ``` However, I think that there are probably other benchmarks where it harms performance, so we'll have to see... For example, here's one: In the exact opposite case as above, if the two objects _are_ (almost) equal, necessitating checking every object, the `NV` comparisons could have exited after all the recursive pointer checks, and never compare the fields, whereas now the fields are checked first, so this gets slower. I'm not exactly sure why the `VN` comparisons get somewhat slower too, but it's maybe because of the second scan mentioned above. ```julia julia> function make_chain_nv(n, sentinal) head = NV(sentinal, 1) for i in 2:n head = NV(head, i) end return head end make_chain_nv (generic function with 1 method) julia> function make_chain_vn(n, sentinal) head = VN(1, sentinal) for i in 2:n head = VN(i, head) end return head end make_chain_vn (generic function with 1 method) julia> vn1, vn2 = make_chain_vn(10000, []), make_chain_vn(10000, []); julia> nv1, nv2 = make_chain_nv(10000, []), make_chain_nv(10000, []); ``` Master: ``` julia> @Btime $vn1 === $vn2 95.996 μs (0 allocations: 0 bytes) false julia> @Btime $nv1 === $nv2 82.192 μs (0 allocations: 0 bytes) false ``` This commit: ``` julia> @Btime $vn1 === $vn2 127.512 μs (0 allocations: 0 bytes) false julia> @Btime $nv1 === $nv2 126.837 μs (0 allocations: 0 bytes) ``` * Ignore pointer fields completely in the first pass of === Delay the nullptr checks until all non-ptr fields have been compared, since we have to go back to those anyways to follow the pointers. Co-authored-by:Jeff Bezanson <[email protected]>
* jl_egal Optimization: Defer recursion into pointees Immutable struct comparisons with `===` can be arbitrarily expensive for deeply recursive but (almost) equal objects. Whenever possible, it's valuable to defer the potentially expensive recursion by first comparing the struct fields for bitwise equality. Before this commit, two structs are compare elementwise, in the order of the struct definition, recursing when pointer fields are encountered. This commit defers the recursion into pointed-to fields until after all other non-pointer fields of the struct are compared. This has two advantages: 1. It defers the expensive part of === comparison as long as possible, in the hopes that we can exit early from dissimilarities discovered elsewhere in the struct instances. 2. It improves cache-locality by scanning the whole struct before jumping into any new places in memory (and reducing comparisons needed on the current cache line after returning from the recursive call). The drawback is that you'll have to scan the pointer fields again, which means potentially more cache misses if the struct is very large. The best way to tell if this is helpful or harmful is benchmarking. Here is the motivating benchmark, which indeed improves by 10x with this commit, compared to master: ```julia julia> using BenchmarkTools julia> struct VN val::Float32 next::Union{VN, Array} # put a mutable thing at the end, to prevent runtime from sharing instances end julia> struct NV next::Union{NV, Array} # put a mutable thing at the end, to prevent runtime from sharing instances val::Float32 end julia> function make_chain_vn(n, sentinal) head = VN(1, sentinal) for i in 2:n head = VN(rand(Int), head) end return head end make_chain_vn (generic function with 1 method) julia> function make_chain_nv(n, sentinal) head = NV(sentinal, 1) for i in 2:n head = NV(head, rand(Int)) end return head end make_chain_nv (generic function with 1 method) julia> vn1, vn2 = make_chain_vn(10000, []), make_chain_vn(10000, []); julia> nv1, nv2 = make_chain_nv(10000, []), make_chain_nv(10000, []); ``` Master: ``` julia> @Btime $vn1 === $vn2 7.562 ns (0 allocations: 0 bytes) false julia> @Btime $nv1 === $nv2 # slower, since it recurses pointers unnecessarily 76.952 μs (0 allocations: 0 bytes) false ``` After this commit: ``` julia> @Btime $vn1 === $vn2 8.597 ns (0 allocations: 0 bytes) false julia> @Btime $nv1 === $nv2 # We get to skip the recursion and exit early. :) 10.280 ns (0 allocations: 0 bytes) false ``` However, I think that there are probably other benchmarks where it harms performance, so we'll have to see... For example, here's one: In the exact opposite case as above, if the two objects _are_ (almost) equal, necessitating checking every object, the `NV` comparisons could have exited after all the recursive pointer checks, and never compare the fields, whereas now the fields are checked first, so this gets slower. I'm not exactly sure why the `VN` comparisons get somewhat slower too, but it's maybe because of the second scan mentioned above. ```julia julia> function make_chain_nv(n, sentinal) head = NV(sentinal, 1) for i in 2:n head = NV(head, i) end return head end make_chain_nv (generic function with 1 method) julia> function make_chain_vn(n, sentinal) head = VN(1, sentinal) for i in 2:n head = VN(i, head) end return head end make_chain_vn (generic function with 1 method) julia> vn1, vn2 = make_chain_vn(10000, []), make_chain_vn(10000, []); julia> nv1, nv2 = make_chain_nv(10000, []), make_chain_nv(10000, []); ``` Master: ``` julia> @Btime $vn1 === $vn2 95.996 μs (0 allocations: 0 bytes) false julia> @Btime $nv1 === $nv2 82.192 μs (0 allocations: 0 bytes) false ``` This commit: ``` julia> @Btime $vn1 === $vn2 127.512 μs (0 allocations: 0 bytes) false julia> @Btime $nv1 === $nv2 126.837 μs (0 allocations: 0 bytes) ``` * Ignore pointer fields completely in the first pass of === Delay the nullptr checks until all non-ptr fields have been compared, since we have to go back to those anyways to follow the pointers. Co-authored-by:Jeff Bezanson <[email protected]>
Explicitly referencing #44712 from here. |
Immutable struct comparisons with
===
can be arbitrarily expensive fordeeply recursive but (almost) equal objects. Whenever possible, it's
valuable to defer the potentially expensive recursion by first comparing
the struct fields for bitwise equality.
Before this commit, two structs are compare elementwise, in the order of
the struct definition, recursing when pointer fields are encountered.
This commit defers the recursion into pointed-to fields until after all
other non-pointer fields of the struct are compared.
This has two advantages:
in the hopes that we can exit early from dissimilarities discovered
elsewhere in the struct instances.
jumping into any new places in memory (and reducing comparisons
needed on the current cache line after returning from the recursive
call).
The drawback is that you'll have to scan the pointer fields again, which
means potentially more cache misses if the struct is very large.
The best way to tell if this is helpful or harmful is benchmarking.
Here is the motivating benchmark, which indeed improves by 10x with this
commit, compared to master:
Master:
After this commit:
However, I think that there are probably other benchmarks where it
harms performance, so we'll have to see...
For example, here's one: In the exact opposite case as above, if the two
objects are (almost) equal, necessitating checking every object, the
NV
comparisons could have exited after all the recursive pointerchecks, and never compare the fields, whereas now the fields are checked
first, so this gets slower.
I'm not exactly sure why the
VN
comparisons get somewhat slower too,but it's maybe because of the second scan mentioned above.
Master:
This commit:
We stumbled across this potential optimization while reading through the code for
compare_fields()
. Hopefully it's beneficial, but we'll see! :)Co-Authored-By: @nystrom