Skip to content

Conversation

NHDaly
Copy link
Member

@NHDaly NHDaly commented Jan 4, 2022

Immutable struct comparisons with === can be arbitrarily expensive for
deeply recursive but (almost) equal objects. Whenever possible, it's
valuable to defer the potentially expensive recursion by first comparing
the struct fields for bitwise equality.

Before this commit, two structs are compare elementwise, in the order of
the struct definition, recursing when pointer fields are encountered.

This commit defers the recursion into pointed-to fields until after all
other non-pointer fields of the struct are compared.

This has two advantages:

  1. It defers the expensive part of === comparison as long as possible,
    in the hopes that we can exit early from dissimilarities discovered
    elsewhere in the struct instances.
  2. It improves cache-locality by scanning the whole struct before
    jumping into any new places in memory (and reducing comparisons
    needed on the current cache line after returning from the recursive
    call).

The drawback is that you'll have to scan the pointer fields again, which
means potentially more cache misses if the struct is very large.

The best way to tell if this is helpful or harmful is benchmarking.

Here is the motivating benchmark, which indeed improves by 10x with this
commit, compared to master:

julia> using BenchmarkTools

julia> struct VN
           val::Float32
           next::Union{VN, Array}  # put a mutable thing at the end, to prevent runtime from sharing instances
       end

julia> struct NV
           next::Union{NV, Array}  # put a mutable thing at the end, to prevent runtime from sharing instances
           val::Float32
       end

julia> function make_chain_vn(n, sentinal)
           head = VN(1, sentinal)
           for i in 2:n
               head = VN(rand(Int), head)
           end
           return head
       end
make_chain_vn (generic function with 1 method)

julia> function make_chain_nv(n, sentinal)
           head = NV(sentinal, 1)
           for i in 2:n
               head = NV(head, rand(Int))
           end
           return head
       end
make_chain_nv (generic function with 1 method)

julia> vn1, vn2 = make_chain_vn(10000, []), make_chain_vn(10000, []);

julia> nv1, nv2 = make_chain_nv(10000, []), make_chain_nv(10000, []);

Master:

julia> @btime $vn1 === $vn2
  7.562 ns (0 allocations: 0 bytes)
false

julia> @btime $nv1 === $nv2  # slower, since it recurses pointers unnecessarily
  76.952 μs (0 allocations: 0 bytes)
false

After this commit:

julia> @btime $vn1 === $vn2
  8.597 ns (0 allocations: 0 bytes)
false

julia> @btime $nv1 === $nv2  # We get to skip the recursion and exit early. :)
  10.280 ns (0 allocations: 0 bytes)
false

However, I think that there are probably other benchmarks where it
harms performance, so we'll have to see...

For example, here's one: In the exact opposite case as above, if the two
objects are (almost) equal, necessitating checking every object, the
NV comparisons could have exited after all the recursive pointer
checks, and never compare the fields, whereas now the fields are checked
first, so this gets slower.

I'm not exactly sure why the VN comparisons get somewhat slower too,
but it's maybe because of the second scan mentioned above.

julia> function make_chain_nv(n, sentinal)
           head = NV(sentinal, 1)
           for i in 2:n
               head = NV(head, i)
           end
           return head
       end
make_chain_nv (generic function with 1 method)

julia> function make_chain_vn(n, sentinal)
           head = VN(1, sentinal)
           for i in 2:n
               head = VN(i, head)
           end
           return head
       end
make_chain_vn (generic function with 1 method)

julia> vn1, vn2 = make_chain_vn(10000, []), make_chain_vn(10000, []);

julia> nv1, nv2 = make_chain_nv(10000, []), make_chain_nv(10000, []);

Master:

julia> @btime $vn1 === $vn2
  95.996 μs (0 allocations: 0 bytes)
false

julia> @btime $nv1 === $nv2
  82.192 μs (0 allocations: 0 bytes)
false

This commit:

julia> @btime $vn1 === $vn2
  127.512 μs (0 allocations: 0 bytes)
false

julia> @btime $nv1 === $nv2
  126.837 μs (0 allocations: 0 bytes)

We stumbled across this potential optimization while reading through the code for compare_fields(). Hopefully it's beneficial, but we'll see! :)

Co-Authored-By: @nystrom

@NHDaly NHDaly added the performance Must go faster label Jan 4, 2022
@NHDaly NHDaly changed the title jl_egal Optimization: Defer recursion into pointees === Optimization: Defer recursion into pointees Jan 4, 2022
@NHDaly
Copy link
Member Author

NHDaly commented Jan 4, 2022

@nanosoldier runbenchmarks(ALL, vs="@f8f42ec0370b72072498c1b79eb5cf4e6c6864e6")

1 similar comment
@KristofferC
Copy link
Member

@nanosoldier runbenchmarks(ALL, vs="@f8f42ec0370b72072498c1b79eb5cf4e6c6864e6")

@JeffBezanson
Copy link
Member

Good idea, I like it! This sort of feels like the right way to do it to me, even if it's a tad slower in some cases.

I wonder if it would help to ignore pointer fields completely in the first pass, instead of doing the if (af != bf && (af == NULL || bf == NULL)) check?

@nanosoldier
Copy link
Collaborator

Something went wrong when running your job:

NanosoldierError: error when preparing/pushing to report repo: failed process: Process(setenv(`git push`; dir="/data/nanosoldier/workdir/NanosoldierReports"), ProcessExited(1)) [1]

Unfortunately, the logs could not be uploaded.

@NHDaly
Copy link
Member Author

NHDaly commented Jan 5, 2022

Yay, thanks! 🎉

I wonder if it would help to ignore pointer fields completely in the first pass, instead of doing the if (af != bf && (af == NULL || bf == NULL)) check?

Yeah, i had that idea too, and I could go either way. I chose to do it this way, because it's possible the NULL/NULL check could also allow us to return early: If a struct has 2 pointer fields and the second one is NULL in a but not b, you could avoid the recursive call to compare the the first pointee if you had seen the NULL pointer first.

So it's another instance of front-loading checking the bits in the struct before recursing to pointees. So i think probably better to do it this way, since it's making the same tradeoff as the rest of the PR. Does that make sense to you, too?

@NHDaly NHDaly marked this pull request as ready for review January 5, 2022 16:57
@NHDaly
Copy link
Member Author

NHDaly commented Jan 5, 2022

Anyone know why Nanosolider failed to upload the benchmark results? Is it because the PR is a draft or something? I'll mark it ready for review now.

@KristofferC
Copy link
Member

Anyone know why Nanosolider failed to upload the benchmark results? Is it because the PR is a draft or something?

The size has just gotten too big and no one has fixed it.

@JeffBezanson
Copy link
Member

NULL fields are rare, so I think not worth special-casing. Actual timing is what matters though.

@JeffBezanson
Copy link
Member

I get these results for your last benchmark (where the PR is slower):

before PR:
132.834 μs (0 allocations: 0 bytes)
124.298 μs (0 allocations: 0 bytes)

PR:
192.262 μs (0 allocations: 0 bytes)
193.349 μs (0 allocations: 0 bytes)

PR + ignore pointer fields on the first pass:
180.625 μs (0 allocations: 0 bytes)
183.335 μs (0 allocations: 0 bytes)

So there is a slight improvement (and looks like I need a new laptop again 😂 )

@NHDaly
Copy link
Member Author

NHDaly commented Jan 6, 2022

Okay cool, makes sense!

I think the "NULL fields are rare, so I think not worth special-casing." argument is strong here, so that makes sense. Thanks! 👍 👍

I can make that change.

But first: So is it possible to see the results of the nanosoldier run? Or are they lost to the mysteries of the universe? It would be nice to see if this proves to be good or bad, and I'd like to be able to have these numbers before making any changes so we can compare?

Thanks!

@KristofferC
Copy link
Member

@vtjnash has to manually retrieve them.

@vtjnash
Copy link
Member

vtjnash commented Jan 10, 2022

Oops, sorry, I am behind on email. Here you go:
https://github.com/JuliaCI/NanosoldierReports/blob/master/benchmark/by_hash/cffd656_vs_f8f42ec/report.md

@NHDaly
Copy link
Member Author

NHDaly commented Jan 10, 2022

Thanks @vtjnash!

What do the percentages mean?

It looks like some got quite a bit better and some quite a bit worse, but in ways that don't exactly make sense to me... So I feel like probably this is mostly just noisy? Can any of you with more experience reading these weigh in?
Screen Shot 2022-01-10 at 12 08 53 PM

I'll give your suggestion a shot now, @JeffBezanson.

@NHDaly
Copy link
Member Author

NHDaly commented Jan 10, 2022

@nanosoldier runbenchmarks(ALL, vs="@5449d1bfabdaeeb321c179a8344dc2852a989764")

@oscardssmith
Copy link
Member

Nanosoldier (especially recently) is pretty noisy.

@KristofferC
Copy link
Member

@NHDaly, you need to code quote the part after the nanosoldier invocation.

@oscardssmith
Copy link
Member

@nanosoldier runbenchmarks(ALL, vs="@5449d1bfabdaeeb321c179a8344dc2852a989764")

@nanosoldier
Copy link
Collaborator

Something went wrong when running your job:

NanosoldierError: error when preparing/pushing to report repo: failed process: Process(setenv(`git push`; dir="/data/nanosoldier/workdir/NanosoldierReports"), ProcessExited(1)) [1]

Unfortunately, the logs could not be uploaded.

@NHDaly
Copy link
Member Author

NHDaly commented Jan 11, 2022

Oops, thanks @KristofferC. I did that right the first time (but also it didn't work then either.. 🤔 maybe i don't have permissions or something).

Thanks @oscardssmith and @KristofferC

@oscardssmith
Copy link
Member

it did work, it's just that there's a nanosoldier bug that means @vtjnash needs to post the log manually.

@vtjnash
Copy link
Member

vtjnash commented Jan 11, 2022

I don't know if this is the comparison you wanted, since there are a lot of unrelated commits in that command, but here you go: https://github.com/JuliaCI/NanosoldierReports/blob/master/benchmark/by_hash/18eef47_vs_5449d1b/report.md

@vtjnash
Copy link
Member

vtjnash commented Jan 24, 2022

@nanosoldier runbenchmarks(ALL, vs="@master")

@vtjnash
Copy link
Member

vtjnash commented Jan 24, 2022

For more impact, we may want to update codegen.cpp to do the same ordering change as here also for emit_bits_compare

@nanosoldier
Copy link
Collaborator

Something went wrong when running your job:

NanosoldierError: error when preparing/pushing to report repo: failed process: Process(setenv(`git push`; dir="/nanosoldier/workdir/NanosoldierReports"), ProcessExited(1)) [1]

Unfortunately, the logs could not be uploaded.

@vtjnash
Copy link
Member

vtjnash commented Jan 26, 2022

https://github.com/JuliaCI/NanosoldierReports/blob/master/benchmark/by_hash/3eac22a_vs_master/report.md

Still not what I mean to run though: @nanosoldier runbenchmarks(!"scalar", vs=":master")

@vtjnash vtjnash added the merge me PR is reviewed. Merge when all tests are passing label Jan 26, 2022
@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here.

@NHDaly
Copy link
Member Author

NHDaly commented Jan 27, 2022

I don't know if this is the comparison you wanted, since there are a lot of unrelated commits in that command, but here you go: https://github.com/JuliaCI/NanosoldierReports/blob/master/benchmark/by_hash/18eef47_vs_5449d1b/report.md

Oh, huh, maybe I don't exactly undrestand how Nanosoldier works. I think i was trying to compare the latest commit on this branch against the previous commit. Did that not do that?

@NHDaly
Copy link
Member Author

NHDaly commented Jan 27, 2022

For more impact, we may want to update codegen.cpp to do the same ordering change as here also for emit_bits_compare

Yeah, makes sense. Good idea! I don't really feel like I have the chops to do that.. should it be done in a separate PR? Or could someone help with that here?


Also, i really don't know what to make of these Nanosoldier results.. Some things appear to get quite a bit better, and some quite a bit worse, but it's hard to tell what's just noise? :(

@DilumAluthge DilumAluthge removed the merge me PR is reviewed. Merge when all tests are passing label Jan 30, 2022
@DilumAluthge
Copy link
Member

DilumAluthge commented Jan 30, 2022

@NHDaly Can you rebase on the latest master? That should fix the llvmpasses failure.

Immutable struct comparisons with `===` can be arbitrarily expensive for
deeply recursive but (almost) equal objects. Whenever possible, it's
valuable to defer the potentially expensive recursion by first comparing
the struct fields for bitwise equality.

Before this commit, two structs are compare elementwise, in the order of
the struct definition, recursing when pointer fields are encountered.

This commit defers the recursion into pointed-to fields until after all
other non-pointer fields of the struct are compared.

This has two advantages:
1. It defers the expensive part of === comparison as long as possible,
   in the hopes that we can exit early from dissimilarities discovered
   elsewhere in the struct instances.
2. It improves cache-locality by scanning the whole struct before
   jumping into any new places in memory (and reducing comparisons
   needed on the current cache line after returning from the recursive
   call).

The drawback is that you'll have to scan the pointer fields again, which
means potentially more cache misses if the struct is very large.

The best way to tell if this is helpful or harmful is benchmarking.

Here is the motivating benchmark, which indeed improves by 10x with this
commit, compared to master:
```julia
julia> using BenchmarkTools

julia> struct VN
           val::Float32
           next::Union{VN, Array}  # put a mutable thing at the end, to prevent runtime from sharing instances
       end

julia> struct NV
           next::Union{NV, Array}  # put a mutable thing at the end, to prevent runtime from sharing instances
           val::Float32
       end

julia> function make_chain_vn(n, sentinal)
           head = VN(1, sentinal)
           for i in 2:n
               head = VN(rand(Int), head)
           end
           return head
       end
make_chain_vn (generic function with 1 method)

julia> function make_chain_nv(n, sentinal)
           head = NV(sentinal, 1)
           for i in 2:n
               head = NV(head, rand(Int))
           end
           return head
       end
make_chain_nv (generic function with 1 method)

julia> vn1, vn2 = make_chain_vn(10000, []), make_chain_vn(10000, []);

julia> nv1, nv2 = make_chain_nv(10000, []), make_chain_nv(10000, []);
```
Master:
```
julia> @Btime $vn1 === $vn2
  7.562 ns (0 allocations: 0 bytes)
false

julia> @Btime $nv1 === $nv2  # slower, since it recurses pointers unnecessarily
  76.952 μs (0 allocations: 0 bytes)
false
```
After this commit:
```
julia> @Btime $vn1 === $vn2
  8.597 ns (0 allocations: 0 bytes)
false

julia> @Btime $nv1 === $nv2  # We get to skip the recursion and exit early. :)
  10.280 ns (0 allocations: 0 bytes)
false
```

However, I think that there are probably other benchmarks where it
harms performance, so we'll have to see...

For example, here's one: In the exact opposite case as above, if the two
objects _are_ (almost) equal, necessitating checking every object, the
`NV` comparisons could have exited after all the recursive pointer
checks, and never compare the fields, whereas now the fields are checked
first, so this gets slower.

I'm not exactly sure why the `VN` comparisons get somewhat slower too,
but it's maybe because of the second scan mentioned above.

```julia
julia> function make_chain_nv(n, sentinal)
           head = NV(sentinal, 1)
           for i in 2:n
               head = NV(head, i)
           end
           return head
       end
make_chain_nv (generic function with 1 method)

julia> function make_chain_vn(n, sentinal)
           head = VN(1, sentinal)
           for i in 2:n
               head = VN(i, head)
           end
           return head
       end
make_chain_vn (generic function with 1 method)

julia> vn1, vn2 = make_chain_vn(10000, []), make_chain_vn(10000, []);

julia> nv1, nv2 = make_chain_nv(10000, []), make_chain_nv(10000, []);
```
Master:
```
julia> @Btime $vn1 === $vn2
  95.996 μs (0 allocations: 0 bytes)
false

julia> @Btime $nv1 === $nv2
  82.192 μs (0 allocations: 0 bytes)
false
```
This commit:
```
julia> @Btime $vn1 === $vn2
  127.512 μs (0 allocations: 0 bytes)
false

julia> @Btime $nv1 === $nv2
  126.837 μs (0 allocations: 0 bytes)
```
Delay the nullptr checks until all non-ptr fields have been compared,
since we have to go back to those anyways to follow the pointers.

Co-authored-by:Jeff Bezanson <[email protected]>
@NHDaly NHDaly force-pushed the nhd-jl_egal-defer-ptr-recursion branch from 7d8c933 to ba718e8 Compare January 31, 2022 19:51
@NHDaly
Copy link
Member Author

NHDaly commented Jan 31, 2022

Done, thanks

@vtjnash vtjnash added the merge me PR is reviewed. Merge when all tests are passing label Jan 31, 2022
@DilumAluthge DilumAluthge merged commit d7fd3b7 into master Feb 8, 2022
@DilumAluthge DilumAluthge deleted the nhd-jl_egal-defer-ptr-recursion branch February 8, 2022 21:18
@DilumAluthge DilumAluthge removed the merge me PR is reviewed. Merge when all tests are passing label Feb 8, 2022
antoine-levitt pushed a commit to antoine-levitt/julia that referenced this pull request Feb 17, 2022
* jl_egal Optimization: Defer recursion into pointees

Immutable struct comparisons with `===` can be arbitrarily expensive for
deeply recursive but (almost) equal objects. Whenever possible, it's
valuable to defer the potentially expensive recursion by first comparing
the struct fields for bitwise equality.

Before this commit, two structs are compare elementwise, in the order of
the struct definition, recursing when pointer fields are encountered.

This commit defers the recursion into pointed-to fields until after all
other non-pointer fields of the struct are compared.

This has two advantages:
1. It defers the expensive part of === comparison as long as possible,
   in the hopes that we can exit early from dissimilarities discovered
   elsewhere in the struct instances.
2. It improves cache-locality by scanning the whole struct before
   jumping into any new places in memory (and reducing comparisons
   needed on the current cache line after returning from the recursive
   call).

The drawback is that you'll have to scan the pointer fields again, which
means potentially more cache misses if the struct is very large.

The best way to tell if this is helpful or harmful is benchmarking.

Here is the motivating benchmark, which indeed improves by 10x with this
commit, compared to master:
```julia
julia> using BenchmarkTools

julia> struct VN
           val::Float32
           next::Union{VN, Array}  # put a mutable thing at the end, to prevent runtime from sharing instances
       end

julia> struct NV
           next::Union{NV, Array}  # put a mutable thing at the end, to prevent runtime from sharing instances
           val::Float32
       end

julia> function make_chain_vn(n, sentinal)
           head = VN(1, sentinal)
           for i in 2:n
               head = VN(rand(Int), head)
           end
           return head
       end
make_chain_vn (generic function with 1 method)

julia> function make_chain_nv(n, sentinal)
           head = NV(sentinal, 1)
           for i in 2:n
               head = NV(head, rand(Int))
           end
           return head
       end
make_chain_nv (generic function with 1 method)

julia> vn1, vn2 = make_chain_vn(10000, []), make_chain_vn(10000, []);

julia> nv1, nv2 = make_chain_nv(10000, []), make_chain_nv(10000, []);
```
Master:
```
julia> @Btime $vn1 === $vn2
  7.562 ns (0 allocations: 0 bytes)
false

julia> @Btime $nv1 === $nv2  # slower, since it recurses pointers unnecessarily
  76.952 μs (0 allocations: 0 bytes)
false
```
After this commit:
```
julia> @Btime $vn1 === $vn2
  8.597 ns (0 allocations: 0 bytes)
false

julia> @Btime $nv1 === $nv2  # We get to skip the recursion and exit early. :)
  10.280 ns (0 allocations: 0 bytes)
false
```

However, I think that there are probably other benchmarks where it
harms performance, so we'll have to see...

For example, here's one: In the exact opposite case as above, if the two
objects _are_ (almost) equal, necessitating checking every object, the
`NV` comparisons could have exited after all the recursive pointer
checks, and never compare the fields, whereas now the fields are checked
first, so this gets slower.

I'm not exactly sure why the `VN` comparisons get somewhat slower too,
but it's maybe because of the second scan mentioned above.

```julia
julia> function make_chain_nv(n, sentinal)
           head = NV(sentinal, 1)
           for i in 2:n
               head = NV(head, i)
           end
           return head
       end
make_chain_nv (generic function with 1 method)

julia> function make_chain_vn(n, sentinal)
           head = VN(1, sentinal)
           for i in 2:n
               head = VN(i, head)
           end
           return head
       end
make_chain_vn (generic function with 1 method)

julia> vn1, vn2 = make_chain_vn(10000, []), make_chain_vn(10000, []);

julia> nv1, nv2 = make_chain_nv(10000, []), make_chain_nv(10000, []);
```
Master:
```
julia> @Btime $vn1 === $vn2
  95.996 μs (0 allocations: 0 bytes)
false

julia> @Btime $nv1 === $nv2
  82.192 μs (0 allocations: 0 bytes)
false
```
This commit:
```
julia> @Btime $vn1 === $vn2
  127.512 μs (0 allocations: 0 bytes)
false

julia> @Btime $nv1 === $nv2
  126.837 μs (0 allocations: 0 bytes)
```

* Ignore pointer fields completely in the first pass of ===

Delay the nullptr checks until all non-ptr fields have been compared,
since we have to go back to those anyways to follow the pointers.

Co-authored-by:Jeff Bezanson <[email protected]>
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Feb 22, 2022
* jl_egal Optimization: Defer recursion into pointees

Immutable struct comparisons with `===` can be arbitrarily expensive for
deeply recursive but (almost) equal objects. Whenever possible, it's
valuable to defer the potentially expensive recursion by first comparing
the struct fields for bitwise equality.

Before this commit, two structs are compare elementwise, in the order of
the struct definition, recursing when pointer fields are encountered.

This commit defers the recursion into pointed-to fields until after all
other non-pointer fields of the struct are compared.

This has two advantages:
1. It defers the expensive part of === comparison as long as possible,
   in the hopes that we can exit early from dissimilarities discovered
   elsewhere in the struct instances.
2. It improves cache-locality by scanning the whole struct before
   jumping into any new places in memory (and reducing comparisons
   needed on the current cache line after returning from the recursive
   call).

The drawback is that you'll have to scan the pointer fields again, which
means potentially more cache misses if the struct is very large.

The best way to tell if this is helpful or harmful is benchmarking.

Here is the motivating benchmark, which indeed improves by 10x with this
commit, compared to master:
```julia
julia> using BenchmarkTools

julia> struct VN
           val::Float32
           next::Union{VN, Array}  # put a mutable thing at the end, to prevent runtime from sharing instances
       end

julia> struct NV
           next::Union{NV, Array}  # put a mutable thing at the end, to prevent runtime from sharing instances
           val::Float32
       end

julia> function make_chain_vn(n, sentinal)
           head = VN(1, sentinal)
           for i in 2:n
               head = VN(rand(Int), head)
           end
           return head
       end
make_chain_vn (generic function with 1 method)

julia> function make_chain_nv(n, sentinal)
           head = NV(sentinal, 1)
           for i in 2:n
               head = NV(head, rand(Int))
           end
           return head
       end
make_chain_nv (generic function with 1 method)

julia> vn1, vn2 = make_chain_vn(10000, []), make_chain_vn(10000, []);

julia> nv1, nv2 = make_chain_nv(10000, []), make_chain_nv(10000, []);
```
Master:
```
julia> @Btime $vn1 === $vn2
  7.562 ns (0 allocations: 0 bytes)
false

julia> @Btime $nv1 === $nv2  # slower, since it recurses pointers unnecessarily
  76.952 μs (0 allocations: 0 bytes)
false
```
After this commit:
```
julia> @Btime $vn1 === $vn2
  8.597 ns (0 allocations: 0 bytes)
false

julia> @Btime $nv1 === $nv2  # We get to skip the recursion and exit early. :)
  10.280 ns (0 allocations: 0 bytes)
false
```

However, I think that there are probably other benchmarks where it
harms performance, so we'll have to see...

For example, here's one: In the exact opposite case as above, if the two
objects _are_ (almost) equal, necessitating checking every object, the
`NV` comparisons could have exited after all the recursive pointer
checks, and never compare the fields, whereas now the fields are checked
first, so this gets slower.

I'm not exactly sure why the `VN` comparisons get somewhat slower too,
but it's maybe because of the second scan mentioned above.

```julia
julia> function make_chain_nv(n, sentinal)
           head = NV(sentinal, 1)
           for i in 2:n
               head = NV(head, i)
           end
           return head
       end
make_chain_nv (generic function with 1 method)

julia> function make_chain_vn(n, sentinal)
           head = VN(1, sentinal)
           for i in 2:n
               head = VN(i, head)
           end
           return head
       end
make_chain_vn (generic function with 1 method)

julia> vn1, vn2 = make_chain_vn(10000, []), make_chain_vn(10000, []);

julia> nv1, nv2 = make_chain_nv(10000, []), make_chain_nv(10000, []);
```
Master:
```
julia> @Btime $vn1 === $vn2
  95.996 μs (0 allocations: 0 bytes)
false

julia> @Btime $nv1 === $nv2
  82.192 μs (0 allocations: 0 bytes)
false
```
This commit:
```
julia> @Btime $vn1 === $vn2
  127.512 μs (0 allocations: 0 bytes)
false

julia> @Btime $nv1 === $nv2
  126.837 μs (0 allocations: 0 bytes)
```

* Ignore pointer fields completely in the first pass of ===

Delay the nullptr checks until all non-ptr fields have been compared,
since we have to go back to those anyways to follow the pointers.

Co-authored-by:Jeff Bezanson <[email protected]>
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Mar 8, 2022
* jl_egal Optimization: Defer recursion into pointees

Immutable struct comparisons with `===` can be arbitrarily expensive for
deeply recursive but (almost) equal objects. Whenever possible, it's
valuable to defer the potentially expensive recursion by first comparing
the struct fields for bitwise equality.

Before this commit, two structs are compare elementwise, in the order of
the struct definition, recursing when pointer fields are encountered.

This commit defers the recursion into pointed-to fields until after all
other non-pointer fields of the struct are compared.

This has two advantages:
1. It defers the expensive part of === comparison as long as possible,
   in the hopes that we can exit early from dissimilarities discovered
   elsewhere in the struct instances.
2. It improves cache-locality by scanning the whole struct before
   jumping into any new places in memory (and reducing comparisons
   needed on the current cache line after returning from the recursive
   call).

The drawback is that you'll have to scan the pointer fields again, which
means potentially more cache misses if the struct is very large.

The best way to tell if this is helpful or harmful is benchmarking.

Here is the motivating benchmark, which indeed improves by 10x with this
commit, compared to master:
```julia
julia> using BenchmarkTools

julia> struct VN
           val::Float32
           next::Union{VN, Array}  # put a mutable thing at the end, to prevent runtime from sharing instances
       end

julia> struct NV
           next::Union{NV, Array}  # put a mutable thing at the end, to prevent runtime from sharing instances
           val::Float32
       end

julia> function make_chain_vn(n, sentinal)
           head = VN(1, sentinal)
           for i in 2:n
               head = VN(rand(Int), head)
           end
           return head
       end
make_chain_vn (generic function with 1 method)

julia> function make_chain_nv(n, sentinal)
           head = NV(sentinal, 1)
           for i in 2:n
               head = NV(head, rand(Int))
           end
           return head
       end
make_chain_nv (generic function with 1 method)

julia> vn1, vn2 = make_chain_vn(10000, []), make_chain_vn(10000, []);

julia> nv1, nv2 = make_chain_nv(10000, []), make_chain_nv(10000, []);
```
Master:
```
julia> @Btime $vn1 === $vn2
  7.562 ns (0 allocations: 0 bytes)
false

julia> @Btime $nv1 === $nv2  # slower, since it recurses pointers unnecessarily
  76.952 μs (0 allocations: 0 bytes)
false
```
After this commit:
```
julia> @Btime $vn1 === $vn2
  8.597 ns (0 allocations: 0 bytes)
false

julia> @Btime $nv1 === $nv2  # We get to skip the recursion and exit early. :)
  10.280 ns (0 allocations: 0 bytes)
false
```

However, I think that there are probably other benchmarks where it
harms performance, so we'll have to see...

For example, here's one: In the exact opposite case as above, if the two
objects _are_ (almost) equal, necessitating checking every object, the
`NV` comparisons could have exited after all the recursive pointer
checks, and never compare the fields, whereas now the fields are checked
first, so this gets slower.

I'm not exactly sure why the `VN` comparisons get somewhat slower too,
but it's maybe because of the second scan mentioned above.

```julia
julia> function make_chain_nv(n, sentinal)
           head = NV(sentinal, 1)
           for i in 2:n
               head = NV(head, i)
           end
           return head
       end
make_chain_nv (generic function with 1 method)

julia> function make_chain_vn(n, sentinal)
           head = VN(1, sentinal)
           for i in 2:n
               head = VN(i, head)
           end
           return head
       end
make_chain_vn (generic function with 1 method)

julia> vn1, vn2 = make_chain_vn(10000, []), make_chain_vn(10000, []);

julia> nv1, nv2 = make_chain_nv(10000, []), make_chain_nv(10000, []);
```
Master:
```
julia> @Btime $vn1 === $vn2
  95.996 μs (0 allocations: 0 bytes)
false

julia> @Btime $nv1 === $nv2
  82.192 μs (0 allocations: 0 bytes)
false
```
This commit:
```
julia> @Btime $vn1 === $vn2
  127.512 μs (0 allocations: 0 bytes)
false

julia> @Btime $nv1 === $nv2
  126.837 μs (0 allocations: 0 bytes)
```

* Ignore pointer fields completely in the first pass of ===

Delay the nullptr checks until all non-ptr fields have been compared,
since we have to go back to those anyways to follow the pointers.

Co-authored-by:Jeff Bezanson <[email protected]>
@KristofferC
Copy link
Member

Explicitly referencing #44712 from here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants