Skip to content

Investigate FxHash low-order bit quality when hashing aligned addresses. #58249

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
eddyb opened this issue Feb 6, 2019 · 4 comments
Closed
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. WG-compiler-performance Working group: Compiler Performance

Comments

@eddyb
Copy link
Member

eddyb commented Feb 6, 2019

We might be producing hashes that always have lowest N bits zeroed.
cc @Amanieu @Zoxc

@Zoxc Zoxc added the WG-compiler-performance Working group: Compiler Performance label Feb 7, 2019
@Mark-Simulacrum
Copy link
Member

Cc @nnethercote as well, who I believe pulled FxHash from Firefox originally...

@Amanieu
Copy link
Member

Amanieu commented Feb 25, 2019

@nnethercote
Copy link
Contributor

See also #69153

@jonas-schievink jonas-schievink added C-enhancement Category: An issue proposing an enhancement or a PR with one. I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Feb 14, 2020
@Mark-Simulacrum
Copy link
Member

I instrumented hashbrown's ProbeSeq with a manual Drop impl that logged the number of times we moved to a new group. If I'm understanding the code right that should correspond to the number of times we had a collision bad enough to have to go to the next group (16 buckets/group with sse2 hashbrown). When compiling libcore, that gives us these numbers:

79075634 counts
(  1) 77018621 (97.4%, 97.4%): groups_visited=0
(  2)  1291659 ( 1.6%, 99.0%): groups_visited=1
(  3)   507299 ( 0.6%, 99.7%): groups_visited=2
(  4)   159950 ( 0.2%, 99.9%): groups_visited=3
(  5)    29613 ( 0.0%, 99.9%): groups_visited=4
(  6)    20021 ( 0.0%, 99.9%): groups_visited=5
(  7)    10334 ( 0.0%,100.0%): groups_visited=6
(  8)     5744 ( 0.0%,100.0%): groups_visited=7
(  9)     4269 ( 0.0%,100.0%): groups_visited=8
( 10)     3754 ( 0.0%,100.0%): groups_visited=9
[...] up to 54 groups visited in worst case, which happened 7 times

I think it's fair to say based on this data that even if we do have low quality in the bottom bits, it's not a huge problem - we probe only once in the table (i.e., moved=0) a significant majority of the time.

We also relatively rarely call eq() lots of times; I think these are the right counts. Note that some probes look for an empty bucket without checking for equality, which I think is why the total count here doesn't line up with the count above. eq_calls=0 and =1 mean we don't have any collisions in the lower + upper (h2) bits on lookup, which is the good case:

61370715 counts
(  1) 38594719 (62.9%, 62.9%): eq_calls=1
(  2) 22474004 (36.6%, 99.5%): eq_calls=0
(  3)   271631 ( 0.4%,100.0%): eq_calls=2
(  4)    20983 ( 0.0%,100.0%): eq_calls=3
(  5)     6604 ( 0.0%,100.0%): eq_calls=4
(  6)     1794 ( 0.0%,100.0%): eq_calls=5
(  7)      667 ( 0.0%,100.0%): eq_calls=6
(  8)      230 ( 0.0%,100.0%): eq_calls=7
(  9)       65 ( 0.0%,100.0%): eq_calls=8
( 10)       12 ( 0.0%,100.0%): eq_calls=9
( 11)        5 ( 0.0%,100.0%): eq_calls=10
( 12)        1 ( 0.0%,100.0%): eq_calls=12

I spent some time trying to figure out if there's anything interesting about where we have longer probe lengths (e.g., correlation with bad hashes or w/e), but at least naively I'm not seeing anything all that interesting -- the hashes aren't necessarily amazing but there's no consistent patterns or other flaws that I can quickly spot. I think it's not fully implausible that we're just really unlucky when we need to probe lots of times?

My sense is that there's no good case for investigating changes further based on this data collection, so I'm going to go ahead and close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: An issue proposing an enhancement or a PR with one. I-compiletime Issue: Problems and improvements with respect to compile times. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. WG-compiler-performance Working group: Compiler Performance
Projects
None yet
Development

No branches or pull requests

6 participants