Skip to content

Look into using a better hash function #48

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Amanieu opened this issue Feb 25, 2019 · 23 comments · Fixed by #97
Closed

Look into using a better hash function #48

Amanieu opened this issue Feb 25, 2019 · 23 comments · Fixed by #97

Comments

@Amanieu
Copy link
Member

Amanieu commented Feb 25, 2019

We currently use FxHash as the default hash function, but this function handles aligned values poorly: when hashing an integer, if the low X bits of the input value are 0 then the low X bits of the hash value will also be 0.

One option would be to copy Google's CityHash (used by SwissTable). This uses a long multiple and XORs the top and bottom words of the result together:

impl FxHasher {
    #[inline]
    #[cfg(target_pointer_width = "32")]
    fn add_to_hash(&mut self, i: usize) {
        let tmp = self.hash.wrapping_add(i) as u64 * 0xcc9e2d51;
        self.hash = (tmp >> 32 ^ tmp) as usize;
    }
    #[inline]
    #[cfg(target_pointer_width = "64")]
    fn add_to_hash(&mut self, i: usize) {
        let tmp = self.hash.wrapping_add(i) as u128 * 0x9ddfea08eb382d69;
        self.hash = (tmp >> 64 ^ tmp) as usize;
    }
}

cc rust-lang/rust#58249

@tkaitchuck
Copy link
Contributor

I would humbly submit aHash as a possible alternative: https://github.com/tkaitchuck/aHash
It obviously is architecture specific, but it would be possible to implement a fallback without any runtime overhead.
Either way it may be worth taking a look at how aHash handles strings. Aside from the actual scrambling function, I think the way aHash extracts the data is faster. When implementing it I became convinced that FxHash could do a lot better with poorly aligned data.

@Amanieu
Copy link
Member Author

Amanieu commented Mar 1, 2019

@tkaitchuck aHash looks great, except for the part where it is not always supported by the CPU. Unfortunately the hash function is called often enough that doing dynamic dispatch based on the CPU features would cost too much performance.

@tkaitchuck
Copy link
Contributor

tkaitchuck commented Mar 1, 2019

It ought to be possible to do statically with no runtime overhead. For example it is possible define something like:

pub struct HasherState { buffer: [u64; 2] }
#[cfg(all(any(target_arch = "x86", target_arch = "x86_64"), target_feature = "aes"))]
impl Hasher for HasherState {
    // Ahash's implementation
}
#[cfg(not(all(any(target_arch = "x86", target_arch = "x86_64"), target_feature = "aes")))]
impl Hasher for HasherState {
    // Fallback hash implementation
}

Then to the calling code there is always one implementation of the trait, statically dispatched. Which one get's compiled just depends of which architecture you are targeting.

If that sounds interesting I could take a stab at implementing a fallback inside of the aHash repo.

@tkaitchuck
Copy link
Contributor

tkaitchuck commented Mar 4, 2019

I've added a fallback algorithm to aHash. The details are in the Readme. (It was inspired by FxHash), it's also keyed and should be considerably harder to attack than MermerHash2 was (or fx which is trivial). I'm still not totally happy with it's security or performance, but it's still better than anything else I'm aware of at that level of performance. Take a look. I'll continue to tweak it.

@tkaitchuck
Copy link
Contributor

tkaitchuck commented Mar 13, 2019

@Amanieu Do you have any thoughts? The aHash fallback algorithm speed now beats FXHash at any string longer than 4 bytes. It also is within a nano second for all primitives. (Which I think is as good as can be done. FXHash really plays fast and loose when hashing primitives. I can easily imagine a HashMap<u64, _> where the key is a 'size' or 'location' in bytes which would turn the map into a list if the values happened to be a whole number of KB.) If AESNI is available, aHash will beat FX's speed at everything except primitives (only off by 1/3rd of a ns) and strings of exactly 4 bytes.

The benchmarks included in the README are against the generic code path, so as you can see the dispatch is done completely at compile time, and it adds no overhead.

PS: Do you know how the snippet you have at the top performs on non-x86? I considered such an approach in the fallback, knowing it would get translated to a 32/64 bit multiply on Intel, but I was afraid that on other architectures it might turn into a 128bit multiply, which could really suck. -- UPDATE: It appears ARM and WebAssembly don't have such an instruction.

@Amanieu
Copy link
Member Author

Amanieu commented Mar 14, 2019

@tkaitchuck One of the major use cases for hash table is with integer keys. Can you provide some benchmarks when hashing simple u32 values?

@Amanieu
Copy link
Member Author

Amanieu commented Mar 14, 2019

Ah nevermind, I see the results for u32 in the link you provided. Can you try running the hashbrown benchmarks with the hasher replaced with aHash (with and without AES)? I am particularly interested in the effect this has on lookup performance (if any).

@tkaitchuck
Copy link
Contributor

tkaitchuck commented Mar 14, 2019

(Updated)

With FX:
running 7 tests
test find_existing     ... bench:       4,797 ns/iter (+/- 181)
test find_nonexisting  ... bench:       4,060 ns/iter (+/- 291)
test get_remove_insert ... bench:          29 ns/iter (+/- 0)
test grow_by_insertion ... bench:         166 ns/iter (+/- 22)
test hashmap_as_queue  ... bench:          21 ns/iter (+/- 0)
test new_drop          ... bench:           0 ns/iter (+/- 0)
test new_insert_drop   ... bench:          49 ns/iter (+/- 1)

aHash without AES
running 7 tests
test find_existing     ... bench:       5,713 ns/iter (+/- 180)
test find_nonexisting  ... bench:       5,473 ns/iter (+/- 95)
test get_remove_insert ... bench:          65 ns/iter (+/- 1)
test grow_by_insertion ... bench:         131 ns/iter (+/- 20)
test hashmap_as_queue  ... bench:          42 ns/iter (+/- 0)
test new_drop          ... bench:           0 ns/iter (+/- 0)
test new_insert_drop   ... bench:          52 ns/iter (+/- 0)

aHash with AES:
running 7 tests
test find_existing     ... bench:       5,401 ns/iter (+/- 109)
test find_nonexisting  ... bench:       5,025 ns/iter (+/- 87)
test get_remove_insert ... bench:          36 ns/iter (+/- 1)
test grow_by_insertion ... bench:         140 ns/iter (+/- 23)
test hashmap_as_queue  ... bench:          23 ns/iter (+/- 0)
test new_drop          ... bench:           0 ns/iter (+/- 0)
test new_insert_drop   ... bench:          43 ns/iter (+/- 0)

There is a bit of a performance hit because all those run with i32 keys which are just not as fast in aHash as in FX. A few things seem notable:

  • There is a bit of a reduction in variance in find_nonexisting.
  • The biggest hit is in get_remove_insert in the fallback. (I don't fully understand this, but it does not appear to be strongly influenced to tweaks to the algorithm)
  • grow_by_insertion sees a performance boost with aHash with or without AES despite the lower i32 perf. I don't know what that could be if not collisions.

I could improve the performance of the fallback hasher on primitive types like i32 if it could be guaranteed the the hasher is only ever passed a single value. The only way I can think to make such a thing work, is to have the user use a macro to instantiate the map, where the macro looks at the key and creates one hasherBuilder for known primitives and a different one if it is something else. This strikes me a too invasive, as it requires changing users' code.

@tkaitchuck
Copy link
Contributor

tkaitchuck commented Mar 15, 2019

@Amanieu I figured out a way to speed up primitives (without changing the actual algorithm!), and released 0.1.13. (benchmarks are in the README) and I updated my comment above to reflect the impact on hashbrown using it.

@tkaitchuck
Copy link
Contributor

@Amanieu what is the next step? Should I create a pull request? Do you want to see some string benchmarks? A comparison of a number of different hashes?

@Amanieu
Copy link
Member Author

Amanieu commented Mar 23, 2019

Sorry for the late response, I've been busy with some things lately.

I am hesitant to use aHash directly as the default hasher because it is slower than the current hasher. However if it is as secure as SipHash it might be a good idea to use it as a replacement for the default hasher in libstd.

@tkaitchuck
Copy link
Contributor

tkaitchuck commented Mar 31, 2019

Update: I was able to improve the performance a the fallback algorithm a bit. I've also submitted a PR for the additional benchmarks I've been comparing with. Here are what I observe locally (FxHash on the left AHash on the right.):

FXHash                               vs		   Ahash with fallback:	
find_existing    	   4764	(+/- 270) ns	   5050	(+/- 122) ns	106.00%
find_existing_high_bits	145,878	(+/- 3,905) ns	   5451	(+/- 210) ns	  3.74%
find_nonexisting	   4066	(+/- 154) ns	   4416	(+/- 73) ns	108.61%
get_remove_insert	     30	(+/- 1) ns	     32	(+/- 1) ns 	106.67%
grow_by_insertion	    152	(+/- 18) ns	    134	(+/- 13) ns	 88.16%
grow_by_insertion_kb	    248	(+/- 361) ns	    178	(+/- 5) ns 	 71.77%
hashmap_as_queue	     21	(+/- 0) ns  	     22	(+/- 0) ns 	104.76%
insert_8_char_string	 12,704	(+/- 429) ns	   9692	(+/- 174) ns	 76.29%
new_insert_drop              73	(+/- 2) ns  	     73	(+/- 1) ns 	100.00%

FXHash                               vs		   Ahash with AES
find_existing              4764 (+/- 270) ns	   5253	(+/- 106) ns	110.26%
find_existing_high_bits	145,878	(+/- 3,905) ns	   6106	(+/- 212) ns	  4.19%
find_nonexisting	   4066	(+/- 154) ns	   5150	(+/- 88) ns	126.66%
get_remove_insert	     30	(+/- 1) ns	     37	(+/- 0) ns 	123.33%
grow_by_insertion	    152	(+/- 18) ns	    137	(+/- 25) ns	 90.13%
grow_by_insertion_kb	    248	(+/- 361) ns	    145	(+/- 3) ns 	 58.47%
hashmap_as_queue	     21	(+/- 0) ns  	     23	(+/- 0) ns 	109.52%
insert_8_char_string	 12,704	(+/- 429) ns	   8129	(+/- 196) ns	 63.99%
new_insert_drop     	     73	(+/- 2) ns  	     64	(+/- 2) ns 	 87.67%

With this new version, the performance gap is a lot narrower, and with strings > 4 characters aHash is faster. Also FxHash really falls apart on the find_existing_high_bits test. I realize that isn't "normal" data but as a user I would never expect that level of performance drop (vs find_existing).

@Amanieu Is there a good way to actually get a representative sample of what users actually USE as their keys? That would allow us to make a benchmark that is a lot less abstract.

bors bot referenced this issue Mar 31, 2019
56: Add additional benchmarks. r=Amanieu a=tkaitchuck

This covers performance of three cases I wanted to study when looking into https://github.com/Amanieu/hashbrown/issues/48

They are:
`grow_by_insertion_kb` which is similar to grow by insertion, but instead of every entry differing by 1, they differ by 1024. This makes an important performance difference to the hasher.

`find_existing_high_bits` which is similar to find_existing but uses 64 bit keys instead of 32 bit keys, where the lower 32 bits are zeros. This is a pathologically bad case for FxHash.

`insert_8_char_string` tests a case where the key is a string. (As opposed to all the existing tests which operate on u32 values. This is important to cover because strings as keys are very common.

Co-authored-by: Tom Kaitchuck <[email protected]>
@Zoxc
Copy link
Contributor

Zoxc commented Apr 1, 2019

@tkaitchuck Your benchmarks (in the aHash repo) seem to use the fxhash crate, which is not the same as the FxHash in hashbrown or rustc_hash. Are you using the hashbrown version here? Testing with this PR would be interesting too.

@tkaitchuck
Copy link
Contributor

tkaitchuck commented Apr 1, 2019

@Zoxc AHash's readme refers to the public crate. The above was using the version in this repo.
Your comment did get me wondering if there was a performance difference: There is.
In my test they are the identical for u8-u128 and strings 68 bytes and up. However for small strings this version is much better (Though it doesn't have that one really odd good point at 4 bytes)

u8                   time:    1.1536 ns                    
u16                  time:    1.1517 ns                     
u32                  time:    1.1483 ns                     
u64                  time:    1.1584 ns                     
u128                 time:    1.4320 ns                      
  1 bytes            time:    1.8159 ns                            
  3 bytes            time:    2.5453 ns                              
  4 bytes            time:    2.3543 ns                               
  7 bytes            time:    2.9000 ns                                  
  8 bytes            time:    2.6467 ns                                                                         
  15 bytes           time:    3.8354 ns 
  16 bytes           time:    2.9812 ns                                  
  24 bytes           time:    3.9831 ns 
  68 bytes           time:    6.8920 ns 
 132 bytes           time:    14.674 ns 
1024 bytes           time:    205.61 ns 

(Still not as good as aHash with or without fallback for strings, but much better overall)

bors bot referenced this issue Apr 13, 2019
56: Add additional benchmarks. r=Amanieu a=tkaitchuck

This covers performance of three cases I wanted to study when looking into https://github.com/Amanieu/hashbrown/issues/48

They are:
`grow_by_insertion_kb` which is similar to grow by insertion, but instead of every entry differing by 1, they differ by 1024. This makes an important performance difference to the hasher.

`find_existing_high_bits` which is similar to find_existing but uses 64 bit keys instead of 32 bit keys, where the lower 32 bits are zeros. This is a pathologically bad case for FxHash.

`insert_8_char_string` tests a case where the key is a string. (As opposed to all the existing tests which operate on u32 values. This is important to cover because strings as keys are very common.

62: Remove incorrect debug_assert r=Amanieu a=Amanieu

Fixes #60 

Co-authored-by: Tom Kaitchuck <[email protected]>
Co-authored-by: Amanieu d'Antras <[email protected]>
@Amanieu
Copy link
Member Author

Amanieu commented Jun 13, 2019

After thinking about this a bit and looking at alternatives, I think that AHash would be great as the default hasher for hashbrown. However there are a few issues that currently prevent me from doing so:

  • The ahash crate needs to be #![no_std].
  • The ahash crate needs to compile on stable, and on Rust 1.31 which is the minimum Rust version for hashbrown.

@Amanieu
Copy link
Member Author

Amanieu commented Jun 19, 2019

ping @tkaitchuck

@alkis
Copy link
Contributor

alkis commented Jun 26, 2019

You probably want to look at XXH3. The author really gets it. The benchmarks cover the space well: variable sizes, measuring latency instead of throughput.

How does aHash fare on SMHasher and particularly this fork.

@tkaitchuck
Copy link
Contributor

tkaitchuck commented Jul 3, 2019

@alkis there is an open issue to do that here: tkaitchuck/aHash#6
If you are interested in contributing the needed glue code, it would be very much appreciated.

@tkaitchuck
Copy link
Contributor

@Amanieu aHash now builds on stable and is no-std.

bors added a commit that referenced this issue Aug 4, 2019
Replace FxHash with AHash as the default hasher

Fixes #48

cc @tkaitchuck
@bors bors closed this as completed in #97 Aug 4, 2019
@LifeIsStrange
Copy link

Cyan4973/xxHash#155
Meow seems to be the fastest non cryptographic hash algorithm to exist, at least when hardware SIMD is available.

@tkaitchuck
Copy link
Contributor

Cyan4973/xxHash#155
Meow seems to be the fastest non cryptographic hash algorithm to exist, at least when hardware SIMD is available.

I haven't compared it in a benchmark with ahash, but looking at the code, it's very clearly designed for throughput not latency. It might even beat ahash when dealing with 1mb of data (Though I would not be too confident). But that's rarely the case with a hashmap in memory. A hashmap needs to be able to get a value in just a couple of CPU cycles when dealing with a u64 or a 5 byte string for a key. That case is far more common.

@alkis
Copy link
Contributor

alkis commented Oct 16, 2019

Cyan4973/xxHash#155
Meow seems to be the fastest non cryptographic hash algorithm to exist, at least when hardware SIMD is available.

I haven't compared it in a benchmark with ahash, but looking at the code, it's very clearly designed for throughput not latency. It might even beat ahash when dealing with 1mb of data (Though I would not be too confident). But that's rarely the case with a hashmap in memory. A hashmap needs to be able to get a value in just a couple of CPU cycles when dealing with a u64 or a 5 byte string for a key. That case is far more common.

Are you talking about meow or XXH3? If the latter I find this surprising because the blogpost I linked above specifically discusses latency for small inputs as a specific design goal.

@tkaitchuck
Copy link
Contributor

Are you talking about meow or XXH3?

No I wasn't. I see that. H3 now adds a fast path for short inputs, and it looks eerily familiar. It uses many of the same tricks I do in aHash. It says a lot about the constraints of the problem that independent solutions can endup so similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants