Block structured Bloom filter #690

dcoutts · 2025-04-22T00:57:55Z

Long patch series to:

refactor the bloomfilter sub-lib and its uses
introduce a new block-structured implementation
switch to the new one

The result is better performance.

jorisdral · 2025-04-24T09:16:01Z

A test failure on 1ff45af:

cabal run bloomfilter-tests -- -p prop_calc_size_fpr_bits --quickcheck-replay="(SMGen 18024522305972736904 8705882024453529731,95)"
Data.BloomFilter
  Classic
    calculations
      prop_calc_size_fpr_bits: FAIL
        *** Failed! Falsified (after 1 test and 11 shrinks):
        BitsPerEntry 2.336869198112799
        NumEntries 1000
        0.3306894755756382 /= 0.3307128865771576 and not within (abs) tolerance of 1.0e-6
        Use --quickcheck-replay="(SMGen 18024522305972736904 8705882024453529731,95)" to reproduce.

1 out of 1 tests failed (0.00s)

One solution would be increase tolerance

jorisdral

I didn't finish my review yet, but I'm posting these PR comments here just so that I don't lose them between now and when I continue the review. The comments are also not polished, they're mostly draft notes, so no need to look at or resolve them @dcoutts . I'll curate the comments as part of my next leg of reviewing

EDIT: I've deleted most comments in favour of re-adding them in the next review

bloomfilter/src/Data/BloomFilter.hs

bloomfilter/src/Data/BloomFilter/Classic/Calc.hs

jorisdral

Looking very good! I have a bunch of comments / requests for clarifications, but nothing that should hold back the PR I think

There are also a number of comments that I left only on only one of the bloom filter types, but some of those comments probably also apply to the other type of bloom filter

test/Test/Database/LSMTree/Internal/BloomFilter.hs

test/Test/Database/LSMTree/Internal/RunBloomFilterAlloc.hs

bloomfilter/src/Data/BloomFilter/Classic/Mutable.hs

bloomfilter/src/Data/BloomFilter/Blocked/Internal.hs

bloomfilter/src/Data/BloomFilter/Classic/Internal.hs

bloomfilter/tests/bloomfilter-tests.hs

test/Test/Database/LSMTree/Internal/RunBloomFilterAlloc.hs

bloomfilter/src/Data/BloomFilter/Blocked/Calc.hs

dcoutts · 2025-04-28T09:07:50Z

        0.3306894755756382 /= 0.3307128865771576 and not within (abs) tolerance of 1.0e-6

One solution would be increase tolerance

It'd need 3e-5 here for this one. The calculations get particularly approximate around 2 bits or less, so another approach would be to adjust the tolerance so it's greater at the low end only.

For testing the bloomfilter lib in isolation, rather than the use in the lsm-tree lib.

We don't need multiple schemes.

rather than Data.Array.Byte. It seems to be compatible with a wider range of ghc & lib versions this way.

merge Data.BloomFilter.Mutable.Internal into Data.BloomFilter.Mutable

and export a helpful construction function.

The spell example was a test suite but does not really test anything. It really is an example, not a test. Remove the Words example since it was being used as a benchmark, but we now have better benchmarks.

Calculate the optimal number of bits and hashes directly rather than via an optimisation algorithm.

Remove the prime-based approach.

It was used to calculate the table of primes that we no longer use.

Co-authored-by: Joris Dral <[email protected]>

use a newtype for NumBlocks

They're the only ones with class connstraints and callbacks. Since they're trivial we just use inline rather than SPECIALISE.

range assertions

misc minor review fixes

For the classic impl, test up to 75 bits per entry and similar FPRs. There's no artificial limit on bits per entry or FPRs. There's just a limit on the overall filter size.

To make clear where the formulae come from.

To avoid accidental breakage. To ensure we bump the formatVersion if we change the format.

And give some sensible guidance.

dcoutts requested review from jorisdral, mheinzel, recursion-ninja and wenkokke as code owners April 22, 2025 00:57

dcoutts force-pushed the dcoutts/bloomfilter-blocked branch 2 times, most recently from c59568d to 23971a2 Compare April 22, 2025 16:48

jorisdral force-pushed the dcoutts/bloomfilter-blocked branch 2 times, most recently from e232688 to 1ff45af Compare April 24, 2025 08:57

jorisdral reviewed Apr 24, 2025

View reviewed changes

bloomfilter/src/Data/BloomFilter.hs Outdated Show resolved Hide resolved

bloomfilter/src/Data/BloomFilter/Classic/Calc.hs Outdated Show resolved Hide resolved

jorisdral mentioned this pull request Apr 25, 2025

Compute requested FPR using equations instead of an algorithm #644

Closed

jorisdral reviewed Apr 28, 2025

View reviewed changes

bloomfilter/src/Data/BloomFilter/Blocked/Calc.hs Outdated Show resolved Hide resolved

dcoutts force-pushed the dcoutts/bloomfilter-blocked branch 4 times, most recently from 9714096 to 9ba3bee Compare May 2, 2025 14:51

dcoutts and others added 12 commits May 9, 2025 14:25

bloomfilter: Add a simple construction benchmark

800bc1f

For testing the bloomfilter lib in isolation, rather than the use in the lsm-tree lib.

bloomfilter: removes Hashes, specialise to CheapHashes scheme

e5f7690

We don't need multiple schemes.

bloomfilter: use ByteArray type from primitive package

321016b

rather than Data.Array.Byte. It seems to be compatible with a wider range of ghc & lib versions this way.

bloomfilter: combine a couple modules into one

91cd410

merge Data.BloomFilter.Mutable.Internal into Data.BloomFilter.Mutable

bloomfilter: Remove pointless exported functions

ef7370b

and export a helpful construction function.

bloomfilter: misc minor cleanups of the tests

054c36c

bloomfilter: change the example spell program into an executable

b822af0

The spell example was a test suite but does not really test anything. It really is an example, not a test. Remove the Words example since it was being used as a benchmark, but we now have better benchmarks.

bloomfilter: Add new size calculation code

de48f42

Calculate the optimal number of bits and hashes directly rather than via an optimisation algorithm.

bloomfilter: add tests for new size calculation functions

22e2b10

bloomfilter: change Easy module to use new size calculations

47201b4

Remove the prime-based approach.

bloomfilter: remove primes helper program

b574549

It was used to calculate the table of primes that we no longer use.

bloomfilter: remove old calc functions

bc2f4d1

dcoutts and others added 25 commits May 9, 2025 14:29

Update bloomfilter/src/Data/BloomFilter/Blocked/BitArray.hs

543566a

Co-authored-by: Joris Dral <[email protected]>

Apply suggestions from code review

16da5b2

Co-authored-by: Joris Dral <[email protected]>

Apply suggestions from code review

8dbb1d4

Co-authored-by: Joris Dral <[email protected]>

Apply suggestions from code review

5ed7127

Co-authored-by: Joris Dral <[email protected]>

Apply suggestions from code review

29590dd

Co-authored-by: Joris Dral <[email protected]>

Apply suggestions from code review

cdb947a

Co-authored-by: Joris Dral <[email protected]>

FIXUP: bloomfilter: Add new Data.BloomFilter.Blocked implementation

86ca5a5

use a newtype for NumBlocks

bloomfilter: use INLINE pragma on deserialise functions

0dca2b7

They're the only ones with class connstraints and callbacks. Since they're trivial we just use inline rather than SPECIALISE.

FIXUP: bloomfilter: Add new Data.BloomFilter.Blocked implementation

a64afb1

range assertions

FIXUP: bloomfilter: Add new Data.BloomFilter.Blocked implementation

326d307

misc minor review fixes

bloomfilter: minor updates to module docs

9f74483

FIXUP: add comment for test_calculations_classic

3e48d6b

FIXUP: comment for test_calculations proxyBlocked

1a4fb5f

bloomfilter: remove arbitrary bit & FPR limits and test wider range

1ad4c6f

For the classic impl, test up to 75 bits per entry and similar FPRs. There's no artificial limit on bits per entry or FPRs. There's just a limit on the overall filter size.

FIXUP: comment on formatVersion

be88672

FIXUP: typo in deserialise Blocked impl

2bbe995

FIXUP: typo in deserialise Classic impl

1226886

bloomfilter: update docs section on differences vs original package

e0760e5

FIXUP: fromList INLINEABLE

98f5f54

FIXUP: classic overview doc section

21544d9

FIXUP: bloomFilterFromFile args

438e1d0

FIXUP: invariant blocked

3bbf20a

FIXUP: invariant classic

222f852

bloomfilter: add source and derivation for classic calculations

0ab69e3

To make clear where the formulae come from.

Add a TODO for a golden test for BloomFilter formatVersion

1071dd1

To avoid accidental breakage. To ensure we bump the formatVersion if we change the format.

dcoutts force-pushed the dcoutts/bloomfilter-blocked branch from 9ba3bee to 1071dd1 Compare May 9, 2025 13:31

dcoutts added 4 commits May 9, 2025 14:41

FIXUP: better comment on how we calculated the regression params

13fe847

Document in the API the bits/FPR trade-off

fe4238d

And give some sensible guidance.

FIXUP: use counterexample in roundtrip_prop

3c1ba7e

FIXUP: use splitGen, split is deprecated

d99ebc3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Block structured Bloom filter #690

Block structured Bloom filter #690

dcoutts commented Apr 22, 2025

jorisdral commented Apr 24, 2025

jorisdral left a comment •

edited

Loading

jorisdral left a comment

dcoutts commented Apr 28, 2025

Block structured Bloom filter #690

Are you sure you want to change the base?

Block structured Bloom filter #690

Conversation

dcoutts commented Apr 22, 2025

jorisdral commented Apr 24, 2025

jorisdral left a comment • edited Loading

Choose a reason for hiding this comment

jorisdral left a comment

Choose a reason for hiding this comment

dcoutts commented Apr 28, 2025

jorisdral left a comment •

edited

Loading