Skip to content

Block structured Bloom filter #690

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 77 commits into
base: main
Choose a base branch
from
Open

Conversation

dcoutts
Copy link
Collaborator

@dcoutts dcoutts commented Apr 22, 2025

Long patch series to:

  • refactor the bloomfilter sub-lib and its uses
  • introduce a new block-structured implementation
  • switch to the new one

The result is better performance.

@dcoutts dcoutts force-pushed the dcoutts/bloomfilter-blocked branch 2 times, most recently from c59568d to 23971a2 Compare April 22, 2025 16:48
@jorisdral jorisdral force-pushed the dcoutts/bloomfilter-blocked branch 2 times, most recently from e232688 to 1ff45af Compare April 24, 2025 08:57
@jorisdral
Copy link
Collaborator

A test failure on 1ff45af:

cabal run bloomfilter-tests -- -p prop_calc_size_fpr_bits --quickcheck-replay="(SMGen 18024522305972736904 8705882024453529731,95)"
Data.BloomFilter
  Classic
    calculations
      prop_calc_size_fpr_bits: FAIL
        *** Failed! Falsified (after 1 test and 11 shrinks):
        BitsPerEntry 2.336869198112799
        NumEntries 1000
        0.3306894755756382 /= 0.3307128865771576 and not within (abs) tolerance of 1.0e-6
        Use --quickcheck-replay="(SMGen 18024522305972736904 8705882024453529731,95)" to reproduce.

1 out of 1 tests failed (0.00s)

One solution would be increase tolerance

Copy link
Collaborator

@jorisdral jorisdral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't finish my review yet, but I'm posting these PR comments here just so that I don't lose them between now and when I continue the review. The comments are also not polished, they're mostly draft notes, so no need to look at or resolve them @dcoutts . I'll curate the comments as part of my next leg of reviewing

EDIT: I've deleted most comments in favour of re-adding them in the next review

Copy link
Collaborator

@jorisdral jorisdral left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking very good! I have a bunch of comments / requests for clarifications, but nothing that should hold back the PR I think

There are also a number of comments that I left only on only one of the bloom filter types, but some of those comments probably also apply to the other type of bloom filter

@dcoutts
Copy link
Collaborator Author

dcoutts commented Apr 28, 2025

        0.3306894755756382 /= 0.3307128865771576 and not within (abs) tolerance of 1.0e-6

One solution would be increase tolerance

It'd need 3e-5 here for this one. The calculations get particularly approximate around 2 bits or less, so another approach would be to adjust the tolerance so it's greater at the low end only.

@dcoutts dcoutts force-pushed the dcoutts/bloomfilter-blocked branch 4 times, most recently from 9714096 to 9ba3bee Compare May 2, 2025 14:51
dcoutts and others added 12 commits May 9, 2025 14:25
For testing the bloomfilter lib in isolation, rather than the use in the
lsm-tree lib.
rather than Data.Array.Byte.

It seems to be compatible with a wider range of ghc & lib versions this
way.
merge Data.BloomFilter.Mutable.Internal into Data.BloomFilter.Mutable
and export a helpful construction function.
The spell example was a test suite but does not really test anything.
It really is an example, not a test.

Remove the Words example since it was being used as a benchmark, but we
now have better benchmarks.
Calculate the optimal number of bits and hashes directly rather than via
an optimisation algorithm.
It was used to calculate the table of primes that we no longer use.
dcoutts and others added 25 commits May 9, 2025 14:29
They're the only ones with class connstraints and callbacks. Since
they're trivial we just use inline rather than SPECIALISE.
For the classic impl, test up to 75 bits per entry and similar FPRs.

There's no artificial limit on bits per entry or FPRs. There's just a
limit on the overall filter size.
To make clear where the formulae come from.
To avoid accidental breakage. To ensure we bump the formatVersion if we
change the format.
@dcoutts dcoutts force-pushed the dcoutts/bloomfilter-blocked branch from 9ba3bee to 1071dd1 Compare May 9, 2025 13:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants