Skip to content

Conversation

mr-brobot
Copy link
Contributor

@mr-brobot mr-brobot commented Oct 4, 2025

Which issue does this PR close?

Rationale for this change

Parquet types are a subset of Arrow types, so the Arrow writer must coerce to Parquet types. In some cases, this changes the physical representation. Therefore, passing Arrow data directly to Sbbf::check will produce false negatives. Correctness is only guaranteed when checking with the coerced Parquet value.

This issue affects some integer and decimal types. It can also affect Date64.

What changes are included in this PR?

Introduces ArrowSbbf as an Arrow-aware interface to the Parquet Sbbf. This coerces incoming data if necessary and calls Sbbf::check.

Currently, Date64 types can be written as either INT32 (days since epoch) or INT64 (milliseconds since epoch), depending on Arrow writer properties (coerce_types). Instead of requiring additional information to handle this special (non-default) case, this implementation instructs users to coerce Date64 to Date32 if the Parquet column type is INT32. I'm open to feedback on this decision.

Are these changes tested?

There are tests for integer, float, decimal, and date types. Not exhaustive but covering all cases where coercion is necessary.

Are there any user-facing changes?

There is a new ArrowSbbf struct that most Arrow users should prefer over using Sbbf directly. Also, the Sized constraint was relaxed on the Sbbf::check function to support slices. This is consistent with Sbbf::insert.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Oct 4, 2025
Copy link
Contributor Author

@mr-brobot mr-brobot Oct 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark Sbbf ArrowSbbf Delta
i8 1.51 ns 7.38 ns +5.87 ns
i32 3.86 ns 7.15 ns +3.29 ns
Decimal128(5,2) 1.73 ns 7.69 ns +5.96 ns
Decimal128(15,2) 1.73 ns 8.20 ns +6.48 ns
Decimal128(30,2) 1.73 ns 5.85 ns +4.12 ns

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bloom filters for i8 and i16 always return false negatives

1 participant