Description
Is your feature request related to a problem or challenge?
Currently stats filter pruning (both at the row group and page level) has one of two outcomes per container:
- This container cannot possibly match the filter (discard it).
- This container may match the filter, but which rows to include or exclude needs to be confirmed by evaluating each row of the data.
There is a big optimization here which is if we know that every row in the container matches the filter, we don't need to evaluate the filter at all.
Consider a column name
with values ["Adrian", "Adrian", "Adrian"]
. The min/max stats are "Adrian"/"Adrian"
. A query with the filter name = "Adrian"
should not need to ever read the column to know that all rows match the filter.
Another relevant case is a ts
column with values ["2025-01-01T00:00:00Z", ..., "2025-01-01T00:01:32Z"]
. The values need not be sorted or ordered, but let's say that the min/max stats are "2025-01-01T00:00:00Z"/"2025-01-01T00:01:32Z"
. For a filter ts > '2024-12-31T00:00:00Z'
there should be no need to evaluate the filter on every row: we know just from stats that every row matches.
We could incorporate this change, but it would require some refactoring of https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/pruning.rs and consumers.