Skip to content

Avoid evaluating filters when they can be discarded purely from statistics #15425

Open
@adriangb

Description

@adriangb

Is your feature request related to a problem or challenge?

Currently stats filter pruning (both at the row group and page level) has one of two outcomes per container:

  1. This container cannot possibly match the filter (discard it).
  2. This container may match the filter, but which rows to include or exclude needs to be confirmed by evaluating each row of the data.

There is a big optimization here which is if we know that every row in the container matches the filter, we don't need to evaluate the filter at all.

Consider a column name with values ["Adrian", "Adrian", "Adrian"]. The min/max stats are "Adrian"/"Adrian". A query with the filter name = "Adrian" should not need to ever read the column to know that all rows match the filter.

Another relevant case is a ts column with values ["2025-01-01T00:00:00Z", ..., "2025-01-01T00:01:32Z"]. The values need not be sorted or ordered, but let's say that the min/max stats are "2025-01-01T00:00:00Z"/"2025-01-01T00:01:32Z". For a filter ts > '2024-12-31T00:00:00Z' there should be no need to evaluate the filter on every row: we know just from stats that every row matches.

We could incorporate this change, but it would require some refactoring of https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/pruning.rs and consumers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions