Skip to content

Systematic fuzz testing for parquet predicate pushdown #12115

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

We have several forms of predicate pushdown in DataFusion's Parquet reader. The code path taken depends on the exact data layout and predicates defined

@itsjunetime is working on #4028 to improve performance by being more clever about some of these predicates.

The current code paths taken depend on

  1. Row group size
  2. Sort order of the data within the file
  3. File repartitioning size (how many partitions are read)
  4. Number of row groups
  5. Datapage size
  6. Use predicate pushdown?
  7. Use predicate reordering?

Describe the solution you'd like

I would like some additional test coverage (for correctness) when reading from parquet files with the various forms of pushdown enabled. It is especially important to ensure correctness with the various pushdowns enabled.

Describe alternatives you've considered

I would like to have a test that

  1. Creates multiple parquet files with different orderings / row group distribution etc
  2. Runs the same query on the same input
  3. Compares the results from the different queries and ensures it is the same

Parameters to check

  1. Row group size
  2. Sort order
  3. Number of row groups
  4. Datapage size
  5. Use predicate pushdown
  6. use predicate reordering

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions