-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Is your feature request related to a problem or challenge?
We have several forms of predicate pushdown in DataFusion's Parquet reader. The code path taken depends on the exact data layout and predicates defined
@itsjunetime is working on #4028 to improve performance by being more clever about some of these predicates.
The current code paths taken depend on
- Row group size
- Sort order of the data within the file
- File repartitioning size (how many partitions are read)
- Number of row groups
- Datapage size
- Use predicate pushdown?
- Use predicate reordering?
Describe the solution you'd like
I would like some additional test coverage (for correctness) when reading from parquet files with the various forms of pushdown enabled. It is especially important to ensure correctness with the various pushdowns enabled.
Describe alternatives you've considered
I would like to have a test that
- Creates multiple parquet files with different orderings / row group distribution etc
- Runs the same query on the same input
- Compares the results from the different queries and ensures it is the same
Parameters to check
- Row group size
- Sort order
- Number of row groups
- Datapage size
- Use predicate pushdown
- use predicate reordering
Additional context
No response
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request