Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Currently projection indices are pushed down to scans as Vec<usize>
. This creates some ambiguities:
- How to handle out of order or repeated indices - Reading parquet with (pre-release) arrow fails with "out of order projection is not supported" #2543
- How to handle nested types - Incorrect Parquet Projection For Nested Types #2453
To demonstrate how these problems intertwine, consider the case of
Struct {
first: Struct {
a: Integer,
b: Integer,
},
second: Struct {
c: Integer
}
}
If I project ["first.a", "second.c", "first.b"]
what is the resulting schema?
Describe the solution you'd like
I would like to propose we instead pushdown a leaf column mask, where leaf columns are fields with no children, as enumerated by a depth-first-scan of the schema tree. This avoids any ordering ambiguities, whilst also being relatively straightforward to implement and interpret.
I recently introduced a similar concept to the parquet reader apache/arrow-rs#1716. We could theoretically lift this into arrow-rs, potentially adding support to RecordBatch for it, and then use this in DataFusion.
Describe alternatives you've considered
We could not support nested pushdown
Additional context
Currently pushdown for nested types in ParquetExec is broken - #2453
Thoughts @andygrove @alamb