Skip to content

Support Decimal and Decimal256 Parquet Data Page Statistics #11111

Closed
@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

Part of #10922

We are adding APIs to efficiently convert the data stored in Parquet's "PageIndex" into ArrayRefs -- which will make it significantly easier to use this information for pruning and other tasks.

Describe the solution you'd like

Add support to StatisticsConverter::min_page_statistics and StatisticsConverter::max_page_statistics for the types above

/// of parquet page [`Index`]'es to an [`ArrayRef`]
pub(crate) fn min_page_statistics<'a, I>(
data_type: Option<&DataType>,
iterator: I,
) -> Result<ArrayRef>
where
I: Iterator<Item = (usize, &'a Index)>,
{
get_data_page_statistics!(Min, data_type, iterator)
}
/// Extracts the max statistics from an iterator
/// of parquet page [`Index`]'es to an [`ArrayRef`]
pub(crate) fn max_page_statistics<'a, I>(
data_type: Option<&DataType>,
iterator: I,
) -> Result<ArrayRef>
where
I: Iterator<Item = (usize, &'a Index)>,
{

Describe alternatives you've considered

You can follow the model from @Weijun-H in #10931

  1. Update the test for the listed data types to be Check::Both, following the model of test_int64
    async fn test_int_64() {
    // This creates a parquet files of 4 columns named "i8", "i16", "i32", "i64"
    let reader = TestReader {
    scenario: Scenario::Int,
    row_per_group: 5,
    }
    .build()
    .await;
    // since each row has only one data page, the statistics are the same
    Test {
    reader: &reader,
    // mins are [-5, -4, 0, 5]
    expected_min: Arc::new(Int64Array::from(vec![-5, -4, 0, 5])),
    // maxes are [-1, 0, 4, 9]
    expected_max: Arc::new(Int64Array::from(vec![-1, 0, 4, 9])),
    // nulls are [0, 0, 0, 0]
    expected_null_counts: UInt64Array::from(vec![0, 0, 0, 0]),
    // row counts are [5, 5, 5, 5]
    expected_row_counts: UInt64Array::from(vec![5, 5, 5, 5]),
    column_name: "i64",
    check: Check::Both,
    }
    .run();
  2. Add any required implementation in get_datapage_statistics: (follow the model of the row counts, )

Typically the change to the test looks like

- check: Check::RowGroup, 
+ check: Check::Both, 

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions