[Parquet] Add ParquetMetadataPushDecoder #8080

alamb · 2025-08-07T13:51:09Z

Which issue does this PR close?

Rationale for this change

Metadata is needed when implementing a push decoder for Parquet:

Decouple IO and CPU operations in the Parquet Reader (push decoder) #7983

If we want to truly separate IO and CPU we also need a way to decode the metadata without explicit IO, and hence this PR that provides a way to decode metadata "push style" where it tells you what bytes are needed. It follows the same API as the parquet push decoder

This PR also introduces some of the common infrastructure needed in the parquet push decoder

What changes are included in this PR?

Add PushBuffers to hold byte ranges
Add DecodeResult to communicate back to the caller
Add ParquetMetaDataPushDecoder for decoding metadata

Are these changes tested?

Yes, there are several fully working doc tests that show how to use this API

Are there any user-facing changes?

There is a new API

alamb · 2025-08-07T13:52:37Z

parquet/src/errors.rs

    /// Returned when a function needs more data to complete properly. The `usize` field indicates
    /// the total number of bytes required, not the number of additional bytes.
    NeedMoreData(usize),
+    /// Returned when a function needs more data to complete properly.


This is the one part of this PR I am not sure about

Since the ParquetError is (now) marked as #[non_exhaustive] I don't think this is technically a breaking change. However, it would be really nice to avoid a new enum -- I will comment about this later in the PR

I think I had this error in my original draft of ParquetMetaDataReader (3c340b7) but got talked out of it 😅 (#6392 (comment))

I should have known you were right and gone with your instinct!

FWIW I could remove ParquetError::NeedMoreDataRange if I reworked the decoder to have a proper state machine internally (rather than calling into the existing ParquetMetaDataDecoder).

That is certainly my long term plan. If we like this API I will invest some time into seeing if I can sketch out what it would look like

alamb · 2025-08-07T13:53:29Z

parquet/src/lib.rs


 pub mod thrift;
+
+/// What data is needed to read the next item from a decoder.


This is a new public API for returning results or requesting more data. It is also used in the parquet push decoder -- you can see it all wired up here: #7997

alamb · 2025-08-07T13:54:36Z

parquet/src/util/push_buffers.rs

+use std::fmt::Display;
+use std::ops::Range;
+
+/// Holds multiple buffers of data that have been requested by the ParquetDecoder


This is the in memory buffer that holds the data needed for decoding. It is also used by the parquet push decoder in #7997

alamb · 2025-08-07T13:55:53Z

parquet/src/util/push_buffers.rs

+}
+
+/// less efficinet implementation of Read for Buffers
+impl std::io::Read for PushBuffers {


This std::io::Read impl is needed so the current thrift decoder can read bytes

I suspect when @etseidl is done with his custom thrift decoder we can remove this impl

Use custom thrift decoder to improve speed of parsing parquet metadata #5854

alamb · 2025-08-07T13:57:39Z

parquet/src/file/metadata/push_decoder.rs

+use crate::file::metadata::{ParquetMetaData, ParquetMetaDataReader};
+use crate::DecodeResult;
+
+/// A push decoder for [`ParquetMetaData`].


I am quite pleased with this API and I think the examples show the main IO patterns that we would want:

Feed exactly the byte ranges needed

"prefetch" a bunch of the data to avoid multiple IOs

Read via the standard library AsyncRead trait, which has been asked for several times.

alamb · 2025-08-07T13:58:18Z

parquet/src/file/metadata/push_decoder.rs

+/// // The `ParquetMetaDataPushDecoder` needs to know the file length.
+/// let mut decoder = ParquetMetaDataPushDecoder::try_new(file_len).unwrap();
+/// // try to decode the metadata. If more data is needed, the decoder will tell you what ranges
+/// loop {


I am probably too exuberant, but I am really pleased with this API and how it works

alamb · 2025-08-07T14:00:09Z

parquet/src/file/metadata/push_decoder.rs

+#[derive(Debug)]
+pub struct ParquetMetaDataPushDecoder {
+    done: bool,
+    metadata_reader: ParquetMetaDataReader,


Internally the current implementation just calls out to the existing ParquetMetaDataReader

However, I think long term it would make more sense to reverse the logic, and have the decoding machinery to live in this decoder, refactor ParquetMetaDataReader to use the push decoder.

Given this PR is already quite large, I figured this would be a good follow on to do

I wholeheartedly endorse this idea. 😄

alamb · 2025-08-07T14:00:46Z

parquet/src/file/metadata/push_decoder.rs

+    use std::ops::Range;
+    use std::sync::{Arc, LazyLock};
+
+    /// It is possible to decode the metadata from the entire file at once before being asked


I am also quite pleased with these tests as I think it makes clear what "IO" is happening compared to different operations

alamb · 2025-08-07T15:35:20Z

@etseidl I would especially be interested in your feedback on this API / idea, given your plans for the parquet metadata decoding revamp

etseidl · 2025-08-08T01:36:45Z

Thanks @alamb, this seems exciting. I want to try playing with it in one of my branches to see if there are any ergonomic issues.

etseidl · 2025-08-14T22:40:50Z

Ok, I'm starting to grok this. I merged this branch into my current thrift branch, and changed try_decode to

  pub fn try_decode(
      &mut self,
  ) -> std::result::Result<DecodeResult<ParquetMetaData>, ParquetError> {
      if self.done {
          return Ok(DecodeResult::Finished);
      }

      // need to have the last 8 bytes of the file to decode the metadata
      let file_len = self.buffers.file_len();
      if !self.buffers.has_range(&(file_len - 8..file_len)) {
          #[expect(clippy::single_range_in_vec_init)]
          return Ok(DecodeResult::NeedsData(vec![file_len - 8..file_len]));
      }

      // Try to parse the metadata from the buffers we have.
      // If we don't have enough data, it will return a `ParquetError::NeedMoreData`
      // with the number of bytes needed to complete the metadata parsing.
      // If we have enough data, it will return `Ok(())` and we can
      let footer_bytes = self
          .buffers
          .get_bytes(file_len - FOOTER_SIZE as u64, FOOTER_SIZE)?;
      let mut footer = [0_u8; FOOTER_SIZE];
      footer_bytes.as_ref().copy_to_slice(&mut footer);
      let footer = ParquetMetaDataReader::decode_footer_tail(&footer)?;

      let metadata_len = footer.metadata_length();
      let footer_metadata_len = FOOTER_SIZE + metadata_len;
      let footer_start = file_len - footer_metadata_len as u64;
      let footer_end = file_len - FOOTER_SIZE as u64;
      if !self.buffers.has_range(&(footer_start..footer_end)) {
          #[expect(clippy::single_range_in_vec_init)]
          return Ok(DecodeResult::NeedsData(vec![footer_start..file_len]));
      }

      let metadata_bytes = self.buffers.get_bytes(footer_start, metadata_len)?;
      let metadata = ParquetMetaDataReader::decode_file_metadata(&metadata_bytes)?;
      self.done = true;
      Ok(DecodeResult::Data(metadata))
  }

No page indexes yet, but this seems pretty nice 👍 Once I have the page indexes converted the parser should get pretty simple.

alamb · 2025-08-15T10:38:29Z

No page indexes yet, but this seems pretty nice 👍 Once I have the page indexes converted the parser should get pretty simple.

I am also thinking of going one step further and rewriting the entire metadata decoder as an explicit state machine, roughly in the pattern you describe, which would avoid having to use DecodeResult::NeedsData(vec![footer_start..file_len]) (we would return Ok(DecodeResults::Ranges) instead

@alamb

…aData` (#8111) # Which issue does this PR close? **Note: this targets a feature branch, not main** - Part of #5854. # Rationale for this change # What changes are included in this PR? This PR completes reading of the `FileMetaData` and `RowGroupMetaData` pieces of the `ParquetMetaData`. Column indexes and encryption will be follow-on work. This replaces the macro for generating structs with a more general one that can take visibility and lifetime specifiers. Also (temporarily) adds a new function `ParquetMetaDataReader::decode_file_metadata` which should be a drop-in replacement for `ParquetMetaDataReader::decode_metadata`. Still todo: 1. Add some tests that verify this produces the same output as `ParquetMetaDataReader::decode_metadata` 2. Read column indexes with new decoder 3. Read page headers with new decoder 4. Integrate with @alamb's push decoder work #8080 # Are these changes tested? Not yet # Are there any user-facing changes? Yes

…ecoder

parquet/src/file/metadata/push_decoder.rs

parquet/src/util/push_buffers.rs

alamb · 2025-08-20T19:06:15Z

I believe @mbutrovich said he was planning to review this PR

etseidl

Just some nits I found when reading the rendered documentation. Looking good so far 👍

parquet/src/file/metadata/push_decoder.rs

etseidl · 2025-08-20T22:07:11Z

parquet/src/util/push_buffers.rs

+/// 1. Zero copy
+/// 2. non contiguous ranges of bytes
+///
+/// # Non Coalescing


Co-authored-by: Ed Seidl <[email protected]>

parquet/src/file/metadata/push_decoder.rs

mbutrovich

Minor nits, this is looking great @alamb!

…ecoder

Requested changes made

mbutrovich

Thanks for the quick revision, @alamb. This is looking great!

alamb · 2025-08-23T10:35:07Z

Thanks for the reviews @mbutrovich and @etseidl -- I plan to merge this PR sometime this week, and then I'll try and make a POC to rework the current metadata decoder to use this one instead of the other way around

parquet/src/file/metadata/push_decoder.rs

albertlockett · 2025-09-02T12:36:08Z

parquet/src/util/push_buffers.rs

+        if found {
+            // If we found the range, we can return the number of bytes read
+            // advance our offset
+            self.offset += buf.len() as u64;


Would we be able to avoid re-iterating ranges coming before the virtual offset on subsequent calls to read if we kept the index of the range/buffers Vecs as part of the offset?

I guess this would only be impactful if there were many ranges, so feel free to ignore this suggestion if the complexity outweighs the benefit

That is an interesting idea -- I am not sure the current API requires any particular order of the ranges (so I don't think it is guaranteed that the bytes for the next range are next after the current found index.

However, the linear scan is likely to be inefficient if we are storing many ranges -- maybe we should keep them in some sort of sorted structure (like a BTreeSet, for example) 🤔

Oh I didn't realize the ranges weren't in a sorted order. I think BTreeSet could work. We could probably start the iter from the last range we saw by calling ranges.range((Included(&last_range), Unbounded)).

That said, I don't want to hold up this PR. I'm fine with the current solution for now & we could always optimize the linear scan down the line if we find we're storing a lot of ranges.

alamb · 2025-09-04T17:52:05Z

I was out last week so I am delayed

…ecoder

Co-authored-by: albertlockett <[email protected]>

…rs into alamb/push_metadata_decoder

alamb · 2025-09-10T19:35:09Z

I'll plan to merge this PR tomorrow unless anyone else would like time to comment

alamb · 2025-09-11T21:09:06Z

Thanks everyone. I am hoping to try and prototype moving hte decoder logic too shortly

alamb · 2025-09-12T19:51:49Z

FWIW I am starting to refactor the actual decoding state machine here:

Move ParquetMetadata decoder state machine into ParquetMetadataPushDecoder #8340

alamb · 2025-09-26T14:42:57Z

The PR to move the actual decoding into its own state machine is now ready for review:

Move ParquetMetadata decoder state machine into ParquetMetadataPushDecoder #8340

@etseidl

…coder (#8340) # Which issue does this PR close? - part of #8000 - Follow on to #8080 - Closes #8439 # Rationale for this change The current ParquetMetadataDecoder intermixes three things: 1. The state machine for decoding parquet metadata (footer, then metadata, then (optional) indexes) 2. orchestrating IO (aka calling read, etc) 3. Decoding thrift encoded byte into objects This makes it almost impossible to add features like "only decode a subset of the columns in the ColumnIndex" and other potentially advanced usecases Now that we have a "push" style API for metadata decoding that avoids IO, the next step is to extract out the actual work into this API so that the existing ParquetMetadataDecoder just calls into the PushDecoder # What changes are included in this PR? 1. Extract decoding state machine into PushMetadataDecoder 2. Extract thrift parsing into its own `parser` module 3. Update ParquetMetadataDecoder to use the PushMetadataDecoder 4. Extract the bytes --> object code into its own module This almost certainly will conflict with @etseidl 's plans in thrift-remodel. # Are these changes tested? by existing tests # Are there any user-facing changes? Not really -- this is an internal change that will make it easier to add features like "only decode a subset of the columns in the ColumnIndex, for example

github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate labels Aug 7, 2025

alamb changed the title ~~Alamb/push metadata decoder~~ Add ParquetMetadataPushDecoder Aug 7, 2025

alamb mentioned this pull request Aug 7, 2025

Implement Push Parquet Decoder #7997

Open

alamb force-pushed the alamb/push_metadata_decoder branch from ff4a158 to 8ae1311 Compare August 7, 2025 14:02

alamb commented Aug 7, 2025

View reviewed changes

alamb changed the title ~~Add ParquetMetadataPushDecoder~~ [Parquet] Add ParquetMetadataPushDecoder Aug 7, 2025

alamb mentioned this pull request Aug 7, 2025

Decouple IO and CPU operations in the Parquet Reader (push decoder) #7983

Open

ParquetMetaDataPushDecoder

dfd6435

alamb force-pushed the alamb/push_metadata_decoder branch from 8ae1311 to dfd6435 Compare August 7, 2025 15:29

alamb marked this pull request as ready for review August 7, 2025 15:29

github-actions bot removed arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate labels Aug 7, 2025

alamb requested a review from etseidl August 7, 2025 15:34

alamb self-assigned this Aug 7, 2025

alamb mentioned this pull request Aug 8, 2025

[Parquet] Add tests for IO/CPU access in parquet reader #7971

Merged

etseidl mentioned this pull request Aug 11, 2025

[thrift-remodel] Complete decoding of FileMetaData and RowGroupMetaData #8111

Merged

alamb mentioned this pull request Aug 15, 2025

Add an API to provide the reason / what is being requested in IO #8157

Open

alamb added 4 commits August 15, 2025 12:35

Merge remote-tracking branch 'apache/main' into alamb/push_metadata_d…

8e0bc63

…ecoder

Update with new PageIndexPolicy API

5fc55da

Fix docs

3c292b2

Merge branch 'main' into alamb/push_metadata_decoder

8a9630a

etseidl reviewed Aug 16, 2025

View reviewed changes

parquet/src/file/metadata/push_decoder.rs Show resolved Hide resolved

jhorstmann reviewed Aug 18, 2025

View reviewed changes

parquet/src/util/push_buffers.rs Show resolved Hide resolved

alamb added 2 commits August 20, 2025 14:19

break when found

3f4ad09

fmt

be330af

etseidl reviewed Aug 20, 2025

View reviewed changes

Apply suggestions from code review

e2ed491

Co-authored-by: Ed Seidl <[email protected]>

mbutrovich reviewed Aug 21, 2025

View reviewed changes

parquet/src/file/metadata/push_decoder.rs Outdated Show resolved Hide resolved

mbutrovich reviewed Aug 21, 2025

View reviewed changes

parquet/src/file/metadata/push_decoder.rs Outdated Show resolved Hide resolved

mbutrovich previously requested changes Aug 21, 2025

View reviewed changes

alamb added 2 commits August 21, 2025 15:30

Merge remote-tracking branch 'apache/main' into alamb/push_metadata_d…

a5e7834

…ecoder

Improve comments

d04d3c5

alamb requested a review from mbutrovich August 21, 2025 19:32

mbutrovich approved these changes Aug 22, 2025

View reviewed changes

albertlockett reviewed Sep 2, 2025

View reviewed changes

parquet/src/file/metadata/push_decoder.rs Outdated Show resolved Hide resolved

albertlockett reviewed Sep 2, 2025

View reviewed changes

albertlockett approved these changes Sep 2, 2025

View reviewed changes

Merge branch 'main' into alamb/push_metadata_decoder

060e27e

alamb and others added 3 commits September 10, 2025 15:27

Merge remote-tracking branch 'apache/main' into alamb/push_metadata_d…

d76b415

…ecoder

Update parquet/src/file/metadata/push_decoder.rs

52b5343

Co-authored-by: albertlockett <[email protected]>

Merge branch 'alamb/push_metadata_decoder' of github.com:alamb/arrow-…

4e20f57

…rs into alamb/push_metadata_decoder

alamb merged commit aa626e1 into apache:main Sep 11, 2025
16 checks passed

alamb mentioned this pull request Sep 12, 2025

Move ParquetMetadata decoder state machine into ParquetMetadataPushDecoder #8340

Merged

alamb deleted the alamb/push_metadata_decoder branch September 12, 2025 19:51


		pub mod thrift;

		/// What data is needed to read the next item from a decoder.

[Parquet] Add ParquetMetadataPushDecoder #8080

[Parquet] Add ParquetMetadataPushDecoder #8080

Uh oh!

Conversation

alamb commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Aug 7, 2025

Uh oh!

etseidl commented Aug 8, 2025

Uh oh!

etseidl commented Aug 14, 2025

Uh oh!

alamb commented Aug 15, 2025

Uh oh!

Uh oh!

Uh oh!

alamb commented Aug 20, 2025

Uh oh!

etseidl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mbutrovich left a comment

Choose a reason for hiding this comment

Uh oh!

mbutrovich left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Aug 23, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Sep 4, 2025

Uh oh!

alamb commented Sep 10, 2025

Uh oh!

Uh oh!

alamb commented Aug 7, 2025 •

edited

Loading

alamb commented Sep 26, 2025 •

edited

Loading