Foundation of API for reading Variant data and metadata #7535

mkarbo · 2025-05-21T14:24:50Z

Which issue does this PR close?

Rationale for this change

We need to agree on an API for reading Variant metadata. Based on the work and discussions in #7452, in this PR we propose an API plus an implementation (WIP while draft) for reading variant metadata in the parquet-variant crate.
A lot of the work is based on the work in #7452 by @PinkCrow007 and feedback from @alamb, @scovich, and @Weijun-H.

What changes are included in this PR?

Adds Variant enum (and associated structs)
Adds an API for parsing and reading metadata
Adds an API for parsing and reading Variant values of various types

We attempt to be result- and validation driven while ensuring zero-allocations, and we do so avoiding serde_json.

We tried to keep the Variant API similar to the Json::Value api.

Are there any user-facing changes?

The new API's added in parquet-variant will be user facing.

mkarbo · 2025-05-21T14:26:35Z

@alamb @PinkCrow007 fyi

For first draft, it only contains a small subset of the primitive types, so we can get feedback on that before implementing decoding and conversion for all primitive types.

Tried to incorporate and address as many comments from alamb and scovich in #7452 as possible, but can’t promise that all have been addressed, please let me know.

alamb

This is great, thank you so much @mkarbo -- I left some comments but I think this is a super strong

Let's see if we get any more API suggestions

I also made a PR for your consideration:

https://github.com/mkarbo/arrow-rs/pull/1/files

I'll also go and try to fix some of the examples upstream in parquet-data

parquet-variant/Cargo.toml

parquet-variant/src/test_variant.rs

parquet-variant/src/variant.rs

alamb · 2025-05-21T19:45:44Z

parquet-variant/src/decoder.rs

+    }
+
+    #[test]
+    fn test_string() -> Result<(), ArrowError> {


FWIW I don't think these unit tests are necessary given we have code that is reading the vaues from parquet-testing

Yeah, let me write these out a little more. I think having some hand-crafted tests (e.g., also for object) where we see how the bytes look will be helpful for readers as documentation, not just in validating. We can add test variants using both the builder & manual process.

I personally learnt a lot by writing out the details with pen and paper for the metadata and value bytes for various variant data such as 42, <short string>, <longer string> {'a': 1, 'b': 2, 'c': <uuid>} and {'a':1, 'b': 'hello 💯', 'c': {'x': 99, 'y': true, 'z': <uuid>}} and I think that could help future maintainers get onboarded (would update the tests with more in-depth docs)

However, we can also skip them if you guys prefer, lmk.

I think they are ok, maybe we just shouldn't go too crazy.

The other thing we could do is to make them examples in the documentation -- which would help both users and maintainers.

alamb · 2025-05-21T19:48:23Z

parquet-variant/src/variant.rs

+use strum_macros::EnumDiscriminants;
+
+#[derive(Clone, Copy, Debug, PartialEq)]
+/// Encodes the Variant Metadata, see the Variant spec file for more information


I think this API looks pretty good to me.

One thing that we might want to do is to make a VariantMetadata::try_new function that checks and parses out the sorted_strings, dict_len, etc fields. To do anything with the metadata those fields will need to be decoded, so it seems like it would be a nicer API to compute and validate them once on constructon rather than on each access

+1

Also, pay the cost to construct it once and pass references to everyone else that needs it.

Something like this:

struct VariantMetadataHeader { version: u8, sorted: bool, offset_size_bytes: OffsetSizeBytes, } impl<'m> VariantMetadata<'m> { pub fn try_new(bytes: &'m [u8]) -> Result<Self, ArrowError> { let Some(header) = bytes.get(0) else { return Err(...); } let header = VariantMetadataHeader::try_new(bytes)?; let dictionary_size = header.offset_size_bytes.unpack_usize(bytes, 1)?; // verify the offsets are monotonically increasing let valid = (0..=dictionary_size) .map(|i| header.offset_size_bytes.unpack_usize(bytes, 1, i)?) .scan(0, |prev, cur| cur.is_ok_and(|i| prev <= i)) .all(|valid| valid); if !valid { return Err(...); } Ok(Self { ... }) } pub fn get(i: usize) -> Result<&'m str, ArrowError> { // TODO: Should we memoize the offsets? There could be thousands of them... let start = self.header.offset_size_bytes.unpack_usize(self.bytes, self.first_offset, i)?; let end = self.header.offset_size_bytes.unpack_usize(self.bytes, self.first_offset, i+1)?; let bytes = slice_from_slice(self.value_bytes, self.first_value+start..self.first_value+end)?; Ok(str::from_utf8(bytes).map_err(|| invalid_utf8_err())?) } }

The important thing, I suspect, is to verify that the offsets are monotonically increasing, so we don't risk panics due to a case where end < begin. Tho I suppose we could also check that in the individual accessors instead.

Actually, there are four different XX_size_minus_one situations in the variant spec (metadata dictionary offsets, array value offsets, object field ids, and object value offsets). We should probably factor out a helper class for working with them, since it's ~always the same:

enum PackedUsizeArrayElementSize { One = 1, Two = 2, Three = 3, Four = 4, }; impl PackedUsizeArrayElementSize { fn try_new(offset_size_minus_one: u8) -> Result<Self, ArrowError> { use PackedUsizeArrayElementSize::*; let result = match offset_size_minus_one { 0 => One, 1 => Two, 2 => Three, 3 => Four, _ => return Err(...), }; Ok(result) } } struct PackedUsizeArray<'a> { element_size: PackedArrayElementSize, elements: &'a [u8], } impl<'a> PackedUsizeArray<'a> { /// NOTE: Only looks at the lower 2 bits of the element size! fn try_new( bytes: &'a [u8], byte_offset: usize, element_size: PackedUsizeArrayElementSize, num_elements: usize, ) -> Result<Self, ArrowError> { let len = (element_size as usize)*num_elements; Ok(Self { elements: slice_from_slice(bytes, byte_offset..byte_offset+len)? element_size, num_elements, }) } fn get(&self, i: usize) -> Result<usize, ArrowError> { self.element_size.unpack_usize(self. bytes, 0, i) } }

Actually, there are four different XX_size_minus_one situations in the variant spec (metadata dictionary offsets, array value offsets, object field ids, and object value offsets). We should probably factor out a helper class for working with them, since it's ~always the same:

If we really want to obsess about performance, we could potentially make VariantMetadata templated on the offset size. I am not sure how much that will matter so we probably shouldn't do it at this point

Yeah... and the use sites in object and array variant would anyway be the more likely performance bottleneck. Unless we want to define a different enum variant for each of those as well... 4*4 (object) + 4 (array) = 20 total combinations... seems excessive. A middle ground would be to special-case just the common ones (e.g. object where both sizes are 1 or both are 4; array where the size is 1 or 4). But it's still extra complexity for likely a very small gain. We'd be trading off more match arms in one place, in order to eliminate match arms somewhere nearby.

Should we open a ticket to explore this after this one? Would like to get the API ready so my team can work in parallel on the rest of the primitives and optimisations needed

alamb · 2025-05-21T19:52:19Z

parquet-variant/src/variant.rs

+                        self.buffer[(self.seen ) * self.offset_size + 1 // +1 to skip header
+                            ..(self.seen ) * self.offset_size + 1]
+                            .try_into()
+                            .ok()?,


this is interesting -- what should we do when we hit an offset that is an invlalid usize -- I think this code will basically return None (stop iterating) rather than erroring in this situation

Maybe when someone calls iter() it would make sense to validate all the offsets then 🤔

I don't think the offsets are ever valid usizes? They occupy 1, 2, 3, or 4 bytes, depending on the metadata header's (2-bit) offset_size_minus_one field? We'll need to extract the correct number of bytes, pad it out to u32 (or usize) and then try_into+ from_le_bytes?

Also, I think this code misses length checks needed for those subslice ranges to not panic?
Or did I miss some earlier global check that covers the whole iteration?

We'd want some helper like this:

enum OffsetSizeBytes { One = 1, Two = 2, Three = 3, Four = 4, } impl OffsetSizeBytes { fn try_new(offset_size_minus_one: u8) -> Result<Self, ArrowError> { use OffsetSizeBytes::*; let result = match offset_size_minus_one { 0 => One, 1 => Two, 2 => Three, 3 => Four, _ => return Err(...), }; Ok(result) } fn unpack_usize( bytes: &[u8], byte_offset: usize, // how many bytes to skip offset_index: usize, // which offset in an array of offsets ) -> Result<usize, ArrowError> { use OffsetSizeBytes::*; let offset = byte_offset + (self as usize)*offset_index; let result = match self { One => u8::from_le_bytes(array_from_slice(bytes, offset)?).into(), Two => u16::from_le_bytes(array_from_slice(bytes, offset)?).into(), Three => todo!(), // ugh, endianness Four => u32::from_le_bytes(array_from_slice(bytes, offset)?).try_into()?, }; Ok(result) } }

For simplicity (and IIRC also speed), I believe this kind of enum-based dispatch is generally preferred in rust, instead of function pointers?

scovich

Flushing comments at EOD...

parquet-variant/src/decoder.rs

scovich · 2025-05-21T23:29:13Z

parquet-variant/src/variant.rs

+use strum_macros::EnumDiscriminants;
+
+#[derive(Clone, Copy, Debug, PartialEq)]
+/// Encodes the Variant Metadata, see the Variant spec file for more information


Something like this:

struct VariantMetadataHeader { version: u8, sorted: bool, offset_size_bytes: OffsetSizeBytes, } impl<'m> VariantMetadata<'m> { pub fn try_new(bytes: &'m [u8]) -> Result<Self, ArrowError> { let Some(header) = bytes.get(0) else { return Err(...); } let header = VariantMetadataHeader::try_new(bytes)?; let dictionary_size = header.offset_size_bytes.unpack_usize(bytes, 1)?; // verify the offsets are monotonically increasing let valid = (0..=dictionary_size) .map(|i| header.offset_size_bytes.unpack_usize(bytes, 1, i)?) .scan(0, |prev, cur| cur.is_ok_and(|i| prev <= i)) .all(|valid| valid); if !valid { return Err(...); } Ok(Self { ... }) } pub fn get(i: usize) -> Result<&'m str, ArrowError> { // TODO: Should we memoize the offsets? There could be thousands of them... let start = self.header.offset_size_bytes.unpack_usize(self.bytes, self.first_offset, i)?; let end = self.header.offset_size_bytes.unpack_usize(self.bytes, self.first_offset, i+1)?; let bytes = slice_from_slice(self.value_bytes, self.first_value+start..self.first_value+end)?; Ok(str::from_utf8(bytes).map_err(|| invalid_utf8_err())?) } }

The important thing, I suspect, is to verify that the offsets are monotonically increasing, so we don't risk panics due to a case where end < begin. Tho I suppose we could also check that in the individual accessors instead.

parquet-variant/src/variant.rs

scovich · 2025-05-22T00:42:56Z

parquet-variant/src/variant.rs

+                Variant::ShortString(Cow::Borrowed(decoder::decode_short_string(value)?))
+            }
+
+            VariantType::Object => Variant::Object(VariantObject {


Probably? VariantObject::try_new and VariantArray::try_new should probably do basic header and offset monotonicity validation, similar to what I suggested for VariantMetadata::try_new (see comment above)?

scovich · 2025-05-22T00:53:30Z

parquet-variant/src/variant.rs

+impl<'m, 'v> Index<usize> for VariantArray<'m, 'v> {
+    type Output = Variant<'m, 'v>;
+
+    fn index(&self, _index: usize) -> &Self::Output {


I don't think this trait works, if it forces us to return a reference?
We need to return a Variant by value, which wraps the referenced bytes.

I agree --that will likely become obvious if we try to implement it

scovich · 2025-05-22T01:09:40Z

parquet-variant/src/variant.rs

+use strum_macros::EnumDiscriminants;
+
+#[derive(Clone, Copy, Debug, PartialEq)]
+/// Encodes the Variant Metadata, see the Variant spec file for more information


Actually, there are four different XX_size_minus_one situations in the variant spec (metadata dictionary offsets, array value offsets, object field ids, and object value offsets). We should probably factor out a helper class for working with them, since it's ~always the same:

enum PackedUsizeArrayElementSize { One = 1, Two = 2, Three = 3, Four = 4, }; impl PackedUsizeArrayElementSize { fn try_new(offset_size_minus_one: u8) -> Result<Self, ArrowError> { use PackedUsizeArrayElementSize::*; let result = match offset_size_minus_one { 0 => One, 1 => Two, 2 => Three, 3 => Four, _ => return Err(...), }; Ok(result) } } struct PackedUsizeArray<'a> { element_size: PackedArrayElementSize, elements: &'a [u8], } impl<'a> PackedUsizeArray<'a> { /// NOTE: Only looks at the lower 2 bits of the element size! fn try_new( bytes: &'a [u8], byte_offset: usize, element_size: PackedUsizeArrayElementSize, num_elements: usize, ) -> Result<Self, ArrowError> { let len = (element_size as usize)*num_elements; Ok(Self { elements: slice_from_slice(bytes, byte_offset..byte_offset+len)? element_size, num_elements, }) } fn get(&self, i: usize) -> Result<usize, ArrowError> { self.element_size.unpack_usize(self. bytes, 0, i) } }

Test out avoiding strum and Cow

parquet-variant/src/decoder.rs

alamb · 2025-05-22T18:08:27Z

parquet-variant/src/decoder.rs

+}
+
+/// Extracts the primitive type from a header byte
+pub(crate) fn get_primitive_type(header: u8) -> Result<VariantPrimitiveType, ArrowError> {


@scovich 's comments got me thinking we could implement this as impl TryFrom<u8> for VariantPrimitiveType

Which I think would be more "idomatic" rust and would make for better look.

Similarly for other APIs in this module

alamb · 2025-05-22T18:10:00Z

parquet-variant/src/decoder.rs

+    }
+
+    #[test]
+    fn test_string() -> Result<(), ArrowError> {


I think they are ok, maybe we just shouldn't go too crazy.

The other thing we could do is to make them examples in the documentation -- which would help both users and maintainers.

alamb · 2025-05-22T18:10:21Z

parquet-variant/src/test_variant.rs

+    ("primitive_string", Variant::from("This string is longer than 64 bytes and therefore does not fit in a short_string and it also includes several non ascii characters such as 🐢, 💖, ♥\u{fe0f}, 🎣 and 🤦!!")),
+    // Using the From<String> trait
+    ("short_string", Variant::from("Less than 64 bytes (❤\u{fe0f} with utf8)")), 
+    // TODO Reenable when https://github.com/apache/parquet-testing/issues/81 is fixed


BTW I made a PR to fix this

Add primitive_null metadata for example Variant values parquet-testing#84

alamb · 2025-05-22T18:11:58Z

parquet-variant/src/variant.rs

+use strum_macros::EnumDiscriminants;
+
+#[derive(Clone, Copy, Debug, PartialEq)]
+/// Encodes the Variant Metadata, see the Variant spec file for more information


Actually, there are four different XX_size_minus_one situations in the variant spec (metadata dictionary offsets, array value offsets, object field ids, and object value offsets). We should probably factor out a helper class for working with them, since it's ~always the same:

If we really want to obsess about performance, we could potentially make VariantMetadata templated on the offset size. I am not sure how much that will matter so we probably shouldn't do it at this point

parquet-variant/src/variant.rs

alamb · 2025-05-22T18:20:37Z

parquet-variant/src/variant.rs

+impl<'m, 'v> Index<usize> for VariantArray<'m, 'v> {
+    type Output = Variant<'m, 'v>;
+
+    fn index(&self, _index: usize) -> &Self::Output {


I agree --that will likely become obvious if we try to implement it

parquet-variant/src/variant.rs

alamb · 2025-05-22T18:22:34Z

parquet-variant/src/decoder.rs

+/// To be used in `map_err` when unpacking an integer from a slice of bytes.
+fn map_try_from_slice_error(e: TryFromSliceError) -> ArrowError {
+    ArrowError::InvalidArgumentError(e.to_string())
+}


I think it would be better actually to improve this conversion a bit to explain the error comes from trying to parse variants. If we just add a blanket From impl the code will just work, but users will see messages like Invalid slice which will be pretty unspecific

parquet-variant/src/variant.rs

parquet-variant/src/decoder.rs

mkarbo · 2025-05-23T13:14:17Z

I am refactoring the Array to have try_new and will try to consolidate to avoid double validations of metadata.

parquet-variant/src/variant.rs

…t-api

alamb · 2025-05-23T21:26:23Z

I took the liberty of merging up from main and adding headers to the files to get CI to pass

alamb

TLDR is I think this PR is good enough to merge so we can begin making improvements (like running checks in the CI, adding docs, tests, other support, etc)

Thank you so much for spearheading it @mkarbo and for all the detailed feedback @scovich and @mapleFU .

In my opinion the only outstanding questions are some of the details of handling Objects and arrays but I think we can iterate on that in subsequent PRs as we haven't really committed to an API

What do you think @mkarbo and @scovich ? Is this good enough to merge and make follow on PRs?

FYI @PinkCrow007 -- if you have time to review this it would be most appreciated

mapleFU · 2025-05-23T21:51:58Z

parquet-variant/src/variant.rs

+                    "First offset is non-zero".to_string(),
+                ));
+            }
+            if prev.is_some_and(|prev| prev >= offset) {


I just found a interesting questing: does empty key valid here?

This code can probably be simplified, but meanwhile -- empty prev means this is the first entry, and there is no previous entry it could disagree with.

I mean what if prev == offset, which means key.empty, could this scenerio exists?

Yipes, good catch! The metadata dictionary encodes field names, and JSON definitely allows empty field names. Plus, the spec allows duplicates in an unordered dictionary (**), so we could even encounter more than one empty string.

(**) The spec says the following:

If sorted_strings is set to 1, strings in the dictionary must be unique and sorted in lexicographic order. If the value is set to 0, readers may not make any assumptions about string order or uniqueness.

The spec also says:

An object may not contain duplicate keys

... which would be "not fun" to enforce if the dictionary contained duplicate entries -- you could end up with two different field ids whose values compare equal.

I guess this opens yet another question about how much validation we should be performing? If it's sorted, do we need to verify that each string is lexicographically greater-than its predecessor?

Currently in C++ implementations, I just do some O(1) checks to prevent from memory issue and believe user pass right data. Validate whether a string dictionary is sorted is really tricky ( Maybe it's also tricky for field_ids in object ). I think a Validate function can be added but not run by default

For empty key, I don't know how shredding handles this, I've send a mail in mail list, maybe we can make it clear later these days: https://lists.apache.org/thread/1shsf89w9dxgjp0d8kgr7dom6rh5sk9t

I think a Validate function can be added but not run by default

I agree with this design idea -- as long as invalid data supplied by the user can not cause undefined behavior (crashes or memory unsafety) I think it is fine to elide expensive checks by default and provide a more expensive validation routine

If you agree @scovich I will file a ticket to track the idea

Agree -- a separate in-depth validate method sounds like a great way to balance efficiency with safety.

So for this PR, we just add basic O(1) validation and sanity checks (**), and leave the in-depth validation as a tracking ticket?

(**) I guess such checks must include the header, and probably also would check that the data section of bytes is not obviously too small? As in, verify that the first offset is 0 and that last offset makes sense vs. the byte slice we have to work with?

Did we file a ticket already?

scovich

Finished a review pass. I think more changes landed while I was reviewing tho.

parquet-variant/src/variant.rs

scovich · 2025-05-23T23:15:49Z

parquet-variant/src/variant.rs

+impl<'m, 'v> From<&'v str> for Variant<'m, 'v> {
+    fn from(value: &'v str) -> Self {


Hmm. I'm not sure these From conversions are actually helpful in the decoder path? Strings, in particular, can't just be any &'v str -- they have to reside in the correct location inside the value byte slice that comprises the overall variant. Even in the builder path, turning a string to a variant would be a bytewise copy.

scovich · 2025-05-23T23:30:36Z

In my opinion the only outstanding questions are some of the details of handling Objects and arrays but I think we can iterate on that in subsequent PRs as we haven't really committed to an API

What do you think @mkarbo and @scovich ? Is this good enough to merge and make follow on PRs?

We're getting very close. I agree the specifics of objects and arrays can be follow-on work.

A few remaining items on my mind (some of which are comments that github oh-so-helpfully hides):

How much "extra" validation (= not immediately needed by the function that performs it) should we be doing? See Foundation of API for reading Variant data and metadata #7535 (comment)
The VariantMetadata API seems overly wide and complex, I wonder if we can simplify it? See Foundation of API for reading Variant data and metadata #7535 (comment)
Significant code simplifications still seem to be possible. A good chunk of this is IMO due to the previous two points. I'm interested in this because we're still finding ways to factor out redundant code to helper functions that will likely make the remaining pathfinding easier. Also because this is very complex stuff; the less code we use to express it, the smaller the cognitive burden and bug surface will be. Plus, it's something that gets harder and harder to fix later as the code grows and matures.

I suspect one more pass will address the code simplifications. Not sure how to resolve the other two most efficiently? Maybe others could chime in so we have more voices to consider?

scovich · 2025-05-23T23:34:24Z

Thank you so much for spearheading it @mkarbo

+100 -- to both you and PinkCrow007 (who github won't let me mention for some reason)

mkarbo · 2025-05-26T15:37:51Z

Thanks all, I can give it another go tomorrow I think with all the comments.

scovich

LGTM, except a few older review comments might have been missed? Also a bunch of nits to consider either before merge or as follow-up work.

https://github.com/apache/arrow-rs/pull/7535/files#r2105478634

Potential to have VariantMetadata::unpack method

https://github.com/apache/arrow-rs/pull/7535/files#r2105513563

Questionable utility of impl<'v> From<&'v str> for Variant

#7535 (comment):

Better for VariantMetadata to present a (narrow) Vec<&str>-like interface?

scovich · 2025-05-30T12:03:21Z

parquet-variant/src/decoder.rs

+}
+
+/// Extracts the basic type from a header byte
+pub(crate) fn get_basic_type(header: u8) -> Result<VariantBasicType, ArrowError> {


nit: Do we ever expect to return Err? Seems infallible?

Also, out of curiosity, why is one a function and the other an impl [Try]From with a function as well?

scovich · 2025-05-30T12:15:17Z

parquet-variant/src/decoder.rs

+
+/// Decodes a long string from the value section of a variant.
+pub(crate) fn decode_long_string(value: &[u8]) -> Result<&str, ArrowError> {
+    let len = u32::from_le_bytes(array_from_slice(value, 1)?) as usize;


nit:

Suggested change

let len = u32::from_le_bytes(array_from_slice(value, 1)?) as usize;

let len = OffsetSizeBytes::Four.unpack_usize(value, 1, 0)?;

scovich · 2025-05-30T12:15:46Z

parquet-variant/src/decoder.rs

+    let len = (first_byte_from_slice(value)? >> 2) as usize;
+
+    let string = string_from_slice(value, 1..1 + len)?;


nit:

Suggested change

let len = (first_byte_from_slice(value)? >> 2) as usize;

let string = string_from_slice(value, 1..1 + len)?;

let len = (first_byte_from_slice(value)? >> 2) as usize;

let string = string_from_slice(value, 1..1 + len)?;

scovich · 2025-05-30T12:17:37Z

parquet-variant/src/utils.rs

+
+pub(crate) fn slice_from_slice<I: SliceIndex<[u8]> + Clone + Debug>(


nit: doc comment?

Suggested change

pub(crate) fn slice_from_slice<I: SliceIndex<[u8]> + Clone + Debug>(

// Extracts a subslice from a slice without risk of panic

pub(crate) fn slice_from_slice<I: SliceIndex<[u8]> + Clone + Debug>(

(more below)

scovich · 2025-05-30T12:19:19Z

parquet-variant/src/decoder.rs

+use crate::utils::{array_from_slice, first_byte_from_slice, string_from_slice};
+
+#[derive(Debug, Clone, Copy)]
+pub enum VariantBasicType {


These seem pretty basic, should they be in the top-level lib.rs instead of here in decoder.rs?

scovich · 2025-05-30T12:39:59Z

parquet-variant/src/variant.rs

+    }
+
+    /// Get a single offset by index
+    pub fn get_offset_by(&self, index: usize) -> Result<usize, ArrowError> {


Based on previous discussions, should this be pub(crate)?

scovich · 2025-05-30T12:45:31Z

parquet-variant/src/variant.rs

+        // 1.  num_elements + 1
+        let n_offsets = num_elements.checked_add(1).ok_or_else(overflow)?;
+
+        // 2.  (num_elements + 1) * offset_size
+        let value_bytes = n_offsets
+            .checked_mul(offset_size as usize)
+            .ok_or_else(overflow)?;
+
+        // 3.  first_offset_byte + ...
+        let first_value_byte = first_offset_byte
+            .checked_add(value_bytes)
+            .ok_or_else(overflow)?;


nit: Any particular reason this is spread out, instead of one chain like Metadata::try_new did?

Suggested change

// 1. num_elements + 1

let n_offsets = num_elements.checked_add(1).ok_or_else(overflow)?;

// 2. (num_elements + 1) * offset_size

let value_bytes = n_offsets

.checked_mul(offset_size as usize)

.ok_or_else(overflow)?;

// 3. first_offset_byte + ...

let first_value_byte = first_offset_byte

.checked_add(value_bytes)

.ok_or_else(overflow)?;

// (num_elements + 1) * offset_size + first_offset_byte

let n_offsets = num_elements

.checked_add(1)

.and_then(|n| n.checked_mul(offset_size as usize))

.and_then(|n| n.checked_add(value_bytes))

.ok_or_else(overflow)?;

scovich · 2025-05-30T12:46:33Z

parquet-variant/src/variant.rs

+            first_value_byte + start_field_offset_from_first_value_byte
+                ..first_value_byte + end_field_offset_from_first_value_byte,


Is there any risk a malicious input could cause either of these to overflow on a 32-bit arch?

scovich · 2025-05-30T12:50:04Z

parquet-variant/src/variant.rs

+        }
+    }
+
+    pub fn metadata(&self) -> Option<&'m VariantMetadata> {


I can't find my previous github comment now... but it's still not clear to me when/how this method could be useful in practice?

scovich · 2025-05-30T12:55:36Z

parquet-variant/src/variant.rs

+    }
+}
+
+impl<'m, 'v> From<&'v str> for Variant<'m, 'v> {


nit: I'm surprised clippy didn't warn about unnecessary named lifetimes:

Suggested change

impl<'m, 'v> From<&'v str> for Variant<'m, 'v> {

impl<'v> From<&'v str> for Variant<'_, 'v> {

mkarbo · 2025-05-30T13:00:23Z

You're too fast @scovich! I am not done yet 😉

scovich · 2025-05-30T13:17:43Z

You're too fast @scovich! I am not done yet 😉

Hehe sorry... this is just too exciting.

mkarbo added 6 commits May 21, 2025 15:16

feat: Initial READ Api for Variant in parquet-variant crate

81e3847

chore: API refinements, questions for discussions

9075dd4

chore: comments & cleanup

1ebbad7

chore: typo

b3ecdd9

feat: buffer access validations

9688fa4

chore: remove useless API

5705687

Test out avoiding strum and Cow

d80d1d8

alamb mentioned this pull request May 21, 2025

Test out avoiding strum and Cow mkarbo/arrow-rs#1

Merged

alamb reviewed May 21, 2025

View reviewed changes

scovich reviewed May 22, 2025

View reviewed changes

mkarbo and others added 11 commits May 22, 2025 00:25

Merge pull request #1 from alamb/alamb/no_strum

9aa5d3b

Test out avoiding strum and Cow

chore: comments

696fa9b

chore: rename total to dict_len for consistency

798b627

chore: remove unnecessary masks

8bea9b4

chore: Start work on PR comments, validating header, refactor

2e262f1

chore: minor cleanups and utils

715081b

chore: remove redundant comments

b774b38

chore: fix int_8 to use array_from_slice

5c668b4

chore: revert slice_from_slice due to saturation issues at max usize

0ceaedc

chore: comment on memoize with TODO

130e113

chore: clean-up and fix some things in the variant API

45a71c6

alamb reviewed May 22, 2025

View reviewed changes

chore: Iterators, and fix get_by methods and some cleanups

032a9d6

alamb reviewed May 22, 2025

View reviewed changes

parquet-variant/src/variant.rs Outdated Show resolved Hide resolved

mkarbo added 3 commits May 22, 2025 22:50

chore: 3rd branch of unpack, docs + some tests

fe2bf45

chore: cleanup old comments

10fe6bf

chore: use unreachable!() and add comment on why it's sound

992a35d

mkarbo commented May 22, 2025

View reviewed changes

parquet-variant/src/decoder.rs Show resolved Hide resolved

chore: remove note per PR feedback

c2c6bfd

mkarbo added 10 commits May 23, 2025 11:26

chore(refactor): non_empty_byte util -> first_byte_from_slice

88137b0

chore(validation): validate version in try_new

5b4115b

chore(validation): Add final checks to try_new for VariantMetadata

362341e

feat(test): tests for try_new

6946299

chore: memoize dictionary byte offset start

d3426ab

chore(refactor): simpliy str from slice to util (feedback)

c507efb

chore(docstring): forgot to update docstring

7d2b06d

chore: minor comment + clearer offset api

a9354c4

chore: test fields iterator

a68828b

feat: initial get method for array (will refactor)

f4b8f58

mapleFU reviewed May 23, 2025

View reviewed changes

parquet-variant/src/variant.rs Show resolved Hide resolved

parquet-variant/src/variant.rs Show resolved Hide resolved

alamb added 2 commits May 23, 2025 17:22

Merge remote-tracking branch 'apache/main' into mkar/7423-read-varian…

1358eed

…t-api

Add header to parquet files

1273a6d

alamb changed the title ~~API for reading Variant data and metadata~~ Foundation of API for reading Variant data and metadata May 23, 2025

alamb marked this pull request as ready for review May 23, 2025 21:33

alamb approved these changes May 23, 2025

View reviewed changes

mapleFU reviewed May 23, 2025

View reviewed changes

scovich reviewed May 23, 2025

View reviewed changes

alamb mentioned this pull request May 27, 2025

Weekly Plan: Andrew Lamb 2025-05-26 apache/datafusion#16198

Open

11 tasks

mkarbo added 3 commits May 30, 2025 09:55

chore: various PR comments

3fad3eb

chore: PR feedback and better monotone check based on scovich

e203359

chore: metadata return struct not bytes

026774d

scovich approved these changes May 30, 2025

View reviewed changes

		impl<'m, 'v> From<&'v str> for Variant<'m, 'v> {
		fn from(value: &'v str) -> Self {

	let len = u32::from_le_bytes(array_from_slice(value, 1)?) as usize;
	let len = OffsetSizeBytes::Four.unpack_usize(value, 1, 0)?;

		let len = (first_byte_from_slice(value)? >> 2) as usize;

		let string = string_from_slice(value, 1..1 + len)?;


		pub(crate) fn slice_from_slice<I: SliceIndex<[u8]> + Clone + Debug>(

		first_value_byte + start_field_offset_from_first_value_byte
		..first_value_byte + end_field_offset_from_first_value_byte,

	impl<'m, 'v> From<&'v str> for Variant<'m, 'v> {
	impl<'v> From<&'v str> for Variant<'_, 'v> {

Foundation of API for reading Variant data and metadata #7535

Are you sure you want to change the base?

Foundation of API for reading Variant data and metadata #7535

Uh oh!

Conversation

mkarbo commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

mkarbo commented May 21, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

mkarbo commented May 21, 2025 •

edited

Loading

scovich May 27, 2025 •

edited

Loading