-
Notifications
You must be signed in to change notification settings - Fork 932
Foundation of API for reading Variant data and metadata #7535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@alamb @PinkCrow007 fyi For first draft, it only contains a small subset of the primitive types, so we can get feedback on that before implementing decoding and conversion for all primitive types. Tried to incorporate and address as many comments from alamb and scovich in #7452 as possible, but can’t promise that all have been addressed, please let me know. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thank you so much @mkarbo -- I left some comments but I think this is a super strong
Let's see if we get any more API suggestions
I also made a PR for your consideration:
I'll also go and try to fix some of the examples upstream in parquet-data
} | ||
|
||
#[test] | ||
fn test_string() -> Result<(), ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I don't think these unit tests are necessary given we have code that is reading the vaues from parquet-testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let me write these out a little more. I think having some hand-crafted tests (e.g., also for object) where we see how the bytes look will be helpful for readers as documentation, not just in validating. We can add test variants using both the builder & manual process.
I personally learnt a lot by writing out the details with pen and paper for the metadata and value bytes for various variant data such as 42
, <short string>
, <longer string>
{'a': 1, 'b': 2, 'c': <uuid>}
and {'a':1, 'b': 'hello 💯', 'c': {'x': 99, 'y': true, 'z': <uuid>}}
and I think that could help future maintainers get onboarded (would update the tests with more in-depth docs)
However, we can also skip them if you guys prefer, lmk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they are ok, maybe we just shouldn't go too crazy.
The other thing we could do is to make them examples in the documentation -- which would help both users and maintainers.
use strum_macros::EnumDiscriminants; | ||
|
||
#[derive(Clone, Copy, Debug, PartialEq)] | ||
/// Encodes the Variant Metadata, see the Variant spec file for more information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this API looks pretty good to me.
One thing that we might want to do is to make a VariantMetadata::try_new
function that checks and parses out the sorted_strings, dict_len, etc fields. To do anything with the metadata those fields will need to be decoded, so it seems like it would be a nicer API to compute and validate them once on constructon rather than on each access
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Also, pay the cost to construct it once and pass references to everyone else that needs it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this:
struct VariantMetadataHeader {
version: u8,
sorted: bool,
offset_size_bytes: OffsetSizeBytes,
}
impl<'m> VariantMetadata<'m> {
pub fn try_new(bytes: &'m [u8]) -> Result<Self, ArrowError> {
let Some(header) = bytes.get(0) else {
return Err(...);
}
let header = VariantMetadataHeader::try_new(bytes)?;
let dictionary_size = header.offset_size_bytes.unpack_usize(bytes, 1)?;
// verify the offsets are monotonically increasing
let valid = (0..=dictionary_size)
.map(|i| header.offset_size_bytes.unpack_usize(bytes, 1, i)?)
.scan(0, |prev, cur| cur.is_ok_and(|i| prev <= i))
.all(|valid| valid);
if !valid {
return Err(...);
}
Ok(Self { ... })
}
pub fn get(i: usize) -> Result<&'m str, ArrowError> {
// TODO: Should we memoize the offsets? There could be thousands of them...
let start = self.header.offset_size_bytes.unpack_usize(self.bytes, self.first_offset, i)?;
let end = self.header.offset_size_bytes.unpack_usize(self.bytes, self.first_offset, i+1)?;
let bytes = slice_from_slice(self.value_bytes, self.first_value+start..self.first_value+end)?;
Ok(str::from_utf8(bytes).map_err(|| invalid_utf8_err())?)
}
}
The important thing, I suspect, is to verify that the offsets are monotonically increasing, so we don't risk panics due to a case where end < begin. Tho I suppose we could also check that in the individual accessors instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, there are four different XX_size_minus_one
situations in the variant spec (metadata dictionary offsets, array value offsets, object field ids, and object value offsets). We should probably factor out a helper class for working with them, since it's ~always the same:
enum PackedUsizeArrayElementSize {
One = 1,
Two = 2,
Three = 3,
Four = 4,
};
impl PackedUsizeArrayElementSize {
fn try_new(offset_size_minus_one: u8) -> Result<Self, ArrowError> {
use PackedUsizeArrayElementSize::*;
let result = match offset_size_minus_one {
0 => One,
1 => Two,
2 => Three,
3 => Four,
_ => return Err(...),
};
Ok(result)
}
}
struct PackedUsizeArray<'a> {
element_size: PackedArrayElementSize,
elements: &'a [u8],
}
impl<'a> PackedUsizeArray<'a> {
/// NOTE: Only looks at the lower 2 bits of the element size!
fn try_new(
bytes: &'a [u8],
byte_offset: usize,
element_size: PackedUsizeArrayElementSize,
num_elements: usize,
) -> Result<Self, ArrowError> {
let len = (element_size as usize)*num_elements;
Ok(Self {
elements: slice_from_slice(bytes, byte_offset..byte_offset+len)?
element_size,
num_elements,
})
}
fn get(&self, i: usize) -> Result<usize, ArrowError> {
self.element_size.unpack_usize(self. bytes, 0, i)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, there are four different XX_size_minus_one situations in the variant spec (metadata dictionary offsets, array value offsets, object field ids, and object value offsets). We should probably factor out a helper class for working with them, since it's ~always the same:
If we really want to obsess about performance, we could potentially make VariantMetadata
templated on the offset size. I am not sure how much that will matter so we probably shouldn't do it at this point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... and the use sites in object and array variant would anyway be the more likely performance bottleneck. Unless we want to define a different enum variant for each of those as well... 4*4 (object) + 4 (array) = 20 total combinations... seems excessive. A middle ground would be to special-case just the common ones (e.g. object where both sizes are 1 or both are 4; array where the size is 1 or 4). But it's still extra complexity for likely a very small gain. We'd be trading off more match arms in one place, in order to eliminate match arms somewhere nearby.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we open a ticket to explore this after this one? Would like to get the API ready so my team can work in parallel on the rest of the primitives and optimisations needed
parquet-variant/src/variant.rs
Outdated
self.buffer[(self.seen ) * self.offset_size + 1 // +1 to skip header | ||
..(self.seen ) * self.offset_size + 1] | ||
.try_into() | ||
.ok()?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is interesting -- what should we do when we hit an offset that is an invlalid usize -- I think this code will basically return None
(stop iterating) rather than erroring in this situation
Maybe when someone calls iter()
it would make sense to validate all the offsets then 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the offsets are ever valid usizes? They occupy 1, 2, 3, or 4 bytes, depending on the metadata header's (2-bit) offset_size_minus_one
field? We'll need to extract the correct number of bytes, pad it out to u32 (or usize) and then try_into
+ from_le_bytes
?
Also, I think this code misses length checks needed for those subslice ranges to not panic?
Or did I miss some earlier global check that covers the whole iteration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd want some helper like this:
enum OffsetSizeBytes {
One = 1,
Two = 2,
Three = 3,
Four = 4,
}
impl OffsetSizeBytes {
fn try_new(offset_size_minus_one: u8) -> Result<Self, ArrowError> {
use OffsetSizeBytes::*;
let result = match offset_size_minus_one {
0 => One,
1 => Two,
2 => Three,
3 => Four,
_ => return Err(...),
};
Ok(result)
}
fn unpack_usize(
bytes: &[u8],
byte_offset: usize, // how many bytes to skip
offset_index: usize, // which offset in an array of offsets
) -> Result<usize, ArrowError> {
use OffsetSizeBytes::*;
let offset = byte_offset + (self as usize)*offset_index;
let result = match self {
One => u8::from_le_bytes(array_from_slice(bytes, offset)?).into(),
Two => u16::from_le_bytes(array_from_slice(bytes, offset)?).into(),
Three => todo!(), // ugh, endianness
Four => u32::from_le_bytes(array_from_slice(bytes, offset)?).try_into()?,
};
Ok(result)
}
}
For simplicity (and IIRC also speed), I believe this kind of enum-based dispatch is generally preferred in rust, instead of function pointers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flushing comments at EOD...
use strum_macros::EnumDiscriminants; | ||
|
||
#[derive(Clone, Copy, Debug, PartialEq)] | ||
/// Encodes the Variant Metadata, see the Variant spec file for more information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this:
struct VariantMetadataHeader {
version: u8,
sorted: bool,
offset_size_bytes: OffsetSizeBytes,
}
impl<'m> VariantMetadata<'m> {
pub fn try_new(bytes: &'m [u8]) -> Result<Self, ArrowError> {
let Some(header) = bytes.get(0) else {
return Err(...);
}
let header = VariantMetadataHeader::try_new(bytes)?;
let dictionary_size = header.offset_size_bytes.unpack_usize(bytes, 1)?;
// verify the offsets are monotonically increasing
let valid = (0..=dictionary_size)
.map(|i| header.offset_size_bytes.unpack_usize(bytes, 1, i)?)
.scan(0, |prev, cur| cur.is_ok_and(|i| prev <= i))
.all(|valid| valid);
if !valid {
return Err(...);
}
Ok(Self { ... })
}
pub fn get(i: usize) -> Result<&'m str, ArrowError> {
// TODO: Should we memoize the offsets? There could be thousands of them...
let start = self.header.offset_size_bytes.unpack_usize(self.bytes, self.first_offset, i)?;
let end = self.header.offset_size_bytes.unpack_usize(self.bytes, self.first_offset, i+1)?;
let bytes = slice_from_slice(self.value_bytes, self.first_value+start..self.first_value+end)?;
Ok(str::from_utf8(bytes).map_err(|| invalid_utf8_err())?)
}
}
The important thing, I suspect, is to verify that the offsets are monotonically increasing, so we don't risk panics due to a case where end < begin. Tho I suppose we could also check that in the individual accessors instead.
parquet-variant/src/variant.rs
Outdated
Variant::ShortString(Cow::Borrowed(decoder::decode_short_string(value)?)) | ||
} | ||
|
||
VariantType::Object => Variant::Object(VariantObject { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably? VariantObject::try_new
and VariantArray::try_new
should probably do basic header and offset monotonicity validation, similar to what I suggested for VariantMetadata::try_new
(see comment above)?
parquet-variant/src/variant.rs
Outdated
impl<'m, 'v> Index<usize> for VariantArray<'m, 'v> { | ||
type Output = Variant<'m, 'v>; | ||
|
||
fn index(&self, _index: usize) -> &Self::Output { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this trait works, if it forces us to return a reference?
We need to return a Variant
by value, which wraps the referenced bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree --that will likely become obvious if we try to implement it
use strum_macros::EnumDiscriminants; | ||
|
||
#[derive(Clone, Copy, Debug, PartialEq)] | ||
/// Encodes the Variant Metadata, see the Variant spec file for more information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, there are four different XX_size_minus_one
situations in the variant spec (metadata dictionary offsets, array value offsets, object field ids, and object value offsets). We should probably factor out a helper class for working with them, since it's ~always the same:
enum PackedUsizeArrayElementSize {
One = 1,
Two = 2,
Three = 3,
Four = 4,
};
impl PackedUsizeArrayElementSize {
fn try_new(offset_size_minus_one: u8) -> Result<Self, ArrowError> {
use PackedUsizeArrayElementSize::*;
let result = match offset_size_minus_one {
0 => One,
1 => Two,
2 => Three,
3 => Four,
_ => return Err(...),
};
Ok(result)
}
}
struct PackedUsizeArray<'a> {
element_size: PackedArrayElementSize,
elements: &'a [u8],
}
impl<'a> PackedUsizeArray<'a> {
/// NOTE: Only looks at the lower 2 bits of the element size!
fn try_new(
bytes: &'a [u8],
byte_offset: usize,
element_size: PackedUsizeArrayElementSize,
num_elements: usize,
) -> Result<Self, ArrowError> {
let len = (element_size as usize)*num_elements;
Ok(Self {
elements: slice_from_slice(bytes, byte_offset..byte_offset+len)?
element_size,
num_elements,
})
}
fn get(&self, i: usize) -> Result<usize, ArrowError> {
self.element_size.unpack_usize(self. bytes, 0, i)
}
}
Test out avoiding strum and Cow
parquet-variant/src/decoder.rs
Outdated
} | ||
|
||
/// Extracts the primitive type from a header byte | ||
pub(crate) fn get_primitive_type(header: u8) -> Result<VariantPrimitiveType, ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scovich 's comments got me thinking we could implement this as impl TryFrom<u8> for VariantPrimitiveType
Which I think would be more "idomatic" rust and would make for better look.
Similarly for other APIs in this module
} | ||
|
||
#[test] | ||
fn test_string() -> Result<(), ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they are ok, maybe we just shouldn't go too crazy.
The other thing we could do is to make them examples in the documentation -- which would help both users and maintainers.
("primitive_string", Variant::from("This string is longer than 64 bytes and therefore does not fit in a short_string and it also includes several non ascii characters such as 🐢, 💖, ♥\u{fe0f}, 🎣 and 🤦!!")), | ||
// Using the From<String> trait | ||
("short_string", Variant::from("Less than 64 bytes (❤\u{fe0f} with utf8)")), | ||
// TODO Reenable when https://github.com/apache/parquet-testing/issues/81 is fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I made a PR to fix this
use strum_macros::EnumDiscriminants; | ||
|
||
#[derive(Clone, Copy, Debug, PartialEq)] | ||
/// Encodes the Variant Metadata, see the Variant spec file for more information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, there are four different XX_size_minus_one situations in the variant spec (metadata dictionary offsets, array value offsets, object field ids, and object value offsets). We should probably factor out a helper class for working with them, since it's ~always the same:
If we really want to obsess about performance, we could potentially make VariantMetadata
templated on the offset size. I am not sure how much that will matter so we probably shouldn't do it at this point
parquet-variant/src/variant.rs
Outdated
impl<'m, 'v> Index<usize> for VariantArray<'m, 'v> { | ||
type Output = Variant<'m, 'v>; | ||
|
||
fn index(&self, _index: usize) -> &Self::Output { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree --that will likely become obvious if we try to implement it
/// To be used in `map_err` when unpacking an integer from a slice of bytes. | ||
fn map_try_from_slice_error(e: TryFromSliceError) -> ArrowError { | ||
ArrowError::InvalidArgumentError(e.to_string()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better actually to improve this conversion a bit to explain the error comes from trying to parse variants. If we just add a blanket From
impl the code will just work, but users will see messages like Invalid slice
which will be pretty unspecific
I am refactoring the Array to have |
I took the liberty of merging up from main and adding headers to the files to get CI to pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR is I think this PR is good enough to merge so we can begin making improvements (like running checks in the CI, adding docs, tests, other support, etc)
Thank you so much for spearheading it @mkarbo and for all the detailed feedback @scovich and @mapleFU .
In my opinion the only outstanding questions are some of the details of handling Objects and arrays but I think we can iterate on that in subsequent PRs as we haven't really committed to an API
What do you think @mkarbo and @scovich ? Is this good enough to merge and make follow on PRs?
FYI @PinkCrow007 -- if you have time to review this it would be most appreciated
parquet-variant/src/variant.rs
Outdated
"First offset is non-zero".to_string(), | ||
)); | ||
} | ||
if prev.is_some_and(|prev| prev >= offset) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just found a interesting questing: does empty key valid here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code can probably be simplified, but meanwhile -- empty prev
means this is the first entry, and there is no previous entry it could disagree with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean what if prev == offset
, which means key.empty
, could this scenerio exists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yipes, good catch! The metadata dictionary encodes field names, and JSON definitely allows empty field names. Plus, the spec allows duplicates in an unordered dictionary (**), so we could even encounter more than one empty string.
(**) The spec says the following:
If
sorted_strings
is set to 1, strings in the dictionary must be unique and sorted in lexicographic order. If the value is set to 0, readers may not make any assumptions about string order or uniqueness.
The spec also says:
An object may not contain duplicate keys
... which would be "not fun" to enforce if the dictionary contained duplicate entries -- you could end up with two different field ids whose values compare equal.
I guess this opens yet another question about how much validation we should be performing? If it's sorted, do we need to verify that each string is lexicographically greater-than its predecessor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently in C++ implementations, I just do some O(1) checks to prevent from memory issue and believe user pass right data. Validate whether a string dictionary is sorted is really tricky ( Maybe it's also tricky for field_ids in object ). I think a Validate
function can be added but not run by default
For empty key, I don't know how shredding handles this, I've send a mail in mail list, maybe we can make it clear later these days: https://lists.apache.org/thread/1shsf89w9dxgjp0d8kgr7dom6rh5sk9t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a Validate function can be added but not run by default
I agree with this design idea -- as long as invalid data supplied by the user can not cause undefined behavior (crashes or memory unsafety) I think it is fine to elide expensive checks by default and provide a more expensive validation routine
If you agree @scovich I will file a ticket to track the idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree -- a separate in-depth validate
method sounds like a great way to balance efficiency with safety.
So for this PR, we just add basic O(1) validation and sanity checks (**), and leave the in-depth validation as a tracking ticket?
(**) I guess such checks must include the header, and probably also would check that the data section of bytes is not obviously too small? As in, verify that the first offset is 0 and that last offset makes sense vs. the byte slice we have to work with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we file a ticket already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished a review pass. I think more changes landed while I was reviewing tho.
impl<'m, 'v> From<&'v str> for Variant<'m, 'v> { | ||
fn from(value: &'v str) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I'm not sure these From
conversions are actually helpful in the decoder path? Strings, in particular, can't just be any &'v str
-- they have to reside in the correct location inside the value byte slice that comprises the overall variant. Even in the builder path, turning a string to a variant would be a bytewise copy.
We're getting very close. I agree the specifics of objects and arrays can be follow-on work. A few remaining items on my mind (some of which are comments that github oh-so-helpfully hides):
I suspect one more pass will address the code simplifications. Not sure how to resolve the other two most efficiently? Maybe others could chime in so we have more voices to consider? |
+100 -- to both you and PinkCrow007 (who github won't let me mention for some reason) |
Thanks all, I can give it another go tomorrow I think with all the comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, except a few older review comments might have been missed? Also a bunch of nits to consider either before merge or as follow-up work.
https://github.com/apache/arrow-rs/pull/7535/files#r2105478634
- Potential to have
VariantMetadata::unpack
method
https://github.com/apache/arrow-rs/pull/7535/files#r2105513563
- Questionable utility of
impl<'v> From<&'v str> for Variant
- Better for
VariantMetadata
to present a (narrow)Vec<&str>
-like interface?
} | ||
|
||
/// Extracts the basic type from a header byte | ||
pub(crate) fn get_basic_type(header: u8) -> Result<VariantBasicType, ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Do we ever expect to return Err
? Seems infallible?
Also, out of curiosity, why is one a function and the other an impl [Try]From
with a function as well?
|
||
/// Decodes a long string from the value section of a variant. | ||
pub(crate) fn decode_long_string(value: &[u8]) -> Result<&str, ArrowError> { | ||
let len = u32::from_le_bytes(array_from_slice(value, 1)?) as usize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
let len = u32::from_le_bytes(array_from_slice(value, 1)?) as usize; | |
let len = OffsetSizeBytes::Four.unpack_usize(value, 1, 0)?; |
let len = (first_byte_from_slice(value)? >> 2) as usize; | ||
|
||
let string = string_from_slice(value, 1..1 + len)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
let len = (first_byte_from_slice(value)? >> 2) as usize; | |
let string = string_from_slice(value, 1..1 + len)?; | |
let len = (first_byte_from_slice(value)? >> 2) as usize; | |
let string = string_from_slice(value, 1..1 + len)?; |
|
||
pub(crate) fn slice_from_slice<I: SliceIndex<[u8]> + Clone + Debug>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: doc comment?
pub(crate) fn slice_from_slice<I: SliceIndex<[u8]> + Clone + Debug>( | |
// Extracts a subslice from a slice without risk of panic | |
pub(crate) fn slice_from_slice<I: SliceIndex<[u8]> + Clone + Debug>( |
(more below)
use crate::utils::{array_from_slice, first_byte_from_slice, string_from_slice}; | ||
|
||
#[derive(Debug, Clone, Copy)] | ||
pub enum VariantBasicType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These seem pretty basic, should they be in the top-level lib.rs instead of here in decoder.rs?
} | ||
|
||
/// Get a single offset by index | ||
pub fn get_offset_by(&self, index: usize) -> Result<usize, ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on previous discussions, should this be pub(crate)
?
// 1. num_elements + 1 | ||
let n_offsets = num_elements.checked_add(1).ok_or_else(overflow)?; | ||
|
||
// 2. (num_elements + 1) * offset_size | ||
let value_bytes = n_offsets | ||
.checked_mul(offset_size as usize) | ||
.ok_or_else(overflow)?; | ||
|
||
// 3. first_offset_byte + ... | ||
let first_value_byte = first_offset_byte | ||
.checked_add(value_bytes) | ||
.ok_or_else(overflow)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Any particular reason this is spread out, instead of one chain like Metadata::try_new
did?
// 1. num_elements + 1 | |
let n_offsets = num_elements.checked_add(1).ok_or_else(overflow)?; | |
// 2. (num_elements + 1) * offset_size | |
let value_bytes = n_offsets | |
.checked_mul(offset_size as usize) | |
.ok_or_else(overflow)?; | |
// 3. first_offset_byte + ... | |
let first_value_byte = first_offset_byte | |
.checked_add(value_bytes) | |
.ok_or_else(overflow)?; | |
// (num_elements + 1) * offset_size + first_offset_byte | |
let n_offsets = num_elements | |
.checked_add(1) | |
.and_then(|n| n.checked_mul(offset_size as usize)) | |
.and_then(|n| n.checked_add(value_bytes)) | |
.ok_or_else(overflow)?; |
first_value_byte + start_field_offset_from_first_value_byte | ||
..first_value_byte + end_field_offset_from_first_value_byte, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any risk a malicious input could cause either of these to overflow on a 32-bit arch?
} | ||
} | ||
|
||
pub fn metadata(&self) -> Option<&'m VariantMetadata> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't find my previous github comment now... but it's still not clear to me when/how this method could be useful in practice?
} | ||
} | ||
|
||
impl<'m, 'v> From<&'v str> for Variant<'m, 'v> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'm surprised clippy didn't warn about unnecessary named lifetimes:
impl<'m, 'v> From<&'v str> for Variant<'m, 'v> { | |
impl<'v> From<&'v str> for Variant<'_, 'v> { |
You're too fast @scovich! I am not done yet 😉 |
Hehe sorry... this is just too exciting. |
Which issue does this PR close?
Closes #7423
Rationale for this change
We need to agree on an API for reading Variant metadata. Based on the work and discussions in #7452, in this PR we propose an API plus an implementation (WIP while draft) for reading variant metadata in the parquet-variant crate.
A lot of the work is based on the work in #7452 by @PinkCrow007 and feedback from @alamb, @scovich, and @Weijun-H.
What changes are included in this PR?
We attempt to be result- and validation driven while ensuring zero-allocations, and we do so avoiding
serde_json
.We tried to keep the Variant API similar to the
Json::Value
api.Are there any user-facing changes?
The new API's added in parquet-variant will be user facing.