-
Notifications
You must be signed in to change notification settings - Fork 1k
[Varint] Implement ShreddingState::AllNull variant #8093
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Varint] Implement ShreddingState::AllNull variant #8093
Conversation
…t handling Signed-off-by: codephage2020 <[email protected]>
Signed-off-by: codephage2020 <[email protected]>
#[test] | ||
fn all_null_variant_array_construction() { | ||
let metadata = BinaryViewArray::from(vec![b"test" as &[u8]; 3]); | ||
let nulls = NullBuffer::from(vec![false, false, false]); // all null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure that this case (where there is a value
field present, but it is all null) should be treated as though there was no value
field?
I haven't double checked the spec (probably the Arrow one) but this feels like it may be out of compliance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know, the spec's requirement that "both value and typed_value are null" meaning "the value is missing."
We can see from the spec's example:
refer to here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I got it. I will add a test for the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comprehensive test that demonstrates when a value field exists in the schema but contains all null values, it correctly remains in the Unshredded state rather than AllNull.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand, the treatment of NULL/NULL depends on context:
- For a top-level variant value, it's interpreted as
Variant::Null
- For a shredded variant object field, it's interpreted as missing (SQL NULL)
So I guess there are two ways to get SQL NULL -- null mask on the struct(value, typed_value), or if both value and typed_value are themselves NULL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah -- the spec requires that the struct ("group") containing value
and typed_value
columns must be non-nullable:
The group for each named field must use repetition level
required
. A field'svalue
andtyped_value
are set to null (missing) to indicate that the field does not exist in the variant. To encode a field that is present with a [variant/JSON, not SQL]null
value, thevalue
must contain a Variant null: basic type 0 (primitive) and physical type 0 (null).
So effectively, the NULL/NULL combo becomes the null mask for that nested field. Which is why a top-level NULL/NULL combo is incorrect -- the top-level field already has its own nullability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @codephage2020 -- this looks great -- I had one corner case question
…t handling Signed-off-by: codephage2020 <[email protected]>
Signed-off-by: codephage2020 <[email protected]>
bd43fbf
to
b5d4c5d
Compare
Signed-off-by: codephage2020 <[email protected]>
Signed-off-by: codephage2020 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @codephage2020 -- I think this looks good to me
FYI @scovich in case you would also like to review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lots of messiness in the shredding spec here... but I think the current code is probably ok-ish, because we only support top-level variant so far?
Main question is whether we should forbid the AllNull
case at top level, since it's technically valid only for nested fields?
typed_value_to_variant(typed_value, index) | ||
} | ||
} | ||
ShreddingState::AllNull { .. } => Variant::Null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is tricky... see #8122 (comment)
- For a top-level variant, null/null is illegal (tho returning
Variant::Null
is arguably a correct way to compensate) - For a shredded variant field, null/null means SQL NULL, and returning
Variant::Null
is arguably incorrect (causes SQLIS NULL
operator to returnFALSE
). But we don't even have a way to return SQL NULL here (it would probably correspond toOption::None
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are absolutely correct. Subsequently, we might need a value_opt() method that returns Option to properly handle SQL NULL semantics for shredded fields.
// Should succeed and create an AllNull variant when neither value nor typed_value are present | ||
let variant_array = VariantArray::try_new(Arc::new(array)).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By a strict reading of the spec, this should actually fail, because this ShreddingState
does not represent a shredded object field, but rather represents a top-level variant value:
value typed_value Meaning null null The value is missing; only valid for shredded object fields
But maybe that's a validation VariantArray::try_new
should perform, not ShreddingState::try_new
?
Also, it would quickly become annoying if variant_get
has to replace a missing or all-null value
column with an all-Variant::Null
column just to comply with the spec? Maybe that's why there's this additional tidbit?
If a Variant is missing in a context where a value is required, readers must return a Variant null (
00
): basic type 0 (primitive) and physical type 0 (null). For example, if a Variant is required (likemeasurement
above) and bothvalue
andtyped_value
are null, the returned value must be00
(Variant null).
#[test] | ||
fn all_null_variant_array_construction() { | ||
let metadata = BinaryViewArray::from(vec![b"test" as &[u8]; 3]); | ||
let nulls = NullBuffer::from(vec![false, false, false]); // all null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand, the treatment of NULL/NULL depends on context:
- For a top-level variant value, it's interpreted as
Variant::Null
- For a shredded variant object field, it's interpreted as missing (SQL NULL)
So I guess there are two ways to get SQL NULL -- null mask on the struct(value, typed_value), or if both value and typed_value are themselves NULL?
#[test] | ||
fn all_null_variant_array_construction() { | ||
let metadata = BinaryViewArray::from(vec![b"test" as &[u8]; 3]); | ||
let nulls = NullBuffer::from(vec![false, false, false]); // all null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah -- the spec requires that the struct ("group") containing value
and typed_value
columns must be non-nullable:
The group for each named field must use repetition level
required
. A field'svalue
andtyped_value
are set to null (missing) to indicate that the field does not exist in the variant. To encode a field that is present with a [variant/JSON, not SQL]null
value, thevalue
must contain a Variant null: basic type 0 (primitive) and physical type 0 (null).
So effectively, the NULL/NULL combo becomes the null mask for that nested field. Which is why a top-level NULL/NULL combo is incorrect -- the top-level field already has its own nullability.
Co-authored-by: Ryan Johnson <[email protected]> Co-authored-by: Ryan Johnson <[email protected]>
Signed-off-by: codephage2020 <[email protected]>
Thanks for the thorough review! I’ll go through all the comments carefully and respond later today. |
@codephage2020 what is the status of this PR from your perpsective? Your last comment is
But I don't see any changes since then This PR appears to have accumulated some conflicts; I am happy to help resolve them / get this PR merged but I wanted to check with you first |
Signed-off-by: codephage2020 <[email protected]>
First, I sincerely apologize for the delayed response. After carefully reviewing all the feedback, I initially struggled with determining the necessary changes and how to respond—I wasn’t entirely confident in my understanding, which led to the unintended hold-up. My apologies again for the wait. I’ve now added some comments. Additionally, I list some comments outlining potential next steps or action items in the following, FYI.
Please feel free to correct or refine them—I’d greatly appreciate your guidance to ensure we move forward effectively. |
My understanding is that the array's null buffer and When building a top-level variant, we could intersect the null masks of the When using e.g. Kind of complicated/subtle... took a long time to wrap my head around even the |
Actually... will we need to make use of Array::logical_nulls to handle these weird semantics, instead of simple null masks? I don't love the fact that the |
Based on other conversations, we probably need different Array implementations to represent shredded struct fields and array elements, which are not |
I agree with @scovich that VariantArray::value() should continue to return
👍
I think this makes sense to me |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @codephage2020 and @scovich
In my opinion this PR is a step forward as it adds a needed VariantArray case.
I think we have identified several potential follow ons, but at this time I don't see anything that would prevent this PR from merging. I think it is a really nice step forward
}, | ||
/// All values are null, only metadata is present. | ||
/// | ||
/// This state occurs when neither `value` nor `typed_value` fields exist in the schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a reasonable approach for now. If we find a reason to restrict AllNull to only sub fields at some later point, then we can add additional code / restrictions
@codephage2020 please let me know if there are additional issues you would like me to file |
Which issue does this PR close?
ShreddingState::AllNull
variant #8088 .What changes are included in this PR?
handles the "all null" case
Are these changes tested?
Yes.
Are there any user-facing changes?
no.