Skip to content

Conversation

codephage2020
Copy link
Contributor

@codephage2020 codephage2020 commented Aug 8, 2025

Which issue does this PR close?

What changes are included in this PR?

handles the "all null" case

Are these changes tested?

Yes.

Are there any user-facing changes?

no.

Signed-off-by: codephage2020 <[email protected]>
@codephage2020 codephage2020 marked this pull request as ready for review August 9, 2025 03:10
#[test]
fn all_null_variant_array_construction() {
let metadata = BinaryViewArray::from(vec![b"test" as &[u8]; 3]);
let nulls = NullBuffer::from(vec![false, false, false]); // all null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that this case (where there is a value field present, but it is all null) should be treated as though there was no value field?

I haven't double checked the spec (probably the Arrow one) but this feels like it may be out of compliance

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, the spec's requirement that "both value and typed_value are null" meaning "the value is missing."
We can see from the spec's example:
refer to here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I got it. I will add a test for the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comprehensive test that demonstrates when a value field exists in the schema but contains all null values, it correctly remains in the Unshredded state rather than AllNull.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand, the treatment of NULL/NULL depends on context:

  • For a top-level variant value, it's interpreted as Variant::Null
  • For a shredded variant object field, it's interpreted as missing (SQL NULL)

So I guess there are two ways to get SQL NULL -- null mask on the struct(value, typed_value), or if both value and typed_value are themselves NULL?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah -- the spec requires that the struct ("group") containing value and typed_value columns must be non-nullable:

The group for each named field must use repetition level required. A field's value and typed_value are set to null (missing) to indicate that the field does not exist in the variant. To encode a field that is present with a [variant/JSON, not SQL] null value, the value must contain a Variant null: basic type 0 (primitive) and physical type 0 (null).

So effectively, the NULL/NULL combo becomes the null mask for that nested field. Which is why a top-level NULL/NULL combo is incorrect -- the top-level field already has its own nullability.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @codephage2020 -- this looks great -- I had one corner case question

@codephage2020 codephage2020 force-pushed the feat/shreddingstate-allnull-variant branch from bd43fbf to b5d4c5d Compare August 12, 2025 08:15
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @codephage2020 -- I think this looks good to me

FYI @scovich in case you would also like to review

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of messiness in the shredding spec here... but I think the current code is probably ok-ish, because we only support top-level variant so far?

Main question is whether we should forbid the AllNull case at top level, since it's technically valid only for nested fields?

typed_value_to_variant(typed_value, index)
}
}
ShreddingState::AllNull { .. } => Variant::Null,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tricky... see #8122 (comment)

  • For a top-level variant, null/null is illegal (tho returning Variant::Null is arguably a correct way to compensate)
  • For a shredded variant field, null/null means SQL NULL, and returning Variant::Null is arguably incorrect (causes SQL IS NULL operator to return FALSE). But we don't even have a way to return SQL NULL here (it would probably correspond to Option::None?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are absolutely correct. Subsequently, we might need a value_opt() method that returns Option to properly handle SQL NULL semantics for shredded fields.

Comment on lines 445 to 446
// Should succeed and create an AllNull variant when neither value nor typed_value are present
let variant_array = VariantArray::try_new(Arc::new(array)).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By a strict reading of the spec, this should actually fail, because this ShreddingState does not represent a shredded object field, but rather represents a top-level variant value:

value typed_value Meaning
null null The value is missing; only valid for shredded object fields

But maybe that's a validation VariantArray::try_new should perform, not ShreddingState::try_new?

Also, it would quickly become annoying if variant_get has to replace a missing or all-null value column with an all-Variant::Null column just to comply with the spec? Maybe that's why there's this additional tidbit?

If a Variant is missing in a context where a value is required, readers must return a Variant null (00): basic type 0 (primitive) and physical type 0 (null). For example, if a Variant is required (like measurement above) and both value and typed_value are null, the returned value must be 00 (Variant null).

#[test]
fn all_null_variant_array_construction() {
let metadata = BinaryViewArray::from(vec![b"test" as &[u8]; 3]);
let nulls = NullBuffer::from(vec![false, false, false]); // all null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand, the treatment of NULL/NULL depends on context:

  • For a top-level variant value, it's interpreted as Variant::Null
  • For a shredded variant object field, it's interpreted as missing (SQL NULL)

So I guess there are two ways to get SQL NULL -- null mask on the struct(value, typed_value), or if both value and typed_value are themselves NULL?

#[test]
fn all_null_variant_array_construction() {
let metadata = BinaryViewArray::from(vec![b"test" as &[u8]; 3]);
let nulls = NullBuffer::from(vec![false, false, false]); // all null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah -- the spec requires that the struct ("group") containing value and typed_value columns must be non-nullable:

The group for each named field must use repetition level required. A field's value and typed_value are set to null (missing) to indicate that the field does not exist in the variant. To encode a field that is present with a [variant/JSON, not SQL] null value, the value must contain a Variant null: basic type 0 (primitive) and physical type 0 (null).

So effectively, the NULL/NULL combo becomes the null mask for that nested field. Which is why a top-level NULL/NULL combo is incorrect -- the top-level field already has its own nullability.

@github-actions github-actions bot added the parquet-variant parquet-variant* crates label Aug 13, 2025
Signed-off-by: codephage2020 <[email protected]>
@codephage2020
Copy link
Contributor Author

Thanks for the thorough review! I’ll go through all the comments carefully and respond later today.

@alamb
Copy link
Contributor

alamb commented Aug 18, 2025

Thanks for the thorough review! I’ll go through all the comments carefully and respond later today.

@codephage2020 what is the status of this PR from your perpsective? Your last comment is

Thanks for the thorough review! I’ll go through all the comments carefully and respond later today.

But I don't see any changes since then

This PR appears to have accumulated some conflicts; I am happy to help resolve them / get this PR merged but I wanted to check with you first

Signed-off-by: codephage2020 <[email protected]>
@codephage2020
Copy link
Contributor Author

This PR appears to have accumulated some conflicts; I am happy to help resolve them / get this PR merged but I wanted to check with you first

First, I sincerely apologize for the delayed response. After carefully reviewing all the feedback, I initially struggled with determining the necessary changes and how to respond—I wasn’t entirely confident in my understanding, which led to the unintended hold-up. My apologies again for the wait.

I’ve now added some comments.

Additionally, I list some comments outlining potential next steps or action items in the following, FYI.

  1. VariantArray::value() to return Option to properly distinguish SQL NULL vs JSON null
  2. Add validation to reject top-level AllNull variants entirely
  3. Pass schema information to ShreddingState::try_new() to detect context
  4. Split AllNull handling into separate variants for top-level vs shredded contexts

Please feel free to correct or refine them—I’d greatly appreciate your guidance to ensure we move forward effectively.

@scovich
Copy link
Contributor

scovich commented Aug 19, 2025

  1. VariantArray::value() to return Option to properly distinguish SQL NULL vs JSON null

4. Split AllNull handling into separate variants for top-level vs shredded contexts

My understanding is that the array's null buffer and is_null method should always say whether the value is SQL null, and the value method should always return something if invoked. So returning Variant::Null for null/null combo should always be correct, in that sense.

When building a top-level variant, we could intersect the null masks of the value and typed_value fields, and verify that the result is strictly a subset of the variant's own null mask. Similar to the validation performed by StructArray for non-nullable fields inside a nullable struct.

When using e.g. variant_get to project a variant object field as a new VariantArray, we again need to intersect the null masks of the value and typed_value fields, to form the basis of the null mask for the resulting VariantArray. But we also need to union in all ancestors' null masks along the path variant_get followed to reach the field -- same as we'd have to do when projecting a field from a StructArray.

Kind of complicated/subtle... took a long time to wrap my head around even the StructArray version, let alone the VariantArray extensions.

@scovich
Copy link
Contributor

scovich commented Aug 19, 2025

Actually... will we need to make use of Array::logical_nulls to handle these weird semantics, instead of simple null masks?

I don't love the fact that the logical_null method even needs to exist, but looking at the other cases it covers, it might be the correct home for this logical null variant logic as well?

@scovich
Copy link
Contributor

scovich commented Aug 19, 2025

2. Add validation to reject top-level AllNull variants entirely

3. Pass schema information to ShreddingState::try_new() to detect context

Based on other conversations, we probably need different Array implementations to represent shredded struct fields and array elements, which are not VariantArray. With all three relying on an underlying ShreddingState to do the heavy lifting. If so, the constructor for each array implementation would do the validation for null/null case, and ShreddingState can just accept all combos?

@alamb
Copy link
Contributor

alamb commented Aug 19, 2025

Additionally, I list some comments outlining potential next steps or action items in the following, FYI.

  1. VariantArray::value() to return Option to properly distinguish SQL NULL vs JSON null

I agree with @scovich that VariantArray::value() should continue to return Variant (not Option<Variant>) -- and the elements which where null (the validity mask was 0) VariantArray::value(), should return the item in the array at that location (which should be Variant::Null in this case, though that may not happen tody)

  1. Add validation to reject top-level AllNull variants entirely

👍

  1. Pass schema information to ShreddingState::try_new() to detect context
  2. Split AllNull handling into separate variants for top-level vs shredded contexts

I think this makes sense to me

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @codephage2020 and @scovich

In my opinion this PR is a step forward as it adds a needed VariantArray case.

I think we have identified several potential follow ons, but at this time I don't see anything that would prevent this PR from merging. I think it is a really nice step forward

},
/// All values are null, only metadata is present.
///
/// This state occurs when neither `value` nor `typed_value` fields exist in the schema.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a reasonable approach for now. If we find a reason to restrict AllNull to only sub fields at some later point, then we can add additional code / restrictions

@alamb alamb merged commit 0d25340 into apache:main Aug 20, 2025
12 checks passed
@alamb
Copy link
Contributor

alamb commented Aug 20, 2025

@codephage2020 please let me know if there are additional issues you would like me to file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet-variant parquet-variant* crates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Variant] Implement ShreddingState::AllNull variant
3 participants