-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Saner handling of nulls inside arrays #15149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5fe5bf3
to
57d105b
Compare
Thank you @joroKr21 ! Perhaps @Dandandan or @thinkharderdev or @avantgardnerio could help review this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me. I'm unclear on the coercion to variable length arrays though
ArrayFunctionArgument::Element, | ||
ArrayFunctionArgument::Array, | ||
], | ||
array_coercion: Some(ListCoercion::FixedSizedListToList), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh I'm not entirely clear on how the coercion comes into play but an append to a fixed size list would still be a fixed size list right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a type signature PoV yes, it's possible to infer that the return type should also be a fixed-size list with size n + 1
but the implementation itself is not able to handle it:
pub(crate) fn array_append_inner(args: &[ArrayRef]) -> Result<ArrayRef> {
let [array, _] = take_function_args("array_append", args)?;
match array.data_type() {
DataType::LargeList(_) => general_append_and_prepend::<i64>(args, true),
_ => general_append_and_prepend::<i32>(args, true),
}
}
It's unclear to me how an implementation for FixedSizeList
would look like and I would say probably it's out of scope for this PR, it should be separate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally if the list size is changed like (append operation) we will convert it to List, so append with fixed-size-list ends up with List
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we need is to provide customizable signature, so if they want List for append, they can add array_coercion: Some(ListCoercion::FixedSizedListToList)
, if they want to keep fixed-size-list they can add array_coercion: None
. Both cases should be possible.
For datafusion implementation, we choose to convert to List by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's why I'm setting array_coercion: Some(ListCoercion::FixedSizedListToList)
by default 👍
57d105b
to
0d49f0d
Compare
Also added a commit to fix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@@ -1204,7 +1204,7 @@ select array_element([1, 2], NULL); | |||
---- | |||
NULL | |||
|
|||
query I | |||
query ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @jayzhan211 and @goldmedal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @joroKr21 - I had a question about some test changes but I don't know a huge amount about this code so it may be ok too
|
||
query ? | ||
query error DataFusion error: Error during planning: Failed to unify argument types of array_union: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a regression? It looks like the query used to run but now doesn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems so @joroKr21 can we keep null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that the argument types don't make sense. How are we supposed to union LargeList(List(Int64))
and LargeList(Int64)
?
match array_coercion { | ||
Some(ListCoercion::FixedSizedListToList) => (), | ||
None if fixed_size.is_none() => fixed_size = Some(*size), | ||
None if fixed_size == Some(*size) => (), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there are more than one fixed size list types like fixed(4), fixed(3).
The valid types should be vec![fixed(4), fixed(3)]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we found more than 1 fixed size list, return list might be the easiest solution 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But return the current_type
makes more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see what you mean. I will fix that. current_type
is not unified so we can't return that.
for arr in array.iter() { | ||
if arr.is_some() { | ||
data.push(Some(ndims)) | ||
builder.append_value(ndims) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember that builder is slower than UInt64Array::from(vec![])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, that sounds pretty bad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I optimized it to use vec![ndims; array.len()]
and just clone the null buffer from the array
@@ -6232,12 +6244,12 @@ select array_intersect(arrow_cast([1, 1, 2, 2, 3, 3], 'LargeList(Int64)'), null) | |||
query ? | |||
select array_intersect(null, [1, 1, 2, 2, 3, 3]); | |||
---- | |||
NULL | |||
[] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to keep null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above we have
query ?
select array_intersect([1, 1, 2, 2, 3, 3], null);
----
[]
So I think it's better that array_intersect
is commutative. I can make both null
or both empty list but I think it should be the same if we flip the arguments. Which one should it be?
select array_concat( | ||
[arrow_cast('1', 'Utf8'), arrow_cast('2', 'Utf8')], | ||
[arrow_cast('3', 'Utf8View')] | ||
); | ||
---- | ||
[1, 2, 3] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to show the type
select arrow_typeof(array_concat(
[arrow_cast('1', 'Utf8'), arrow_cast('2', 'Utf8')],
[arrow_cast('3', 'Utf8View')]
));
@@ -200,6 +199,7 @@ fn array_element_inner(args: &[ArrayRef]) -> Result<ArrayRef> { | |||
let [array, indexes] = take_function_args("array_element", args)?; | |||
|
|||
match &array.data_type() { | |||
Null => Ok(Arc::new(NullArray::new(array.len()))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we return the null with List i64 type (a default one) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will write a test that shows the issue. But basically something like array_union(make_array(), ['x', 'y', 'z'])
will be an issue because we can't union a list of ints and a list of strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or in this case similar nested expression with array element
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this query as a test:
select coalesce(array_element([], 1), array_element(NULL, 1), 'ok');
When we return If this is problematic, can you explain the issue? |
1bb8aff
to
c0a769a
Compare
Is this one ready for merge? |
I see there's a conflict, need to resolve it |
One more thing, we need a decision if intersection with null should return null or empty array. I can see arguments for both, but the important thing for me is that it should be commutative - flipping the arguments shouldn't change the result. |
c0a769a
to
81dd050
Compare
I recommend following whatever DuckDB (or postgres do) -- there is not muchv alue in DataFusion having different semantics from other systems |
|
I agree that it should be commutative. In PG at least So I would voter for |
81dd050
to
44916a1
Compare
@thinkharderdev shall we merge this PR? |
I think it's good to go. There are still some open comments so I was waiting for another approval, but if we're all agreed then we can merge it |
44916a1
to
3e8f005
Compare
This missed the v47 train. Anything else needed to merge? |
@alamb @jayzhan211 Are you guys good with this? |
It appears the tests started failing on main after this PR was merged: |
Looks like a semantic merge conflict with #15160 |
The issue has been fixed. Thank you for the quick response |
* Saner handling of nulls inside arrays * Fix array_sort for empty record batch * Fix get_valid_types for FixedSizeLists * Optimize array_ndims * Add a test for result type of Concatenating Mixed types * Fix array_element of empty array * Handle more FixedSizeLists
Which issue does this PR close?
NULL
as an element #7142Rationale for this change
#10790 changed empty arrays to default to
Int64
but that is just a band-aid which creates other problems downstream, e.g. type errors when nesting expressions. It also introduced a bug (and possible panic) inarray_union
andarray_intersect
which are fixed here.Apart from that, from a type-theoretical PoV empty arrays should indeed be typed as
Array<Null>
becauseNull
is coercible to any other type and by extension (immutable arrays are covariant)Array<Null>
is coercible to any other array type.The crux of the problem is that Arrow is incapable of dealing with the mismatching array types that arise from this encoding. To workaround this limitation of Arrow, we unify array function argument types in advance and coerce them to the common type. In addition, we handle
DataType::Null
explicitly in Array functions for the case when all arguments are empty arrays or nulls.What changes are included in this PR?
We use the existing
type_union_resolution
type unification algorithm to handle array function arguments. This is a more robust and consistent approach as opposed to implementing custom coercion rules. Depending on the function we unify either the base types (because some functions work on arrays of arbitrary dimensions) or the array / element types. Then we coerce all arguments to the common type. We still need to handle the edge case when all arguments are nulls or empty arrays explicitly in the array function implementations.Are these changes tested?
Yes, I have amended the
array.slt
tests accordingly.Are there any user-facing changes?
Yes, there are some changes in array function semantics that can be seen in the updated tests:
array_union
andarray_intersect
(which are now also commutative)array_position(Array<Int64>, Utf8)
)I also extended the public API of
Signature
with more helper methods and by default it coerces fixed-size lists to non-fixed-size lists because most array functions don't preserve the arity of lists (e.g. append, prepend, concat, union, intersection, etc.)