Skip to content

Saner handling of nulls inside arrays #15149

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 29, 2025

Conversation

joroKr21
Copy link
Contributor

@joroKr21 joroKr21 commented Mar 11, 2025

Which issue does this PR close?

Rationale for this change

#10790 changed empty arrays to default to Int64 but that is just a band-aid which creates other problems downstream, e.g. type errors when nesting expressions. It also introduced a bug (and possible panic) in array_union and array_intersect which are fixed here.

Apart from that, from a type-theoretical PoV empty arrays should indeed be typed as Array<Null> because Null is coercible to any other type and by extension (immutable arrays are covariant) Array<Null> is coercible to any other array type.

The crux of the problem is that Arrow is incapable of dealing with the mismatching array types that arise from this encoding. To workaround this limitation of Arrow, we unify array function argument types in advance and coerce them to the common type. In addition, we handle DataType::Null explicitly in Array functions for the case when all arguments are empty arrays or nulls.

What changes are included in this PR?

We use the existing type_union_resolution type unification algorithm to handle array function arguments. This is a more robust and consistent approach as opposed to implementing custom coercion rules. Depending on the function we unify either the base types (because some functions work on arrays of arbitrary dimensions) or the array / element types. Then we coerce all arguments to the common type. We still need to handle the edge case when all arguments are nulls or empty arrays explicitly in the array function implementations.

Are these changes tested?

Yes, I have amended the array.slt tests accordingly.

Are there any user-facing changes?

Yes, there are some changes in array function semantics that can be seen in the updated tests:

  • More expressions involving empty arrays and null arrays now work out of the box
  • Fixed bugs in array_union and array_intersect (which are now also commutative)
  • Some type-incorrect expressions no longer compile (e.g. array_position(Array<Int64>, Utf8))

I also extended the public API of Signature with more helper methods and by default it coerces fixed-size lists to non-fixed-size lists because most array functions don't preserve the arity of lists (e.g. append, prepend, concat, union, intersection, etc.)

@github-actions github-actions bot added logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) labels Mar 11, 2025
@joroKr21 joroKr21 force-pushed the bugfix/array-nulls branch 6 times, most recently from 5fe5bf3 to 57d105b Compare March 11, 2025 10:25
@joroKr21 joroKr21 marked this pull request as ready for review March 11, 2025 11:08
@alamb
Copy link
Contributor

alamb commented Mar 13, 2025

Thank you @joroKr21 !

Perhaps @Dandandan or @thinkharderdev or @avantgardnerio could help review this PR

Copy link
Contributor

@thinkharderdev thinkharderdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me. I'm unclear on the coercion to variable length arrays though

ArrayFunctionArgument::Element,
ArrayFunctionArgument::Array,
],
array_coercion: Some(ListCoercion::FixedSizedListToList),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh I'm not entirely clear on how the coercion comes into play but an append to a fixed size list would still be a fixed size list right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a type signature PoV yes, it's possible to infer that the return type should also be a fixed-size list with size n + 1 but the implementation itself is not able to handle it:

pub(crate) fn array_append_inner(args: &[ArrayRef]) -> Result<ArrayRef> {
    let [array, _] = take_function_args("array_append", args)?;

    match array.data_type() {
        DataType::LargeList(_) => general_append_and_prepend::<i64>(args, true),
        _ => general_append_and_prepend::<i32>(args, true),
    }
}

It's unclear to me how an implementation for FixedSizeList would look like and I would say probably it's out of scope for this PR, it should be separate.

Copy link
Contributor

@jayzhan211 jayzhan211 Mar 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally if the list size is changed like (append operation) we will convert it to List, so append with fixed-size-list ends up with List

Copy link
Contributor

@jayzhan211 jayzhan211 Mar 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we need is to provide customizable signature, so if they want List for append, they can add array_coercion: Some(ListCoercion::FixedSizedListToList), if they want to keep fixed-size-list they can add array_coercion: None. Both cases should be possible.

For datafusion implementation, we choose to convert to List by default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's why I'm setting array_coercion: Some(ListCoercion::FixedSizedListToList) by default 👍

@joroKr21 joroKr21 force-pushed the bugfix/array-nulls branch from 57d105b to 0d49f0d Compare March 13, 2025 16:02
@joroKr21
Copy link
Contributor Author

Also added a commit to fix array_sort for empty record batches

Copy link
Contributor

@thinkharderdev thinkharderdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@@ -1204,7 +1204,7 @@ select array_element([1, 2], NULL);
----
NULL

query I
query ?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @joroKr21 - I had a question about some test changes but I don't know a huge amount about this code so it may be ok too


query ?
query error DataFusion error: Error during planning: Failed to unify argument types of array_union:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a regression? It looks like the query used to run but now doesn't.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems so @joroKr21 can we keep null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the argument types don't make sense. How are we supposed to union LargeList(List(Int64)) and LargeList(Int64)?

match array_coercion {
Some(ListCoercion::FixedSizedListToList) => (),
None if fixed_size.is_none() => fixed_size = Some(*size),
None if fixed_size == Some(*size) => (),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there are more than one fixed size list types like fixed(4), fixed(3).

The valid types should be vec![fixed(4), fixed(3)]

Copy link
Contributor

@jayzhan211 jayzhan211 Mar 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we found more than 1 fixed size list, return list might be the easiest solution 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But return the current_type makes more sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see what you mean. I will fix that. current_type is not unified so we can't return that.

for arr in array.iter() {
if arr.is_some() {
data.push(Some(ndims))
builder.append_value(ndims)
Copy link
Contributor

@jayzhan211 jayzhan211 Mar 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that builder is slower than UInt64Array::from(vec![])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, that sounds pretty bad

Copy link
Contributor Author

@joroKr21 joroKr21 Mar 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I optimized it to use vec![ndims; array.len()] and just clone the null buffer from the array

@@ -6232,12 +6244,12 @@ select array_intersect(arrow_cast([1, 1, 2, 2, 3, 3], 'LargeList(Int64)'), null)
query ?
select array_intersect(null, [1, 1, 2, 2, 3, 3]);
----
NULL
[]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to keep null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above we have

query ?
select array_intersect([1, 1, 2, 2, 3, 3], null);
----
[]

So I think it's better that array_intersect is commutative. I can make both null or both empty list but I think it should be the same if we flip the arguments. Which one should it be?

select array_concat(
[arrow_cast('1', 'Utf8'), arrow_cast('2', 'Utf8')],
[arrow_cast('3', 'Utf8View')]
);
----
[1, 2, 3]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to show the type

select arrow_typeof(array_concat(
  [arrow_cast('1', 'Utf8'), arrow_cast('2', 'Utf8')],
  [arrow_cast('3', 'Utf8View')]
));

@@ -200,6 +199,7 @@ fn array_element_inner(args: &[ArrayRef]) -> Result<ArrayRef> {
let [array, indexes] = take_function_args("array_element", args)?;

match &array.data_type() {
Null => Ok(Arc::new(NullArray::new(array.len()))),
Copy link
Contributor

@jayzhan211 jayzhan211 Mar 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we return the null with List i64 type (a default one) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will write a test that shows the issue. But basically something like array_union(make_array(), ['x', 'y', 'z']) will be an issue because we can't union a list of ints and a list of strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or in this case similar nested expression with array element

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this query as a test:
select coalesce(array_element([], 1), array_element(NULL, 1), 'ok');

@jayzhan211
Copy link
Contributor

jayzhan211 commented Mar 15, 2025

When we return Null, can we return with type other than null, for example List::i64 for list type, i64 for non-list type. I guess this reduces the null case we need to handle.

If this is problematic, can you explain the issue?

@joroKr21 joroKr21 force-pushed the bugfix/array-nulls branch 3 times, most recently from 1bb8aff to c0a769a Compare March 16, 2025 09:22
@alamb
Copy link
Contributor

alamb commented Mar 19, 2025

Is this one ready for merge?

@joroKr21
Copy link
Contributor Author

I see there's a conflict, need to resolve it

@joroKr21
Copy link
Contributor Author

One more thing, we need a decision if intersection with null should return null or empty array. I can see arguments for both, but the important thing for me is that it should be commutative - flipping the arguments shouldn't change the result.

@joroKr21 joroKr21 force-pushed the bugfix/array-nulls branch from c0a769a to 81dd050 Compare March 19, 2025 20:51
@alamb
Copy link
Contributor

alamb commented Mar 19, 2025

One more thing, we need a decision if intersection with null should return null or empty array. I can see arguments for both, but the important thing for me is that it should be commutative - flipping the arguments shouldn't change the result.

I recommend following whatever DuckDB (or postgres do) -- there is not muchv alue in DataFusion having different semantics from other systems

@joroKr21
Copy link
Contributor Author

joroKr21 commented Mar 20, 2025

I recommend following whatever DuckDB (or postgres do) -- there is not muchv alue in DataFusion having different semantics from other systems

@thinkharderdev
Copy link
Contributor

I recommend following whatever DuckDB (or postgres do) -- there is not muchv alue in DataFusion having different semantics from other systems

I agree that it should be commutative. In PG at least array_cat is commutative with array_cat(ARRAY[1,2,3],NULL) <=> array_cat(NULL, ARRAY[1,2,3]) <=> {1,2,3}

So I would voter for list_intersect with NULL arg to return empty list.

@joroKr21 joroKr21 force-pushed the bugfix/array-nulls branch from 81dd050 to 44916a1 Compare March 24, 2025 07:55
@alamb
Copy link
Contributor

alamb commented Mar 30, 2025

@thinkharderdev shall we merge this PR?

@thinkharderdev
Copy link
Contributor

@thinkharderdev shall we merge this PR?

I think it's good to go. There are still some open comments so I was waiting for another approval, but if we're all agreed then we can merge it

@joroKr21 joroKr21 force-pushed the bugfix/array-nulls branch from 44916a1 to 3e8f005 Compare April 6, 2025 07:23
@joroKr21
Copy link
Contributor Author

This missed the v47 train. Anything else needed to merge?

@thinkharderdev
Copy link
Contributor

This missed the v47 train. Anything else needed to merge?

@alamb @jayzhan211 Are you guys good with this?

@thinkharderdev thinkharderdev merged commit e44ae0a into apache:main Apr 29, 2025
27 checks passed
@thinkharderdev thinkharderdev deleted the bugfix/array-nulls branch April 29, 2025 20:29
@alamb
Copy link
Contributor

alamb commented Apr 29, 2025

It appears the tests started failing on main after this PR was merged:

@joroKr21
Copy link
Contributor Author

Looks like a semantic merge conflict with #15160

@joroKr21
Copy link
Contributor Author

#15899

@alamb
Copy link
Contributor

alamb commented Apr 30, 2025

The issue has been fixed. Thank you for the quick response

nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025
* Saner handling of nulls inside arrays

* Fix array_sort for empty record batch

* Fix get_valid_types for FixedSizeLists

* Optimize array_ndims

* Add a test for result type of Concatenating Mixed types

* Fix array_element of empty array

* Handle more FixedSizeLists
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants