Skip to content

Conversation

shehabgamin
Copy link
Contributor

@shehabgamin shehabgamin commented Mar 11, 2025

Which issue does this PR close?

Rationale for this change

See discussion in #5600

TL;DR Many projects want Spark-compatible expressions for use with DataFusion. There are some in Comet and there are some in the Sail project.

What changes are included in this PR?

Adding Spark crate.

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the functions Changes to functions implementation label Mar 11, 2025
@shehabgamin shehabgamin marked this pull request as ready for review March 13, 2025 02:15
@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Mar 13, 2025
specific language governing permissions and limitations
under the License.
-->

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing instructions here

# datafusion-spark: Spark-compatible Expressions

This crate provides Apache Spark-compatible expressions for use with DataFusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing instructions here

# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example test here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expm1 probably works identically across all Spark versions and is not affected by different configuration settings, but many expressions are affected by settings such as ANSI mode and different date/time formats and timezones.

How do you think we should handle these different cases with this test approach?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A related question - There will be some functions that we can support with 100% compatibility and some that we cannot. It would be good to think about how we express that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for all the questions, but I am really excited about this 😄 ... what would be involved in being able to run these same sqllogic test files in Spark (either in CI or manually locally) to confirm same/similar results

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are great questions and really good things to discuss. I'm about to go into a meeting but I have a bunch of thoughts that I'll share afterwards in a couple of hours.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expm1 probably works identically across all Spark versions and is not affected by different configuration settings, but many expressions are affected by settings such as ANSI mode and different date/time formats and timezones.

How do you think we should handle these different cases with this test approach?

In the Sail code base, auxiliary information is passed into new() and stored within the struct. For example:
https://github.com/lakehq/sail/blob/be54362ce8bd79bfb3e55214c5a1b80c2c3d2492/crates/sail-plan/src/extension/function/datetime/timestamp_now.rs#L9-L32

#[derive(Debug)]
pub struct TimestampNow {
    signature: Signature,
    timezone: Arc<str>,
    time_unit: TimeUnit,
}

impl TimestampNow {
    pub fn new(timezone: Arc<str>, time_unit: TimeUnit) -> Self {
        Self {
            signature: Signature::nullary(Volatility::Stable),
            timezone,
            time_unit,
        }
    }

    pub fn timezone(&self) -> &str {
        &self.timezone
    }

    pub fn time_unit(&self) -> &TimeUnit {
        &self.time_unit
    }
}

And then in our PhysicalExtensionCodec we can do the following:
https://github.com/lakehq/sail/blob/be54362ce8bd79bfb3e55214c5a1b80c2c3d2492/crates/sail-execution/src/codec.rs#L946-L953

if let Some(func) = node.inner().as_any().downcast_ref::<TimestampNow>() {
            let timezone = func.timezone().to_string();
            let time_unit: gen_datafusion_common::TimeUnit = func.time_unit().into();
            let time_unit = time_unit.as_str_name().to_string();
            UdfKind::TimestampNow(gen::TimestampNowUdf {
                timezone,
                time_unit,
            })

If we decide to not use sqllogictest (per #15168 (comment)) then we will have no problem testing UDFs with auxiliary information. There are already tests in DataFusion core for this type of pattern as well:

async fn test_parameterized_scalar_udf() -> Result<()> {
let batch = RecordBatch::try_from_iter([(
"text",
Arc::new(StringArray::from(vec!["foo", "bar", "foobar", "barfoo"])) as ArrayRef,
)])?;
let ctx = SessionContext::new();
ctx.register_batch("t", batch)?;
let t = ctx.table("t").await?;
let foo_udf = ScalarUDF::from(MyRegexUdf::new("fo{2}"));
let bar_udf = ScalarUDF::from(MyRegexUdf::new("[Bb]ar"));
let plan = LogicalPlanBuilder::from(t.into_optimized_plan()?)
.filter(
foo_udf
.call(vec![col("text")])
.and(bar_udf.call(vec![col("text")])),
)?
.filter(col("text").is_not_null())?
.build()?;
assert_eq!(
format!("{plan}"),
"Filter: t.text IS NOT NULL\n Filter: regex_udf(t.text) AND regex_udf(t.text)\n TableScan: t projection=[text]"
);
let actual = DataFrame::new(ctx.state(), plan).collect().await?;
let expected = [
"+--------+",
"| text |",
"+--------+",
"| foobar |",
"| barfoo |",
"+--------+",
];
assert_batches_eq!(expected, &actual);
ctx.deregister_table("t")?;
Ok(())
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A related question - There will be some functions that we can support with 100% compatibility and some that we cannot. It would be good to think about how we express that.

Throw errors when possible, and provide documentation. Depending on how shaky the compatibility for the function is, we may want to avoid implementing it altogether.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... what would be involved in being able to run these same sqllogic test files in Spark (either in CI or manually locally) to confirm same/similar results

One option:

  1. A Python script to automatically generate Spark SQL function test cases and their results using PySpark.
  2. A README for developers explaining how to run the script and commit the test cases.
  3. An optional CI workflow to verify the correctness of the test cases' ground truth on demand.

Separate topic... Do you have ideas about fuzzy testing and its suitability in DataFusion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is my suggestion.

  1. Write a Python script that generates interesting test data (probably in Parquet format) with edge cases using PySpark
  2. Create files containing SQL queries that operate on these test files
  3. Write a Python script to run those queries via PySpark and write results out to file
  4. Write a Rust script to run those queries using datafusion-spark and write results out to file
  5. Write a script that can compare the Spark and datafusion-spark output and report on any differences

Copy link
Contributor

@alamb alamb Mar 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expm1 probably works identically across all Spark versions and is not affected by different configuration settings, but many expressions are affected by settings such as ANSI mode and different date/time formats and timezones.

I believe @Omega359 has a proposal here of how to thread the config options into the arguments of the functions (for the same reason)

# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example test here

};
}

#[test]
Copy link
Contributor Author

@shehabgamin shehabgamin Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example invoke test here. Direct invocation tests should only be used to verify that the function is correctly implemented. Further tests are required in the sqllogictest crate (examples for ascii can be found in this PR).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My personal preference is to test them all from .slt rather than have any rust based tests unless there is something that can not be tested from .slt

For the different string types, we could perhaps cover the different string types using the same pattern as normal string tests -- see https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/string/README.md

However, I don't think this is required

};
}

#[test]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example invoke test here. Direct invocation tests should only be used to verify that the function is correctly implemented. Further tests are required in the sqllogictest crate (examples for expm1 can be found in this PR).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because sqllogictests are so much faster to write and update, I suggest we point people towards using sqllogictests to test the functions unless there is somehting that can not be tested using .slt files

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andygrove and I found some correctness issues with sqllogictests. Specifically, we found issues with testing the correctness of floating point results.

The idea was to do something like this: #15168 (comment)

@alamb What are your thoughts on this? Should we perhaps use sqllogictest as long as we're not testing float point results and as long as the function being tested is not configurable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the underlying sqlogictest library has the notion of "engines"

The one we use on main is here:
https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest/src/engines/datafusion_engine and

part of that is how to normalize the results into the strings printed in .slt tests

pub fn cell_to_string(col: &ArrayRef, row: usize) -> Result<String> {
if !col.is_valid(row) {
// represent any null value with the string "NULL"
Ok(NULL_STR.to_string())
} else {
match col.data_type() {
DataType::Null => Ok(NULL_STR.to_string()),
DataType::Boolean => {
Ok(bool_to_str(get_row_value!(array::BooleanArray, col, row)))
}
DataType::Float16 => {
Ok(f16_to_str(get_row_value!(array::Float16Array, col, row)))
}
DataType::Float32 => {
Ok(f32_to_str(get_row_value!(array::Float32Array, col, row)))
}
DataType::Float64 => {
Ok(f64_to_str(get_row_value!(array::Float64Array, col, row)))
}
DataType::Decimal128(_, scale) => {
let value = get_row_value!(array::Decimal128Array, col, row);
Ok(decimal_128_to_str(value, *scale))
}
DataType::Decimal256(_, scale) => {
let value = get_row_value!(array::Decimal256Array, col, row);
Ok(decimal_256_to_str(value, *scale))
}
DataType::LargeUtf8 => Ok(varchar_to_str(get_row_value!(
array::LargeStringArray,
col,
row
))),
DataType::Utf8 => {
Ok(varchar_to_str(get_row_value!(array::StringArray, col, row)))
}
DataType::Utf8View => Ok(varchar_to_str(get_row_value!(
array::StringViewArray,
col,
row
))),
DataType::Dictionary(_, _) => {
let dict = col.as_any_dictionary();
let key = dict.normalized_keys()[row];
Ok(cell_to_string(dict.values(), key)?)
}
_ => {
let f =
ArrayFormatter::try_new(col.as_ref(), &DEFAULT_CLI_FORMAT_OPTIONS);
Ok(f.unwrap().value(row).to_string())
}
}
.map_err(DFSqlLogicTestError::Arrow)
}
}

If there is some different way to convert floating point values for spark maybe we could make a spark functions specific driver

The idea was to do something like this: #15168 (comment)

Ideally we could use one of the many existing tools in datafusion rather than write new scripts

Another potential possiblity might be to use insta.rs perhaps (which was added to the repo recently) which automates result update 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb Thanks for the pointers, this seems reasonable to me.

@andygrove Does this sound good to you as well?

If we're all aligned, I think we've gathered enough input for me to push up some more code 🤠

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI I think @andygrove might be out for a week so he may not respond quickly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb Thanks for the pointers, this seems reasonable to me.

@andygrove Does this sound good to you as well?

If we're all aligned, I think we've gathered enough input for me to push up some more code 🤠

@andygrove @alamb Just checking in here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late response here. I had a vacation and have been busy with Comet priorities since getting back. I would like to help with the review here. I do wonder if we could start with a smaller scope PR to get the initial crate in place.

I would also like to contribute some Spark-compatible shuffle implementation from Comet so that we can re-use it in Ballista.

}

fn name(&self) -> &str {
"spark_ascii"
Copy link
Contributor Author

@shehabgamin shehabgamin Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefix with spark_ because sqllogictest evaluates both implementations of ascii.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

}

fn name(&self) -> &str {
"spark_expm1"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefix with spark_ because sqllogictest may evaluate more than one implementation of expm1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I would recommend is

  1. keep the original expm1 name (that seems the most useful to people who are trying to get spark compatible behavior)
  2. Change to use a function to register spark compatible functions (see above)
  3. Change our sqlloigictest driver so it registers spark functions for any test that starts with spark_*.slt (similiar to pg_...)

That way most sqllogictest stuff stays the same, and we can write spark tests in spark/spark_math.slt, spark/spark_string.slt etc type tests

Here is the code that customizes the context for the individual test files

match file_name {
"information_schema_table_types.slt" => {
info!("Registering local temporary table");
register_temp_table(test_ctx.session_ctx()).await;
}

Comment on lines 24 to 26
SELECT spark_expm1(1::INT);
----
1.718281828459
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried running this in Spark 3.5.3 and did not get the same answer.

scala> spark.sql("select expm1(1)").show()
+-----------------+
|         EXPM1(1)|
+-----------------+
|1.718281828459045|
+-----------------+

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andygrove I believe sqllogic is truncating the answer.

I initially had this test, which tested for the value 1.7182818284590453 (slightly more precise than your result) but removed it because cargo test (amd64) was giving the value 1.718281828459045 (https://github.com/apache/datafusion/actions/runs/13825914914/job/38680868216) while the rest of the cargo tests on different architectures were passing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does sqllogictest have a way to test floating point results within some tolerance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like it:

Remember, the purpose of sqllogictest is to validate the logic behind the evaluation of SQL statements, not the ability to handle extreme values. So keep content in a reasonable range: small integers, short strings, and floating point numbers that use only the most significant bits of an a 32-bit IEEE float.

https://www.sqlite.org/sqllogictest/doc/trunk/about.wiki

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to stick with sqllogictest.

We can create test helpers similar to:
https://github.com/lakehq/datafusion/blob/d78877a55c5e835a07a7ebf23a7bd515faf7d827/datafusion/optimizer/src/analyzer/type_coercion.rs#L2137-L2208

The above link is from an old PR that didn't end up getting merged in, but the general idea seems useful here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol

query R
SELECT spark_expm1(1::INT);
----
1.718281828459

query T
SELECT spark_expm1(1::INT)::STRING;
----
1.7182818284590453

1.718281828459

query R
SELECT spark_expm1(a) FROM (VALUES (0::INT), (1::INT)) AS t(a);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest adding tests for a wider range of values and edge cases, such as negative numbers, large positive and negative numbers, NaN, null, and so on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, will do!

50

query I
SELECT spark_ascii(a) FROM (VALUES ('Spark'), ('PySpark'), ('Pandas API')) AS t(a);
Copy link
Member

@andygrove andygrove Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add tests for edge cases?

Some ideas from ChatGPT (the results are from me actually running them in Spark):

scala> spark.sql("select ascii('😀')").show()
+---------+                                                                     
|ascii(😀)|
+---------+
|   128512|
+---------+

scala> spark.sql("select ascii('\n')").show()
+---------+
|ascii(\n)|
+---------+
|       10|
+---------+


scala> spark.sql("select ascii('\t')").show()
+---------+
|ascii(\t)|
+---------+
|        9|
+---------+

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, will do!

@github-actions github-actions bot removed the core Core DataFusion crate label Apr 25, 2025
@shehabgamin shehabgamin requested a review from alamb April 26, 2025 08:58
@@ -193,7 +192,7 @@ macro_rules! get_row_value {
///
/// Floating numbers are rounded to have a consistent representation with the Postgres runner.
///
pub fn cell_to_string(col: &ArrayRef, row: usize) -> Result<String> {
pub fn cell_to_string(col: &ArrayRef, row: usize, is_spark_path: bool) -> Result<String> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb While digging into your suggestion (#15168 (comment)), I realized that we don't need to write an entire engine for Spark. All we care about is the logic in cell_to_string. For now, I haven’t created a Spark-specific spark_cell_to_string, since the issues we originally encountered with sqllogictest were related only to Float64 precision. We can always create a Spark-specificspark_cell_to_string later if we find that other changes are needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice -- this makes sense to me -- I agree what you have here looks good

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @shehabgamin -- I think this looks great to me. I left a few comments but nothing that would block merging this PR and I think we can do them as follow on items.

Things that I think we should do next:

  1. Add an example somewhere (perhaps in the examples_directory) showing how to configure and use the spark functions in a SessionContext. I can help with this
  2. Automatically generate documentation for these functions, the way we do for other functions -- https://datafusion.apache.org/user-guide/sql/scalar_functions.html
  3. Test integrating this code into comet (with a draft PR or something) to make sure it works.

After this PR s merged, I suggest we implement one or two more small functions to give some example patterns to follow, and then I think we'll be ready to write a bunch of tickets to port all the functions

Screenshot 2025-04-28 at 8 43 03 PM

Cargo.lock Outdated
Comment on lines 2572 to 2577
"datafusion-functions-aggregate",
"datafusion-functions-aggregate-common",
"datafusion-functions-nested",
"datafusion-functions-table",
"datafusion-functions-window",
"datafusion-functions-window-common",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these dependencies are used

Suggested change
"datafusion-functions-aggregate",
"datafusion-functions-aggregate-common",
"datafusion-functions-nested",
"datafusion-functions-table",
"datafusion-functions-window",
"datafusion-functions-window-common",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, done!

use std::sync::Arc;

/// Fluent-style API for creating `Expr`s
#[allow(unused)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does it need to be "allow unused"? I don't think this should be necessary for pub APIs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linter yells at me otherwise

/// Fluent-style API for creating `Expr`s
#[allow(unused)]
pub mod expr_fn {
pub use super::function::aggregate::expr_fn::*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this list of modules ones that spark offers? I am not familiar with spark so I don't know off the top of my head

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly!

@@ -1,5 +1,5 @@
Apache DataFusion
Copyright 2019-2024 The Apache Software Foundation
Copyright 2019-2025 The Apache Software Foundation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

use std::sync::Arc;

#[user_doc(
doc_section(label = "Spark Math Functions"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed per your suggestion here:
#15168 (comment)

@@ -193,7 +192,7 @@ macro_rules! get_row_value {
///
/// Floating numbers are rounded to have a consistent representation with the Postgres runner.
///
pub fn cell_to_string(col: &ArrayRef, row: usize) -> Result<String> {
pub fn cell_to_string(col: &ArrayRef, row: usize, is_spark_path: bool) -> Result<String> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice -- this makes sense to me -- I agree what you have here looks good

}
}

pub(crate) fn spark_f64_to_str(value: f64) -> String {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like a copy/paste of f64_to_str -- maybe we could just thread the spark flag down and avoid some duplication. Not necesary just a suggestion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f64_to_str is used in more than 1 place, so I figured it made sense to create a new function.

};
}

#[test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My personal preference is to test them all from .slt rather than have any rust based tests unless there is something that can not be tested from .slt

For the different string types, we could perhaps cover the different string types using the same pattern as normal string tests -- see https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/string/README.md

However, I don't think this is required

@alamb
Copy link
Contributor

alamb commented Apr 30, 2025

This looks great to me -- I plan to merge it tomorrow and start collecting next steps in a new EPIC ticket unless someone beats me to it

@alamb
Copy link
Contributor

alamb commented May 1, 2025

I have filed an epic to track filling out the datafusion-spark crate:

I will file some subtickets for follow on work as well (e.g. what is in #15168 (review))

@alamb
Copy link
Contributor

alamb commented May 1, 2025

Onward!

@alamb alamb merged commit 6bda479 into apache:main May 1, 2025
29 checks passed
@xudong963
Copy link
Member

Fyi, the main CI has failed since the PR

@blaginin blaginin mentioned this pull request May 2, 2025
@alamb
Copy link
Contributor

alamb commented May 2, 2025

Fyi, the main CI has failed since the PR

@blaginin has fixed it -- it appears to have been a logical conflict

@linhr linhr deleted the add-spark-crate branch May 14, 2025 03:10
mach-kernel added a commit to spiceai/datafusion that referenced this pull request Aug 1, 2025
* Fix: fetch is lost in replace_order_preserving_variants method during EnforceDistribution (#15808)

* Speed up `optimize_projection` (#15787)

* save

* fmt

* Support WITHIN GROUP syntax to standardize certain existing aggregate functions  (#13511)

* Add within group variable to aggregate function and arguments

* Support within group and disable null handling for ordered set aggregate functions (#13511)

* Refactored function to match updated signature

* Modify proto to support within group clause

* Modify physical planner and accumulator to support ordered set aggregate function

* Support session management for ordered set aggregate functions

* Align code, tests, and examples with changes to aggregate function logic

* Ensure compatibility with new `within_group` and `order_by` handling.

* Adjust tests and examples to align with the new logic.

* Fix typo in existing comments

* Enhance test

* Add test cases for changed signature

* Update signature in docs

* Fix bug : handle missing within_group when applying children tree node

* Change the signature of approx_percentile_cont for consistency

* Add missing within_group for expr display

* Handle edge case when over and within group clause are used together

* Apply clippy advice: avoids too many arguments

* Add new test cases using descending order

* Apply cargo fmt

* Revert unintended submodule changes

* Apply prettier guidance

* Apply doc guidance by update_function_doc.sh

* Rollback WITHIN GROUP and related logic after converting it into expr

* Make it not to handle redundant logic

* Rollback ordered set aggregate functions from session to save same info in udf itself

* Convert within group to order by when converting sql to expr

* Add function to determine it is ordered-set aggregate function

* Rollback within group from proto

* Utilize within group as order by in functions-aggregate

* Apply clippy

* Convert order by to within group

* Apply cargo fmt

* Remove plain line breaks

* Remove duplicated column arg in schema name

* Refactor boolean functions to just return primitive type

* Make within group necessary in the signature of existing ordered set aggr funcs

* Apply cargo fmt

* Support a single ordering expression in the signature

* Apply cargo fmt

* Add dataframe function test cases to verify descending ordering

* Apply cargo fmt

* Apply code reviews

* Uses order by consistently after done with sql

* Remove redundant comment

* Serve more clear error msg

* Handle error cases in the same code block

* Update error msg in test as corresponding code changed

* fix

---------

Co-authored-by: Jay Zhan <[email protected]>

* docs: add ArkFlow (#15826)

* chore(deps): bump env_logger from 0.11.7 to 0.11.8 (#15823)

Bumps [env_logger](https://github.com/rust-cli/env_logger) from 0.11.7 to 0.11.8.
- [Release notes](https://github.com/rust-cli/env_logger/releases)
- [Changelog](https://github.com/rust-cli/env_logger/blob/main/CHANGELOG.md)
- [Commits](https://github.com/rust-cli/env_logger/compare/v0.11.7...v0.11.8)

---
updated-dependencies:
- dependency-name: env_logger
  dependency-version: 0.11.8
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Support unparsing `UNION` for distinct results (#15814)

* Add `MemoryPool::memory_limit`  to expose setting memory usage limit (#15828)

* add `memory_limit` to `MemoryPool`, and impl it for the pools in datafusion.

* Update datafusion/execution/src/memory_pool/mod.rs

Co-authored-by: Ruihang Xia <[email protected]>

---------

Co-authored-by: Ruihang Xia <[email protected]>

* Preserve projection for inline scan (#15825)

* Preserve projection for inline scan

* fix

---------

Co-authored-by: Vadim Piven <[email protected]>

* cleanup after emit (#15834)

* chore(deps): bump pyo3 from 0.24.1 to 0.24.2 (#15838)

Bumps [pyo3](https://github.com/pyo3/pyo3) from 0.24.1 to 0.24.2.
- [Release notes](https://github.com/pyo3/pyo3/releases)
- [Changelog](https://github.com/PyO3/pyo3/blob/main/CHANGELOG.md)
- [Commits](https://github.com/pyo3/pyo3/compare/v0.24.1...v0.24.2)

---
updated-dependencies:
- dependency-name: pyo3
  dependency-version: 0.24.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Fix: fetch is missing in `EnforceSorting` optimizer (two places) (#15822)

* Fix: fetch is missing in EnforceSort

* add ut test_parallelize_sort_preserves_fetch

* add ut: test_plan_with_order_preserving_variants_preserves_fetch

* update

* address comments

* Minor: fix potential flaky test in aggregate.slt (#15829)

* Fix `ILIKE` expression support in SQL unparser (#15820)

* Fix ILIKE expression support in SQL unparser (#76)

* update tests

* Make `Diagnostic` easy/convinient to attach by using macro and avoiding `map_err` (#15796)

* First Step

* Final Step?

* Homogenisation

* Feature/benchmark config from env (#15782)

* Read benchmark SessionConfig from env

* Set target partitions from env by default

fix

* Set batch size from env by default

* Fix batch size option for tpch ci

* Log environment variable configuration

* Document benchmarking env variable config

* Add DATAFUSION_* env config to Error: unknown command: help

Orchestrates running benchmarks against DataFusion checkouts

Usage:
./bench.sh data [benchmark] [query]
./bench.sh run [benchmark]
./bench.sh compare <branch1> <branch2>
./bench.sh venv

**********
Examples:
**********
# Create the datasets for all benchmarks in /Users/christian/MA/datafusion/benchmarks/data
./bench.sh data

# Run the 'tpch' benchmark on the datafusion checkout in /source/datafusion
DATAFUSION_DIR=/source/datafusion ./bench.sh run tpch

**********
* Commands
**********
data:         Generates or downloads data needed for benchmarking
run:          Runs the named benchmark
compare:      Compares results from benchmark runs
venv:         Creates new venv (unless already exists) and installs compare's requirements into it

**********
* Benchmarks
**********
all(default): Data/Run/Compare for all benchmarks
tpch:                   TPCH inspired benchmark on Scale Factor (SF) 1 (~1GB), single parquet file per table, hash join
tpch_mem:               TPCH inspired benchmark on Scale Factor (SF) 1 (~1GB), query from memory
tpch10:                 TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), single parquet file per table, hash join
tpch_mem10:             TPCH inspired benchmark on Scale Factor (SF) 10 (~10GB), query from memory
cancellation:           How long cancelling a query takes
parquet:                Benchmark of parquet reader's filtering speed
sort:                   Benchmark of sorting speed
sort_tpch:              Benchmark of sorting speed for end-to-end sort queries on TPCH dataset
clickbench_1:           ClickBench queries against a single parquet file
clickbench_partitioned: ClickBench queries against a partitioned (100 files) parquet
clickbench_extended:    ClickBench "inspired" queries against a single parquet (DataFusion specific)
external_aggr:          External aggregation benchmark
h2o_small:              h2oai benchmark with small dataset (1e7 rows) for groupby,  default file format is csv
h2o_medium:             h2oai benchmark with medium dataset (1e8 rows) for groupby, default file format is csv
h2o_big:                h2oai benchmark with large dataset (1e9 rows) for groupby,  default file format is csv
h2o_small_join:         h2oai benchmark with small dataset (1e7 rows) for join,  default file format is csv
h2o_medium_join:        h2oai benchmark with medium dataset (1e8 rows) for join, default file format is csv
h2o_big_join:           h2oai benchmark with large dataset (1e9 rows) for join,  default file format is csv
imdb:                   Join Order Benchmark (JOB) using the IMDB dataset converted to parquet

**********
* Supported Configuration (Environment Variables)
**********
DATA_DIR            directory to store datasets
CARGO_COMMAND       command that runs the benchmark binary
DATAFUSION_DIR      directory to use (default /Users/christian/MA/datafusion/benchmarks/..)
RESULTS_NAME        folder where the benchmark files are stored
PREFER_HASH_JOIN    Prefer hash join algorithm (default true)
VENV_PATH           Python venv to use for compare and venv commands (default ./venv, override by <your-venv>/bin/activate)
DATAFUSION_*        Set the given datafusion configuration

* fmt

* predicate pruning: support cast and try_cast for more types (#15764)

* predicate pruning: support dictionaries

* more types

* clippy

* add tests

* add tests

* simplify to dicts

* revert most changes

* just check for strings, more tests

* more tests

* remove unecessary now confusing clause

* Fix: fetch is missing in plan_with_order_breaking_variants method (#15842)

* Fix `CoalescePartitionsExec` proto serialization (#15824)

* add fetch to CoalescePartitionsExecNode

* gen proto code

* Add test

* fix

* fix build

* Fix test build

* remove comments

* Fix build (#15849)

* Fix scalar list comparison when the compared lists have different lengths (#15856)

* chore: More details to `No UDF registered` error (#15843)

* chore(deps): bump clap from 4.5.36 to 4.5.37 (#15853)

Bumps [clap](https://github.com/clap-rs/clap) from 4.5.36 to 4.5.37.
- [Release notes](https://github.com/clap-rs/clap/releases)
- [Changelog](https://github.com/clap-rs/clap/blob/master/CHANGELOG.md)
- [Commits](https://github.com/clap-rs/clap/compare/clap_complete-v4.5.36...clap_complete-v4.5.37)

---
updated-dependencies:
- dependency-name: clap
  dependency-version: 4.5.37
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Remove usage of `dbg!` (#15858)

* Fix `from_unixtime` function documentation (#15844)

* Fix `from_unixtime` function documentation

* Update scalar_functions.md

* Minor: Interval singleton (#15859)

* interval singleron

* fmt

* impl from

* Make aggr fuzzer query builder more configurable (#15851)

* refactor and make `QueryBuilder` more configurable.

* fix tests.

* fix clippy.

* extract `QueryBuilder` to a dedicated module.

* add `min_group_by_columns`, and fix some bugs.

* chore(deps): bump aws-config from 1.6.1 to 1.6.2 (#15874)

Bumps [aws-config](https://github.com/smithy-lang/smithy-rs) from 1.6.1 to 1.6.2.
- [Release notes](https://github.com/smithy-lang/smithy-rs/releases)
- [Changelog](https://github.com/smithy-lang/smithy-rs/blob/main/CHANGELOG.md)
- [Commits](https://github.com/smithy-lang/smithy-rs/commits)

---
updated-dependencies:
- dependency-name: aws-config
  dependency-version: 1.6.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Add slt tests for `datafusion.execution.parquet.coerce_int96` setting (#15723)

* Add slt tests for datafusion.execution.parquet.coerce_int96 setting

* tweak

* Improve `ListingTable` / `ListingTableOptions` docs (#15767)

* Improve `ListingTable` / `ListingTableOptions` docs

* Update datafusion/core/src/datasource/listing/table.rs

Co-authored-by: Alex Huang <[email protected]>

---------

Co-authored-by: Alex Huang <[email protected]>

* Upgrade-guide: Downgrade "FileScanConfig –> FileScanConfigBuilder" headline (#15883)

I noticed that https://datafusion.apache.org/library-user-guide/upgrading.html#filescanconfig-filescanconfigbuilder had "FileScanConfig –> FileScanConfigBuilder" as a top-level headline. It should probably be under the 47 release

* Migrate Optimizer tests to insta, part2 (#15884)

* migrate tests in `replace_distinct_aggregate.rs`

* migrate tests in `replace_distinct_aggregate.rs`

* migrate tests in `push_down_limit.rs`

* migrate tests in `eliminate_duplicated_expr.rs`

* migrate tests in `eliminate_filter.rs`

* migrate tests in `eliminate_group_by_constant.rs` to insta

* migrate tests in `eliminate_join.rs` to use snapshot assertions

* migrate tests in `eliminate_nested_union.rs` to use snapshot assertions

* migrate tests in `eliminate_outer_join.rs` to use snapshot assertions

* migrate tests in `filter_null_join_keys.rs` to use snapshot assertions

* fix Type inferance

* fix macro to use crate path for OptimizerContext and Optimizer

* clean up

* fix: Avoid mistaken ILike to string equality optimization (#15836)

* fix: Avoid mistaken ILike to string equality optimization

* test: ILIKE without wildcards

* Improve documentation for `FileSource`, `DataSource` and `DataSourceExec` (#15766)

* Improve documentation for FileSource

* more

* Update datafusion/datasource/src/file.rs

Co-authored-by: Adrian Garcia Badaracco <[email protected]>

* Clippy

* fmt

---------

Co-authored-by: Adrian Garcia Badaracco <[email protected]>

* allow min max dictionary (#15827)

* Map file-level column statistics to the table-level (#15865)

* init

* fix clippy

* add test

* chore(deps): bump blake3 from 1.8.1 to 1.8.2 (#15890)

Bumps [blake3](https://github.com/BLAKE3-team/BLAKE3) from 1.8.1 to 1.8.2.
- [Release notes](https://github.com/BLAKE3-team/BLAKE3/releases)
- [Commits](https://github.com/BLAKE3-team/BLAKE3/compare/1.8.1...1.8.2)

---
updated-dependencies:
- dependency-name: blake3
  dependency-version: 1.8.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Respect ignore_nulls in array_agg (#15544)

* Respect ignore_nulls in array_agg

* Reduce code duplication

* Add another test

* Set HashJoin seed (#15783)

* Set HashJoin seed

* fmt

* whitespace grr

* Document hash seed

Co-authored-by: Alex Huang <[email protected]>

---------

Co-authored-by: Alex Huang <[email protected]>

* Add Extension Type / Metadata support for Scalar UDFs (#15646)

* Add in plumbing to pass around metadata for physical expressions

* Adding argument metadata to scalar argument struct

* Since everywhere we use this we immediately clone, go ahead and returned an owned version of the metadata for simplicity

* Cargo fmt

* Benchmarks required args_metadata in tests

* Clippy warnings

* Switching over to passing Field around instead of metadata so we can handle extension types directly

* Switching return_type_from_args to return_field_from_args

* Updates to unit tests for switching to field instead of data_type

* Resolve unit test issues

* Update after rebase on main

* GetFieldFunc should return the field it finds instead of creating a new one

* Get metadata from scalar functions

* Change expr_schema to use to_field primarily instead of individual calls for getting data type, nullability, and schema

* Scalar function arguments should take return field instead of return data type now

* subquery should just get the field from below and not lose potential metadata

* Update comment

* Remove output_field now that we've determined it using return_field_from_args

* Change name to_field to field_from_column to be more consistent with the usage and prevent misconception about if we are doing some conversion

* Minor moving around of the explicit lifetimes in the struct definition

* Change physical expression to require to output a field which requires a lot of unit test updates, especially because the scalar arguments pass around borrowed values

* Change name from output_field to return_field to be more consistent

* Update migration guide for DF48 with user defined functions

* Whitespace

* Docstring correction

* chore: fix clippy::large_enum_variant for DataFusionError (#15861)

* Saner handling of nulls inside arrays (#15149)

* Saner handling of nulls inside arrays

* Fix array_sort for empty record batch

* Fix get_valid_types for FixedSizeLists

* Optimize array_ndims

* Add a test for result type of Concatenating Mixed types

* Fix array_element of empty array

* Handle more FixedSizeLists

* Feat: introduce `ExecutionPlan::partition_statistics` API (#15852)

* save

* save

* save

* functional way

* fix sort

* adding test

* add tests

* save

* update

* add PartitionedStatistics structure

* use Arc

* refine tests

* save

* resolve conflicts

* use PartitionedStatistics

* impl index and len for PartitionedStatistics

* add test for cross join

* fix clippy

* Check the statistics_by_partition with real results

* rebase main and fix cross join test

* resolve conflicts

* Feat: introduce partition statistics API

* address comments

* deprecated statistics API

* rebase main and fix tests

* fix

* Keeping pull request in sync with the base branch (#15894)

* fix: cast inner fsl to list in flatten (#15898)

* support OR operator in binary `evaluate_bounds` (#15716)

* support OR operator in binary `evaluate_bounds`

* fixup tests

* feat: Add option to adjust writer buffer size for query output (#15747)

* Add execution config option to set buffer size

* Document new configuration option (#15656)

* Minor documentation correction (#15656)

* Add default to documentation (#15656)

* Minor doc. fix and correct failing tests (#15656)

* Fix test (#15656)

* Updated with Builder API

---------

Co-authored-by: m09526 <[email protected]>

* infer placeholder datatype for IN lists (#15864)

* infer placeholder datatype for IN lists

* infer placeholder datatype for Expr::Like

* add tests for Expr::SimilarTo

---------

Co-authored-by: Kevin <[email protected].>

* Update known users (#15895)

* fix(avro): Respect projection order in Avro reader (#15840)

Fixed issue in the Avro reader that caused queries to fail when columns
were reordered in the SELECT statement. The reader now correctly:

1. Builds arrays in the order specified in the projection
2. Creates a properly ordered schema matching the projection

Previously when selecting columns in a different order than the original
schema (e.g., `SELECT timestamp, username FROM avro_table`), the reader
would produce error due to type mismatches between the data arrays and
the expected schema.

Fixes #15839

* Fix allow_update_branch (#15904)

* fix: correctly specify the nullability of `map_values` return type (#15901)

Co-authored-by: Andrew Lamb <[email protected]>

* Add `union_tag` scalar function (#14687)

* feat: add union_tag scalar function

* update for new api

* Add test for second field type

---------

Co-authored-by: Andrew Lamb <[email protected]>

* chore(deps): bump tokio from 1.44.1 to 1.44.2 (#15900)

Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.44.1 to 1.44.2.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](https://github.com/tokio-rs/tokio/compare/tokio-1.44.1...tokio-1.44.2)

---
updated-dependencies:
- dependency-name: tokio
  dependency-version: 1.44.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: xudong.w <[email protected]>

* chore(deps): bump assert_cmd from 2.0.16 to 2.0.17 (#15909)

Bumps [assert_cmd](https://github.com/assert-rs/assert_cmd) from 2.0.16 to 2.0.17.
- [Changelog](https://github.com/assert-rs/assert_cmd/blob/master/CHANGELOG.md)
- [Commits](https://github.com/assert-rs/assert_cmd/compare/v2.0.16...v2.0.17)

---
updated-dependencies:
- dependency-name: assert_cmd
  dependency-version: 2.0.17
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Factor out Substrait consumers into separate files (#15794)

* Factor out Substrait consumers into separate files

* Move relations and expressions into their own modules

* Refactor: rename rex to expr

* Refactor: move from_substrait_extended_expr to mod.rs

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Unparse `UNNEST` projection with the table column alias (#15879)

* add table column alias for unnest projection

* fix clippy

* fix columns check

* feat: Add `datafusion-spark` crate (#15168)

* feat: Add datafusion-spark crate

* spark crate setup

* clean up 2 example functions

* cleanup crate

* Spark crate setup

* fix lint issue

* cargo cleanup

* fix collision in sqllogic

* remove redundant test

* test float precision when casting to string

* reorder

* undo

* save

* save

* save

* add spark crate

* remove spark from core

* add comment to import tests

* Fix: reset submodule to main pointer and clean state

* Save

* fix registration

* modify float64 precision for spark

* Update datafusion/spark/src/lib.rs

Co-authored-by: Andrew Lamb <[email protected]>

* clean up code

* code cleanup

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Fix typo in introduction.md (#15910)

- Fix typo in introduction.md
- Remove period from end of bullet point to maintain consistency with other bullet points

* Fix CI in main (#15917)

* Migrate Optimizer tests to insta, part3 (#15893)

* migrate tests in `push_down_filters.rs` to use snapshot assertions

* remove unused format checks

* Revert "remove unused format checks"

This reverts commit dc4f137c7fc8cf642c8dbb158fbbb5526c69e051.

* migrate `assert_eq!` in `push_down_filters.rs` to use snapshot assertions

* migrate `assert_eq!` in `push_down_filters.rs` to use snapshot assertions

---------

Co-authored-by: Dmitrii Blaginin <[email protected]>

* Add `FormatOptions` to Config (#15793)

* Add `FormatOptions` to Config

* Fix `output_with_header`

* Add cli test

* Add `to_string`

* Prettify

* Prettify

* Preserve the initial `NULL` logic

* Cleanup

* Remove `lt` as no longer needed

* Format assert

* Fix sqllogictest

* Fix tests

* Set formatting params for dates / times

* Lowercase `duration_format`

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Minor: cleanup datafusion-spark scalar functions (#15921)

* Fix ClickBench extended queries after update to APPROX_PERCENTILE_CONT (#15929)

* fix: SqlLogicTest on Windows (#15932)

* docs: Label `bloom_filter_on_read` as a reading config (#15933)

* docs: Label �loom_filter_on_read as a reading config

* fix: Update configs.md

* Add extended query for checking improvement for blocked groups optimization (#15936)

* add query to show improvement for 15591.

* document the new added query.

* Character length (#15931)

* chore(deps): bump tokio-util from 0.7.14 to 0.7.15 (#15918)

Bumps [tokio-util](https://github.com/tokio-rs/tokio) from 0.7.14 to 0.7.15.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](https://github.com/tokio-rs/tokio/compare/tokio-util-0.7.14...tokio-util-0.7.15)

---
updated-dependencies:
- dependency-name: tokio-util
  dependency-version: 0.7.15
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: xudong.w <[email protected]>

* Migrate Optimizer tests to insta, part4 (#15937)

* migrate `assert_eq!` in `optimize_projection/mod.rs` to use snapshot assertions

* migrate `assert_optimized_plan_equal!` in `propagate_empty_relations.rs` to use snapshot assertions

* remove all `assert_optimized_plan_eq`

* migrate `assert_optimized_plan_equal!` in `decorrelate_predicate_subquery.rs` to use snapshot assertions

* Add snapshot assertion macro for optimized plan equality checks

---------

Co-authored-by: Dmitrii Blaginin <[email protected]>

* fix query results for predicates referencing partition columns and data columns (#15935)

* fix query results for predicates referencing partition columns and data columns

* fmt

* add e2e test

* newline

* chore(deps): bump substrait from 0.55.0 to 0.55.1 (#15941)

Bumps [substrait](https://github.com/substrait-io/substrait-rs) from 0.55.0 to 0.55.1.
- [Release notes](https://github.com/substrait-io/substrait-rs/releases)
- [Changelog](https://github.com/substrait-io/substrait-rs/blob/main/CHANGELOG.md)
- [Commits](https://github.com/substrait-io/substrait-rs/compare/v0.55.0...v0.55.1)

---
updated-dependencies:
- dependency-name: substrait
  dependency-version: 0.55.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat: create helpers to set the max_temp_directory_size (#15919)

* feat: create helpers to set the max_temp_directory_size

Signed-off-by: Jérémie Drouet <[email protected]>

* refactor: use helper in cli

Signed-off-by: Jérémie Drouet <[email protected]>

* refactor: update error message

Signed-off-by: Jérémie Drouet <[email protected]>

* refactor: use setter in tests

Signed-off-by: Jérémie Drouet <[email protected]>

---------

Signed-off-by: Jérémie Drouet <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>

* Fix main CI (#15942)

* Improve sqllogictest error reporting (#15905)

* refactor filter pushdown apis (#15801)

* refactor filter pushdown apis

* remove commented out code

* fix tests

* fail to fix bug

* fix

* add/fix docs

* lint

* add some docstrings, some minimal cleaup

* review suggestions

* add more comments

* fix doc links

* fmt

* add comments

* make test deterministic

* add bench

* fix bench

* register bench

* fix bench

* cargo fmt

---------

Co-authored-by: berkaysynnada <[email protected]>
Co-authored-by: Berkay Şahin <[email protected]>

* fix: fold cast null to substrait typed null (#15854)

* fix: fold cast null to typed null

* test: unit test

* chore: clippy

* fix: only handle ScalarValue::Null instead of all null-ed value

* Add additional tests for filter pushdown apis (#15955)

* Add additional tests for filter pushdown apis

* rename the testing module

* move TestNode to util

* fmt

---------

Co-authored-by: berkaysynnada <[email protected]>

* Improve filter pushdown optimizer rule performance (#15959)

* Improve filter pushdown optimizer rule performance

* fmt

* fix lint

* feat: ORDER BY ALL (#15772)

* feat: ORDER BY ALL

* refactor: orderyby all

* refactor: order_by_to_sort_expr

* refactor: TODO comment

* fix query results for predicates referencing partition columns and data columns (#15935)

* fix query results for predicates referencing partition columns and data columns

* fmt

* add e2e test

* newline

* chore(deps): bump substrait from 0.55.0 to 0.55.1 (#15941)

Bumps [substrait](https://github.com/substrait-io/substrait-rs) from 0.55.0 to 0.55.1.
- [Release notes](https://github.com/substrait-io/substrait-rs/releases)
- [Changelog](https://github.com/substrait-io/substrait-rs/blob/main/CHANGELOG.md)
- [Commits](https://github.com/substrait-io/substrait-rs/compare/v0.55.0...v0.55.1)

---
updated-dependencies:
- dependency-name: substrait
  dependency-version: 0.55.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat: create helpers to set the max_temp_directory_size (#15919)

* feat: create helpers to set the max_temp_directory_size

Signed-off-by: Jérémie Drouet <[email protected]>

* refactor: use helper in cli

Signed-off-by: Jérémie Drouet <[email protected]>

* refactor: update error message

Signed-off-by: Jérémie Drouet <[email protected]>

* refactor: use setter in tests

Signed-off-by: Jérémie Drouet <[email protected]>

---------

Signed-off-by: Jérémie Drouet <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>

* Fix main CI (#15942)

* Improve sqllogictest error reporting (#15905)

* refactor filter pushdown apis (#15801)

* refactor filter pushdown apis

* remove commented out code

* fix tests

* fail to fix bug

* fix

* add/fix docs

* lint

* add some docstrings, some minimal cleaup

* review suggestions

* add more comments

* fix doc links

* fmt

* add comments

* make test deterministic

* add bench

* fix bench

* register bench

* fix bench

* cargo fmt

---------

Co-authored-by: berkaysynnada <[email protected]>
Co-authored-by: Berkay Şahin <[email protected]>

---------

Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: Jérémie Drouet <[email protected]>
Co-authored-by: silezhou <[email protected]>
Co-authored-by: Adrian Garcia Badaracco <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Jérémie Drouet <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: xudong.w <[email protected]>
Co-authored-by: Gabriel <[email protected]>
Co-authored-by: berkaysynnada <[email protected]>
Co-authored-by: Berkay Şahin <[email protected]>

* Implement Parquet filter pushdown via new filter pushdown APIs (#15769)

* Implement Parquet filter pushdown via new filter pushdown APIs

* Update filter_pushdown.rs

---------

Co-authored-by: berkaysynnada <[email protected]>

* Reduce rehashing cost for primitive grouping by also reusing hash value (#15962)

* also save hash in hashtable in primitive single group by.

* address cr.

* chore(deps): bump chrono from 0.4.40 to 0.4.41 (#15956)

Bumps [chrono](https://github.com/chronotope/chrono) from 0.4.40 to 0.4.41.
- [Release notes](https://github.com/chronotope/chrono/releases)
- [Changelog](https://github.com/chronotope/chrono/blob/main/CHANGELOG.md)
- [Commits](https://github.com/chronotope/chrono/compare/v0.4.40...v0.4.41)

---
updated-dependencies:
- dependency-name: chrono
  dependency-version: 0.4.41
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat: support min/max for struct (#15667)

* feat: support min/max for struct

* groups aggregator

* update based on lamb's suggestion

* refactor: replace `unwrap_or` with `unwrap_or_else` for improved lazy… (#15841)

* refactor: replace `unwrap_or` with `unwrap_or_else` for improved lazy evaluation

* refactor: improve code readability by adjusting formatting and using `unwrap_or_else` for better lazy evaluation

* [FIX] added imports

* [FIX] formatting and restored original config logic

* config restored

* optimized the use of .clone()

* removed the use of clone

* cleanup the clone usecase

* add benchmark code for `Reuse rows in row cursor stream` (#15913)

* add benchmark for SortPreservingMergeExec

* add comments

* add comments

* Cover more test scenarios

* Update-docs_pr.yaml (#15966)

* Segfault in ByteGroupValueBuilder (#15968)

* test to demonstrate segfault in ByteGroupValueBuilder

* check for offset overflow

* clippy

* make can_expr_be_pushed_down_with_schemas public again (#15971)

* re-export can_expr_be_pushed_down_with_schemas to be public (#15974)

* Migrate Optimizer tests to insta, part5 (#15945)

* migrate `assert_optimized_plan_equal` in `extract_equijoin_predicate.rs` to use snapshot assertions

* format

* migrate `assert_optimized_plan_equal` in `single_distinct_to_groupby.rs` to use snapshot assertions

* remove all `assert_optimized_plan_eq_display_indent`

* remove unused test helper functions

* migrate `assert_optimized_plan_equal` in `scalar_subquery_to_join.rs` to use snapshot assertions

* remove unused test helper functions

* Show LogicalType name for `INFORMATION_SCHEMA` (#15965)

* show logica type instead of arrow type for parameteres table

* use debug fmt directly

* chore(deps): bump sha2 from 0.10.8 to 0.10.9 (#15970)

Bumps [sha2](https://github.com/RustCrypto/hashes) from 0.10.8 to 0.10.9.
- [Commits](https://github.com/RustCrypto/hashes/compare/sha2-v0.10.8...sha2-v0.10.9)

---
updated-dependencies:
- dependency-name: sha2
  dependency-version: 0.10.9
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* refactor: remove deprecated `ParquetExec` (#15973)

* refactor: remove deprecated ParquetExec

* fix doc

* fix: remove allow deprecated attribute

---------

Co-authored-by: Andrew Lamb <[email protected]>

* chore(deps): bump insta from 1.42.2 to 1.43.1 (#15988)

Bumps [insta](https://github.com/mitsuhiko/insta) from 1.42.2 to 1.43.1.
- [Release notes](https://github.com/mitsuhiko/insta/releases)
- [Changelog](https://github.com/mitsuhiko/insta/blob/master/CHANGELOG.md)
- [Commits](https://github.com/mitsuhiko/insta/compare/1.42.2...1.43.1)

---
updated-dependencies:
- dependency-name: insta
  dependency-version: 1.43.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* [datafusion-spark] Add Spark-compatible hex function (#15947)

* refactor: remove deprecated AvroExec (#15987)

* Substrait: Handle inner map fields in schema renaming (#15869)

* add tests

* fix tests

* fix

* fix

---------

Co-authored-by: Andrew Lamb <[email protected]>

* refactor: remove deprecated CsvExec (#15991)

* Migrate Optimizer tests to insta, part6 (#15984)

* migrate tests in `type_coercion.rs` to use snapshot assertions

* remove `assert_analyzed_plan_eq` and `assert_analyzed_plan_with_config_eq`

* remove unnecessary `pub`

* refactor: replace custom assertion functions with snapshot assertions in EliminateLimit tests

* format

* rename

* rename

* refactor: replace custom assertion function with macro for optimized plan equality in tests

* format macro

* chore(deps): bump nix from 0.29.0 to 0.30.1 (#16002)

Bumps [nix](https://github.com/nix-rust/nix) from 0.29.0 to 0.30.1.
- [Changelog](https://github.com/nix-rust/nix/blob/master/CHANGELOG.md)
- [Commits](https://github.com/nix-rust/nix/compare/v0.29.0...v0.30.1)

---
updated-dependencies:
- dependency-name: nix
  dependency-version: 0.30.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Implement RightSemi join for SortMergeJoin (#15972)

* Enable repartitioning on MemTable. (#15409)

* test(15088): reproducer of missing sort parallelization

* feat(15088): repartitioning for MemorySourceConfig

* test(15088): update test outcome due to fix

* test: update sort spill test to not parallelize sorts (due to scan repartitioning)

* fix: out of bounds

* test: during fuzz testing, we are hitting limits of FDs open due to mem table repartitioned scan

* refactor: imrpove performance

* chore: change variable naming, and update documentation to make clear how the datasource repartitioning is configured and performed

* test: update test snapshots for updated config description

* refactor: update algo used for even splitting, to proper binpack

* refactor: change config name back to original for backwards compatibility

* fix: maintain ordering within partition, when a single partition

* chore: add more doc comments

* Migrate Optimizer tests to insta, part7 (#16010)

* generalize `assert_optimized_plan_eq_snapshot` interface

* fix clippy

* refactor: simplify assertion for optimized plan equality in tests

* migrate tests in `elimiate_cross_join.rs` to use snapshot assertions

* chore(deps): bump sysinfo from 0.34.2 to 0.35.1 (#16027)

Bumps [sysinfo](https://github.com/GuillaumeGomez/sysinfo) from 0.34.2 to 0.35.1.
- [Changelog](https://github.com/GuillaumeGomez/sysinfo/blob/master/CHANGELOG.md)
- [Commits](https://github.com/GuillaumeGomez/sysinfo/commits/v0.35.1)

---
updated-dependencies:
- dependency-name: sysinfo
  dependency-version: 0.35.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Fix: `build_predicate_expression` method doesn't process `false` expr correctly (#15995)

* Fix: build_predicate_expression method doesn't process false correctly

* fix test

* refactor:  move should_enable_page_index from mod.rs to opener.rs (#16026)

* fix: add an "expr_planners" method to SessionState (#15119)

* add expr_planners to SessionState

* minor

* fix ci

* add test

* flatten imports

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Updated extending operators documentation (#15612)

* Updated extending operators documentation

* commented out Rust code to pass doc test

---------

Co-authored-by: Andrew Lamb <[email protected]>

* feat(proto): udf decoding fallback (#15997)

* feat(proto): udf decoding fallback

* add test case for proto udf decode fallback

* chore: Replace MSRV link on main page with Github badge (#16020)

* Replace MSRV link on main page with Github badge

* Add note to upgrade guide for removal of `ParquetExec`, `AvroExec`, `CsvExec`, `JsonExec` (#16034)

* refactor: remove deprecated ArrowExec (#16006)

* refactor: remove deprecated MemoryExec (#16007)

* refactor: remove deprecated JsonExec (#16005)

Co-authored-by: Andrew Lamb <[email protected]>

* chore(deps): bump sqllogictest from 0.28.1 to 0.28.2 (#16037)

Bumps [sqllogictest](https://github.com/risinglightdb/sqllogictest-rs) from 0.28.1 to 0.28.2.
- [Release notes](https://github.com/risinglightdb/sqllogictest-rs/releases)
- [Changelog](https://github.com/risinglightdb/sqllogictest-rs/blob/main/CHANGELOG.md)
- [Commits](https://github.com/risinglightdb/sqllogictest-rs/compare/v0.28.1...v0.28.2)

---
updated-dependencies:
- dependency-name: sqllogictest
  dependency-version: 0.28.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* chores: Add lint rule to enforce string formatting style (#16024)

* Add lint rule to enforce string formatting style

* format

* extra

* Update datafusion/ffi/src/tests/async_provider.rs

Co-authored-by: kosiew <[email protected]>

* Update datafusion/functions/src/datetime/to_date.rs

Co-authored-by: kosiew <[email protected]>

---------

Co-authored-by: kosiew <[email protected]>

* Use human-readable byte sizes in EXPLAIN (#16043)

* Docs: Add example of creating a field in `return_field_from_args` (#16039)

* Docs: Add example of creating a field in `return_field_from_args`

* fmt

* Update datafusion/expr/src/udf.rs

Co-authored-by: Oleks V <[email protected]>

* fmt

---------

Co-authored-by: Oleks V <[email protected]>

* Support `MIN` and `MAX` for `DataType::List` (#16025)

* Fix comparisons between lists that contain nulls

* Add support for lists in min/max agg functions

* Add sqllogictests

* Support lists in window frame target type

* fix: overcounting of memory in first/last. (#15924)

When aggregating first/last list over a column of lists, the first/last
accumulators hold the necessary scalar value as is, which points to the
list in the original input buffer.

This results in two issues:

1) We prevent the deallocation of the input arrays which might be
significantly larger than the single value we want to hold.

2) During aggreagtion with groups, many accumulators receive slices of the
same input buffer, resulting in all held values pointing to this buffer.
Then, when calculating the size of all accumulators we count the buffer
multiple times, since each accumulator considers it to be part of its own
allocation.

* Improve docs for Exprs and scalar functions (#16036)

* Improve docs for Exprs and scalar functions

* fix links

* Add h2o window benchmark (#16003)

* h2o-window benchmark

* Review: clarify h2o-window is an extended benchmark

* fix: track coalescer's consumption (#16048)

Signed-off-by: Ruihang Xia <[email protected]>

* Fix Infer prepare statement type tests  (#15743)

* draft commit to rolledback changes on function naming and include prepare clause on the infer types tests

* include data types in plan when it is not included in the prepare statement

* fix: prepare statement error

* Update datafusion/sql/src/statement.rs

Co-authored-by: Andrew Lamb <[email protected]>

* remove infer types from prepare statement

the infer data type changes in statement will be introduced in a new PR

* fix to show correct output message

* remove white space

* Restore the original tests too

---------

Co-authored-by: Andrew Lamb <[email protected]>

* fix: Clarify that it is only the name of the field that is ignored (#16052)

* style: simplify some strings for readability (#15999)

* style: simplify some strings for readability

* fix: formatting in `datafusion/` directory

* refactor: replace long `format!` string

* refactor: replace `format!` with `assert_eq!`

---------

Co-authored-by: Andrew Lamb <[email protected]>

* support simple/cross lateral joins (#16015)

* support simple lateral joins

Signed-off-by: Alex Chi Z <[email protected]>

* fix explain test

Signed-off-by: Alex Chi Z <[email protected]>

* plan scalar agg correctly

Signed-off-by: Alex Chi Z <[email protected]>

* add uncorrelated query tests

Signed-off-by: Alex Chi Z <[email protected]>

* fix clippy + fmt

Signed-off-by: Alex Chi Z <[email protected]>

* make rule matching faster

Signed-off-by: Alex Chi Z <[email protected]>

* revert build_join visibility

Signed-off-by: Alex Chi Z <[email protected]>

* revert find plan outer column changes

Signed-off-by: Alex Chi Z <[email protected]>

* remove clone

* address comment

---------

Signed-off-by: Alex Chi Z <[email protected]>
Co-authored-by: Alex Chi Z <[email protected]>

* Make error msg for oom human readable (#16050)

* chore(deps): bump the arrow-parquet group with 7 updates (#16047)

* chore(deps): bump the arrow-parquet group with 7 updates

Bumps the arrow-parquet group with 7 updates:

| Package | From | To |
| --- | --- | --- |
| [arrow](https://github.com/apache/arrow-rs) | `55.0.0` | `55.1.0` |
| [arrow-buffer](https://github.com/apache/arrow-rs) | `55.0.0` | `55.1.0` |
| [arrow-flight](https://github.com/apache/arrow-rs) | `55.0.0` | `55.1.0` |
| [arrow-ipc](https://github.com/apache/arrow-rs) | `55.0.0` | `55.1.0` |
| [arrow-ord](https://github.com/apache/arrow-rs) | `55.0.0` | `55.1.0` |
| [arrow-schema](https://github.com/apache/arrow-rs) | `55.0.0` | `55.1.0` |
| [parquet](https://github.com/apache/arrow-rs) | `55.0.0` | `55.1.0` |


Updates `arrow` from 55.0.0 to 55.1.0
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/main/CHANGELOG-old.md)
- [Commits](https://github.com/apache/arrow-rs/compare/55.0.0...55.1.0)

Updates `arrow-buffer` from 55.0.0 to 55.1.0
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/main/CHANGELOG-old.md)
- [Commits](https://github.com/apache/arrow-rs/compare/55.0.0...55.1.0)

Updates `arrow-flight` from 55.0.0 to 55.1.0
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/main/CHANGELOG-old.md)
- [Commits](https://github.com/apache/arrow-rs/compare/55.0.0...55.1.0)

Updates `arrow-ipc` from 55.0.0 to 55.1.0
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/main/CHANGELOG-old.md)
- [Commits](https://github.com/apache/arrow-rs/compare/55.0.0...55.1.0)

Updates `arrow-ord` from 55.0.0 to 55.1.0
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/main/CHANGELOG-old.md)
- [Commits](https://github.com/apache/arrow-rs/compare/55.0.0...55.1.0)

Updates `arrow-schema` from 55.0.0 to 55.1.0
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/main/CHANGELOG-old.md)
- [Commits](https://github.com/apache/arrow-rs/compare/55.0.0...55.1.0)

Updates `parquet` from 55.0.0 to 55.1.0
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/main/CHANGELOG-old.md)
- [Commits](https://github.com/apache/arrow-rs/compare/55.0.0...55.1.0)

---
updated-dependencies:
- dependency-name: arrow
  dependency-version: 55.1.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: arrow-parquet
- dependency-name: arrow-buffer
  dependency-version: 55.1.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: arrow-parquet
- dependency-name: arrow-flight
  dependency-version: 55.1.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: arrow-parquet
- dependency-name: arrow-ipc
  dependency-version: 55.1.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: arrow-parquet
- dependency-name: arrow-ord
  dependency-version: 55.1.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: arrow-parquet
- dependency-name: arrow-schema
  dependency-version: 55.1.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: arrow-parquet
- dependency-name: parquet
  dependency-version: 55.1.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: arrow-parquet
...

Signed-off-by: dependabot[bot] <[email protected]>

* Update sqllogictest results

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Andrew Lamb <[email protected]>

* chore(deps): bump petgraph from 0.7.1 to 0.8.1 (#15669)

Bumps [petgraph](https://github.com/petgraph/petgraph) from 0.7.1 to 0.8.1.
- [Release notes](https://github.com/petgraph/petgraph/releases)
- [Changelog](https://github.com/petgraph/petgraph/blob/master/CHANGELOG.md)
- [Commits](https://github.com/petgraph/petgraph/compare/[email protected]@v0.8.1)

---
updated-dependencies:
- dependency-name: petgraph
  dependency-version: 0.8.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* [datafusion-spark] Add Spark-compatible `char` expression (#15994)

* Add Spark-compatible char expression

* Add slt test

* [Docs]: Added SQL example for all window functions (#16074)

* update window function

* pretier fix

* Update window_functions.md

* chore(deps): bump substrait from 0.55.1 to 0.56.0 (#16091)

Bumps [substrait](https://github.com/substrait-io/substrait-rs) from 0.55.1 to 0.56.0.
- [Release notes](https://github.com/substrait-io/substrait-rs/releases)
- [Changelog](https://github.com/substrait-io/substrait-rs/blob/main/CHANGELOG.md)
- [Commits](https://github.com/substrait-io/substrait-rs/compare/v0.55.1...v0.56.0)

---
updated-dependencies:
- dependency-name: substrait
  dependency-version: 0.56.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Add test for collect_statistics (#16098)

* Add window function examples in code (#16102)

* Refactor substrait producer into multiple files (#16089)

* Fix temp dir leak in tests (#16094)

`TempDir::into_path` "leaks" the temp dir. This updates the `tempfile`
crate to a version where this method is deprecated and fixes all usages.

* Label Spark functions PRs with spark label (#16095)

* Rename Labeler workflow file name

Sync workflow name and its file name.

* Fix typo in Labeler config

* Label Spark functions PRs with `spark` label

* feat: add slt tests for imdb data (#16067)

* fix: stack overflow for substrait functions with large argument lists that translate to DataFusion binary operators   (#16031)

* Add substrait consumer test causing a stack overflow

* Mitigate stack overflow for substrait binary op with large arg list

When transforming a substrait function call to DataFusion logical plan,
if the substrait function maps to a DataFusion binary operator, but has
more than 2 arguments, it is mapped to a tree of BinaryExpr. This
BinaryExpr tree is not balanced, and its depth is the number of
arguments:

       Op
      /  \
    arg1  Op
         /  \
       arg2  ...
             /  \
           argN  Op

Since many functions manipulating the logical plan are recursive, it
means that N arguments result in an O(N) recursion, leading to stack
overflows for large N (1000 for example).

Transforming these function calls into a balanced tree mitigates the
issue:

             .__ Op __.
            /          \
          Op            Op
         /  \          /  \
       ...  ...      ...  ...
      /  \  /  \    /  \  /  \
    arg1        ...          argN

The recursion depth is now O(log2(N)), meaning that 1000 arguments
results in a depth of ~10, and it would take 2^1000 arguments to reach a
depth of 1000, which is a vastly unreasonable amount of data.

Therefore, it's not possible to use this flaw anymore to trigger stack
overflows in processes running DataFusion.

* arg_list_to_binary_op_tree: avoid cloning Expr

* cargo fmt

* from_scalar_function: improve error handling

* Move test_binary_op_large_argument_list test to scalar_function module

* arg_list_to_binary_op_tree: add more unit tests

Courtesy of @gabotechs

* substrait consumer scalar_function tests: more explicit function name

---------

Co-authored-by: Andrew Lamb <[email protected]>

* chore: Remove SMJ experimental status (#16072)

* chore(CI) Update workspace / CI to Rust 1.87 (#16068)

Co-authored-by: Andrew Lamb <[email protected]>

* minor: Add benchmark query and corresponding documentation for Average Duration (#16105)

* ADD query and documentation

* Prettier

---------

Co-authored-by: Andrew Lamb <[email protected]>

* feat: metadata handling for aggregates and window functions (#15911)

* Move expr_schema to use return_field instead of return_type

* More work on moving to Field from DataType for aggregates

* Update field output name for aggregates

* Improve unit test for aggregate udf with metadata

* Move window functions over to use Field instead of DataType

* Correct nullability flag

* Add import after rebase

* Add unit test for using udaf as window function with metadata processing

* Update documentation for migration guide

* Update naming from data type to field to match the actual parameters passed

* Avoid some allocations

* Update docs to use aggregate example

---------

Co-authored-by: Andrew Lamb <[email protected]>

* doc: fix indent format explain (#16085)

* doc: fix indent format explain

* update

* fix: coerce int96 resolution inside of list, struct, and map types (#16058)

* Add test generated from schema in Comet.

* Checkpoint DFS.

* Checkpoint with working transformation.

* fmt, clippy fixes.

* Remove maximum stack depth.

* More testing.

* Improve tests.

* Improve docs.

* Use a smaller HashSet instead of HashMap with every field in it. More docs.

* Use a smaller HashSet instead of HashMap with every field in it. More docs.

* More docs.

* More docs.

* Fix typo.

* Refactor match with nested if lets to make it more readable.

* Address some PR feedback.

* Rename variables in struct processing to address PR feedback. Do List next.

* Rename variables in list processing to address PR feedback.

* Update docs.

* Simplify list parquet path generation.

* Map support.

* Remove old TODO.

* Reduce redundant docs be referring to docs above.

* Reduce redundant docs be referring to docs above.

* Add parquet file generated from CometFuzzTestSuite ParquetGenerator (similar to schema in file_format tests) to exercise end-to-end support.

* Fix clippy.

* Update documentation for `datafusion.execution.collect_statistics` (#16100)

* Update documentation for `datafusion.execution.collect_statistics` setting

* Update test

* Update datafusion/common/src/config.rs

Co-authored-by: Leonardo Yvens <[email protected]>

* update docs

* Update doc

---------

Co-authored-by: Leonardo Yvens <[email protected]>

* fix: Add coercion rules for Float16 types (#15816)

* handle coercion for Float16 types

* Add some basic slt tests

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Use qualified names on DELETE selections (#16033)

Co-authored-by: Andrew Lamb <[email protected]>

* chore(deps): bump testcontainers from 0.23.3 to 0.24.0 (#15989)

* chore(deps): bump testcontainers from 0.23.3 to 0.24.0

Bumps [testcontainers](https://github.com/testcontainers/testcontainers-rs) from 0.23.3 to 0.24.0.
- [Release notes](https://github.com/testcontainers/testcontainers-rs/releases)
- [Changelog](https://github.com/testcontainers/testcontainers-rs/blob/main/CHANGELOG.md)
- [Commits](https://github.com/testcontainers/testcontainers-rs/compare/0.23.3...0.24.0)

---
updated-dependencies:
- dependency-name: testcontainers
  dependency-version: 0.24.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>

* Update test_containers_modules too

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Andrew Lamb <[email protected]>

* feat: make error handling in indent explain consistent with that in tree (#16097)

* feat: make error handling in indent consistent with that in tree

* update test

* return all plans instead of throwing err

* update test

* Clean up ExternalSorter and use upstream converter (#16109)

* Support `GroupsAccumulator` for Avg duration (#15748)

* Support GroupsAccumulator for avg duration

* update test

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Test Duration in `fuzz` tests (#16111)

* Move PruningStatistics into datafusion::common (#16069)

* Move PruningStatistics into datafusion::common

* fix doc

* remove new code

* fmt

* Revert use file schema in parquet pruning (#16086)

* wip

* comment

* Update datafusion/core/src/datasource/physical_plan/parquet.rs

* remove prints

* better test

* fmt

* Make `SessionContext::register_parquet` obey `collect_statistics` config (#16080)

* fix

* add a test

* fmt

* add to upgrade guide

* fix tests

* fix test

* fix test

* fix ci

* Fix example in upgrade guide (#29)

---------

Co-authored-by: Andrew Lamb <[email protected]>

* fix: describe escaped quoted identifiers (#16082)

* feat: escape quote wrap identifiers in describe

rm: dev files

fmt: final formatting

sed: s/<comment>//

* fix: use ident instead of col + format

* Minor: Add `ScalarFunctionArgs::return_type` method (#16113)

* feat: coerce from fixed size binary to binary view (#16110)

* Improve the DML / DDL Documentation (#16115)

* Update documentation about DDL and DML

* Improve the DML Documentation

* Apply suggestions from code review

Co-authored-by: Oleks V <[email protected]>

* Fix docs

* Fix docs

---------

Co-authored-by: Oleks V <[email protected]>

* Fix `contains` function expression (#16046)

* Optimize performance of `string::ascii` function (#16087)

* Optimize performance of string::ascii function

d

* Add benchmark with with NULL_DENSITY=0

d

---------

Co-authored-by: Tai Le Manh <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>

* chore: Use materialized data for filter pushdown tests (#16123)

* chore: Use pre created data for filter pushdown tests

* chore: Use pre created data for filter pushdown tests

* chore: Upgrade rand crate and some other minor crates (#16062)

* chore: Upgrade `rand` crate and some other minor crates

---------

Co-authored-by: Andrew Lamb <[email protected]>

* Include data types in logical plans of inferred prepare statements (#16019)

* draft commit to rolledback changes on function naming and include prepare clause on the infer types tests

* include data types in plan when it is not included in the prepare statement

* fix: prepare statement error

* Update datafusion/sql/src/statement.rs

Co-authored-by: Andrew Lamb <[email protected]>

* remove infer types from prepare statement

the infer data type changes in statement will be introduced in a new PR

* fix to show correct output message

* include data types on logical plans of prepare statements without explicit type declaration

* fix using clippy sugestions

* explicitly get the data types using the placeholder id to avoid sorting

* Restore the original tests too

* update set data type routine to be more rust idiomatic

Co-authored-by: Tommy shu <[email protected]>

* update set datatype routine

* fix formatting in sql_integration

---------

Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Tommy shu <[email protected]>

* docs: Fix typos and minor grammatical issues in Architecture docs (#16119)

* minor fixes to arch docs


Co-authored-by: Oleks V <[email protected]>

---------

Co-authored-by: Oleks V <[email protected]>

* add top-memory-consumers option in cli (#16081)

add snapshot tests for memory exhaustion

* fix ci extended test (#16144)

* Fix: handle column name collisions when combining UNION logical inputs & nested Column expressions in maybe_fix_physical_column_name (#16064)

* Fix union schema name coercion

* Address renaming for columns that are not in the top level as well

* Add unit test

* Format

* Use insta tests properly

* Address review - comment + minor simplification change

---------

Co-authored-by: Berkay Şahin <[email protected]>

* adding support for Min/Max over LargeList and FixedSizeList (#16071)

* initial Iteration

* add Sql Logic tests

* tweak comments

* unify data, structure tests

* Deleted by mistake

* Move prepare/parameter handling tests into `params.rs` (#16141)

* Move prepare/parameter handling tests into `params.rs`

* Resolve conflicts

* Add `StateFieldsArgs::return_field` (#16112)

* Support filtering specific sqllogictests identified by line number (#16029)

* Support filtering specific sqllogictests identified by line number

* Add license header

* Try parsing in different dialects

* Add test filtering example to README.md

* Improve Filter doc comment

* Factor out statement_is_skippable into its own function

* Add example about how filters work in the doc comments

* Enrich GroupedHashAggregateStream name to ease debugging Resources exhausted errors (#16152)

* Enrich GroupedHashAggregateStream name to ease debugging Resources exhausted errors

* Use human_display

* clippy

* chore(deps): bump uuid from 1.16.0 to 1.17.0 (#16162)

Bumps [uuid](https://github.com/uuid-rs/uuid) from 1.16.0 to 1.17.0.
- [Release notes](https://github.com/uuid-rs/uuid/releases)
- [Commits](https://github.com/uuid-rs/uuid/compare/v1.16.0...v1.17.0)

---
updated-dependencies:
- dependency-name: uuid
  dependency-version: 1.17.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Minor: Fix links in substrait readme (#16156)

* Remove Filter::having field (#16154)

Both WHERE clause and HAVING clause translate to a Filter plan node.
They differ in how the references and aggregates are handled.
HAVING goes after aggregation and may reference aggregate expressions
and therefore HAVING's filter will be placed after Aggregation plan
node.

Once a plan has been built, however, there is no special additional
semantics to filters created from HAVING. Remove the unnecessary field.

For reference, the field was added along with usage in
a50aeefcbfc84d491495887d57fa8ebc0db57ff2 commit and the usage was later
removed in eb62e2871e49c402ec7b0d25658faa6dc5219969 commit.

* Clarify docs and names in parquet predicate pushdown tests (#16155)

* Clarify docs and names in parquet predicate pushdown tests

* Update datafusion/datasource/src/file_scan_config.rs

Co-authored-by: Adrian Garcia Badaracco <[email protected]>

* clippy

---------

Co-authored-by: Adrian Garcia Badaracco <[email protected]>

* Minor: Fix name() for FilterPushdown physical optimizer rule (#16175)

* Fix name() for FilterPushdown physical optimizer rule

Typo that wasn't caught during review...

* fix

* migrate tests in `pool.rs` to use insta (#16145)

fix according to review

fix to_string error

fix test by stripping backtrace

* refactor(optimizer): add `.with_schema` for defining test tables (#16138)

Added `tables: HashMap<String, Arc<dyn TableSource>>` and `MyContextProvider::with_schema` method for dynamically defining tables for optimizer integration tests.

* [Minor] Speedup TPC-H benchmark run with memtable option (#16159)

* Speedup tpch run with memtable

* Clippy

* Clippy

* Fast path for joins with distinct values in build side (#16153)

* Specialize unique join

* handle splitting

* rename a bit

* fix

* fix

* fix

* fix

* Fix the test, add explanation

* Simplify

* Update datafusion/physical-plan/src/joins/join_hash_map.rs

Co-authored-by: Christian <[email protected]>

* Update datafusion/physical-plan/src/joins/join_hash_map.rs

Co-authored-by: Christian <[email protected]>

* Simplify

* Simplify

* Simplify

---------

Co-authored-by: Christian <[email protected]>

* chore: Reduce repetition in the parameter type inference tests (#16079)

* added test

* added parameterTest

* cargo fmt

* Update sql_integration.rs

* allow needless_lifetimes

* remove needless lifetime

* update some tests

* move to params.rs

* feat: array_length for fixed size list (#16167)

* feat: array_length for fixed size list

* remove list view

* fix: remove trailing whitespace in `Display` for `LogicalPlan::Projection` (#16164)

* chore(deps): bump tokio from 1.45.0 to 1.45.1 (#16190)

Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.45.0 to 1.45.1.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](https://github.com/tokio-rs/tokio/compare/tokio-1.45.0...tokio-1.45.1)

---
updated-dependencies:
- dependency-name: tokio
  dependency-version: 1.45.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Improve `unproject_sort_expr` to handle arbitrary expressions (#16127)

* Add failing test to demonstrate problem

* Improve `unproject_sort_expr` to handle arbitrary expressions (#83)

* Remove redundant return

* chore(deps): bump rustyline from 15.0.0 to 16.0.0 (#16194)

Bumps [rustyline](https://github.com/kkawakam/rustyline) from 15.0.0 to 16.0.0.
- [Release notes](https://github.com/kkawakam/rustyline/releases)
- [Changelog](https://github.com/kkawakam/rustyline/blob/master/History.md)
- [Commits](https://github.com/kkawakam/rustyline/compare/v15.0.0...v16.0.0)

---
updated-dependencies:
- dependency-name: rustyline
  dependency-version: 16.0.0
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* feat: ADD sha2 spark function (#16168)

ADD sha2 spark function

* Add macro for creating DataFrame (#16090) (#16104)

* Add macro for creating DataFr…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[DISCUSSION] Add separate crate to cover spark builtin functions
4 participants