-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Support WITHIN GROUP syntax to standardize certain existing aggregate functions #13511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
a9b901a
0918000
070a96b
3fd92fd
4082a78
c3be3c6
9fd05a3
8518a59
79669d9
597f4d7
d3b483c
a827c9d
23bdf70
97d96ca
1b61b5b
d0fdde3
3c8bce3
be99a35
f9aa1fc
d5f0b62
d7f2f59
7ef2139
91565b3
d482bff
005a27c
1179bc4
e5fc1a4
cf4faad
fc7d2bc
5469e39
d96b667
293d33e
ecdb21b
d65420e
b6d426a
4b0c52f
36a732d
8d6db85
db0355a
37b783e
124d8c5
3259c95
57c4281
40d7055
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -34,6 +34,7 @@ use datafusion_common::{ | |
downcast_value, internal_err, not_impl_datafusion_err, not_impl_err, plan_err, | ||
Result, ScalarValue, | ||
}; | ||
use datafusion_expr::expr::{AggregateFunction, Sort}; | ||
use datafusion_expr::function::{AccumulatorArgs, StateFieldsArgs}; | ||
use datafusion_expr::type_coercion::aggregates::{INTEGERS, NUMERICS}; | ||
use datafusion_expr::utils::format_state_name; | ||
|
@@ -51,29 +52,39 @@ create_func!(ApproxPercentileCont, approx_percentile_cont_udaf); | |
|
||
/// Computes the approximate percentile continuous of a set of numbers | ||
pub fn approx_percentile_cont( | ||
expression: Expr, | ||
order_by: Sort, | ||
percentile: Expr, | ||
centroids: Option<Expr>, | ||
) -> Expr { | ||
let expr = order_by.expr.clone(); | ||
|
||
let args = if let Some(centroids) = centroids { | ||
vec![expression, percentile, centroids] | ||
vec![expr, percentile, centroids] | ||
} else { | ||
vec![expression, percentile] | ||
vec![expr, percentile] | ||
}; | ||
approx_percentile_cont_udaf().call(args) | ||
|
||
Expr::AggregateFunction(AggregateFunction::new_udf( | ||
approx_percentile_cont_udaf(), | ||
args, | ||
false, | ||
None, | ||
Some(vec![order_by]), | ||
None, | ||
)) | ||
} | ||
|
||
#[user_doc( | ||
doc_section(label = "Approximate Functions"), | ||
description = "Returns the approximate percentile of input values using the t-digest algorithm.", | ||
syntax_example = "approx_percentile_cont(expression, percentile, centroids)", | ||
syntax_example = "approx_percentile_cont(percentile, centroids) WITHIN GROUP (ORDER BY expression)", | ||
sql_example = r#"```sql | ||
> SELECT approx_percentile_cont(column_name, 0.75, 100) FROM table_name; | ||
+-------------------------------------------------+ | ||
| approx_percentile_cont(column_name, 0.75, 100) | | ||
+-------------------------------------------------+ | ||
| 65.0 | | ||
+-------------------------------------------------+ | ||
> SELECT approx_percentile_cont(0.75, 100) WITHIN GROUP (ORDER BY column_name) FROM table_name; | ||
+-----------------------------------------------------------------------+ | ||
| approx_percentile_cont(0.75, 100) WITHIN GROUP (ORDER BY column_name) | | ||
+-----------------------------------------------------------------------+ | ||
| 65.0 | | ||
+-----------------------------------------------------------------------+ | ||
```"#, | ||
standard_argument(name = "expression",), | ||
argument( | ||
|
@@ -130,6 +141,19 @@ impl ApproxPercentileCont { | |
args: AccumulatorArgs, | ||
) -> Result<ApproxPercentileAccumulator> { | ||
let percentile = validate_input_percentile_expr(&args.exprs[1])?; | ||
|
||
let is_descending = args | ||
.ordering_req | ||
.first() | ||
.map(|sort_expr| sort_expr.options.descending) | ||
.unwrap_or(false); | ||
|
||
let percentile = if is_descending { | ||
1.0 - percentile | ||
} else { | ||
percentile | ||
}; | ||
Comment on lines
+151
to
+155
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I used floating point subtraction instead of actual sorting in reverse order, for conciseness. If any slight floating point difference is not permitted (even if this branch passed the tests), please let me know. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems reasonable to me, but I don't have that much experience on the execution side of things. |
||
|
||
let tdigest_max_size = if args.exprs.len() == 3 { | ||
Some(validate_input_max_size_expr(&args.exprs[2])?) | ||
} else { | ||
|
@@ -292,6 +316,14 @@ impl AggregateUDFImpl for ApproxPercentileCont { | |
Ok(arg_types[0].clone()) | ||
} | ||
|
||
fn supports_null_handling_clause(&self) -> bool { | ||
false | ||
} | ||
|
||
fn is_ordered_set_aggregate(&self) -> bool { | ||
true | ||
} | ||
|
||
fn documentation(&self) -> Option<&Documentation> { | ||
self.doc() | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something we need? From what I know, there aren't any aggregate functions that have options for null handling. At the moment, the 2 overrides you have of this both return
Some(false)
, which is what I would consider the default value anyways.Speaking of which, if we do need this, do we need to return an
Optional<bool>
or could we just returnbool
directly?Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some aggregate functions using null handling in current datafusion.
(cf. If this is something we need to discuss/fix, then I can make another git issue. Or, I can refactor it too in this PR. I left this comment because I am not 100% sure about the SQL standard.)
And I refactored the function to just return
bool
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was smelling odd, so I dug a bit deeper. I think you've inadvertantly stumbled into something even weirder than you anticipated
The example you've linked is
which I don't think is a valid query because
first_value
should not be an aggregate function, or at the very least the above query is not valid in most SQL dialects.first_value
is actually a window function in other engines (eg. Trino, Postgres, MySQL).If you try running something like
against Postgres you get an error like
dbfiddle
The
RESPECT NULLS | IGNORE NULLS
options is only a property of certain window functions, hence we shouldn't need to track it for aggregate functions.I'm going to file a ticket for the above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #15006