-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Eliminate all redundant aggregations #17139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eliminate all redundant aggregations #17139
Conversation
f3601c2
to
cb49bfc
Compare
cb49bfc
to
4c92ae8
Compare
Before the change, it was disallowed to have an aggregation without GROUP BY and without any aggregate functions. This prevented the optimizer from removing each redundant aggregation if all were redundant. The first one would always be retained. This commit removes the limitation, allowing for queries to be further optimized.
4c92ae8
to
9e02646
Compare
Projection: | ||
Aggregate: groupBy=[[]], aggr=[[count(Int32(1))]] | ||
TableScan: ?table? projection=[] | ||
EmptyRelation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
table scan gets replaced with 1-row VALUES (that would be more visible if we merge #17145)
we could further eliminate count(*)
on top of a relation with known cardinality. follow-up potential
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @findepi -- this seems like an improvement to me
I have some suggestions but nothing I think is required before merging
let new_aggr_expr = aggregate_reqs.get_at_indices(&aggregate.aggr_expr); | ||
|
||
if new_group_bys.is_empty() && new_aggr_expr.is_empty() { | ||
// Global aggregation with no aggregate functions always produces 1 row and no columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems like a good rule to have, though perhaps it would be eaiser to find if it were with related rules for aggregates, for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO here is a natural place to place this logic. We're pruning and we need to somehow handle the case where we prune the last aggregate function. The code should do something reasonable and now it does.
alternative is to let this place create Agg with empty group by and no aggregate functions (I had so initially), and then have a separate rule that finds such trivial Aggs and replaces with VALUES.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping it here just makes sense
@@ -0,0 +1,36 @@ | |||
statement ok |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend we name this file more specifically, or put it in an existing function, perhaps aggregate.slt ?
Not a blocker, just a suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This showed up as a regression so i named this file as a regression test.
We have regression tests stucked in other files, but this creates risk of inadvertently changing the scenario and so thus breaking the regression properties of a test. They are better of isolated.
Would it help if this file is moved under a directory for just regression cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, maybe that would be better. But if there is already precedent for regression cases perhaps we can leave this PR as is and move it in a subsequent PR
04)----Projection: | ||
05)------Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]] | ||
06)--------TableScan: t2 projection=[] | ||
04)----EmptyRelation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like an improvement to me
return Ok(Transformed::yes(LogicalPlan::EmptyRelation( | ||
EmptyRelation { | ||
produce_one_row: true, | ||
schema: Arc::new(DFSchema::empty()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not related to this PR, but why we need schema for EmptyRelation
, the structure description presumes the schema always empty?
/// Produces no rows: An empty relation with an empty schema
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct EmptyRelation
We probably can simplify using this relationship
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UPD: actually the description is not accurate, EmptyRelation used with non empty schema in joins and some other places, we need to update the description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we don't produce any symbols, but maybe we use EmptyRelation
also for other cases?
For exmple SELECT a, b, c FROM t WHERE false
could be replaced with EmptyRelation
, but something needs to project a, b, c
symbols. It can be a project above EmptyRelation
or part of EmptyRelation
itself.
A different question -- why do we have both, EmptyRelation
and Values
?
The latter is strictly more generic, without any visible downsides. It even has a better name (EmptyRelation
is only conditionally an empty relation).
Can we deprecate EmptyRelation
and replace with Values
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both structs have much in common, but EmptyRelation
also serves for producing 0 rows, not sure if that can work for Values?
Although EmptyRelation
with produce_one_row == true that probably feasible to replace with Values 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both structs have much in common, but
EmptyRelation
also serves for producing 0 rows, not sure if that can work for Values?
Why not?
SQL doesn't allow empty values, but I see no reason for LP not to allow them.
Merging to unblock |
Before the change, it was disallowed to have an aggregation without
GROUP BY and without any aggregate functions. This prevented the
optimizer from removing each redundant aggregation if all were
redundant. The first one would always be retained.
This PR removes this optimizer limitation, if a global aggregation has
all aggregate functions redundant, it's replaced with 1-row VALUES.