Skip to content

Conversation

findepi
Copy link
Member

@findepi findepi commented Aug 12, 2025

Before the change, it was disallowed to have an aggregation without
GROUP BY and without any aggregate functions. This prevented the
optimizer from removing each redundant aggregation if all were
redundant. The first one would always be retained.

This PR removes this optimizer limitation, if a global aggregation has
all aggregate functions redundant, it's replaced with 1-row VALUES.

@findepi findepi added the performance Make DataFusion faster label Aug 12, 2025
@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Aug 12, 2025
@findepi findepi force-pushed the findepi/eliminate-all-redundant-aggs branch from f3601c2 to cb49bfc Compare August 12, 2025 09:33
@findepi findepi force-pushed the findepi/eliminate-all-redundant-aggs branch from cb49bfc to 4c92ae8 Compare August 12, 2025 09:56
@github-actions github-actions bot added the core Core DataFusion crate label Aug 12, 2025
Before the change, it was disallowed to have an aggregation without
GROUP BY and without any aggregate functions. This prevented the
optimizer from removing each redundant aggregation if all were
redundant. The first one would always be retained. This commit removes
the limitation, allowing for queries to be further optimized.
Projection:
Aggregate: groupBy=[[]], aggr=[[count(Int32(1))]]
TableScan: ?table? projection=[]
EmptyRelation
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

table scan gets replaced with 1-row VALUES (that would be more visible if we merge #17145)

we could further eliminate count(*) on top of a relation with known cardinality. follow-up potential

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @findepi -- this seems like an improvement to me

I have some suggestions but nothing I think is required before merging

let new_aggr_expr = aggregate_reqs.get_at_indices(&aggregate.aggr_expr);

if new_group_bys.is_empty() && new_aggr_expr.is_empty() {
// Global aggregation with no aggregate functions always produces 1 row and no columns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a good rule to have, though perhaps it would be eaiser to find if it were with related rules for aggregates, for example

https://github.com/apache/datafusion/blob/main/datafusion/optimizer/src/eliminate_group_by_constant.rs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO here is a natural place to place this logic. We're pruning and we need to somehow handle the case where we prune the last aggregate function. The code should do something reasonable and now it does.
alternative is to let this place create Agg with empty group by and no aggregate functions (I had so initially), and then have a separate rule that finds such trivial Aggs and replaces with VALUES.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping it here just makes sense

@@ -0,0 +1,36 @@
statement ok
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend we name this file more specifically, or put it in an existing function, perhaps aggregate.slt ?

Not a blocker, just a suggestion

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This showed up as a regression so i named this file as a regression test.
We have regression tests stucked in other files, but this creates risk of inadvertently changing the scenario and so thus breaking the regression properties of a test. They are better of isolated.
Would it help if this file is moved under a directory for just regression cases?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe that would be better. But if there is already precedent for regression cases perhaps we can leave this PR as is and move it in a subsequent PR

04)----Projection:
05)------Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]
06)--------TableScan: t2 projection=[]
04)----EmptyRelation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like an improvement to me

return Ok(Transformed::yes(LogicalPlan::EmptyRelation(
EmptyRelation {
produce_one_row: true,
schema: Arc::new(DFSchema::empty()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related to this PR, but why we need schema for EmptyRelation, the structure description presumes the schema always empty?

/// Produces no rows: An empty relation with an empty schema
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct EmptyRelation

We probably can simplify using this relationship

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UPD: actually the description is not accurate, EmptyRelation used with non empty schema in joins and some other places, we need to update the description

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we don't produce any symbols, but maybe we use EmptyRelation also for other cases?
For exmple SELECT a, b, c FROM t WHERE false could be replaced with EmptyRelation, but something needs to project a, b, c symbols. It can be a project above EmptyRelation or part of EmptyRelation itself.

A different question -- why do we have both, EmptyRelation and Values?
The latter is strictly more generic, without any visible downsides. It even has a better name (EmptyRelation is only conditionally an empty relation).
Can we deprecate EmptyRelation and replace with Values?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both structs have much in common, but EmptyRelation also serves for producing 0 rows, not sure if that can work for Values?

Although EmptyRelation with produce_one_row == true that probably feasible to replace with Values 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both structs have much in common, but EmptyRelation also serves for producing 0 rows, not sure if that can work for Values?

Why not?
SQL doesn't allow empty values, but I see no reason for LP not to allow them.

@findepi
Copy link
Member Author

findepi commented Aug 12, 2025

@findepi findepi merged commit b786b9a into apache:main Aug 12, 2025
27 checks passed
@findepi findepi deleted the findepi/eliminate-all-redundant-aggs branch August 12, 2025 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules performance Make DataFusion faster sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Redundant aggregation elimination regression
3 participants