[SPARK-53094][SQL] Fix CUBE with aggregate containing HAVING clauses #51820

peter-toth · 2025-08-04T20:40:38Z

What changes were proposed in this pull request?

This is an alternative PR to #51810 to fix a regresion introduced in Spark 3.2 with #32470.
This PR defers the resolution of not fully resolved UnresolvedHaving nodes from ResolveGroupingAnalytics:

=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics ===
 'Sort ['s DESC NULLS LAST], true                                                                                               'Sort ['s DESC NULLS LAST], true
!+- 'UnresolvedHaving ('count('product) > 2)                                                                                    +- 'UnresolvedHaving ('count(tempresolvedcolumn(product#261, product, false)) > 2)
!   +- 'Aggregate [cube(Vector(0), Vector(1), product#261, region#262)], [product#261, region#262, sum(amount#263) AS s#264L]      +- Aggregate [product#269, region#270, spark_grouping_id#268L], [product#269, region#270, sum(amount#263) AS s#264L]
!      +- SubqueryAlias t                                                                                                             +- Expand [[product#261, region#262, amount#263, product#266, region#267, 0], [product#261, region#262, amount#263, product#266, null, 1], [product#261, region#262, amount#263, null, region#267, 2], [product#261, region#262, amount#263, null, null, 3]], [product#261, region#262, amount#263, product#269, region#270, spark_grouping_id#268L]
!         +- LocalRelation [product#261, region#262, amount#263]                                                                         +- Project [product#261, region#262, amount#263, product#261 AS product#266, region#262 AS region#267]
!                                                                                                                                           +- SubqueryAlias t
!                                                                                                                                              +- LocalRelation [product#261, region#262, amount#263]

to ResolveAggregateFunctions to add the correct aggregate expressions (count(product#261)):

=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions ===
 'Sort ['s DESC NULLS LAST], true                                                                                                                                                                                                                                                                                                                             'Sort ['s DESC NULLS LAST], true
!+- 'UnresolvedHaving (count(tempresolvedcolumn(product#261, product, false)) > cast(2 as bigint))                                                                                                                                                                                                                                                            +- Project [product#269, region#270, s#264L]
!   +- Aggregate [product#269, region#270, spark_grouping_id#268L], [product#269, region#270, sum(amount#263) AS s#264L]                                                                                                                                                                                                                                         +- Filter (count(product)#272L > cast(2 as bigint))
!      +- Expand [[product#261, region#262, amount#263, product#266, region#267, 0], [product#261, region#262, amount#263, product#266, null, 1], [product#261, region#262, amount#263, null, region#267, 2], [product#261, region#262, amount#263, null, null, 3]], [product#261, region#262, amount#263, product#269, region#270, spark_grouping_id#268L]         +- Aggregate [product#269, region#270, spark_grouping_id#268L], [product#269, region#270, sum(amount#263) AS s#264L, count(product#261) AS count(product)#272L]
!         +- Project [product#261, region#262, amount#263, product#261 AS product#266, region#262 AS region#267]                                                                                                                                                                                                                                                       +- Expand [[product#261, region#262, amount#263, product#266, region#267, 0], [product#261, region#262, amount#263, product#266, null, 1], [product#261, region#262, amount#263, null, region#267, 2], [product#261, region#262, amount#263, null, null, 3]], [product#261, region#262, amount#263, product#269, region#270, spark_grouping_id#268L]
!            +- SubqueryAlias t                                                                                                                                                                                                                                                                                                                                           +- Project [product#261, region#262, amount#263, product#261 AS product#266, region#262 AS region#267]
!               +- LocalRelation [product#261, region#262, amount#263]                                                                                                                                                                                                                                                                                                       +- SubqueryAlias t
!                                                                                                                                                                                                                                                                                                                                                                               +- LocalRelation [product#261, region#262, amount#263]

Why are the changes needed?

Fix a correctness isue described in #51810.

Does this PR introduce any user-facing change?

Yes, it fixes a correctness issue.

How was this patch tested?

Added new UT from #51810.

Was this patch authored or co-authored using generative AI tooling?

No.

peter-toth · 2025-08-04T21:01:45Z

cc @cloud-fan

dongjoon-hyun · 2025-08-04T21:17:59Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

    }
  }
+
+  test("SPARK-53094: Fix cube-related data quality problem") {


This is the original author's contribution?

Yes, this testcase is from @harris233's #51810, I just cleaned it up a bit.

violet-nspct · 2025-08-04T22:44:13Z

Maybe add tests for the following edge case?

Partially resolved conditions. Critical for ensuring correct resolution order.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

…nctions

dongjoon-hyun

+1, LGTM.

peter-toth · 2025-08-05T18:38:26Z

Thanks @dongjoon-hyun , @cloud-fan for the review!

Merged to master (4.1.0).

As this is a correctness issue I will open backport PRs to 4.0.x and 3.5.x.

…uses This is an alternative PR to apache#51810 to fix a regresion introduced in Spark 3.2 with apache#32470. This PR defers the resolution of not fully resolved `UnresolvedHaving` nodes from `ResolveGroupingAnalytics`: ``` === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics === 'Sort ['s DESC NULLS LAST], true 'Sort ['s DESC NULLS LAST], true !+- 'UnresolvedHaving ('count('product) > 2) +- 'UnresolvedHaving ('count(tempresolvedcolumn(product#261, product, false)) > 2) ! +- 'Aggregate [cube(Vector(0), Vector(1), product#261, region#262)], [product#261, region#262, sum(amount#263) AS s#264L] +- Aggregate [product#269, region#270, spark_grouping_id#268L], [product#269, region#270, sum(amount#263) AS s#264L] ! +- SubqueryAlias t +- Expand [[product#261, region#262, amount#263, product#266, region#267, 0], [product#261, region#262, amount#263, product#266, null, 1], [product#261, region#262, amount#263, null, region#267, 2], [product#261, region#262, amount#263, null, null, 3]], [product#261, region#262, amount#263, product#269, region#270, spark_grouping_id#268L] ! +- LocalRelation [product#261, region#262, amount#263] +- Project [product#261, region#262, amount#263, product#261 AS product#266, region#262 AS region#267] ! +- SubqueryAlias t ! +- LocalRelation [product#261, region#262, amount#263] ``` to `ResolveAggregateFunctions` to add the correct aggregate expressions (`count(product#261)`): ``` === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions === 'Sort ['s DESC NULLS LAST], true 'Sort ['s DESC NULLS LAST], true !+- 'UnresolvedHaving (count(tempresolvedcolumn(product#261, product, false)) > cast(2 as bigint)) +- Project [product#269, region#270, s#264L] ! +- Aggregate [product#269, region#270, spark_grouping_id#268L], [product#269, region#270, sum(amount#263) AS s#264L] +- Filter (count(product)#272L > cast(2 as bigint)) ! +- Expand [[product#261, region#262, amount#263, product#266, region#267, 0], [product#261, region#262, amount#263, product#266, null, 1], [product#261, region#262, amount#263, null, region#267, 2], [product#261, region#262, amount#263, null, null, 3]], [product#261, region#262, amount#263, product#269, region#270, spark_grouping_id#268L] +- Aggregate [product#269, region#270, spark_grouping_id#268L], [product#269, region#270, sum(amount#263) AS s#264L, count(product#261) AS count(product)#272L] ! +- Project [product#261, region#262, amount#263, product#261 AS product#266, region#262 AS region#267] +- Expand [[product#261, region#262, amount#263, product#266, region#267, 0], [product#261, region#262, amount#263, product#266, null, 1], [product#261, region#262, amount#263, null, region#267, 2], [product#261, region#262, amount#263, null, null, 3]], [product#261, region#262, amount#263, product#269, region#270, spark_grouping_id#268L] ! +- SubqueryAlias t +- Project [product#261, region#262, amount#263, product#261 AS product#266, region#262 AS region#267] ! +- LocalRelation [product#261, region#262, amount#263] +- SubqueryAlias t ! +- LocalRelation [product#261, region#262, amount#263] ``` Fix a correctness isue described in apache#51810. Yes, it fixes a correctness issue. Added new UT from apache#51810. No. Closes apache#51820 from peter-toth/SPARK-53094-fix-cube-having. Lead-authored-by: Peter Toth <[email protected]> Co-authored-by: harris233 <[email protected]> Signed-off-by: Peter Toth <[email protected]>

peter-toth · 2025-08-05T18:55:13Z

Bacport PRs: #51854 and #51855.

github-actions bot added the SQL label Aug 4, 2025

peter-toth mentioned this pull request Aug 4, 2025

[SPARK-53094][SQL] Fix cube-related data quality problem #51810

Closed

dongjoon-hyun reviewed Aug 4, 2025

View reviewed changes

cloud-fan reviewed Aug 5, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala Outdated Show resolved Hide resolved

cloud-fan approved these changes Aug 5, 2025

View reviewed changes

harris233 and others added 4 commits August 5, 2025 16:27

Fix cube-related data quality problem

b53f426

defer UnresolvedHaving resolution if it has unresolved aggregate fu…

4b4fc2b

…nctions

simplify test

4b956be

review fix

54ccd0f

peter-toth force-pushed the SPARK-53094-fix-cube-having branch from 4a68e79 to 54ccd0f Compare August 5, 2025 14:28

dongjoon-hyun approved these changes Aug 5, 2025

View reviewed changes

peter-toth closed this in 3f0c450 Aug 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-53094][SQL] Fix CUBE with aggregate containing HAVING clauses #51820

[SPARK-53094][SQL] Fix CUBE with aggregate containing HAVING clauses #51820

Uh oh!

peter-toth commented Aug 4, 2025 •

edited

Loading

Uh oh!

peter-toth commented Aug 4, 2025

Uh oh!

dongjoon-hyun Aug 4, 2025

Uh oh!

peter-toth Aug 4, 2025

Uh oh!

violet-nspct commented Aug 4, 2025

Uh oh!

Uh oh!

dongjoon-hyun left a comment

Uh oh!

peter-toth commented Aug 5, 2025 •

edited

Loading

Uh oh!

peter-toth commented Aug 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-53094][SQL] Fix CUBE with aggregate containing HAVING clauses #51820

[SPARK-53094][SQL] Fix CUBE with aggregate containing HAVING clauses #51820

Uh oh!

Conversation

peter-toth commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

peter-toth commented Aug 4, 2025

Uh oh!

dongjoon-hyun Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

peter-toth Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

violet-nspct commented Aug 4, 2025

Uh oh!

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

peter-toth commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peter-toth commented Aug 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

peter-toth commented Aug 4, 2025 •

edited

Loading

peter-toth commented Aug 5, 2025 •

edited

Loading