[SPARK-53074][SQL] Avoid partial clustering in SPJ to meet a child's required distribution #51818

chirag-s-db · 2025-08-04T18:40:15Z

What changes were proposed in this pull request?

Currently, SPJ logic can apply partial clustering (when enabled) to either side of an inner JOIN as long as the nodes between the scan and JOIN preserve partitioning. This doesn't work if one of these nodes is using the scan's key-grouped partitioning to satisfy its required distribution (for example, a grouping agg or window function).

This PR avoids this issue by avoiding applying a partially clustered distribution to a JOIN's child if any node in that child relies on the KeyGroupedPartitioning to satisfy its required distribution (since it's not safe to do so with a partially clustered distribution).

Why are the changes needed?

Without this fix, using a partially-clustered distribution with SPJ may cause correctness issues.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

See test changes.

Was this patch authored or co-authored using generative AI tooling?

No.

chirag-s-db · 2025-08-04T18:41:26Z

@szehon-ho @sunchao Could you take a look at this PR? Thanks!

violetnspct · 2025-08-05T05:00:51Z

@chirag-s-db Should you be adding the tests to cover the following scenarios?

Aggregate operation requiring key-grouped partitioning. Need test that verifies behavior with GROUP BY operation. Aggregates also require specific partitioning that must be preserved.
Edge cases:
- Multiple unary nodes in plan chain. Important because need to verify behavior when multiple nodes have distribution requirements
- Mixed nodes with and without distribution requirements. Important because need to verify partial clustering decision is correct when only some nodes require specific distribution

szehon-ho · 2025-08-05T00:08:09Z

sql/core/src/test/scala/org/apache/spark/sql/connector/KeyGroupedPartitioningSuite.scala

+             |WHERE p.RN = 1
+             |""".stripMargin)
+        val shuffles = collectShuffles(df.queryExecution.executedPlan)
+        assert(shuffles.isEmpty, "should not contain any shuffle")


should we check number of partitions to make sure that partial cluster distribution replication did not kick in?

Good point, done.

szehon-ho · 2025-08-05T00:08:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala

@@ -490,9 +502,15 @@ case class EnsureRequirements(
          // whether partially clustered distribution can be applied. For instance, the
          // optimization cannot be applied to a left outer join, where the left hand
          // side is chosen as the side to replicate partitions according to stats.
+          // Similarly, the partially clustered distribution cannot be applied if the


nit: can we add a comment about side-effects, like select row_number()

Added expansion to this comment.

szehon-ho · 2025-08-05T00:10:45Z

sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala

+   */
+  private def canApplyPartialClusteredDistribution(plan: SparkPlan): Boolean = {
+    !plan.exists {
+      case u: UnaryExecNode => u.requiredChildDistribution.head != UnspecifiedDistribution


sorry just some question here:

unary node is just to make sure we exclude window functions, and not joins? Should we check window function node specifically, if that's the goal? Just trying to understand the reason why unary node in particular here.

Is it a better check to do something like:
u.child.outputPartitioning.satisfies(u.requiredChildDistribution.head)?

unary node is just to make sure we exclude window functions, and not joins? Should we check window function node specifically, if that's the goal? Just trying to understand the reason why unary node in particular here.

We don't want to restrict this only to window functions - any exec node that has a specified required distribution applies (including windows, grouping aggregates, etc.). Checking for unary nodes just generally makes this easier to reason about (since it guarantees that there's only a single required child distribution to check).

Note that we also exclude any non-unary nodes (see the line below) like JOINs. The reason that it's not safe to apply to a JOIN is that when applying the partially clustered distribution, all scans on the partially clustered side are marked as partially clustered, which would create incorrect results on the lower JOIN. For example, suppose we had 3 partitioned tables A, B, and C and joined like A JOIN (b JOIN C) (assuming all keys lined up with the partitioning etc). It would be safe to apply a partially clustered distribution to A (since the tasks on b JOIN c can simply be replicated), but it would not be safe to apply a partially clustered distribution to the right side (since the results of b JOIN c would be incorrect.

Is it a better check to do something like:
u.child.outputPartitioning.satisfies(u.requiredChildDistribution.head)?

I don't think this would work - the KeyGroupedPartitioning would still (on paper) return that it satisfies the required distribution, even if the distribution actually will be partially clustered. In addition, we haven't yet actually applied the partially clustered distribution here, so the output partitioning is still the original KeyGroupedPartitioning (without any of the pushed down spjParams applied).

fix

557efea

github-actions bot added the SQL label Aug 4, 2025

remove irrelevant test

0f9f0cb

HyukjinKwon changed the title ~~[SPARK-53074] Avoid partial clustering in SPJ to meet a child's required distribution~~ [SPARK-53074][SQL] Avoid partial clustering in SPJ to meet a child's required distribution Aug 4, 2025

szehon-ho reviewed Aug 5, 2025

View reviewed changes

chirag-s-db added 2 commits August 5, 2025 12:45

test fix

6bb9ab7

comment

0797d99

chirag-s-db requested a review from szehon-ho August 5, 2025 19:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-53074][SQL] Avoid partial clustering in SPJ to meet a child's required distribution #51818

[SPARK-53074][SQL] Avoid partial clustering in SPJ to meet a child's required distribution #51818

chirag-s-db commented Aug 4, 2025

Uh oh!

chirag-s-db commented Aug 4, 2025

Uh oh!

violetnspct commented Aug 5, 2025

Uh oh!

szehon-ho Aug 5, 2025

Uh oh!

chirag-s-db Aug 5, 2025

Uh oh!

szehon-ho Aug 5, 2025

Uh oh!

chirag-s-db Aug 5, 2025

Uh oh!

szehon-ho Aug 5, 2025

Uh oh!

chirag-s-db Aug 5, 2025

Uh oh!

Uh oh!

[SPARK-53074][SQL] Avoid partial clustering in SPJ to meet a child's required distribution #51818

Are you sure you want to change the base?

[SPARK-53074][SQL] Avoid partial clustering in SPJ to meet a child's required distribution #51818

Conversation

chirag-s-db commented Aug 4, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

chirag-s-db commented Aug 4, 2025

Uh oh!

violetnspct commented Aug 5, 2025

Uh oh!

szehon-ho Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

chirag-s-db Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

szehon-ho Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

chirag-s-db Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

szehon-ho Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

chirag-s-db Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!