Skip to content

[SPARK-53272][SQL] Refactor SPJ pushdown logic out of BatchScanExec #51979

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

chirag-s-db
Copy link
Contributor

What changes were proposed in this pull request?

SPJ logic is currently closely coupled with the DSV2-specific BatchScanExec physical node, making it difficult for connectors to take advantage of SPJ for other types of scans. This PR refactors the SPJ-specific logic out of BatchScanExec, exposing a parameterized base class for connectors to use. This base class requires a partition value accessor (mapping from the parameterized type to an InternalRow).

Why are the changes needed?

Allow connectors to take advantage of SPJ on existing scans.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pure refactor - existing tests should suffice.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Aug 11, 2025
Copy link
Contributor

@rahulsmahadev rahulsmahadev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! looks clean, only material change is the logic is encapsulated in partitionValueAccessor

}
case None =>
spjParams.joinKeyPositions match {
case Some(projectionPositions) => basePartitioning.partitionValues.map{r =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
case Some(projectionPositions) => basePartitioning.partitionValues.map{r =>
case Some(projectionPositions) => basePartitioning.partitionValues.map { r =>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

*/
def getInputPartitionGrouping(
p: KeyGroupedPartitioning,
spjParams: StoragePartitionJoinParams,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for my curiosity: what's the relationship between p.expressions and spjParams.keyGroupedPartitioning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.expressions includes join key reordering of the expressions (ref), while spjParams.keyGroupedPartitioning contains the partitioning expressions in their original ordering (which is why they must be reordered here if join key positions are present).

*/
def getOutputKeyGroupedPartitioning(
basePartitioning: KeyGroupedPartitioning,
spjParams: StoragePartitionJoinParams): KeyGroupedPartitioning = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move StoragePartitionJoinParams to an individual file instead of BatchScanExec.scala?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done.

@chirag-s-db
Copy link
Contributor Author

chirag-s-db commented Aug 12, 2025

Also cc: @sunchao and @szehon-ho for visibility

@chirag-s-db chirag-s-db requested a review from cloud-fan August 12, 2025 15:01
@HyukjinKwon
Copy link
Member

Let's file a JIRA and add it to the PR title.

@chirag-s-db chirag-s-db changed the title [SQL] Refactor SPJ pushdown logic out of BatchScanExec [SPARK-53272][SQL] Refactor SPJ pushdown logic out of BatchScanExec Aug 13, 2025
@cloud-fan
Copy link
Contributor

@chirag-s-db can you follow the instructions and set up your Github Action? https://github.com/apache/spark/pull/51979/checks?check_run_id=47918325385

@chirag-s-db
Copy link
Contributor Author

@chirag-s-db can you follow the instructions and set up your Github Action? https://github.com/apache/spark/pull/51979/checks?check_run_id=47918325385

@cloud-fan Checks should be running now, had to rebase on latest master.

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @chirag-s-db !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants