Spark: Faster net changelogs using identifier columns #14293

1raghavmahajan · 2025-10-10T14:12:20Z

Implementation

The implementation is similar to what was initially proposed:

Repartition by identifier_columns and sort within partition by identifier_columns + change_ordinal
Apply RemoveCarryoverIterator.

Note: Above is the same as net_changes without identifier columns but with a simpler repartition spec.

Use window functions to identify first and last changes for each logical row
Filter to keep only first and last changes (as per change_ordinal) for each logical row

Note: Above performs the netting of the changes, we get rid of all change except for the first and last change ordinal, this is cheaper than iterating through them all. Existing net_changes cannot leverage this as we do not have a consistent set of identifier columns across the entire snapshot range so we need to iterate through them all to build the lineage.

Remove INSERT-DELETE (no-op) pairs using an iterator.
Calculate pre/post images using first DELETE - last INSERT pairs.

Note: Above is similar to existing ComputeUpdateIterator). Here we need to handle multiple INSERTS/DELETEs entries(as the intermediate changes aren't present).

Testing

Added iterator tests for ComputeNetUpdateIterator and RemoveNoopPairIterator
Updated integration tests for CreateChangelogViewProcedure

1raghavmahajan · 2025-10-10T14:40:04Z

cc @flyrain @szehon-ho @aokolnychyi @RussellSpitzer

github-actions bot added the spark label Oct 10, 2025

1raghavmahajan force-pushed the feature/changelog-compute-net-update branch 2 times, most recently from 785704e to f89a10c Compare October 13, 2025 13:45

1raghavmahajan mentioned this pull request Oct 14, 2025

Spark: Refactor to use ArrayUtils #14291

Merged

Net changelog view using identifier fields

916cca5

1raghavmahajan force-pushed the feature/changelog-compute-net-update branch from f89a10c to 916cca5 Compare October 14, 2025 15:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spark: Faster net changelogs using identifier columns #14293

Spark: Faster net changelogs using identifier columns #14293

Uh oh!

1raghavmahajan commented Oct 10, 2025 •

edited

Loading

Uh oh!

1raghavmahajan commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Spark: Faster net changelogs using identifier columns #14293

Are you sure you want to change the base?

Spark: Faster net changelogs using identifier columns #14293

Uh oh!

Conversation

1raghavmahajan commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation

Testing

Uh oh!

1raghavmahajan commented Oct 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1raghavmahajan commented Oct 10, 2025 •

edited

Loading