Skip to content

Conversation

1raghavmahajan
Copy link
Contributor

@1raghavmahajan 1raghavmahajan commented Oct 10, 2025

Closes #14249

Implementation

The implementation is similar to what was initially proposed:

  1. Repartition by identifier_columns and sort within partition by identifier_columns + change_ordinal
  2. Apply RemoveCarryoverIterator.

Note: Above is the same as net_changes without identifier columns but with a simpler repartition spec.

  1. Use window functions to identify first and last changes for each logical row
  2. Filter to keep only first and last changes (as per change_ordinal) for each logical row

Note: Above performs the netting of the changes, we get rid of all change except for the first and last change ordinal, this is cheaper than iterating through them all. Existing net_changes cannot leverage this as we do not have a consistent set of identifier columns across the entire snapshot range so we need to iterate through them all to build the lineage.

  1. Remove INSERT-DELETE (no-op) pairs using an iterator.
  2. Calculate pre/post images using first DELETE - last INSERT pairs.

Note: Above is similar to existing ComputeUpdateIterator). Here we need to handle multiple INSERTS/DELETEs entries(as the intermediate changes aren't present).

Testing

@github-actions github-actions bot added the spark label Oct 10, 2025
@1raghavmahajan
Copy link
Contributor Author

@1raghavmahajan 1raghavmahajan force-pushed the feature/changelog-compute-net-update branch 2 times, most recently from 785704e to f89a10c Compare October 13, 2025 13:45
@1raghavmahajan 1raghavmahajan force-pushed the feature/changelog-compute-net-update branch from f89a10c to 916cca5 Compare October 14, 2025 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature: Optimize net_changes changelog view by leveraging Identifier columns

1 participant