-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Spark: enable stream-results option for remove orphan files #14278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Spark: enable stream-results option for remove orphan files #14278
Conversation
.../v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java
Outdated
Show resolved
Hide resolved
Hi @RussellSpitzer @pvary @liziyan-lzy @huaxingao, if any of you have time to help review this that would be greatly appreciated! |
@arifazmidd: Could you fix the test please? |
Thanks for running the CI @pvary; I have fixed the formatting issues. |
+ "See that IO's documentation to learn how to adjust parallelism for that particular " | ||
+ "IO's bulk delete.", | ||
"max_concurrent_deletes only works with FileIOs that do not support bulk deletes." | ||
+ " Thistable is currently using {} which supports bulk deletes so the" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit space
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The space wasn't there from before but I can add it. When I ran spotlessApply locally it updated this string concatenation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please share the command which you have used?
When I run:
./gradlew spotlessApply -DallModules=true
And I don't see any changes
+ "at the same time. If you are absolutely confident that no concurrent operations will be " | ||
+ "affected by removing orphan files with such a short interval, you can use the Action API " | ||
+ "to remove orphan files with an arbitrary interval."); | ||
"Cannot remove orphan files with an interval less than 24 hours. Executing this procedure" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, the content of the string is the exact same but running spotlessApply locally updated the string concatenation.
LOG.warn( | ||
"Deleted only {} of {} files using bulk deletes", deletedFilesCount, paths.size(), e); | ||
} | ||
private boolean streamResults() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this worth it's own method? Only used once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was keeping it consistent with how it's already setup for ExpireSnapshotsSparkAction
Lines 223 to 233 in e34ec24
private ExpireSnapshots.Result doExecute() { | |
if (streamResults()) { | |
return deleteFiles(expireFiles().toLocalIterator()); | |
} else { | |
return deleteFiles(expireFiles().collectAsList().iterator()); | |
} | |
} | |
private boolean streamResults() { | |
return PropertyUtil.propertyAsBoolean(options(), STREAM_RESULTS, STREAM_RESULTS_DEFAULT); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some places where we handle differently, like:
iceberg/core/src/main/java/org/apache/iceberg/RemoveSnapshots.java
Lines 90 to 92 in f7e6a27
long defaultMaxSnapshotAgeMs = | |
PropertyUtil.propertyAsLong( | |
base.properties(), MAX_SNAPSHOT_AGE_MS, MAX_SNAPSHOT_AGE_MS_DEFAULT); |
I see your point. Still a bit strange for me, but I don't have strong opinion on this
.../v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java
Show resolved
Hide resolved
.../v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java
Show resolved
Hide resolved
* @param orphanFiles list of file paths to delete (already in driver memory) | ||
*/ | ||
private void deleteFilesCollected(List<String> orphanFiles) { | ||
if (deleteFunc == null && table.io() instanceof SupportsBulkOperations) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change for another feature?
I think this should be an independent PR/change where we start using bulk delete when the io supports bulk operation. This has nothing to do with streaming the results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bulk delete support already existed in the original implementation. This not being added as new functionality rather refactoring into separate functions for streaming vs non-streaming deletion.
Current implementation supporting bulk delete:
Line 237 in e34ec24
private void deleteFiles(SupportsBulkOperations io, List<String> paths) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh.. you inlined the method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that now the logic is duplicated.
Is there a way to reuse the code?
Maybe reusing the deleteFilesCollected
inside the deleteFilesStreaming
?
assumeThat(usePrefixListing) | ||
.as( | ||
"This test verifies default listing behavior and does not require prefix listing to be enabled.") | ||
"This test verifies default listing behavior and does not require prefix listing to be" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, the content of the string is the exact same but running spotlessApply locally updated the string concatenation.
// Collect sample paths before deleting | ||
for (String path : fileGroup) { | ||
if (samplePaths.size() < MAX_ORPHAN_FILE_PATHS_TO_RETURN_WHEN_STREAMING) { | ||
samplePaths.add(path); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe do it after deleting, so we only put successfully deleted files to the sample?
Closes #3703
Description
This PR adds streaming support to the
remove_orphan_files
Spark procedure to prevent driver OOM issues when dealing with tables that have many orphan files.This mimics the existing behavior for
expire_snapshots
with thestream_results
parameter that was added in #4152Changes
stream-results
option toDeleteOrphanFilesSparkAction
deleteFilesStreaming()
method that processes files usingtoLocalIterator()
STREAM_RESULTS_PARAM
toRemoveOrphanFilesProcedure
Testing
testStreamResults()
- verifies streaming functionality with multiple orphan filestestStreamResultsBackwardsCompatibility()
- ensures non-streaming mode still workstestStreamResultsWithDryRun()
- tests streaming with dry run modeReal-World Testing Results
Tested on AWS EMR with a production table containing ~3PB of orphaned data:
Key Findings:
Backwards Compatibility
Fully backwards compatible. Default behavior unchanged. Streaming is opt-in via
stream_results
parameter.