Spark: enable stream-results option for remove orphan files #14278

arifazmidd · 2025-10-08T01:41:45Z

Description

This PR adds streaming support to the remove_orphan_files Spark procedure to prevent driver OOM issues when dealing with tables that have many orphan files.

This mimics the existing behavior for expire_snapshots with the stream_results parameter that was added in #4152

Changes

Added stream-results option to DeleteOrphanFilesSparkAction
Added deleteFilesStreaming() method that processes files using toLocalIterator()
Returns sample of up to 20,000 file paths with summary row showing total count
Added STREAM_RESULTS_PARAM to RemoveOrphanFilesProcedure
Added test coverage (3 new tests)

Testing

Added testStreamResults() - verifies streaming functionality with multiple orphan files
Added testStreamResultsBackwardsCompatibility() - ensures non-streaming mode still works
Added testStreamResultsWithDryRun() - tests streaming with dry run mode

Real-World Testing Results

Tested on AWS EMR with a production table containing ~3PB of orphaned data:

Run Description	Learnings
Dry Run using Original Implementation	Driver crashes due to OOM.
Dry Run using Stream Results	Completed successfully. Returned a sample of 20k file paths that will be deleted and the total count of files that will be deleted as ~45M.
Full Run using Original Implementation	Driver crashes due to OOM.
Full Run using Stream Results	Terminated after 4 hours because that was the timeout set; however, it successfully iterated through and deleted ~31M files.
Full Run using Original Implementation (after streaming run)	Completed successfully. Deleted the remaining ~14M orphaned files.
Full Run using Stream Results on sps-eta (validation run)	Completed successfully. Deleted ~1200 new orphan files.

Key Findings:

Original implementation consistently crashes with OOM on large-scale orphan file cleanup (~45M files)
Streaming implementation successfully handles massive workloads without memory issues
Successfully deleted ~31M files in a single streaming run (terminated by timeout, not failure)
Streaming approach enables incremental cleanup of large orphan file sets

Backwards Compatibility

Fully backwards compatible. Default behavior unchanged. Streaming is opt-in via stream_results parameter.

.../v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java

arifazmidd · 2025-10-13T23:16:11Z

Hi @RussellSpitzer @pvary @liziyan-lzy @huaxingao, if any of you have time to help review this that would be greatly appreciated!

pvary · 2025-10-14T14:23:42Z

@arifazmidd: Could you fix the test please?

arifazmidd · 2025-10-14T17:32:45Z

Thanks for running the CI @pvary; I have fixed the formatting issues.

pvary · 2025-10-16T05:15:36Z

...v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/RemoveOrphanFilesProcedure.java

-                      + "See that IO's documentation to learn how to adjust parallelism for that particular "
-                      + "IO's bulk delete.",
+                  "max_concurrent_deletes only works with FileIOs that do not support bulk deletes."
+                      + " Thistable is currently using {} which supports bulk deletes so the"


Why we change this?

The space wasn't there from before but I can add it. When I ran spotlessApply locally it updated this string concatenation.

Could you please share the command which you have used?

When I run:

./gradlew spotlessApply -DallModules=true

And I don't see any changes

pvary · 2025-10-16T05:17:00Z

...v3.5/spark/src/main/java/org/apache/iceberg/spark/procedures/RemoveOrphanFilesProcedure.java

-              + "at the same time. If you are absolutely confident that no concurrent operations will be "
-              + "affected by removing orphan files with such a short interval, you can use the Action API "
-              + "to remove orphan files with an arbitrary interval.");
+          "Cannot remove orphan files with an interval less than 24 hours. Executing this procedure"


Why we change this?

Same as above, the content of the string is the exact same but running spotlessApply locally updated the string concatenation.

pvary · 2025-10-16T05:18:44Z

.../v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java

-      LOG.warn(
-          "Deleted only {} of {} files using bulk deletes", deletedFilesCount, paths.size(), e);
-    }
+  private boolean streamResults() {


Is this worth it's own method? Only used once

Was keeping it consistent with how it's already setup for ExpireSnapshotsSparkAction

iceberg/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/ExpireSnapshotsSparkAction.java

Lines 223 to 233 in e34ec24

private ExpireSnapshots.Result doExecute() {

if (streamResults()) {

return deleteFiles(expireFiles().toLocalIterator());

} else {

return deleteFiles(expireFiles().collectAsList().iterator());

}

}

private boolean streamResults() {

return PropertyUtil.propertyAsBoolean(options(), STREAM_RESULTS, STREAM_RESULTS_DEFAULT);

}

There are some places where we handle differently, like:

iceberg/core/src/main/java/org/apache/iceberg/RemoveSnapshots.java

Lines 90 to 92 in f7e6a27

long defaultMaxSnapshotAgeMs =

PropertyUtil.propertyAsLong(

base.properties(), MAX_SNAPSHOT_AGE_MS, MAX_SNAPSHOT_AGE_MS_DEFAULT);

I see your point. Still a bit strange for me, but I don't have strong opinion on this

.../v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java

pvary · 2025-10-16T05:32:13Z

.../v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java

+   * @param orphanFiles list of file paths to delete (already in driver memory)
+   */
+  private void deleteFilesCollected(List<String> orphanFiles) {
+    if (deleteFunc == null && table.io() instanceof SupportsBulkOperations) {


Is this change for another feature?

I think this should be an independent PR/change where we start using bulk delete when the io supports bulk operation. This has nothing to do with streaming the results

Bulk delete support already existed in the original implementation. This not being added as new functionality rather refactoring into separate functions for streaming vs non-streaming deletion.

Current implementation supporting bulk delete:

iceberg/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java

Line 237 in e34ec24

private void deleteFiles(SupportsBulkOperations io, List<String> paths) {

Oh.. you inlined the method

I see, that now the logic is duplicated.

Is there a way to reuse the code?
Maybe reusing the deleteFilesCollected inside the deleteFilesStreaming?

pvary · 2025-10-16T05:34:00Z

...k/v3.5/spark/src/test/java/org/apache/iceberg/spark/actions/TestRemoveOrphanFilesAction.java

    assumeThat(usePrefixListing)
        .as(
-            "This test verifies default listing behavior and does not require prefix listing to be enabled.")
+            "This test verifies default listing behavior and does not require prefix listing to be"


Why is this change?

Same as above, the content of the string is the exact same but running spotlessApply locally updated the string concatenation.

pvary · 2025-10-17T11:58:22Z

.../v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java

+        // Collect sample paths before deleting
+        for (String path : fileGroup) {
+          if (samplePaths.size() < MAX_ORPHAN_FILE_PATHS_TO_RETURN_WHEN_STREAMING) {
+            samplePaths.add(path);
+          }
+        }


Maybe do it after deleting, so we only put successfully deleted files to the sample?

stream results for remove orphan files

c137ea9

github-actions bot added the spark label Oct 8, 2025

spotlessApply

6fff5f3

RussellSpitzer reviewed Oct 8, 2025

View reviewed changes

.../v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java Outdated Show resolved Hide resolved

remove comment to user and improve variable naming

561d18d

arifazmidd requested a review from RussellSpitzer October 9, 2025 19:06

spotlessApply

ff61c98

Fix validation timing for prefix mismatch detection

365fc9b

pvary reviewed Oct 16, 2025

View reviewed changes

.../v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java Show resolved Hide resolved

pvary reviewed Oct 16, 2025

View reviewed changes

.../v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/DeleteOrphanFilesSparkAction.java Show resolved Hide resolved

pvary reviewed Oct 16, 2025

View reviewed changes

pvary reviewed Oct 17, 2025

View reviewed changes

	private ExpireSnapshots.Result doExecute() {
	if (streamResults()) {
	return deleteFiles(expireFiles().toLocalIterator());
	} else {
	return deleteFiles(expireFiles().collectAsList().iterator());
	}
	}

	private boolean streamResults() {
	return PropertyUtil.propertyAsBoolean(options(), STREAM_RESULTS, STREAM_RESULTS_DEFAULT);
	}

	long defaultMaxSnapshotAgeMs =
	PropertyUtil.propertyAsLong(
	base.properties(), MAX_SNAPSHOT_AGE_MS, MAX_SNAPSHOT_AGE_MS_DEFAULT);

Spark: enable stream-results option for remove orphan files #14278

Are you sure you want to change the base?

Spark: enable stream-results option for remove orphan files #14278

Conversation

arifazmidd commented Oct 8, 2025

Description

Changes

Testing

Real-World Testing Results

Backwards Compatibility

Uh oh!

Uh oh!

arifazmidd commented Oct 13, 2025

Uh oh!

pvary commented Oct 14, 2025

Uh oh!

arifazmidd commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pvary Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pvary Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arifazmidd Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

arifazmidd commented Oct 14, 2025 •

edited

Loading

pvary Oct 17, 2025 •

edited

Loading

pvary Oct 17, 2025 •

edited

Loading

arifazmidd Oct 16, 2025 •

edited

Loading