Skip to content

Conversation

dyno
Copy link

@dyno dyno commented Oct 8, 2025

Iceberg 1.8+ introduces support in rewrite_data_files for writing to an output spec that differs from the table’s current partition spec.

In our use case, the current table is partitioned by event/date/hour/batchId, where batchId serves as a work unit identifier. To address the small files problem, we want to roll up mature data to a coarser partition layout (event/date). This feature enables an efficient in-table rollup.

We observed two issues:

  1. Files are grouped by the current table’s partition spec, which places each small file into its own group — effectively preventing any bin-packing.
  2. When running rewrite_data_files again, because the files no longer conform to the table’s current output partition spec, Spark places all files into a single group, causing a large and unnecessary shuffle.

This patch resolves the issue by grouping files according to the output partition spec instead. As an additional benefit, when the output spec is only partially compatible with the current one, the grouping logic will still align on shared partitions — for example, when rolling up from event/date/hour to event/server/date, it will group data primarily by event/date.

@dyno dyno marked this pull request as draft October 8, 2025 22:29
@dyno dyno marked this pull request as ready for review October 9, 2025 01:43
});
}

private static Object convertPartitionValue(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not exhaustive, only the common year/month/day/hour case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have an exhaustive implementation?
Adding something half finished seems problematic to me.

groups,
group ->
enoughInputFiles(group)
(group.size() >= 1
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly for single file case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we rewriting a single file just so we can change the specId?

@dyno
Copy link
Author

dyno commented Oct 9, 2025

@pvary @Parth-Brahmbhatt could you help to review it? thanks.

@pvary
Copy link
Contributor

pvary commented Oct 10, 2025

This patch resolves the issue by grouping files according to the output partition spec instead.

What happens if the file contains data for multiple new partitions? Say the new output spec is more detailed than the original one, or maybe even contain absolutely different partitions?

I didn't have time to go through your code, but there are existing tools to get the partition value for a row with a given spec. You might want to use them.

Also please fix the test issues.

@dyno
Copy link
Author

dyno commented Oct 10, 2025

This patch resolves the issue by grouping files according to the output partition spec instead.

What happens if the file contains data for multiple new partitions? Say the new output spec is more detailed than the original one, or maybe even contain absolutely different partitions?

let's say we have event/date partition,
if the output spec is more detailed, e.g. event/date/hour then the function will group by event/date/null it is still more performant than group everything together then repartition. event/team/date/hour will be event/null/date/null, still better.
if the output spec is totally different, e.g. team then the extracted partition key will be team=null effectively emptyPartition one file group, it's the current behavior.

I didn't have time to go through your code, but there are existing tools to get the partition value for a row with a given spec. You might want to use them.

i think you mean coercePartition(), it does not deal with the hidden partition correctly e.g. extract date from hour, and need to calculate the extraction from spec every time. let me see if i can just use coercePartition.

Also please fix the test issues.

will do.

Iceberg 1.8+ introduces support in `rewrite_data_files` for writing to an
output spec that differs from the table’s current partition spec.

In our use case, the current table is partitioned by event/date/hour/batchId,
where `batchId` serves as a work unit identifier. To address the small files
problem, we want to roll up mature data to a coarser partition layout
(event/date). This feature enables an efficient in-table rollup.

We observed two issues:

1. Files are grouped by the current table’s partition spec, which places each
   small file into its own group — effectively preventing any bin-packing.
2. When running `rewrite_data_files` again, because the files no longer conform
   to the table’s current output partition spec, Spark places all files into a
   single group, causing a large and unnecessary shuffle.

This patch resolves the issue by grouping files according to the *output*
partition spec instead. As an additional benefit, when the output spec is only
partially compatible with the current one, the grouping logic will still align
on shared partitions — for example, when rolling up from event/date/hour to
event/server/date, it will group data primarily by event/date.
tasks,
task ->
outsideDesiredFileSizeRange(task) || tooManyDeletes(task) || tooHighDeleteRatio(task));
(task.file() != null && task.file().specId() != outputSpecId())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we rewriting a single file just so we can change the specId?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's more to make the storage layout match the spec, e.g. in our case we want to rewrite the file from
s3://bucket/datalake/event=xxx/team=xxx/batchid=xxx/date=xxx/hour=xxx/xxx.parquet (current spec)
to
s3://bucket/datalake/event=xxx/team=xxx/date=xxx/xxx.parquet (rollup spec)

given that a lot of policy is based on s3 prefix, it's better to rewrite even there is only one file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move this to a different PR?
I think this one is a bit questionable, and other community members might have different options around this.

@pvary
Copy link
Contributor

pvary commented Oct 13, 2025

I didn't have time to go through your code, but there are existing tools to get the partition value for a row with a given spec. You might want to use them.

i think you mean coercePartition(), it does not deal with the hidden partition correctly e.g. extract date from hour, and need to calculate the extraction from spec every time. let me see if i can just use coercePartition.

What does does not deal with the hidden partition correctly mean?

@dyno
Copy link
Author

dyno commented Oct 13, 2025

I didn't have time to go through your code, but there are existing tools to get the partition value for a row with a given spec. You might want to use them.

i think you mean coercePartition(), it does not deal with the hidden partition correctly e.g. extract date from hour, and need to calculate the extraction from spec every time. let me see if i can just use coercePartition.

What does does not deal with the hidden partition correctly mean?

 private StructProjection(StructType structType, StructType projection, boolean allowMissing) {
...
if (projectedField.fieldId() == dataField.fieldId()) {

in the hidden partition case, date(ts) and hour(ts) are 2 different field and has different fieldId, then it will be null. it's better to compare the sourceId() and see they are "compatible" (satisfilesOrderOf?) but without spec, the information is lost.

@pvary
Copy link
Contributor

pvary commented Oct 14, 2025

in the hidden partition case, date(ts) and hour(ts) are 2 different field and has different fieldId, then it will be null. it's better to compare the sourceId() and see they are "compatible" (satisfilesOrderOf?) but without spec, the information is lost.

Thanks for the explanation. This makes sense.

@dyno dyno changed the title Core: Group binpack fileGroup by output partitonSpec Core: Group binpack fileGroup by output partitionSpec Oct 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants