[feat] `add_files` support parquet files with field ids #1227

MrDerecho · 2024-10-11T13:15:43Z

Feature Request / Improvement

Currently, if I am using pyiceberg to create/maintain my iceberg tables and I use Trino (AWS Athena) to do compaction on the same (using Spark)- the files created via compaction are unable to be "re-added" using the add_files method at a later time. If we can make this configurable that would be great. Thanks.

kevinjqliu · 2024-10-12T16:28:59Z

Is this the error message you see?

iceberg-python/pyiceberg/io/pyarrow.py

Lines 2515 to 2518 in 7cf0c22

    
           if visit_pyarrow(parquet_metadata.schema.to_arrow_schema(), _HasIds()): 
        
               raise NotImplementedError( 
        
                   f"Cannot add file {file_path} because it has field IDs. `add_files` only supports addition of files without field_ids" 
        
               )

MrDerecho · 2024-10-12T18:44:00Z

Yes, this is correct.

kevinjqliu · 2024-10-14T16:23:41Z

Ty, updated the title. LMK if this is something you would like to work on 😄

sungwy · 2024-10-15T13:07:26Z

Thanks for raising this @MrDerecho - in the initial version of add_files, we wanted to limit it to just parquet files that that were created in an external system. The assumption is that unless the files are created by an Iceberg client and are cognizant of the Iceberg schema, there would be no way for the parquet writing process to be use the correct field IDs in the produced parquet schema.

Currently, if I am using pyiceberg to create/maintain my iceberg tables and I use Trino (AWS Athena) to do compaction on the same (using Spark)- the files created via compaction are unable to be "re-added" using the add_files method at a later time.

This sounds like a really cool use case, but I'd like to understand it better - why isn't the application (Trino/Spark) that is doing the compaction committing the compacted files into Iceberg itself?

MrDerecho · 2024-10-15T13:20:38Z

For context: right now I manage a very large data lake of time partitioned data. The use case has to do with the archival process put into place wherein after a rolling period of time these files are “deleted” from the table and copied (before snapshot expiry) into a archive prefix for later use (if needed) and subject to a lifecycle policy I.e. glacier to physical deletion after some time. Because I use Trino to optimize- I have a mix of pyiceberg batch files and Trino spark generated files that will may require being “re-added” at a later date. Let me know if you have any other questions. Thanks for the help.

…

On Tue, Oct 15, 2024 at 9:07 AM Sung Yun ***@***.***> wrote: Thanks for raising this @MrDerecho <https://github.com/MrDerecho> - in the initial version of add_files, we wanted to limit it to just parquet files that that were created in an external system. The assumption is that unless the files are created by an Iceberg client and are cognizant of the Iceberg schema, there would be no way for the parquet writing process to be use the correct field IDs in the produced parquet schema. Currently, if I am using pyiceberg to create/maintain my iceberg tables and I use Trino (AWS Athena) to do compaction on the same (using Spark)- the files created via compaction are unable to be "re-added" using the add_files method at a later time. This sounds like a really cool use case, but I'd like to understand it better - why isn't the application (Trino/Spark) that is doing the compaction committing the compacted files into Iceberg itself? — Reply to this email directly, view it on GitHub <#1227 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AR6JWWZXATUGYZBS6TFHA4TZ3UHSLAVCNFSM6AAAAABPY5DDRGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJTHA3DQNBXHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

kevinjqliu · 2024-10-15T19:18:06Z

I think we can relax the constraints for add_files to allow field ids that are aligned, such as one written by an external engine like Trino.

A use case I can think of is moving an iceberg table from one system and recreating it in another, similar to whats described above.

kevinjqliu · 2024-10-15T19:18:28Z

@MrDerecho would you like to contribute this feature?

github-actions · 2025-04-14T00:20:32Z

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions · 2025-04-28T00:20:43Z

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

kevinjqliu changed the title ~~Iceberg Add Files (allow for parquet files with field id's)~~ [feat] add_files support parquet files with field ids Oct 14, 2024

sungwy mentioned this issue Nov 1, 2024

Enhance catalog.create_table API to enable creation of table with matching field_ids to provided Schema #1284

Open

github-actions bot added the stale label Apr 14, 2025

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feat] `add_files` support parquet files with field ids #1227

[feat] `add_files` support parquet files with field ids #1227

MrDerecho commented Oct 11, 2024

kevinjqliu commented Oct 12, 2024

Uh oh!

MrDerecho commented Oct 12, 2024

Uh oh!

kevinjqliu commented Oct 14, 2024

Uh oh!

sungwy commented Oct 15, 2024

Uh oh!

MrDerecho commented Oct 15, 2024 via email

Uh oh!

kevinjqliu commented Oct 15, 2024

Uh oh!

kevinjqliu commented Oct 15, 2024

Uh oh!

github-actions bot commented Apr 14, 2025

Uh oh!

github-actions bot commented Apr 28, 2025

Uh oh!

[feat] add_files support parquet files with field ids #1227

[feat] add_files support parquet files with field ids #1227

Comments

MrDerecho commented Oct 11, 2024

Feature Request / Improvement

kevinjqliu commented Oct 12, 2024

Uh oh!

MrDerecho commented Oct 12, 2024

Uh oh!

kevinjqliu commented Oct 14, 2024

Uh oh!

sungwy commented Oct 15, 2024

Uh oh!

MrDerecho commented Oct 15, 2024 via email

Uh oh!

kevinjqliu commented Oct 15, 2024

Uh oh!

kevinjqliu commented Oct 15, 2024

Uh oh!

github-actions bot commented Apr 14, 2025

Uh oh!

github-actions bot commented Apr 28, 2025

Uh oh!

[feat] `add_files` support parquet files with field ids #1227

[feat] `add_files` support parquet files with field ids #1227