Skip to content

[feat] add_files support parquet files with field ids #1227

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MrDerecho opened this issue Oct 11, 2024 · 9 comments
Closed

[feat] add_files support parquet files with field ids #1227

MrDerecho opened this issue Oct 11, 2024 · 9 comments
Labels

Comments

@MrDerecho
Copy link

Feature Request / Improvement

Currently, if I am using pyiceberg to create/maintain my iceberg tables and I use Trino (AWS Athena) to do compaction on the same (using Spark)- the files created via compaction are unable to be "re-added" using the add_files method at a later time. If we can make this configurable that would be great. Thanks.

@kevinjqliu
Copy link
Contributor

Is this the error message you see?

if visit_pyarrow(parquet_metadata.schema.to_arrow_schema(), _HasIds()):
raise NotImplementedError(
f"Cannot add file {file_path} because it has field IDs. `add_files` only supports addition of files without field_ids"
)

@MrDerecho
Copy link
Author

Yes, this is correct.

@kevinjqliu kevinjqliu changed the title Iceberg Add Files (allow for parquet files with field id's) [feat] add_files support parquet files with field ids Oct 14, 2024
@kevinjqliu
Copy link
Contributor

Ty, updated the title. LMK if this is something you would like to work on 😄

@sungwy
Copy link
Collaborator

sungwy commented Oct 15, 2024

Thanks for raising this @MrDerecho - in the initial version of add_files, we wanted to limit it to just parquet files that that were created in an external system. The assumption is that unless the files are created by an Iceberg client and are cognizant of the Iceberg schema, there would be no way for the parquet writing process to be use the correct field IDs in the produced parquet schema.

Currently, if I am using pyiceberg to create/maintain my iceberg tables and I use Trino (AWS Athena) to do compaction on the same (using Spark)- the files created via compaction are unable to be "re-added" using the add_files method at a later time.

This sounds like a really cool use case, but I'd like to understand it better - why isn't the application (Trino/Spark) that is doing the compaction committing the compacted files into Iceberg itself?

@MrDerecho
Copy link
Author

MrDerecho commented Oct 15, 2024 via email

@kevinjqliu
Copy link
Contributor

I think we can relax the constraints for add_files to allow field ids that are aligned, such as one written by an external engine like Trino.

A use case I can think of is moving an iceberg table from one system and recreating it in another, similar to whats described above.

@kevinjqliu
Copy link
Contributor

@MrDerecho would you like to contribute this feature?

Copy link

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

@github-actions github-actions bot added the stale label Apr 14, 2025
Copy link

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants