-
Notifications
You must be signed in to change notification settings - Fork 294
[feat] add_files
support parquet files with field ids
#1227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Is this the error message you see? iceberg-python/pyiceberg/io/pyarrow.py Lines 2515 to 2518 in 7cf0c22
|
Yes, this is correct. |
add_files
support parquet files with field ids
Ty, updated the title. LMK if this is something you would like to work on 😄 |
Thanks for raising this @MrDerecho - in the initial version of add_files, we wanted to limit it to just parquet files that that were created in an external system. The assumption is that unless the files are created by an Iceberg client and are cognizant of the Iceberg schema, there would be no way for the parquet writing process to be use the correct field IDs in the produced parquet schema.
This sounds like a really cool use case, but I'd like to understand it better - why isn't the application (Trino/Spark) that is doing the compaction committing the compacted files into Iceberg itself? |
For context: right now I manage a very large data lake of time partitioned
data. The use case has to do with the archival process put into place
wherein after a rolling period of time these files are “deleted” from the
table and copied (before snapshot expiry) into a archive prefix for later
use (if needed) and subject to a lifecycle policy I.e. glacier to physical
deletion after some time.
Because I use Trino to optimize- I have a mix of pyiceberg batch files and
Trino spark generated files that will may require being “re-added” at a
later date. Let me know if you have any other questions. Thanks for the
help.
…On Tue, Oct 15, 2024 at 9:07 AM Sung Yun ***@***.***> wrote:
Thanks for raising this @MrDerecho <https://github.com/MrDerecho> - in
the initial version of add_files, we wanted to limit it to just parquet
files that that were created in an external system. The assumption is that
unless the files are created by an Iceberg client and are cognizant of the
Iceberg schema, there would be no way for the parquet writing process to be
use the correct field IDs in the produced parquet schema.
Currently, if I am using pyiceberg to create/maintain my iceberg tables
and I use Trino (AWS Athena) to do compaction on the same (using Spark)-
the files created via compaction are unable to be "re-added" using the
add_files method at a later time.
This sounds like a really cool use case, but I'd like to understand it
better - why isn't the application (Trino/Spark) that is doing the
compaction committing the compacted files into Iceberg itself?
—
Reply to this email directly, view it on GitHub
<#1227 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AR6JWWZXATUGYZBS6TFHA4TZ3UHSLAVCNFSM6AAAAABPY5DDRGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJTHA3DQNBXHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I think we can relax the constraints for A use case I can think of is moving an iceberg table from one system and recreating it in another, similar to whats described above. |
@MrDerecho would you like to contribute this feature? |
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. |
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' |
Feature Request / Improvement
Currently, if I am using pyiceberg to create/maintain my iceberg tables and I use Trino (AWS Athena) to do compaction on the same (using Spark)- the files created via compaction are unable to be "re-added" using the add_files method at a later time. If we can make this configurable that would be great. Thanks.
The text was updated successfully, but these errors were encountered: