-
Notifications
You must be signed in to change notification settings - Fork 289
Add write_parquet
API for writing Parquet files without committing
#1742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add write_parquet
API for writing Parquet files without committing
#1742
Conversation
This method allows users to write a PyArrow table to the table's storage format as Parquet files without committing them to the table. The method returns a list of file paths that were written, enabling workflows that require access to the data files before committing metadata changes. Also adds an include_field_ids parameter to the underlying write_file and _dataframe_to_data_files functions to provide more control over the Parquet writing process.
This method allows users to write a PyArrow table to the table's storage format as Parquet files without committing them to the table. The method returns a list of file paths that were written, enabling workflows that require access to the data files before committing metadata changes. Also adds an include_field_ids parameter to the underlying write_file and _dataframe_to_data_files functions to provide more control over the Parquet writing process.
This method allows users to write a PyArrow table to the table's storage format as Parquet files without committing them to the table. The method returns a list of file paths that were written, enabling workflows that require access to the data files before committing metadata changes. Also adds an include_field_ids parameter to the underlying write_file and _dataframe_to_data_files functions to provide more control over the Parquet writing process.
This method allows users to write a PyArrow table to the table's storage format as Parquet files without committing them to the table. The method returns a list of file paths that were written, enabling workflows that require access to the data files before committing metadata changes. Also adds an include_field_ids parameter to the underlying write_file and _dataframe_to_data_files functions to provide more control over the Parquet writing process.
This method allows users to write a PyArrow table to the table's storage format as Parquet files without committing them to the table. The method returns a list of file paths that were written, enabling workflows that require access to the data files before committing metadata changes. Also adds an include_field_ids parameter to the underlying write_file and _dataframe_to_data_files functions to provide more control over the Parquet writing process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @andormarkus
This is a useful functionality but im not sure if its something we would want to add to the pyiceberg library. I understand the motivation but I think is specific to your use case.
The write_parquet
API is not related to the iceberg Table
class, it deals only with the underlying data files. So adding it to the Table
class can lead to confusion.
As a user of the pyiceberg library, I can manually flush dataframes to parquet and then register it to the Iceberg table with add_files
.
Similar to what you describe,
with tbl.update_snapshot().fast_append() as update_snapshot:
data_files = _dataframe_to_data_files(..., df, ...)
for data_file in data_files:
update_snapshot.append_data_file(data_file)
Please let me know if that resolves your specific issue.
Thanks for your feedback! I understand your concern about adding this to the The primary issue I'm trying to solve involves distributed environments. While your suggested approach works well in a single process, my use case involves multiple distributed processes. One process writes data files and another commits them to the table, requiring simple communication between these processes. Passing What I need is a simpler workflow where:
This approach avoids having to pass complex objects like |
Thanks @andormarkus for working on this, I've just commented on #1737, would love to hear your thoughts there as well.
|
Hi @Fokko We ended up to build a dynamic custom serialize / deserialize function which supports gzip and zlib compression to deal with size issue. Once the We needed to built is dynamic thus we deal with the changes of the The Manifest file is implemented in I'm fine to contribute the serialize / deserialize functions if the maintainers agree on this approach. |
I understand that it is a streaming workload? In that case, writing the manifest doesn't help a lot. I understand the problem. Let me think out loud: The problem with the current approach (in this PR) is that when the schema or partition spec changes, the
The most obvious way of serializing it is by using Avro. This is efficient over the wire as well (I expect it to be much smaller than jsonpickle or regular pickle). I would be in favor of having this in combination with |
We want to avoid I have no problem with Avro (Manifestfile) combined with Yes from current PR is kind of obsolete because I think the I like your #1678 however this would create to much commit. We have implemented this suggestion however this created to much commit was was really performance killer. I will close this PR and related issue and reopen everything as distributed write |
This PR adds a new API method
write_parquet()
to theTable
class, which allows writing a PyArrow table to Parquet files in Iceberg-compatible format without committing them to the table metadata. This provides a way to decouple the write and commit process, which is particularly useful in high-concurrency scenarios.Key features
write_parquet(df)
writes Parquet files compatible with Iceberg table formatadd_files()
APIUse case
This is especially useful for high-concurrency ingestion scenarios where multiple writers could be writing data to an Iceberg table simultaneously. By separating the write and commit phases, applications can implement a queue system where the commit process (which requires a lock) is handled separately from the data writing phase:
Documentation
Added comprehensive documentation to the API docs, including explanations and examples of how to use the new method alongside the existing add_files API.
Seeking guidance
I would appreciate guidance from project maintainers on:
Closes #1737