You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PyIceberg currently provides functionality to add existing Parquet files to an Iceberg table using add_files(), which is useful when files already exist in a compatible format. However, the library lacks a convenient way to write new Parquet files that are automatically compatible with the Iceberg table format, specifically:
There's no straightforward API to write Parquet files that match an Iceberg table's schema, partitioning, and other metadata requirements
Users currently need to implement complex logic to ensure schema alignment, partition compatibility, etc.
This creates an unnecessary barrier for users wanting to write files that can later be added to Iceberg tables without rewriting
Use Case: High-Throughput Ingestion with AWS Lambda
We are currently using AWS Lambda functions to write to Iceberg tables. When ingesting large volumes of files concurrently, we run into Lambda concurrency limits because:
The Parquet writing process is the most time-consuming part of the operation
The commit phase is relatively fast
By separating these operations (writing compatible Parquet files independently, then committing via add_files() through a queue), we could significantly increase our throughput. This would allow us to:
Use Lambda functions to write compatible Parquet files in parallel
Queue the much faster commit operations separately
Use a second Lambda function to process these queued operations in bulk (e.g., every minute), committing
multiple files at once rather than one by one
Avoid concurrency limits that are currently bottlenecking our ingestion pipeline
Proposed Solution
Add a new API to PyIceberg that allows writing table-compatible Parquet files. This could look something like:
# Possible API designwriter=tbl.parquet_writer()
writer.write_dataframe(df) # No destination_path needed as table has location info# Or alternativelytbl.write_parquet(df) # Writes to table's data location with appropriate naming# These files could then be added without rewritingtbl.add_files() # Can discover compatible files in the table's data location
The implementation would:
Automatically handle schema alignment
Apply correct partition transforms
Add appropriate metadata to ensure compatibility with add_files()
Set up Name Mapping appropriately
Generate files without field IDs in the Parquet metadata (as required by add_files())
Use the table's location information to determine write paths automatically
Benefits
This would create a complete workflow for efficiently managing Iceberg tables:
Write compatible files
Add them to tables without rewriting
Perform normal maintenance operations
This new feature would simplify creating files that meet these requirements.
The text was updated successfully, but these errors were encountered:
Thanks @andormarkus-alcd, for raising this and the comprehensive writeup. I think a lot of this would be solved if we implement #1678. Just before committing, it will refresh the table and check if everything is still compatible. Thoughts?
Uh oh!
There was an error while loading. Please reload this page.
Feature Request / Improvement
Problem Statement
PR: #1742
PyIceberg currently provides functionality to add existing Parquet files to an Iceberg table using
add_files()
, which is useful when files already exist in a compatible format. However, the library lacks a convenient way to write new Parquet files that are automatically compatible with the Iceberg table format, specifically:Use Case: High-Throughput Ingestion with AWS Lambda
We are currently using AWS Lambda functions to write to Iceberg tables. When ingesting large volumes of files concurrently, we run into Lambda concurrency limits because:
By separating these operations (writing compatible Parquet files independently, then committing via
add_files()
through a queue), we could significantly increase our throughput. This would allow us to:multiple files at once rather than one by one
Proposed Solution
Add a new API to PyIceberg that allows writing table-compatible Parquet files. This could look something like:
The implementation would:
Benefits
This would create a complete workflow for efficiently managing Iceberg tables:
This new feature would simplify creating files that meet these requirements.
The text was updated successfully, but these errors were encountered: