Skip to content

[Feature Request] Add Writer Support for Table-Compatible Parquet Files #1737

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
andormarkus-alcd opened this issue Feb 27, 2025 · 2 comments

Comments

@andormarkus-alcd
Copy link

andormarkus-alcd commented Feb 27, 2025

Feature Request / Improvement

Problem Statement

PR: #1742

PyIceberg currently provides functionality to add existing Parquet files to an Iceberg table using add_files(), which is useful when files already exist in a compatible format. However, the library lacks a convenient way to write new Parquet files that are automatically compatible with the Iceberg table format, specifically:

  1. There's no straightforward API to write Parquet files that match an Iceberg table's schema, partitioning, and other metadata requirements
  2. Users currently need to implement complex logic to ensure schema alignment, partition compatibility, etc.
  3. This creates an unnecessary barrier for users wanting to write files that can later be added to Iceberg tables without rewriting

Use Case: High-Throughput Ingestion with AWS Lambda

We are currently using AWS Lambda functions to write to Iceberg tables. When ingesting large volumes of files concurrently, we run into Lambda concurrency limits because:

  • The Parquet writing process is the most time-consuming part of the operation
  • The commit phase is relatively fast

By separating these operations (writing compatible Parquet files independently, then committing via add_files() through a queue), we could significantly increase our throughput. This would allow us to:

  • Use Lambda functions to write compatible Parquet files in parallel
  • Queue the much faster commit operations separately
  • Use a second Lambda function to process these queued operations in bulk (e.g., every minute), committing
    multiple files at once rather than one by one
  • Avoid concurrency limits that are currently bottlenecking our ingestion pipeline

Proposed Solution

Add a new API to PyIceberg that allows writing table-compatible Parquet files. This could look something like:

# Possible API design
writer = tbl.parquet_writer()
writer.write_dataframe(df)  # No destination_path needed as table has location info

# Or alternatively
tbl.write_parquet(df)  # Writes to table's data location with appropriate naming

# These files could then be added without rewriting
tbl.add_files()  # Can discover compatible files in the table's data location

The implementation would:

  1. Automatically handle schema alignment
  2. Apply correct partition transforms
  3. Add appropriate metadata to ensure compatibility with add_files()
  4. Set up Name Mapping appropriately
  5. Generate files without field IDs in the Parquet metadata (as required by add_files())
  6. Use the table's location information to determine write paths automatically

Benefits

This would create a complete workflow for efficiently managing Iceberg tables:

  • Write compatible files
  • Add them to tables without rewriting
  • Perform normal maintenance operations

This new feature would simplify creating files that meet these requirements.

@Fokko
Copy link
Contributor

Fokko commented Mar 2, 2025

Thanks @andormarkus-alcd, for raising this and the comprehensive writeup. I think a lot of this would be solved if we implement #1678. Just before committing, it will refresh the table and check if everything is still compatible. Thoughts?

@andormarkus-alcd
Copy link
Author

Superseded by #1751

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants