Skip to content

Discussion: Next steps / requirements to support append files #329

@marvinlanhenke

Description

@marvinlanhenke

...out of curiosity, I took a closer look at the pyiceberg impl and how the Table.append() works.

Now, I would like to pick your brain, in order to understand and track the next steps we have to take to support append as well (since we should be getting close to having write support). The goal here is, to extract and create actionable issues.

Here is what I understand from the python impl so far (high-level):

  1. we call append() on the Table class with our DataFrame: pa.Table and the snaphot_properties: Dict[str, str]
  2. we create a Transaction that basically does two things:
    2.1. It creates a _MergingSnapshotProducer which is (on a high-level) responsible for writing a new ManifestList, creating a new Snapshot (returned as AddSnaphotUpdate)
    2.2 It calls update_table on the respective Catalog which creates a new metadata.json and returns the new metadata as well as the new metadata_location

pyiceberg-link

Here is what I think we need to implement (rough sketch):

  1. implfn append(...) on struct Table:
    This should probably accept a RecordBatch as a param, create a new Transaction, and delegates further action to the transaction.
  2. implfn append(...) on struct Transaction:
    Receives RecordBatch and snapshot_properties. Performs validation checks. Converts the RecordBatch to a collection of DataFiles and creates a _MergingSnapshotProducer with the collection.
  3. impl_MergingSnapshotProducer:
    :: write manifests (added, deleted, existing)
    :: get next_sequence_number from TableMetadata
    :: update snapshot summaries
    :: generate manifest_list_path
    :: write manifest_list
    :: create a new Snapshot
    :: return TableUpdate: AddSnapshot
  4. impl update_table on the concrete Catalog implementations

What could be possible Issues here?
I think we need to start with the _MergingSnapshotProducer (possibly split into mutliple parts) and work our way up the list?
Once we have the MergingSnapshotProducer, we can implement the append function on Transaction which basically orchestrates?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions