[EPIC] Iceberg-rust Write support

# Iceberg-rust Write support

I've noticed a lot of interest in write support in Iceberg-rust. This issue aims to break this down into smaller pieces so they can be picked up in parallel.

## Appetizers

If you're not super familiar with the codebase, feel free to pick up one of the appetizers below. They are, or are not, related to the write path, but are good things to get in, and are good to get to know the code base:

- https://github.com/apache/iceberg-rust/issues/723
- https://github.com/apache/iceberg-rust/issues/720
- https://github.com/apache/iceberg-rust/issues/726
- https://github.com/apache/iceberg-rust/issues/734

## Commit path

The commit path entails writing a new metadata JSON.

- [ ] **Applying updates to the metadata** [Updating the metadata](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/table/__init__.py#L706-L956) is important both for writing a new version of the JSON in case of a non-REST catalog, but also to keep an up-to-date version in memory. It is recommended to re-use the [Updates](https://github.com/apache/iceberg/blob/866021d7d34f274349ce7de1f29d113395e7f28c/open-api/rest-catalog-open-api.yaml#L2557-L2575)/[Requirement](https://github.com/apache/iceberg/blob/866021d7d34f274349ce7de1f29d113395e7f28c/open-api/rest-catalog-open-api.yaml#L2588-L2605) objects provided by the [REST catalog protocol](https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml). PyIceberg uses a [similar](https://github.com/apache/iceberg-python/blob/b2f0a9e5cd7dd548e19cdcdd7f9205f03454369a/pyiceberg/table/update/__init__.py#L244) approach.
  - [x] **REST Catalog** [serialize the updates and requirements](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/table/__init__.py#L1364-L1373) into JSON which is [dispatched to the REST catalog](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/catalog/rest.py#L675-L706). Done in https://github.com/apache/iceberg-rust/pull/97.
  - [ ] **Other catalogs** For the other catalogs, instead of dispatching the updates/requirements to the catalog. There are additional steps:
    - [x] Logic to [validate the requirements](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/catalog/glue.py#L453-L455) against the metadata, to detect commit conflicts. A lot of this logic is already being implemented by https://github.com/apache/iceberg-rust/pull/587.
    - [ ] Writing a new version of the [metadata.json](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/catalog/__init__.py#L775-L776) to the object store. Taking into account the naming [as mentioned in the spec](https://github.com/apache/iceberg/blob/main/format/spec.md#metastore-tables).
    - [ ] Provide locking mechanisms within the commit ([Glue](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/catalog/glue.py#L476-L483), [Hive](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/catalog/hive.py#L379), [SQL](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/catalog/sql.py#L426), ..) so the atomic swap happens safely.
      - [ ] **SQL** Looks like conflict [detection is missing](https://github.com/apache/iceberg-rust/blob/50345196c87b00badc1a6490aef284e84f4c3e9a/crates/catalog/sql/src/catalog.rs#L475). I was expecting logic there to see if rows are being affected (if not, another process has altered the table).
      - [ ] **Glue** Not yet implemented
- [ ] **Commit semantics**
  - [ ] **MergeAppend** appends new manifest list entries to existing manifest files. Reduces the amount of metadata produced, but takes some more time to commit since existing metadata has to be rewritten, and retries are also more costly: https://github.com/apache/iceberg-rust/issues/736
    - [ ] **Initial defaults** This makes it easier when merge-appending V1 metadata: https://github.com/apache/iceberg-rust/issues/737
  - [x] **FastAppend** Generates a new manifest per commit, which allows fast commits, but generates more metadata in the long run. PR by @ZENOTME in https://github.com/apache/iceberg-rust/pull/349
- [ ] **Snapshot generation** manipulation of data within a table is done by [appending snapshots](https://iceberg.apache.org/spec/#snapshots) to the metadata JSON.
  - [x] **APPEND** Only data files were added and no files were removed. Similar to `add_files`.
  - [ ] **REPLACE** Data and delete files were added and removed without changing table data; i.e., compaction, changing the data file format, or relocating data files.
  - [ ] **OVERWRITE** Data and delete files were added and removed in a logical overwrite operation.
  - [ ] **DELETE** Data files were removed and their contents were logically deleted and/or delete files were added to delete rows.
- [ ] **Add files** to add existing Parquet files to a table. Issue in https://github.com/apache/iceberg-rust/issues/932
  - [ ] [**Name mapping**](https://iceberg.apache.org/spec/#column-projection) in case the files don't have field-IDs set.
- [x] **Summary generations** Part of the snapshot that indicates what's in the snapshot: https://github.com/apache/iceberg-rust/issues/724
- [ ] **Metrics collection** There are two situations:
  - [x] **Collect metrics when writing** ~~This is done with the Java API where during writing the upper, lower bound are tracked and the number of null- and nan records are counted.~~. Most of this is in, except the `NaN` counts: https://github.com/apache/iceberg-rust/issues/417

## Related operations

These are not on the critical path to enable writes, but are related to it:

- [x] **Update table properties** Sets [properties on the table](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/table/__init__.py#L326). Probably the best to start with since it doesn't require a complicated API.
- [ ] **Schema evolution** [API](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/table/__init__.py#L1809) to update the schema, and produce new metadata.
  - [ ] Having the SchemaUpdate API to evolve the schema without a user have to worry about field-IDs:  https://github.com/apache/iceberg-rust/issues/697
  - [ ] Add the `unionByName` to easily union two schemas to provide easy schema evolution: https://github.com/apache/iceberg-rust/issues/698
- [ ] **Partition spec evolution** [API](https://github.com/apache/iceberg-python/blob/4b96d2f49b04ff7ec551646f489ecc50ac195b5d/pyiceberg/table/__init__.py#L3003) to update the partition spec, and produce new metadata: https://github.com/apache/iceberg-rust/issues/734.
- [ ] **Sort order evolution** API to update the schema, and produce new metadata: https://github.com/apache/iceberg-rust/issues/732.

## Metadata tables

Metadata tables are used to [inspect the table](https://iceberg.apache.org/docs/1.7.0/spark-ddl/). Having these tables also allows easy implementation of the [maintenance procedures](https://iceberg.apache.org/docs/1.7.0/spark-procedures/) since you can easily list all the snapshots, and expire the ones that are older than a certain threshold.

## Integration Tests

Integration tests with other engines like spark.

## Contribute

If you want to contribute to the upcoming milestone, feel free to comment on this issue. If there is anything unclear or missing, feel free to reach out here as well 👍 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[EPIC] Iceberg-rust Write support #700

Iceberg-rust Write support

Appetizers

Commit path

Related operations

Metadata tables

Integration Tests

Contribute

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[EPIC] Iceberg-rust Write support #700

Description

Iceberg-rust Write support

Appetizers

Commit path

Related operations

Metadata tables

Integration Tests

Contribute

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions