Skip to content

Update table metadata throughout transaction #471

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Feb 29, 2024

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Feb 25, 2024

This PR adds support for updating the table metadata throughout the transaction.

This way, if a schema is first evolved, and then a snapshot is created based on the latest schema, it will be able to find the schema.

Fixes #454

This PR add support for updating the table metadata throughout
the transaction.

This way, if a schema is first evolved, and then a snapshot is
created based on the latest schema, it will be able to find the
schema.
@HonahX HonahX self-requested a review February 26, 2024 01:20
Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! @Fokko It looks great. I left some comments but most of them are very small

self._updates = actions or ()
self._requirements = requirements or ()
self._table_metadata = table.metadata
self._autocommit = autocommit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Minor] May be auto_commit is more consistent with other names in pyiceberg?

Also, shall we add doc to explain the usage of this new option. We could probably update the doc when we finishing all modifications of Transaction related code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was on the brink here :D, mostly since the Wikipedia page is also without space: https://en.wikipedia.org/wiki/Autocommit

The auto-commit is currently not exposed to the user because on the table itself the option is not exposed: def transaction(self) -> Transaction:. This is only used when you do the shorthand:

# This will create an autocommit transaction:
tbl.update_schema()
# This one you need to commit yourself:
txn = tbl.transaction()

...

txn.commit()

I've added Pythondoc for clarification. Let me know if satisfies the need for docs 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation! I think the pythondoc is good enough in this case.


def _append_requirements(self, *new_requirements: TableRequirement) -> Transaction:
"""Append requirements to the set of staged requirements.
self._table_metadata = update_table_metadata(self._table_metadata, updates)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now there will be 2 updated metadata after the transaction commits: _table_metadata in the transaction and the new metadata of the table. I think it may be good to add a test to ensure that the 2 metadata are the same after the transaction. This could help verify the correctness of the internal _table_metadata WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I also looked into this, but this breaks some of the APIs. The commit accepts a CommitTableRequest instead of the transaction. For example, for the REST catalog, we don't apply the changes to the table metadata.

Copy link
Contributor

@HonahX HonahX Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to check the metadata equality like:

with tbl.transaction() as txn:
  ...
assert tbl.metadata == txn.metadata

so that it can help verify that transaction accumulates the metadata change in the same way as catalog does.

Of course, this is totally optional since we use the same method to update the metadata in transaction and catalog._commit_table

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that in the Transaction, we call the _commit method on the table, and we don't carry through the metadata there. For the REST catalog, this is not relevant, for the catalogs where you have to construct the metadata yourself, it is. I like these additional checks, but I'm unsure if we want to change the API to pass this through.

Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the great work!

@Fokko Fokko merged commit 6708a6e into apache:main Feb 29, 2024
@Fokko Fokko deleted the fd-stack-changes-in-transaction branch February 29, 2024 12:10
@Fokko
Copy link
Contributor Author

Fokko commented Feb 29, 2024

Thanks @HonahX for the review 🙌

himadripal pushed a commit to himadripal/iceberg-python that referenced this pull request Mar 1, 2024
* Update table metadata throughout transaction

This PR add support for updating the table metadata throughout
the transaction.

This way, if a schema is first evolved, and then a snapshot is
created based on the latest schema, it will be able to find the
schema.

* Fix integration tests

* Thanks Honah!

* Include the partition evolution

* Cleanup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement incremental updating of table metadata
2 participants