-
Notifications
You must be signed in to change notification settings - Fork 291
Update table metadata throughout transaction #471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This PR add support for updating the table metadata throughout the transaction. This way, if a schema is first evolved, and then a snapshot is created based on the latest schema, it will be able to find the schema.
…-changes-in-transaction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! @Fokko It looks great. I left some comments but most of them are very small
self._updates = actions or () | ||
self._requirements = requirements or () | ||
self._table_metadata = table.metadata | ||
self._autocommit = autocommit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Minor] May be auto_commit
is more consistent with other names in pyiceberg?
Also, shall we add doc to explain the usage of this new option. We could probably update the doc when we finishing all modifications of Transaction related code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was on the brink here :D, mostly since the Wikipedia page is also without space: https://en.wikipedia.org/wiki/Autocommit
The auto-commit is currently not exposed to the user because on the table itself the option is not exposed: def transaction(self) -> Transaction:
. This is only used when you do the shorthand:
# This will create an autocommit transaction:
tbl.update_schema()
# This one you need to commit yourself:
txn = tbl.transaction()
...
txn.commit()
I've added Pythondoc for clarification. Let me know if satisfies the need for docs 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation! I think the pythondoc is good enough in this case.
pyiceberg/table/__init__.py
Outdated
|
||
def _append_requirements(self, *new_requirements: TableRequirement) -> Transaction: | ||
"""Append requirements to the set of staged requirements. | ||
self._table_metadata = update_table_metadata(self._table_metadata, updates) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now there will be 2 updated metadata after the transaction commits: _table_metadata
in the transaction and the new metadata of the table. I think it may be good to add a test to ensure that the 2 metadata are the same after the transaction. This could help verify the correctness of the internal _table_metadata
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I also looked into this, but this breaks some of the APIs. The commit accepts a CommitTableRequest
instead of the transaction. For example, for the REST catalog, we don't apply the changes to the table metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to check the metadata equality like:
with tbl.transaction() as txn:
...
assert tbl.metadata == txn.metadata
so that it can help verify that transaction accumulates the metadata change in the same way as catalog does.
Of course, this is totally optional since we use the same method to update the metadata in transaction and catalog._commit_table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that in the Transaction
, we call the _commit
method on the table, and we don't carry through the metadata there. For the REST catalog, this is not relevant, for the catalogs where you have to construct the metadata yourself, it is. I like these additional checks, but I'm unsure if we want to change the API to pass this through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the great work!
…-changes-in-transaction
…-changes-in-transaction
Thanks @HonahX for the review 🙌 |
* Update table metadata throughout transaction This PR add support for updating the table metadata throughout the transaction. This way, if a schema is first evolved, and then a snapshot is created based on the latest schema, it will be able to find the schema. * Fix integration tests * Thanks Honah! * Include the partition evolution * Cleanup
This PR adds support for updating the table metadata throughout the transaction.
This way, if a schema is first evolved, and then a snapshot is created based on the latest schema, it will be able to find the schema.
Fixes #454