Skip to content

Support snapshot management operations like creating tags by adding ManageSnapshots API #728

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 30 commits into from
Jun 15, 2024

Conversation

chinmay-bhat
Copy link
Contributor

@chinmay-bhat chinmay-bhat commented May 12, 2024

Creates the public facing ManageSnapshots API that currently includes create_tag and create_branch. More operations to implement can be found in this issue - #737

Closes #573

Copy link
Collaborator

@sungwy sungwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @chinmay-bhat - thank you very much for working on this PR. I think this will be very useful to have in PyIceberg. I left a few suggestions about your proposed implementation in the Transaction class.

In addition, I think it would be a good idea to add an API in the Table class as well that wraps this transaction API, and add some unit tests for both APIs

@chinmay-bhat
Copy link
Contributor Author

@syun64 thank you for the review!
In regards to testing, I see that the tests are all performed on the classes that are called by the APIs. None of the tests are directly on the APIs.
For example, there's already a test test_update_metadata_set_snapshot_ref that calls SetSnapshotRefUpdate which is also the class that we would call to test set_ref_snapshot.

Do I still need to create unit tests for set_ref_snapshot API in Transaction and Table?

@sungwy
Copy link
Collaborator

sungwy commented May 13, 2024

Yes - even if its small, I think it would still be good to have a unit test that verifies the behavior of the proposed table and transaction API

There are some tests in https://github.com/apache/iceberg-python/blob/main/tests/table/test_init.py#L592 that should serve as good examples of API unit tests

@chinmay-bhat
Copy link
Contributor Author

I've added the tests based on your suggestions, please review again whenever possible :)

Copy link
Collaborator

@sungwy sungwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! Thank you for adding the tests @chinmay-bhat 💯

@chinmay-bhat
Copy link
Contributor Author

chinmay-bhat commented May 15, 2024

Can anyone with write access review?

@sungwy
Copy link
Collaborator

sungwy commented May 15, 2024

@HonahX @Fokko

@HonahX
Copy link
Contributor

HonahX commented May 22, 2024

Thanks @chinmay-bhat for taking this and @syun64 for reviewing! Linking my comment to here, I actually prefer to make this an internal method behind other APIs. However, given that set_ref_snapshot and add_snapshot exists in 0.6.x, we may still wants to add them back and add deprecation notice on them:

def deprecated(deprecated_in: str, removed_in: str, help_message: Optional[str] = None) -> Callable: # type: ignore

@chinmay-bhat
Copy link
Contributor Author

Hi @HonahX , I've updated the PR based on your suggestions. I've made set_ref_snapshot() protected and created an inner class that takes care of snapshot management operations. In this PR, the operations include create_tag and create_branch.

While testing, I realised that we(PyIceberg) only support the main branch, ie. while using create_tag or create_branch, if the branch/tag is "main", the snapshot_log and current_snapshot_id are updated, otherwise they don't reflect any changes. I'm assuming this is because we always display the main branch snapshot_log.
I know the create_tag functions as expected because the metadata.refs[<branch_not_main>] show the expected snapshot after running create_tag. The test function is basicaly doing the above check to verify if it works.

Am I missing something here, or do we need to create a PR to get the snapshot_log, current_snapshot and other details for every branch/tag?

That said, since this PR holds the ManageSnapshots grouping which other PRs depend on, I would like to focus on getting this and the other PRs merged before exploring the branch/tag history problem.

Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update!

I'm assuming this is because we always display the main branch snapshot_log.

Yes you're right. According to the spec, the current-snapshot-id must be the same as the snapshot id of "main" branch. The snapshot-log should store changes to the current-snapshot of the table. Therefore, we do not need to handle these if we're not updating the "main" branch.

We will need some additional work to make pyiceberg fully support branch and tag, especially writing to other branches.

I've left some comments. Please let me know what you think. Thanks for the contribution!

@chinmay-bhat
Copy link
Contributor Author

Thank you Honah for the review! :)

I've updated the tests!

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review, this looks like a great start @chinmay-bhat 🙌

@sungwy
Copy link
Collaborator

sungwy commented Jun 9, 2024

Hi @chinmay-bhat this looks almost ready to merge - I responded to the thread above to continue the discussion on the TableRequirements.

In the mean time, could we add a new section to the mkdocs/api.md file named "Snapshot Management" and port over the examples you already have in your docstrings so we can have it on the official docsite?

@sungwy
Copy link
Collaborator

sungwy commented Jun 9, 2024

And let's update the name of this PR as well to reflect its current state

@chinmay-bhat chinmay-bhat changed the title Support creating tags by adding set_ref_snapshot API Support snapshot management operations like creating tags by adding ManageSnapshots API Jun 9, 2024
@Fokko Fokko self-requested a review June 9, 2024 19:17
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good @chinmay-bhat 🙌 and I agree with @syun64 that we're super close 👍

@chinmay-bhat chinmay-bhat force-pushed the support-creating-tags branch from 602eef3 to 9f068a6 Compare June 11, 2024 02:37
@chinmay-bhat
Copy link
Contributor Author

Removed AssertTableUUID from PR and rebased onto latest main!

@chinmay-bhat
Copy link
Contributor Author

@Fokko @HonahX I believe all suggestions have now been added to the PR. Can we re-trigger the tests?

Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chinmay-bhat Thanks for working on this and resolving all review comments! LGTM! It is a great start for snapshot management support.

@chinmay-bhat
Copy link
Contributor Author

@Fokko @HonahX Can we merge this in? Other draft PRs depend on this one 🙌

@HonahX HonahX merged commit 1dde51a into apache:main Jun 15, 2024
7 checks passed
@HonahX
Copy link
Contributor

HonahX commented Jun 15, 2024

Merged! Thanks @chinmay-bhat for the great work. Thanks @Fokko @syun64 for the review.

@chinmay-bhat
Copy link
Contributor Author

Thanks @HonahX @Fokko @syun64 for the discussions and reviews! 🫡 🙌

@sungwy
Copy link
Collaborator

sungwy commented Jun 15, 2024

This is awesome! @chinmay-bhat You are on a roll with these PRs right now, thank you for contributing these much needed features to PyIceberg! Looking forward to releasing these with the upcoming 0.7.0 release

@chinmay-bhat chinmay-bhat deleted the support-creating-tags branch June 16, 2024 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support creating tags
4 participants