Skip to content

Support loading custom catalog implementation #938

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jackye1995 opened this issue Jul 17, 2024 · 10 comments · Fixed by #947
Closed

Support loading custom catalog implementation #938

jackye1995 opened this issue Jul 17, 2024 · 10 comments · Fixed by #947
Assignees

Comments

@jackye1995
Copy link

Feature Request / Improvement

We have a customer asking about the ability to load a custom catalog implementation, similar to how Java Iceberg works today by loading a custom Java class path. The customer understands that the community would like to go with REST catalog in general, but they are not ready to onboard to that yet for at least a year or two.

I see something similar done in FileIO path. What is our general thoughts around this topic? Are we open to add similar support for catalog when doing load_catalog?

@jackye1995
Copy link
Author

cc @Fokko @HonahX @syun64 @kevinjqliu

@kevinjqliu
Copy link
Contributor

I like the option to load a custom catalog implementation. As long as the implementation subclasses Catalog and have all the necessary functions implemented. We would also add a new (CatalogType)

It'll be great if we have a set of tests to encode specific catalog behaviors, similar to #813

On a meta level, the customer can also "wrap" their custom catalog implementation with the REST protocol. Similar to how the iceberg-rest-image is implemented

@jackye1995
Copy link
Author

the customer can also "wrap" their custom catalog implementation with the REST protocol. Similar to how the iceberg-rest-image is implemented

Yes we brought up that solution, but they do not want to maintain this additional layer of indirection that could cause additional availability, latency, security issues.

@sungwy
Copy link
Collaborator

sungwy commented Jul 18, 2024

Hi @jackye1995 - I'm of the opinion that this is totally fine. If the client wants to have their own catalog implementation for any of many possible reasons (e.g. because it requires a specific auth mechanism) they should be allowed to.

I'm actually wondering what about the state of code right now, would prevent them from having their own implementation. I want to hold off on giving a definite answer before actually trying this out, but I think even now, they should be able to implement the Catalog or MetastoreCatalog classes and use their own Catalog. load_catalog is just a helper method that helps set up a Catalog using the properties, so it should be possible for users to implement their Catalog if they side load the custom catalog specific properties outside of the PyIceberg properties.

But maybe there is value allowing load_catalog to parse properties for custom catalogs for ease of use.

@HonahX
Copy link
Contributor

HonahX commented Jul 18, 2024

But maybe there is value allowing load_catalog to parse properties for custom catalogs for ease of use.

+1. I think it is beneficial for load_catalog to support additional methods for fetching catalog implementations and initializing them with properties parsed from ~/.pyiceberg.yaml, PYICEBERG__* environment variables, and parameters would be highly beneficial.

For example, it would allow users to define custom catalogs in the following way:

custom:
    py-catalog-impl: example.catalog.ExampleCatalog
    custom.property-a: a
    custom.property-b: b

Then, they can load the catalog in their code with:

catalog = load_catalog("custom")

IMO, load_catalog should serve as the primary entry point for users to initialize catalog instances. Extending its capabilities to support loading custom implementations could make non-rest private catalog implementations also benefit from the util.

Inspired by the discussions of custom implementation of Catalog interface, I think we can also consider to make _commit_table public

@abstractmethod
def _commit_table(self, table_request: CommitTableRequest) -> CommitTableResponse:
"""Update one or more tables.
Args:
table_request (CommitTableRequest): The table requests to be carried out.
Returns:
CommitTableResponse: The updated metadata.
Raises:
NoSuchTableError: If a table with the given identifier does not exist.
CommitFailedException: Requirement not met, or a conflict with a concurrent commit.
CommitStateUnknownException: Failed due to an internal exception on the side of the catalog.
"""

_commit_table is crucial if user wants to use the custom catalog implementation to update table. Although in python there is no real "private" method, making this one public could make it clear that custom catalog need to implement this one to enable the usage of transaction and other places where table updates take place. On the doc side, it will be good if _commit_table could show up in the Catalog Code Reference.

We may also add some guide like Java Custom Catalog, especially for _commit_table which is very different from Java Catalog's commit

Would love to hear more thoughts on these! Thanks for all the valuable discussions above.

@sungwy sungwy added this to the PyIceberg 0.8.0 release milestone Jul 19, 2024
@Fokko
Copy link
Contributor

Fokko commented Jul 19, 2024

Yes, I think that would be a nice addition. As Jack already pointed out, we already have dynamic loading of file-io's using py-io-impl, so loading a catalog would be relatively easy to add 👍

You could also inherit from an existing catalog if you want to extend certain parts of it (for example, the authentication mechanism as Sung pointed out). I'm all for it 👍

@jackye1995 jackye1995 self-assigned this Jul 20, 2024
@jackye1995
Copy link
Author

I can give a try for it, assigning to myself.

@jayceslesar
Copy link
Contributor

jayceslesar commented Jul 23, 2024

Would it hurt to add some checks to the custom catalog implementation passed in to ensure that a core suite of methods are provided in the custom implementation? I can imagine a user creating a custom catalog with not everything implemented and really shooting themselves in the foot. I suppose why not enforce the custom catalog to inherit from the abstract Catalog class?

@kevinjqliu
Copy link
Contributor

I think behavior checking for custom catalogs is a nice-to-have feature, but ultimately it is up to the implementation to make sure the catalog behaves correctly.
That said, it would be nice to make it easier to check a set of expected behaviors and to standardize on a set of base behaviors for all catalogs, similar to #813

@jackye1995
Copy link
Author

The type hint def _import_catalog(name: str, catalog_impl: str, properties: Properties) -> Optional[Catalog]: suggests that it should inherit the Catalog class, but it is not enforced at this moment. Are you suggesting to add a check to enforce that?

Personally speaking I prefer to not check it. I see it as an advantage of Python vs strictly typed languages like Java to allow not inheriting a interface during dynamic loading, so it is more flexible for people that could leverage it. But I would imagine most people with custom implementation is inheriting the Catalog class anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants