Skip to content

Support Nessie catalog #19

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Fokko opened this issue Oct 2, 2023 · 28 comments
Closed

Support Nessie catalog #19

Fokko opened this issue Oct 2, 2023 · 28 comments

Comments

@Fokko
Copy link
Contributor

Fokko commented Oct 2, 2023

Feature Request / Improvement

PyIceberg has added support for glue catalog. We need to have support for Nessie catalog too just like hive, glue, REST catalogs.

Migrated from apache/iceberg#6414

@zeddit
Copy link

zeddit commented Oct 19, 2023

looking forward for this feature to conduct testing.

@seunggs
Copy link

seunggs commented Feb 9, 2024

Any update on supporting nessie catalog?

@ajantha-bhat
Copy link
Member

@jbonofre might take it up after java 1.5.0 release.

@fraibacas
Copy link

@ajantha-bhat Any rough idea about when this will be available? thanks!

@RobPrat
Copy link

RobPrat commented Mar 19, 2024

I would also like to know if it is estimated to be worked on soon, I'd find it very useful. Thx!

@alonahmias
Copy link

Hi, we would like to contribute to this issue, is it possible?

@Fokko
Copy link
Contributor Author

Fokko commented Jun 3, 2024

It looks like that Nessie has announced REST catalog support. This would make the native Nessie integration redundant.

@dimas-b
Copy link

dimas-b commented Jun 3, 2024

ATM, Nessie has Iceberg REST API on main, but it's not released yet.

@chayalipy
Copy link

Is there a release date?

@dimas-b
Copy link

dimas-b commented Jun 3, 2024

It might be best to talk about Nessie releases in the project's Zulip chat (the join link is on projectnessie.org) :)

@dimas-b
Copy link

dimas-b commented Jun 21, 2024

Nessie 0.90.2 and later support the Iceberg REST Catalog API.

@jbonofre
Copy link
Member

I think this issue can be considered like fixed thanks to the REST Catalog API support by Nessie.

@Fokko
Copy link
Contributor Author

Fokko commented Jun 21, 2024

@dimas-b Thanks for the update here, and I agree with @jbonofre, let's close this issue!

@Fokko Fokko closed this as completed Jun 21, 2024
@cee-shubham
Copy link

I want to create iceberg tables using pyiceberg and store it in minio store, so for this i have created docker containers for services named as: nessie, minio, dremio
Earlier i was using pyspark and was able to create tables using code:
import pyspark
from pyspark.sql import SparkSession
import os

DEFINE SENSITIVE VARIABLES

NESSIE_URI = "http://nessie:19120/api/v1"
MINIO_ACCESS_KEY = "my_access_key"
MINIO_SECRET_KEY = "my_secret_access_key"

conf = (
pyspark.SparkConf()
.setAppName('app_name')
#packages
.set('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.3.1,org.projectnessie.nessie-integrations:nessie-spark-extensions-3.3_2.12:0.67.0,software.amazon.awssdk:bundle:2.17.178,software.amazon.awssdk:url-connection-client:2.17.178')
#SQL Extensions
.set('spark.sql.extensions', 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,org.projectnessie.spark.extensions.NessieSparkSessionExtensions')
#Configuring Catalog
.set('spark.sql.catalog.nessie', 'org.apache.iceberg.spark.SparkCatalog')
.set('spark.sql.catalog.nessie.uri', NESSIE_URI)
.set('spark.sql.catalog.nessie.ref', 'main')
.set('spark.sql.catalog.nessie.authentication.type', 'NONE')
.set('spark.sql.catalog.nessie.catalog-impl', 'org.apache.iceberg.nessie.NessieCatalog')
.set('spark.sql.catalog.nessie.warehouse', 's3a://warehouse')
.set('spark.sql.catalog.nessie.s3.endpoint', 'http://minio:9000')
.set('spark.sql.catalog.nessie.io-impl', 'org.apache.iceberg.aws.s3.S3FileIO')
#MINIO CREDENTIALS
.set('spark.hadoop.fs.s3a.access.key', MINIO_ACCESS_KEY)
.set('spark.hadoop.fs.s3a.secret.key', MINIO_SECRET_KEY)
)

Start Spark Session

spark = SparkSession.builder.config(conf=conf).getOrCreate()
print("Spark Running")

LOAD A CSV INTO AN SQL VIEW

csv_df = spark.read.format("csv").option("header", "true").load("../datasets/df_open_2023.csv")
csv_df.createOrReplaceTempView("csv_open_2023")

CREATE AN ICEBERG TABLE FROM THE SQL VIEW

spark.sql("CREATE TABLE IF NOT EXISTS nessie.df_open_2023 USING iceberg AS SELECT * FROM csv_open_2023").show()

QUERY THE ICEBERG TABLE

spark.sql("SELECT * FROM nessie.df_open_2023 limit 10").show()

Please tell me how to do it with pyiceberg

@XN137
Copy link

XN137 commented Sep 27, 2024

Please tell me how to do it with pyiceberg

generally speaking you use the REST catalog
these docs may help:
https://py.iceberg.apache.org/configuration/#rest-catalog
https://kevinjqliu.substack.com/i/147257480/connect-to-the-rest-catalog

running the nessie server:
https://projectnessie.org/guides/iceberg-rest/

@XN137
Copy link

XN137 commented Sep 27, 2024

RestCatalog class seems to live in pyiceberg.catalog.rest:

class RestCatalog(Catalog):

however according to https://py.iceberg.apache.org/api/
one is now supposed to use something like:

from pyiceberg.catalog import load_catalog

catalog = load_catalog("rest", <optional_config_dict>)

@cee-shubham
Copy link

I encountered an issue while using the load_catalog() method, where it shows the following error:
load_catalog() takes from 0 to 1 positional arguments but 2 were given.

To address this, I attempted to use load_rest("rest", <config_dict>), but I encountered a validation issue in the ConfigResponse model while working with the RestCatalog from PyIceberg. It seems that the defaults and overrides fields are required in the ConfigResponse model, but the Nessie REST API is not responding with these fields as expected.

Even after passing them explicitly in the response, I am still getting a validation error.

@sean-pasabi
Copy link

sean-pasabi commented Oct 15, 2024

@cee-shubham I am having a similar issue. If someone has managed to load a Nessie catalog using pyiceberg's RestCatalog, that would be greatly appreciated.

@edgarrmondragon
Copy link
Contributor

@sean-pasabi I was able to get pyiceberg working with REST catalog exposed by Nessie, at least as a proof of concept: https://github.com/edgarrmondragon/-learn-iceberg-nessie

@sean-pasabi
Copy link

@edgarrmondragon I have a similar .pyiceberg.ymal, but without the token. I am using minio which requires additional work to add some sort of OAuth2 flow, and I would be surprised if that was the issue. Can your example run without the token, or is it required?

@cee-shubham
Copy link

cee-shubham commented Oct 16, 2024

@edgarrmondragon I have followed your code, and while the namespace and table were successfully created and are visible in the MinIO bucket, I encountered an error when appending data to the table. The error is related to AWS access permissions, specifically an "ACCESS_DENIED" issue during a HeadObject operation. Below is the relevant error message:
OSError: When getting information for key 'demo2/taxi_dataset_f684e603-b914-4f6b-91db-b9f86a2846b3' in bucket 'demobucket': AWS Error ACCESS_DENIED during HeadObject operation: No response body.

@sean-pasabi
Copy link

Hey @cee-shubham, did you mean @edgarrmondragon, because I haven't given any code?

@alsugiliazova
Copy link

@cee-shubham Have you fixed it somehow? I run into the same issue.

@gmweaver
Copy link

gmweaver commented Feb 7, 2025

I ran into the same issue as @cee-shubham and my initial guess is that S3 bucket authentication is not using the S3 keys configured on the server and is instead trying to use local S3 creds/keys?

I confirmed that the S3 keys I have on my Nessie server have access to the bucket by running a similar append operation via Spark.

Example code with output:

>>> catalog = load_catalog(
...     "nessie",
...     **{
...         "uri": "http://nessie:19120/iceberg/main",
...     },
... )
>>> 
>>> print(catalog.list_namespaces())
[('demo',)]
>>> schema = Schema(
...     NestedField(1, "id", IntegerType(), required=True),
...     NestedField(2, "name", StringType(), required=False),
... )
>>> 
>>> catalog.create_table("demo.test_pyiceberg_table", schema)
test_pyiceberg_table(
  1: id: required int,
  2: name: optional string
),
partition by: [],
sort order: [],
snapshot: null
>>> table = catalog.load_table("demo.test_pyiceberg_table")
>>> table.scan().to_pandas()
Empty DataFrame
Columns: [id, name]
Index: []
>>> data = pa.Table.from_pydict(
...     {
...         "id": np.array([1, 2, 3], dtype="int32"),
...         "name": ["Alice", "Bob", "Charlie"],
...     },
...     schema=schema.as_arrow(),
... )
>>> 
>>> table.append(data)
Traceback (most recent call last):
  File "/Users/garrett.weaver/Library/Caches/pypoetry/virtualenvs/testing-py3.12/lib/python3.12/site-packages/s3fs/core.py", line 114, in _error_wrapper
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/garrett.weaver/Library/Caches/pypoetry/virtualenvs/testing-py3.12/lib/python3.12/site-packages/aiobotocore/client.py", line 412, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the PutObject operation: Forbidden

Similar code on Spark works:

SPARK_PACKAGES = [
    "org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.7.1",
    "org.projectnessie.nessie-integrations:nessie-spark-extensions-3.4_2.12:0.99.0",
    "software.amazon.awssdk:bundle:2.20.126",
    "software.amazon.awssdk:url-connection-client:2.20.126",
]

SPARK_SQL_EXTENSIONS = [
    "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
    "org.projectnessie.spark.extensions.NessieSparkSessionExtensions",
]

SPARK_CONFIG = {
    "spark.jars.packages": ",".join(SPARK_PACKAGES),
    "spark.sql.extensions": ",".join(SPARK_SQL_EXTENSIONS),
    "spark.sql.catalog.nessie": "org.apache.iceberg.spark.SparkCatalog",
    "spark.sql.catalog.nessie.uri": "http://nessie:19120/iceberg/main",
    "spark.sql.catalog.nessie.type": "rest",
}

spark = SparkSession.builder.config(map=SPARK_CONFIG).getOrCreate()

>>> spark.sql(
...     """
...     CREATE TABLE IF NOT EXISTS nessie.demo.test_spark_table (
...         id INTEGER, 
...         name STRING
...     ) USING iceberg
...     """
... )
DataFrame[]
>>> 
>>> spark.sql(
...     """
...     INSERT INTO nessie.demo.test_spark_table VALUES
...     (1, 'Alice'),
...     (2, 'Bob')
...     """
... ).show()
++                                                                              
||
++
++
>>> spark.read.format("iceberg").load("nessie.demo.test_spark_table").show()
+---+-----+                                                                     
| id| name|
+---+-----+
|  1|Alice|
|  2|  Bob|
+---+-----+

@gmweaver
Copy link

gmweaver commented Feb 7, 2025

I confirmed that it is using S3 configs set in the catalog and not those configured on the Nessie Server, the following works:

catalog = load_catalog(
    "nessie",
    **{
        "uri": "http://nessie:19120/iceberg/main",
        "warehouse": "warehouse",
        "s3.endpoint": "https://mys3endpoint.com/",
        "s3.access-key-id": os.environ["AWS_ACCESS_KEY_ID"],
        "s3.secret-access-key": os.environ["AWS_SECRET_ACCESS_KEY"],
    },
)

@adamcodes716
Copy link

catalog = load_catalog(
"nessie",
**{
"uri": "http://nessie:19120/iceberg/main",
"warehouse": "warehouse",
"s3.endpoint": "https://mys3endpoint.com/",
"s3.access-key-id": os.environ["AWS_ACCESS_KEY_ID"],
"s3.secret-access-key": os.environ["AWS_SECRET_ACCESS_KEY"],
},

can you please help put me out of my misery? If I try this setup, I get a 404 because the URL is mangled:

/iceberg/main/v1/config?warehouse=warehouse

What else am I missing?

@FickleLife
Copy link

FickleLife commented Mar 30, 2025

@adamcodes716 I changed the URI to http://xxx:19120/api and that gives a response, but pyiceberg shows the error below....

I also cannot connect to Nessie via PyIceberg backed with Minio. Nessie and Minio under Docker on a different machine to the development machine. The code:

catalog = load_catalog(
        "nessie",
        **{
            "uri": "http://192.168.1.xxx:19120/api",
            "warehouse": "s3://warehouse/",
            "s3.access-key-id": "xxx",
            "s3.secret-access-key": "xxx",
            "s3.endpoint": "http://192.168.1.xxx:9000",
            "s3.path-style.access": "true",
            
        }
    )

Returns:

Exception has occurred: ValidationError
2 validation errors for ConfigResponse
defaults
  Field required [type=missing, input_value={'defaultBranch': 'main',...SupportedApiVersion': 2}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.9/v/missing
overrides
  Field required [type=missing, input_value={'defaultBranch': 'main',...SupportedApiVersion': 2}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.9/v/missing
  File "/Users/xxx/Development/bp_index/src/iceberg_createtable_category.py", line 20, in create_category_table
    catalog = load_catalog(
              ^^^^^^^^^^^^^
  File "/Users/xxx/Development/bp_index/src/iceberg_createtable_category.py", line 38, in <module>
    create_category_table()
pydantic_core._pydantic_core.ValidationError: 2 validation errors for ConfigResponse
defaults
  Field required [type=missing, input_value={'defaultBranch': 'main',...SupportedApiVersion': 2}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.9/v/missing
overrides
  Field required [type=missing, input_value={'defaultBranch': 'main',...SupportedApiVersion': 2}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.9/v/missing

I see a working config here: #1524 (comment) but this is using "rest" in load_catalog() instead of "Nessie" - I have tried this but also got errors. In an attempt to match his Nessie config, it seems in the documentation https://projectnessie.org/nessie-latest/configuration/#advanced-docker-image-tuning-java-images-only that there's not all the env variables available in docker as specified in the above post. Any clues to getting PyIceberg running with Nessie?

@dimas-b
Copy link

dimas-b commented Mar 31, 2025

@FickleLife : The uri does not look correct to me. For the Iceberg REST API it should probably be something like http://192.168.1.xxx:19120/iceberg/main. Cf. https://projectnessie.org/guides/try-nessie/

Note: with a URI like http://192.168.1.xxx:19120/api you're probably hitting the config endpoint of the Nessie (native) API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests