Skip to content

Add Hive integration tests #207

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jan 17, 2024
Merged

Add Hive integration tests #207

merged 11 commits into from
Jan 17, 2024

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Dec 12, 2023

This makes it easier to test round trips of updating tables. Currently, we have integration tests for the REST catalog, but there the REST catalog takes care of updating the metadata.

This includes a refactor of moving all the integration tests into their own module.

By having this setup, it will also easier for others to pick up work, for example: #205 or #206.

cc @HonahX

@Fokko Fokko added this to the PyIceberg 0.6.0 release milestone Dec 13, 2023
Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Fokko! It is great to have a folder for integration tests. I have a few comments below

#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
Copy link
Contributor

@HonahX HonahX Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do not need __init__.py in test folders.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree with you, but it raises an error:

_____________________________________________________________________________ ERROR collecting tests/integration/test_hive.py ______________________________________________________________________________
import file mismatch:
imported module 'test_hive' has this __file__ attribute:
  /Users/fokkodriesprong/Desktop/iceberg-python/tests/catalog/test_hive.py
which is not the same as the test file we want to collect:
  /Users/fokkodriesprong/Desktop/iceberg-python/tests/integration/test_hive.py
HINT: remove __pycache__ / .pyc files and/or use a unique basename for your test file modules
_____________________________________________________________________________ ERROR collecting tests/integration/test_rest.py ______________________________________________________________________________
import file mismatch:
imported module 'test_rest' has this __file__ attribute:
  /Users/fokkodriesprong/Desktop/iceberg-python/tests/catalog/test_rest.py
which is not the same as the test file we want to collect:
  /Users/fokkodriesprong/Desktop/iceberg-python/tests/integration/test_rest.py
HINT: remove __pycache__ / .pyc files and/or use a unique basename for your test file modules

We have the __init__.py files in the other folders as well 🤔

Copy link
Contributor

@HonahX HonahX Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I raised this because we previously removed all __init__.py in tests: apache/iceberg#5919
I think we can rename these two files to avoid name collisions with those in catalogs/. This would enable us to safely delete the __init__.py

The other __init__.py in tests/ was added in #45. I removed this one and all tests still work.

It would be ideal to maintain consistency by not having __init__.py files in test folders. However, I believe this is a minor issue and should not block this PR.



@pytest.fixture()
def table_test_limit(catalog: Catalog) -> Table:
Copy link
Contributor

@HonahX HonahX Dec 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could use pytest_lazy_fixture to avoid duplicate tests in test_rest.py and test_hive.py
like what we did in

@pytest.mark.parametrize(
'catalog',
[
lazy_fixture('catalog_memory'),
lazy_fixture('catalog_sqlite'),
],
)
def test_create_table_default_sort_order(catalog: SqlCatalog, table_schema_nested: Schema, random_identifier: Identifier) -> None:
.

Here it can be something like

@pytest.fixture(params=[
    pytest.lazy_fixture('catalog_hive'),
    pytest.lazy_fixture('catalog_rest')
])
def catalog(request):
    return request.param

@pytest.fixture()
def table_test_limit(catalog: Catalog) -> Table:

and table_test_limit should load table from both catalogs in turn

I modified an example from https://pypi.org/project/pytest-lazy-fixture/

@pytest.fixture(params=[
    pytest.lazy_fixture('one'),
    pytest.lazy_fixture('two')
])
def combined(request):
    return request.param

@pytest.fixture
def one():
    return 1

@pytest.fixture
def two():
    return 2

@pytest.fixture
def combined2(combined):
    return combined

def test_func(combined2):
    assert combined2 in [1, 2]

which seems to work.

I will try to test this in our integration test later. If it works, I think it can be worth to implement. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the lazy fixtures in #178 and I waited for the PR to be merged because I wanted to add them here as well. The reason I didn't do it is because test_upgrade_table_version and test_table_properties are not implemented in Hive (yet). This would mean that we need to keep them in parity which might slow down development.

Copy link
Contributor

@HonahX HonahX Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context! For test_upgrade_table_version and test_table_properties, I think we might skip them in test_hive by

@pytest.fixture()
def table(catalog: Catalog) -> Table:
  if catalog.name == "hive":
    pytest.skip("Not Implemented: ...")
    ...

@pytest.fixture()
def table_test_table_version(catalog: Catalog) -> Table:
    if catalog.name == "hive":
      pytest.skip("Not Implemented: ...")
     ...

(We might also find a way to put skip into the decorator)
Since the two fixtures are solely used by the two tests respectively.

I think we can discuss later on whether this can work in future tests related to update table. The current PR has already done many things and it is important for update_table tests. This topic should not block this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually like this suggestion a lot, let me add this to the PR

dev/Dockerfile Outdated
@@ -62,7 +60,7 @@ RUN chmod u+x /opt/spark/sbin/* && \

RUN pip3 install -q ipython

RUN pip3 install "pyiceberg[s3fs]==${PYICEBERG_VERSION}"
RUN pip3 install "pyiceberg[s3fs,pyarrow,hive]==${PYICEBERG_VERSION}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need pyarrow in the docker image? I think in provision.py we only need pyiceberg to create some empty table with UUID column. Since we use minio for storage, s3fs can help us write the metadata

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I got an error before, and therefore I added it, but I've removed it again and it seems to work fine 👍

Copy link
Contributor

@HonahX HonahX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Fokko
Copy link
Contributor Author

Fokko commented Jan 17, 2024

Thanks @HonahX for the review 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants