-
Notifications
You must be signed in to change notification settings - Fork 350
Add Hive integration tests #207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Fokko! It is great to have a folder for integration tests. I have a few comments below
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we do not need __init__.py
in test folders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would agree with you, but it raises an error:
_____________________________________________________________________________ ERROR collecting tests/integration/test_hive.py ______________________________________________________________________________
import file mismatch:
imported module 'test_hive' has this __file__ attribute:
/Users/fokkodriesprong/Desktop/iceberg-python/tests/catalog/test_hive.py
which is not the same as the test file we want to collect:
/Users/fokkodriesprong/Desktop/iceberg-python/tests/integration/test_hive.py
HINT: remove __pycache__ / .pyc files and/or use a unique basename for your test file modules
_____________________________________________________________________________ ERROR collecting tests/integration/test_rest.py ______________________________________________________________________________
import file mismatch:
imported module 'test_rest' has this __file__ attribute:
/Users/fokkodriesprong/Desktop/iceberg-python/tests/catalog/test_rest.py
which is not the same as the test file we want to collect:
/Users/fokkodriesprong/Desktop/iceberg-python/tests/integration/test_rest.py
HINT: remove __pycache__ / .pyc files and/or use a unique basename for your test file modules
We have the __init__.py
files in the other folders as well 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I raised this because we previously removed all __init__.py
in tests: apache/iceberg#5919
I think we can rename these two files to avoid name collisions with those in catalogs/
. This would enable us to safely delete the __init__.py
The other __init__.py
in tests/
was added in #45. I removed this one and all tests still work.
It would be ideal to maintain consistency by not having __init__.py
files in test folders. However, I believe this is a minor issue and should not block this PR.
tests/integration/test_hive.py
Outdated
|
||
|
||
@pytest.fixture() | ||
def table_test_limit(catalog: Catalog) -> Table: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could use pytest_lazy_fixture
to avoid duplicate tests in test_rest.py
and test_hive.py
like what we did in
iceberg-python/tests/catalog/test_sql.py
Lines 113 to 120 in 2bd8cf2
@pytest.mark.parametrize( | |
'catalog', | |
[ | |
lazy_fixture('catalog_memory'), | |
lazy_fixture('catalog_sqlite'), | |
], | |
) | |
def test_create_table_default_sort_order(catalog: SqlCatalog, table_schema_nested: Schema, random_identifier: Identifier) -> None: |
Here it can be something like
@pytest.fixture(params=[
pytest.lazy_fixture('catalog_hive'),
pytest.lazy_fixture('catalog_rest')
])
def catalog(request):
return request.param
@pytest.fixture()
def table_test_limit(catalog: Catalog) -> Table:
and table_test_limit
should load table from both catalogs in turn
I modified an example from https://pypi.org/project/pytest-lazy-fixture/
@pytest.fixture(params=[
pytest.lazy_fixture('one'),
pytest.lazy_fixture('two')
])
def combined(request):
return request.param
@pytest.fixture
def one():
return 1
@pytest.fixture
def two():
return 2
@pytest.fixture
def combined2(combined):
return combined
def test_func(combined2):
assert combined2 in [1, 2]
which seems to work.
I will try to test this in our integration test later. If it works, I think it can be worth to implement. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the lazy fixtures in #178 and I waited for the PR to be merged because I wanted to add them here as well. The reason I didn't do it is because test_upgrade_table_version
and test_table_properties
are not implemented in Hive (yet). This would mean that we need to keep them in parity which might slow down development.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the context! For test_upgrade_table_version
and test_table_properties
, I think we might skip them in test_hive by
@pytest.fixture()
def table(catalog: Catalog) -> Table:
if catalog.name == "hive":
pytest.skip("Not Implemented: ...")
...
@pytest.fixture()
def table_test_table_version(catalog: Catalog) -> Table:
if catalog.name == "hive":
pytest.skip("Not Implemented: ...")
...
(We might also find a way to put skip
into the decorator)
Since the two fixtures are solely used by the two tests respectively.
I think we can discuss later on whether this can work in future tests related to update table. The current PR has already done many things and it is important for update_table
tests. This topic should not block this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually like this suggestion a lot, let me add this to the PR
dev/Dockerfile
Outdated
@@ -62,7 +60,7 @@ RUN chmod u+x /opt/spark/sbin/* && \ | |||
|
|||
RUN pip3 install -q ipython | |||
|
|||
RUN pip3 install "pyiceberg[s3fs]==${PYICEBERG_VERSION}" | |||
RUN pip3 install "pyiceberg[s3fs,pyarrow,hive]==${PYICEBERG_VERSION}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need pyarrow
in the docker image? I think in provision.py
we only need pyiceberg to create some empty table with UUID column. Since we use minio for storage, s3fs
can help us write the metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I got an error before, and therefore I added it, but I've removed it again and it seems to work fine 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Thanks @HonahX for the review 👍 |
This makes it easier to test round trips of updating tables. Currently, we have integration tests for the REST catalog, but there the REST catalog takes care of updating the metadata.
This includes a refactor of moving all the integration tests into their own module.
By having this setup, it will also easier for others to pick up work, for example: #205 or #206.
cc @HonahX