Skip to content

Remove extraneous trailing slash in table location #606

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Vitalii0-o opened this issue Apr 15, 2024 · 5 comments · Fixed by #702
Closed

Remove extraneous trailing slash in table location #606

Vitalii0-o opened this issue Apr 15, 2024 · 5 comments · Fixed by #702

Comments

@Vitalii0-o
Copy link

Apache Iceberg version

0.6.0 (latest release)

Please describe the bug 🐞

reating an Iceberg table from spark-3.2 with pyiceberg runtime 0.6.0 with location s3://bucket/db/tbl/ will cause underlying files be like s3://bucket/db/tbl//data/xxx. This leads to other system (e.g. Trino) file not found exception as those system will remove the duplicated slash in the string (s3://bucket/db/tbl/data/xxx).

I think the right behavior should be letting iceberg remove extraneous trailing slash in table location. i.e. when setting the table location as s3://bucket/db/tbl/. the trailing / should be removed.

In iceberg еhis issue has been solved -
apache/iceberg#4582

@Fokko
Copy link
Contributor

Fokko commented Apr 15, 2024

@Vitalii0-o I think it is a good idea to remove the extraneous slash, are you interested in contributing a patch?

@Vitalii0-o
Copy link
Author

Of course, it’s true that I only managed to localize the problem, I don’t have a ready-made solution yet

@Fokko
Copy link
Contributor

Fokko commented Apr 15, 2024

If you have a stack trace it would be possible to find a suitable place to trim the excess slash

@Vitalii0-o
Copy link
Author

Traceback (most recent call last):
File "/usr/local/airflow/.local/lib/python3.11/site-packages/dlt/destinations/sql_client.py", line 242, in _wrap_gen
return (yield from f(self, *args, **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/airflow/.local/lib/python3.11/site-packages/dlt/destinations/impl/athena/athena.py", line 296, in execute_query
cursor.execute(query_line, db_args)
File "/usr/local/airflow/.local/lib/python3.11/site-packages/pyathena/cursor.py", line 108, in execute
raise OperationalError(query_execution.state_change_reason)
pyathena.error.OperationalError: GENERIC_INTERNAL_ERROR: io.trino.hdfs.s3.TrinoS3FileSystem$UnrecoverableS3OperationException: com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: 123; S3 Extended Request ID: 123=; Proxy: null), S3 Extended Request ID: 123= (Bucket: bucket, Key: facebook/123/bronze_facebook_test1/_dlt_pipeline_state/metadata/123.metadata.json)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/airflow/.local/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 699, in sync_destination
remote_state = self._restore_state_from_destination()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/airflow/.local/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 1420, in _restore_state_from_destination
state = load_pipeline_state_from_destination(self.pipeline_name, job_client)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/airflow/.local/lib/python3.11/site-packages/dlt/pipeline/state_sync.py", line 139, in load_pipeline_state_from_destination
state = client.get_stored_state(pipeline_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/airflow/.local/lib/python3.11/site-packages/dlt/destinations/job_client_impl.py", line 368, in get_stored_state
with self.sql_client.execute_query(query, pipeline_name) as cur:
File "/usr/local/lib/python3.11/contextlib.py", line 137, in enter
return next(self.gen)
^^^^^^^^^^^^^^
File "/usr/local/airflow/.local/lib/python3.11/site-packages/dlt/destinations/sql_client.py", line 244, in _wrap_gen
raise self._make_database_exception(ex)
dlt.destinations.exceptions.DatabaseTerminalException: GENERIC_INTERNAL_ERROR: io.trino.hdfs.s3.TrinoS3FileSystem$UnrecoverableS3OperationException: com.amazonaws.services.s3.model.AmazonS3Exception: The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: 123; S3 Extended Request ID: 123=; Proxy: null), S3 Extended Request ID: 123= (Bucket: bucket, Key: facebook/123/bronze_facebook_test1/_dlt_pipeline_state/metadata/123.metadata.json)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/airflow/.local/lib/python3.11/site-packages/airflow/models/taskinstance.py", line 433, in _execute_task
result = execute_callable(context=context, **execute_callable_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/airflow/.local/lib/python3.11/site-packages/airflow/operators/python.py", line 199, in execute
return_value = self.execute_callable()
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/airflow/.local/lib/python3.11/site-packages/airflow/operators/python.py", line 216, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/airflow/.local/lib/python3.11/site-packages/dlt/helpers/airflow_helper.py", line 273, in _run
for attempt in self.retry_policy.copy(
File "/usr/local/airflow/.local/lib/python3.11/site-packages/tenacity/init.py", line 347, in iter
do = self.iter(retry_state=retry_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/airflow/.local/lib/python3.11/site-packages/tenacity/init.py", line 314, in iter
return fut.result()
^^^^^^^^^^^^
File "/usr/local/lib/python3.11/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/usr/local/airflow/.local/lib/python3.11/site-packages/dlt/helpers/airflow_helper.py", line 283, in _run
load_info = task_pipeline.run(
^^^^^^^^^^^^^^^^^^
File "/usr/local/airflow/.local/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 219, in _wrap
step_info = f(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/airflow/.local/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 264, in _wrap
return f(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/airflow/.local/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 640, in run
self.sync_destination(destination, staging, dataset_name)
File "/usr/local/airflow/.local/lib/python3.11/site-packages/dlt/pipeline/pipeline.py", line 173, in _wrap
rv = f(self, *args, **kwargs)

app is try to find --Bucket: bucket, Key: facebook/123/bronze_facebook_test1/_dlt_pipeline_state/metadata/123.metadata.json
but. i have Bucket: bucket, Key: facebook/123/bronze_facebook_test1/_dlt_pipeline_state/metadata/123.metadata.json
I am using dlt which uses pyAthena, which uses pyIceberg. pyIceberg create folder in s3 with extra /. I found exactly the same error in Iceberg -- apache/iceberg#4582

@Vitalii0-o
Copy link
Author

This error occurred to me about a week ago, but there were no changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants