-
Notifications
You must be signed in to change notification settings - Fork 291
Support Vended Credentials for Azure Data Lake Store #1146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @c-thiel thank you for raising this issue. I'm not an expert in Azure Data Lake Store, but I could help look into this issue together.
It looks like the prefix for the secret name you posted is "adls" whereas the secret name we expect is "adlfs". iceberg-python/pyiceberg/io/__init__.py Line 72 in d587e67
Could you confirm if this is the case? This could be a typo on your comment above, or on the REST Catalog server side, which should be fixed |
@sungwy in my comment as well as in the catalog I am using "adls.sas-token" which is exactly what Java and Spark expect: Is there a reason for pyiceberg not using the same prefix as java? |
The adlfs prefix is wrong, we already have a PR to fix the prefix in #961 |
Thanks for pointing that out @c-thiel ! Like I mentioned, I'm not too familiar with the Azure Data Lake integration, but it looks like @ndrluis has the right solution ready for this issue 🙂 The PR has been merged into main - would be able to help confirm if the fix in main resolves this issue @c-thiel ? |
@sungwy sorry for the late reply, I overlooked this. Looking good now :) - Thanks! |
@sungwy and @ndrluis #961 isn't a fix for this issue. There are two aspects, first the prefix I haven't contributed to this project but here a working example merging in from the Java implementation and using the account-host to support the dfs endpoint def _adls(properties: Properties) -> AbstractFileSystem:
from adlfs import AzureBlobFileSystem
for property_name in properties:
if property_name.startswith(ADLFS_PREFIX):
deprecation_message(
deprecated_in="0.8.0",
removed_in="0.9.0",
help_message=f"The property {property_name} is deprecated. Please use properties that start with adls.",
)
# TODO should be a function on pyiceberg.utils.properties
def properties_with_prefix(properties: dict, prefix: str) -> dict:
return {k[len(prefix):]: v for k, v in properties.items() if k.startswith(prefix)}
account_name=get_first_property_value(
properties,
ADLS_ACCOUNT_NAME,
ADLFS_ACCOUNT_NAME,
)
sas_token=get_first_property_value(
properties,
ADLS_SAS_TOKEN,
ADLFS_SAS_TOKEN,
)
adls_sas_tokens = properties_with_prefix(properties, f"{ADLS_SAS_TOKEN}.")
if len(adls_sas_tokens) == 1:
key, sas_token = next(iter(adls_sas_tokens.items())) # Get the single key-value pair
# TODO add constant?
if key.endswith('.windows.net'):
if account_name is None:
properties[ADLS_ACCOUNT_NAME] = key.split('.')[0]
# TODO add constant
properties['adls.account-host'] = key
if sas_token is None:
properties[ADLS_SAS_TOKEN] = sas_token
return AzureBlobFileSystem(
# TODO add constant
account_host=properties.get('adls.account-host'),
connection_string=get_first_property_value(
properties,
ADLS_CONNECTION_STRING,
ADLFS_CONNECTION_STRING,
),
account_name=get_first_property_value(
properties,
ADLS_ACCOUNT_NAME,
ADLFS_ACCOUNT_NAME,
),
account_key=get_first_property_value(
properties,
ADLS_ACCOUNT_KEY,
ADLFS_ACCOUNT_KEY,
),
sas_token=get_first_property_value(
properties,
ADLS_SAS_TOKEN,
ADLFS_SAS_TOKEN,
),
tenant_id=get_first_property_value(
properties,
ADLS_TENANT_ID,
ADLFS_TENANT_ID,
),
client_id=get_first_property_value(
properties,
ADLS_CLIENT_ID,
ADLFS_CLIENT_ID,
),
client_secret=get_first_property_value(
properties,
ADLS_ClIENT_SECRET,
ADLFS_ClIENT_SECRET,
),
) |
spend too much time on it to find out it is a bug in pyiceberg :( |
I created a custom FileIO fix as a temporary workaround and I've submitted Polaris #418
from pyiceberg.io.fsspec import FsspecFileIO, _adls
from urllib.parse import urlparse
from pyiceberg.io import (ADLS_ACCOUNT_NAME,ADLS_SAS_TOKEN, ADLFS_ACCOUNT_NAME, ADLFS_SAS_TOKEN)
from pyiceberg.utils.properties import get_first_property_value
from fsspec import AbstractFileSystem
from pyiceberg.typedef import Properties
class CustomFsspecFileIO(FsspecFileIO):
def __init__(self, properties):
# Short term fix for https://github.com/apache/iceberg-python/issues/961 and https://github.com/apache/iceberg-python/issues/1146
base_location = properties.get('default-base-location')
if base_location and base_location.startswith('abfs'):
account_name = get_first_property_value(properties,ADLS_ACCOUNT_NAME,ADLFS_ACCOUNT_NAME)
sas_token = get_first_property_value(properties,ADLS_SAS_TOKEN,ADLFS_SAS_TOKEN)
if sas_token is None:
for key, value in properties.items():
key = key.replace('adlfs.', 'adls.')
if key.startswith(ADLS_SAS_TOKEN):
properties[ADLS_SAS_TOKEN] = value
if key.endswith('.windows.net'):
if account_name is None:
account_host = key.removeprefix(f"{ADLS_SAS_TOKEN}.")
account_name = account_host.split('.')[0]
properties[ADLS_ACCOUNT_NAME] = account_name
properties['adls.account-host'] = account_host
break # Exit loop after finding the first match
super().__init__(properties)
def _get_fs(self, scheme: str):
if scheme in ["abfs", "abfss", "wasb", "wasbs"]:
if scheme in ["wasb"]:
scheme = 'abfs'
if scheme in ["wasbs"]:
scheme = 'abfss'
adls_fs = _adls(self.properties)
return adls_fs
# If not adls proceed with the original behavior
return super()._get_fs(scheme)
def new_input(self, location: str):
# Replace wasb(s):// with adfs(s):// in the location
uri = urlparse(location)
if uri.scheme in ["wasb"]:
location = location.replace(f"{uri.scheme}://", "abfs://")
if uri.scheme in ["wasbs"]:
location = location.replace(f"{uri.scheme}://", "abfss://")
return super().new_input(location)
def new_output(self, location: str):
# Replace wasb(s):// with adfs:// in the location
uri = urlparse(location)
if uri.scheme in ["wasb"]:
location = location.replace(f"{uri.scheme}://", "abfs://")
if uri.scheme in ["wasbs"]:
location = location.replace(f"{uri.scheme}://", "abfss://")
return super().new_output(location)
def _adls(properties: Properties) -> AbstractFileSystem:
from adlfs import AzureBlobFileSystem
return AzureBlobFileSystem(
account_host = properties['adls.account-host'],
account_name=properties[ADLS_ACCOUNT_NAME],
sas_token=properties[ADLS_SAS_TOKEN]
) |
Hey @sfc-gh-tbenroeck, thanks for uncovering this bug. Are you interested in creating a PR to fix this? It looks like you're almost there. |
First version with `Fsspec`. Will check with PyArrow tomorrow, but that one seems to be lacking the SAS token: https://arrow.apache.org/docs/cpp/api/filesystem.html#azure-filesystem Closes apache#1146
First version with `Fsspec`. Will check with PyArrow tomorrow, but that one seems to be lacking the SAS token: https://arrow.apache.org/docs/cpp/api/filesystem.html#azure-filesystem Closes apache#1146
First version with `Fsspec`. Will check with PyArrow tomorrow, but that one seems to be lacking the SAS token: https://arrow.apache.org/docs/cpp/api/filesystem.html#azure-filesystem Closes apache#1146
First version with `Fsspec`. Will check with PyArrow tomorrow, but that one seems to be lacking the SAS token: https://arrow.apache.org/docs/cpp/api/filesystem.html#azure-filesystem Closes apache#1146
First version with `Fsspec`. Will check with PyArrow tomorrow, but that one seems to be lacking the SAS token: https://arrow.apache.org/docs/cpp/api/filesystem.html#azure-filesystem Closes apache#1146
First version with `Fsspec`. Will check with PyArrow tomorrow, but that one seems to be lacking the SAS token: https://arrow.apache.org/docs/cpp/api/filesystem.html#azure-filesystem Closes #1146
Feature Request / Improvement
Vended-Credentials for Azure Data Lake Store are supported by Java. For
getTable
/createTable
endpoints, the catalog returns a "config" that looks like:This is currently not respected by Pyiceberg. Instead we get the error:
ValueError: unable to connect to account for Must provide either a connection_string or account_name with credentials!!
Full Traceback:
The text was updated successfully, but these errors were encountered: