-
Notifications
You must be signed in to change notification settings - Fork 289
fix: correct UUIDType
partition representation for BucketTransform
#2003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DinGo4DEV Thanks for working on this!
It would be good to throw in a test as well. It can be a simple test in test_writes.py
where you write to a bucket UUID table. It also looks like there linter is not happy, could you run make lint
as well?
<!-- Thanks for opening a pull request! --> <!-- In the case this PR will resolve an issue, please replace ${GITHUB_ISSUE_ID} below with the actual Github issue id. --> <!-- Closes #${GITHUB_ISSUE_ID} --> # Rationale for this change # Are these changes tested? # Are there any user-facing changes? <!-- In the case of user-facing changes, please add the changelog label. -->
@DinGo4DEV Again, thanks for working on this. As part of this review, I dug a bit deeper into the issues, and it looks like we're missing the Parquet LogicalTypeAnnotation (apache/arrow#46469) which causes interoperability issues with other readers. |
…ntation `LegacyOAuth2AuthManager` (apache#1981) <!-- Thanks for opening a pull request! --> <!-- In the case this PR will resolve an issue, please replace ${GITHUB_ISSUE_ID} below with the actual Github issue id. --> <!-- Closes apache#1909 --> # Rationale for this change Replace existing Auth handling with `LegacyOAuth2AuthManager`. Tracking issue: apache#1909 There will be follow up PRs to this PR that will address the following: - introduce a mechanism for using a custom `AuthManager` implementation, along with the ability to use a set of config parameters - introduce a `OAuth2AuthManager` that more closely follows the OAuth2 protocol, and also uses a separate thread to proactively refreshes the token, rather than reactively refreshing the token on `UnAuthorizedError` or the deprecated `AuthorizationExpiredError`. # Are these changes tested? Yes, both through unit and integration tests # Are there any user-facing changes? Yes - previously, if `TOKEN` and `CREDENTIAL` are both defined, `oauth/tokens` endpoint wouldn't be used to refresh the token with client credentials when the `RestCatalog` was initialized. However, `oauth/tokens` endpoint would be used on retries that handled 401 or 419 error. This erratic behavior will now be updated as follows: - if `CREDENTIAL` is defined, `oauth/tokens` endpoint will be used to fetch the access token using the client credentials both when the RestCatalog is initialized, and when the refresh_tokens call is made as a reaction to 401 or 419 error. - if both `CREDENTIAL` and `TOKEN` are defined, we will follow the above behavior. - if only `TOKEN` is defined, the initial token will be used instead <!-- In the case of user-facing changes, please add the changelog label. -->
# Rationale for this change Add support for the Hugging Face filesystem in `fsspec`, which uses `hf://` paths. This allows to import [HF datasets](https://huggingface.co/datasets). Authentication is done using the `"hf.token"` property. # Are these changes tested? I tried locally but haven't added tests in test_fsspec.py (lmk if it's a requirement) # Are there any user-facing changes? No changes, it simply adds support for `hf://` URLs
@Fokko Thank you for taking the time to review. I appreciate your thoughtful feedback and the effort you put into this. To fully support the UUID type, it looks like we'll need to wait for a new Arrow release (> 20.0.0). In the meantime, I’ll continue working on the test cases for my commits. |
@DinGo4DEV Yes, please do. My biggest concern is that we produce Parquet files that will not be supported by other implementations because of the missing logical annotation. Arrow releases pretty often, so it can be resolved within reasonable timespan. |
@DinGo4DEV Good news, it looks like this is fixed in the next release of Arrow: apache/arrow#45866 |
@Fokko TBR, After running the test case, I found that the identity transform of uuid is not supported for writing, because the value is bytes. So I tried rewrite the Avro writer and other related components.
data/uuid_bucket=0/uuid_identity=ec9b663b-062f-4200-a130-8de19c21b800
|- xxxxx.parquet
data/uuid_bucket=1/uuid_identity=5f473c64-dbeb-449b-bdfa-b6b4185b1bde
|- xxxxx.parquet Not sure is that correct and compatible with other integration as I haven't tried partition the uuid with identity before in other projects. |
…nto uuid-partition-representation
They also noticed that kind of problem in (apache/iceberg#13087) |
Rationale for this change
Resolves #2002
Are these changes tested?
Tested Locally. Should add testcases for this later
Are there any user-facing changes?