Skip to content

fix: correct UUIDType partition representation for BucketTransform #2003

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

DinGo4DEV
Copy link

Rationale for this change

Resolves #2002

Are these changes tested?

Tested Locally. Should add testcases for this later

Are there any user-facing changes?

@DinGo4DEV DinGo4DEV marked this pull request as ready for review May 14, 2025 08:51
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DinGo4DEV Thanks for working on this!

It would be good to throw in a test as well. It can be a simple test in test_writes.py where you write to a bucket UUID table. It also looks like there linter is not happy, could you run make lint as well?

DinGo4DEV and others added 3 commits May 15, 2025 20:46
<!--
Thanks for opening a pull request!
-->

<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
<!-- Closes #${GITHUB_ISSUE_ID} -->

# Rationale for this change

# Are these changes tested?

# Are there any user-facing changes?

<!-- In the case of user-facing changes, please add the changelog label.
-->
@Fokko
Copy link
Contributor

Fokko commented May 16, 2025

@DinGo4DEV Again, thanks for working on this. As part of this review, I dug a bit deeper into the issues, and it looks like we're missing the Parquet LogicalTypeAnnotation (apache/arrow#46469) which causes interoperability issues with other readers.

sungwy and others added 2 commits May 16, 2025 13:25
…ntation `LegacyOAuth2AuthManager` (apache#1981)

<!--
Thanks for opening a pull request!
-->

<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
<!-- Closes apache#1909 -->

# Rationale for this change

Replace existing Auth handling with `LegacyOAuth2AuthManager`. Tracking
issue: apache#1909

There will be follow up PRs to this PR that will address the following:
- introduce a mechanism for using a custom `AuthManager` implementation,
along with the ability to use a set of config parameters
- introduce a `OAuth2AuthManager` that more closely follows the OAuth2
protocol, and also uses a separate thread to proactively refreshes the
token, rather than reactively refreshing the token on
`UnAuthorizedError` or the deprecated `AuthorizationExpiredError`.

# Are these changes tested?

Yes, both through unit and integration tests

# Are there any user-facing changes?

Yes - previously, if `TOKEN` and `CREDENTIAL` are both defined,
`oauth/tokens` endpoint wouldn't be used to refresh the token with
client credentials when the `RestCatalog` was initialized. However,
`oauth/tokens` endpoint would be used on retries that handled 401 or 419
error.

This erratic behavior will now be updated as follows:
- if `CREDENTIAL` is defined, `oauth/tokens` endpoint will be used to
fetch the access token using the client credentials both when the
RestCatalog is initialized, and when the refresh_tokens call is made as
a reaction to 401 or 419 error.
- if both `CREDENTIAL` and `TOKEN` are defined, we will follow the above
behavior.
- if only `TOKEN` is defined, the initial token will be used instead

<!-- In the case of user-facing changes, please add the changelog label.
-->
# Rationale for this change

Add support for the Hugging Face filesystem in `fsspec`, which uses
`hf://` paths.
This allows to import [HF datasets](https://huggingface.co/datasets).

Authentication is done using the `"hf.token"` property.

# Are these changes tested?

I tried locally but haven't added tests in test_fsspec.py (lmk if it's a
requirement)

# Are there any user-facing changes?

No changes, it simply adds support for `hf://` URLs
@DinGo4DEV
Copy link
Author

@Fokko Thank you for taking the time to review. I appreciate your thoughtful feedback and the effort you put into this. To fully support the UUID type, it looks like we'll need to wait for a new Arrow release (> 20.0.0). In the meantime, I’ll continue working on the test cases for my commits.

@Fokko
Copy link
Contributor

Fokko commented May 16, 2025

@DinGo4DEV Yes, please do. My biggest concern is that we produce Parquet files that will not be supported by other implementations because of the missing logical annotation. Arrow releases pretty often, so it can be resolved within reasonable timespan.

@Fokko
Copy link
Contributor

Fokko commented May 16, 2025

@DinGo4DEV Good news, it looks like this is fixed in the next release of Arrow: apache/arrow#45866

@DinGo4DEV
Copy link
Author

@Fokko TBR, After running the test case, I found that the identity transform of uuid is not supported for writing, because the value is bytes. So I tried rewrite the Avro writer and other related components.

  • The uuid is still storing bytes in parquet
  • Changed the identity value to hex representation ec9b663b-062f-4200-a130-8de19c21b800 instead of bytes string value b'\xec\x9bf;\x06/B\x00\xa10\x8d\xe1\x9c!\xb8\x00'
data/uuid_bucket=0/uuid_identity=ec9b663b-062f-4200-a130-8de19c21b800
  |- xxxxx.parquet
data/uuid_bucket=1/uuid_identity=5f473c64-dbeb-449b-bdfa-b6b4185b1bde
  |-  xxxxx.parquet

Not sure is that correct and compatible with other integration as I haven't tried partition the uuid with identity before in other projects.
However if this PR is accepted, we still need to rewrite other test cases related to UUID.

@DinGo4DEV
Copy link
Author

They also noticed that kind of problem in (apache/iceberg#13087)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UUIDType with BucketTransform incorrectly converts int to str in PartitionKey
4 participants