Skip to content

Make ListingTableUrl::try_new public #15250

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 16, 2025

Conversation

linhr
Copy link
Contributor

@linhr linhr commented Mar 15, 2025

Which issue does this PR close?

N/A

Rationale for this change

I spent some time looking into #7393. It seems the simple cases can be supported in a few lines of code (e.g. parsing s3://bucket/key/*.parquet into a base URL s3://bucket/key/ and a glob pattern *.parquet. But soon I realized there is no clear idea how broader cases can be handled.

  1. For http or https schemes, glob patterns in the URL may not make sense, since many HTTP servers do not support listing files under a given directory. Also, the ? character before the query should not be treated as a glob pattern.
  2. Special characters (*, ?, etc.) in the authority (host name, username, or password) probably should not be treated as glob patterns. However, if we want to support glob patterns in URL paths only, we face a difficulty that the parsed URL does not provide access to the raw path (enhancement request: path_raw() servo/rust-url#602). So we cannot recover the glob pattern (before percentage escape) in the original URL string, when using the url crate.
  3. Some applications have different glob syntax. For example, Spark uses ^ instead of ! for negated character class, and it also supports alternation (e.g. {a,b}) which is not supported by the glob crate. I'm not sure if we want this kind of capabilities in DataFusion, since the expected behavior for URL glob patterns may be different in downstream projects.

Given the analysis above, I feel the best workaround for now is to allow applications to construct ListingTableUrl directly from a base URL (with no interpretation of glob) and an optional glob pattern. How the base URL and the glob pattern are parsed is determined by the application before constructing ListingTableUrl.

What changes are included in this PR?

ListingTableUrl::try_new is now public. I also updated its documentation to explain when this method could be useful.

Are these changes tested?

N/A

Are there any user-facing changes?

ListingTableUrl::try_new is now public. This change is backward compatible.

@github-actions github-actions bot added the datasource Changes to the datasource crate label Mar 15, 2025
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @linhr
I feel creational methods mostly should be public, although this method can easily be reproduced on client side.

@alamb alamb merged commit e4bf951 into apache:main Mar 16, 2025
28 checks passed
@alamb
Copy link
Contributor

alamb commented Mar 16, 2025

Thanks @linhr and @comphead

@linhr linhr deleted the listing-table-url-try-new branch March 17, 2025 07:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasource Changes to the datasource crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants