Make ListingTableUrl::try_new
public
#15250
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
N/A
Rationale for this change
I spent some time looking into #7393. It seems the simple cases can be supported in a few lines of code (e.g. parsing
s3://bucket/key/*.parquet
into a base URLs3://bucket/key/
and a glob pattern*.parquet
. But soon I realized there is no clear idea how broader cases can be handled.http
orhttps
schemes, glob patterns in the URL may not make sense, since many HTTP servers do not support listing files under a given directory. Also, the?
character before the query should not be treated as a glob pattern.*
,?
, etc.) in the authority (host name, username, or password) probably should not be treated as glob patterns. However, if we want to support glob patterns in URL paths only, we face a difficulty that the parsed URL does not provide access to the raw path (enhancement request: path_raw() servo/rust-url#602). So we cannot recover the glob pattern (before percentage escape) in the original URL string, when using theurl
crate.^
instead of!
for negated character class, and it also supports alternation (e.g.{a,b}
) which is not supported by theglob
crate. I'm not sure if we want this kind of capabilities in DataFusion, since the expected behavior for URL glob patterns may be different in downstream projects.Given the analysis above, I feel the best workaround for now is to allow applications to construct
ListingTableUrl
directly from a base URL (with no interpretation of glob) and an optional glob pattern. How the base URL and the glob pattern are parsed is determined by the application before constructingListingTableUrl
.What changes are included in this PR?
ListingTableUrl::try_new
is now public. I also updated its documentation to explain when this method could be useful.Are these changes tested?
N/A
Are there any user-facing changes?
ListingTableUrl::try_new
is now public. This change is backward compatible.