Skip to content

Support Glob Expressions for S3 #7393

Open
@a-agmon

Description

@a-agmon

Describe the bug

Im trying to register a listing table using a certain glob string:
s3://somebucket/somepath/*44fad0765ac6-00001.parquet
and receive the error

Error: Custom { kind: Other, error: 
"Object Store error: Object at location
 somepath/%2A44fad0765ac6-00001.parquet not found: 
response error \"No Body\", after 0 retries: HTTP status client error (404 Not Found) 
for url (https://s3.eu-west-1.amazonaws.com/somebucket/somepath/%252A44fad0765ac6-00001.parquet)" }

however, I know the file is there because the action succeeds when I just using the full file name
s3://somebucket/somepath/00000-2367-2918fbc9-ea04-4927-a669-44fad0765ac6-00001.parquet

from the error, I have a feeling it's related to how the url and gloc char is being encoded, looks like its being encoded twice for some reason - from * -> %2A -> %252A

To Reproduce

   let glob_str = "*44fad0765ac6-00001.parquet";
   let table_files = String::from("somepath");
   let table_files_path = format!("s3://{bucket_name}/{table_files}/{glob_str}");
   let table_files_url = ListingTableUrl::parse(&table_files_path)?;
   let schema_file_path = format!("s3://{bucket_name}/{table_files}/00000-2367-2918fbc9-ea04-4927-a669-44fad0765ac6-00001.parquet");
   let schema_file_url = ListingTableUrl::parse(&schema_file_path)?;
   println!("table_files_url: {}", table_files_path);
   
   let file_format = ParquetFormat::default()
.with_enable_pruning(Some(true));
   let listing_options = ListingOptions::new(Arc::new(file_format))
.with_file_extension(".parquet");
   
    let schema = listing_options.infer_schema(&ctx.state(), &schema_file_url).await?;


    let table_config = 
        ListingTableConfig::new(table_files_url) // the path of the files
        .with_listing_options(listing_options)
        .with_schema(schema);

    let table_provider = Arc::new(ListingTable::try_new(table_config).unwrap());

    ctx.register_table(TableReference::from("users"),table_provider).unwrap();
    let df = ctx.sql("SELECT count(*) FROM users").await?;
    df.show().await?;
    

Expected behavior

the files that correspond to the glob string will be registered as a table.
In the code above if I just switch between the ListingTableUrl with the globe string to that of the schema the code executes successfully (so no permissions issue)/

Additional context

My Cargo.Toml

aws-config = "0.55.3"
aws-sdk-s3 = "0.28.0"
object_store = { version = "0.6.1", features = ["aws"] }
url = "2.4.0"  
async-graphql = "6.0.4"
async-graphql-actix-web = { version = "6.0.4"}
actix-web =  { version = "4.3.1" }
datafusion = "28.0.0"
tokio = { version = "1.32.0", features = ["full"] }

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions