Iceberg add_files procedure with partition_filter scan non needed folders #7027

sweetpythoncode · 2023-03-06T15:54:26Z

Apache Iceberg version

1.1.0 (latest release)

Query engine

Spark

Please describe the bug 🐞

source structure example: s3://bucket/data/id=123/name=test/date=321/result.orc

CALL iceberg_catalog.system.add_files(
    table => 'test.test_name',
    source_table => '`orc`.`s3://bucket/data/`',
    partition_filter => map('id', '3')
    check_duplicate_files => false

partition_filter option does not handle the order of partition, which produces nested folders scanning until finding the first match. Should we run filter by partition in order before run nested Listing leaf files and directories?

Example of current flow:

s3://bucket/data/id=1/name=test/date=321/result.orc -> Listing leaf files and directories on each sub folder 
s3://bucket/data/id=2/name=test/date=321/result.orc -> Listing leaf files and directories on each sub folder
s3://bucket/data/id=3/name=test/date=321/result.orc -> Match needed partition_filter
s3://bucket/data/id=4/name=test/date=321/result.orc -> Listing leaf files and directories on each sub folder

Also if i have partition_by id, name, date in table and specify

CALL iceberg_catalog.system.add_files(
    table => 'test.test_name',
    source_table => '`orc`.`s3://bucket/data/id=1/name=test/`',
    check_duplicate_files => false

Iceberg will ignore these partitions and set them as null in table, instead of pulling these data from the path, in spark it's handled by basePath before reading the partitions but here is used InMemoryFileIndex without the possibility to do that?

cc @RussellSpitzer @szehon-ho

The text was updated successfully, but these errors were encountered:

szehon-ho · 2023-03-14T21:52:16Z

Did you try full partition filter? Also cc @abmo-x @dramaticlly , who are expert users of add_files, if they some experience. Its been awhile since I looked at this

abmo-x · 2023-03-15T00:09:52Z

@sweetpythoncode yes, I have come across both issues identified. Partition listing can be improved and when a path is provided which is not the root path without partition info, then partition values are not extracted correctly. I am working on them and will let you know once I have a PR out

sweetpythoncode · 2023-03-16T12:14:44Z

@szehon-ho yes, I try, but it's not using the predicted filter in the dest path, It will anyway scan the full data before applying that filter, based on the code which I found in getPartitions method for add_files procedure(I also debug it to check if that true).
@abmo-x Sweet, wait for it, thank you so much!

sweetpythoncode · 2023-04-06T06:55:41Z

@abmo-x Hello, any progress on this one, thanks?

abmo-x · 2023-04-06T07:37:05Z

Hi @sweetpythoncode I was testing it right now :) hope to have a PR soon. Will keep you posted.

sweetpythoncode · 2023-04-06T08:01:34Z

@abmo-x Wow! Thank you so much!

abmo-x · 2023-04-15T07:33:41Z

@sweetpythoncode
PR is now available, can you test and help review. #7363

siumingdev · 2023-11-28T15:59:50Z

Will this be improved or may I have suggestions on alternative ways doing the equivalent? say using the Java API alone without Spark?

github-actions · 2024-08-28T00:13:44Z

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions · 2024-09-12T00:14:53Z

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

sweetpythoncode mentioned this issue Mar 14, 2023

Add ignoreDuplicates option for add_files procedure #6306

Closed

abmo-x mentioned this issue Apr 17, 2023

Spark 3.2: Optimized add_files procedure's listPartitions #7363

Closed

sungwy mentioned this issue Mar 11, 2024

Add Data Files from Parquet Files to UnPartitioned Table apache/iceberg-python#506

Merged

github-actions bot added the stale label Aug 28, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Iceberg add_files procedure with partition_filter scan non needed folders #7027

Iceberg add_files procedure with partition_filter scan non needed folders #7027

sweetpythoncode commented Mar 6, 2023 •

edited

Loading

szehon-ho commented Mar 14, 2023

Uh oh!

abmo-x commented Mar 15, 2023

Uh oh!

sweetpythoncode commented Mar 16, 2023 •

edited

Loading

Uh oh!

sweetpythoncode commented Apr 6, 2023

Uh oh!

abmo-x commented Apr 6, 2023

Uh oh!

sweetpythoncode commented Apr 6, 2023

Uh oh!

abmo-x commented Apr 15, 2023 •

edited

Loading

Uh oh!

siumingdev commented Nov 28, 2023

Uh oh!

github-actions bot commented Aug 28, 2024

Uh oh!

github-actions bot commented Sep 12, 2024

Uh oh!

Iceberg add_files procedure with partition_filter scan non needed folders #7027

Iceberg add_files procedure with partition_filter scan non needed folders #7027

Comments

sweetpythoncode commented Mar 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Apache Iceberg version

Query engine

Please describe the bug 🐞

szehon-ho commented Mar 14, 2023

Uh oh!

abmo-x commented Mar 15, 2023

Uh oh!

sweetpythoncode commented Mar 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sweetpythoncode commented Apr 6, 2023

Uh oh!

abmo-x commented Apr 6, 2023

Uh oh!

sweetpythoncode commented Apr 6, 2023

Uh oh!

abmo-x commented Apr 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

siumingdev commented Nov 28, 2023

Uh oh!

github-actions bot commented Aug 28, 2024

Uh oh!

github-actions bot commented Sep 12, 2024

Uh oh!

sweetpythoncode commented Mar 6, 2023 •

edited

Loading

sweetpythoncode commented Mar 16, 2023 •

edited

Loading

abmo-x commented Apr 15, 2023 •

edited

Loading