Skip to content

Iceberg add_files procedure with partition_filter scan non needed folders #7027

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sweetpythoncode opened this issue Mar 6, 2023 · 10 comments
Labels

Comments

@sweetpythoncode
Copy link

sweetpythoncode commented Mar 6, 2023

Apache Iceberg version

1.1.0 (latest release)

Query engine

Spark

Please describe the bug 🐞

source structure example: s3://bucket/data/id=123/name=test/date=321/result.orc

CALL iceberg_catalog.system.add_files(
    table => 'test.test_name',
    source_table => '`orc`.`s3://bucket/data/`',
    partition_filter => map('id', '3')
    check_duplicate_files => false

partition_filter option does not handle the order of partition, which produces nested folders scanning until finding the first match. Should we run filter by partition in order before run nested Listing leaf files and directories?

Example of current flow:

s3://bucket/data/id=1/name=test/date=321/result.orc -> Listing leaf files and directories on each sub folder 
s3://bucket/data/id=2/name=test/date=321/result.orc -> Listing leaf files and directories on each sub folder
s3://bucket/data/id=3/name=test/date=321/result.orc -> Match needed partition_filter
s3://bucket/data/id=4/name=test/date=321/result.orc -> Listing leaf files and directories on each sub folder

Also if i have partition_by id, name, date in table and specify

CALL iceberg_catalog.system.add_files(
    table => 'test.test_name',
    source_table => '`orc`.`s3://bucket/data/id=1/name=test/`',
    check_duplicate_files => false

Iceberg will ignore these partitions and set them as null in table, instead of pulling these data from the path, in spark it's handled by basePath before reading the partitions but here is used InMemoryFileIndex without the possibility to do that?

cc @RussellSpitzer @szehon-ho

@szehon-ho
Copy link
Collaborator

Did you try full partition filter? Also cc @abmo-x @dramaticlly , who are expert users of add_files, if they some experience. Its been awhile since I looked at this

@abmo-x
Copy link
Contributor

abmo-x commented Mar 15, 2023

@sweetpythoncode yes, I have come across both issues identified. Partition listing can be improved and when a path is provided which is not the root path without partition info, then partition values are not extracted correctly. I am working on them and will let you know once I have a PR out

@sweetpythoncode
Copy link
Author

sweetpythoncode commented Mar 16, 2023

@szehon-ho yes, I try, but it's not using the predicted filter in the dest path, It will anyway scan the full data before applying that filter, based on the code which I found in getPartitions method for add_files procedure(I also debug it to check if that true).
@abmo-x Sweet, wait for it, thank you so much!

@sweetpythoncode
Copy link
Author

@abmo-x Hello, any progress on this one, thanks?

@abmo-x
Copy link
Contributor

abmo-x commented Apr 6, 2023

Hi @sweetpythoncode I was testing it right now :) hope to have a PR soon. Will keep you posted.

@sweetpythoncode
Copy link
Author

@abmo-x Wow! Thank you so much!

@abmo-x
Copy link
Contributor

abmo-x commented Apr 15, 2023

@sweetpythoncode
PR is now available, can you test and help review. #7363

@siumingdev
Copy link

Will this be improved or may I have suggestions on alternative ways doing the equivalent? say using the Java API alone without Spark?

Copy link

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

@github-actions github-actions bot added the stale label Aug 28, 2024
Copy link

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants