-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Iceberg add_files procedure with partition_filter scan non needed folders #7027
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Did you try full partition filter? Also cc @abmo-x @dramaticlly , who are expert users of add_files, if they some experience. Its been awhile since I looked at this |
@sweetpythoncode yes, I have come across both issues identified. Partition listing can be improved and when a path is provided which is not the root path without partition info, then partition values are not extracted correctly. I am working on them and will let you know once I have a PR out |
@szehon-ho yes, I try, but it's not using the predicted filter in the dest path, It will anyway scan the full data before applying that filter, based on the code which I found in |
@abmo-x Hello, any progress on this one, thanks? |
Hi @sweetpythoncode I was testing it right now :) hope to have a PR soon. Will keep you posted. |
@abmo-x Wow! Thank you so much! |
@sweetpythoncode |
Will this be improved or may I have suggestions on alternative ways doing the equivalent? say using the Java API alone without Spark? |
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible. |
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' |
Uh oh!
There was an error while loading. Please reload this page.
Apache Iceberg version
1.1.0 (latest release)
Query engine
Spark
Please describe the bug 🐞
source structure example:
s3://bucket/data/id=123/name=test/date=321/result.orc
partition_filter
option does not handle the order of partition, which produces nested folders scanning until finding the first match. Should we run filter by partition in order before run nestedListing leaf files and directories
?Example of current flow:
Also if i have
partition_by
id, name, date
in table and specifyIceberg will ignore these partitions and set them as
null
in table, instead of pulling these data from the path, in spark it's handled bybasePath
before reading the partitions but here is used InMemoryFileIndex without the possibility to do that?cc @RussellSpitzer @szehon-ho
The text was updated successfully, but these errors were encountered: