Skip to content

Filter rows directly from pa.RecordBatch #1621

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Feb 13, 2025

Conversation

gabeiglio
Copy link
Contributor

@gabeiglio gabeiglio commented Feb 7, 2025

This PR from Apache Arrow was merged to allow to filter with a boolean expression directly on pa.RecordBatch.

I believe pyiceberg is currently using pyarrow version 19.0.0.
Filtering from pa.RecordBatch was introduced in python in version 17.0.0

I have not run integration tests for some reason my docker setup is messed up. I believe this test should check this change:

def test_read_multiple_batches_in_task_with_position_deletes(spark: SparkSession, session_catalog: RestCatalog) -> None:

Closes #1050

@Fokko
Copy link
Contributor

Fokko commented Feb 7, 2025

Thanks for fixing this @gabeiglio

In addition, I think we also need to bump the minimal version of Arrow here:

pyarrow = { version = ">=14.0.0,<20.0.0", optional = true }

@kevinjqliu
Copy link
Contributor

thanks for following up on that comment 😄

if we're bumping minimum pyarrow version to 17, we might want to address this comment as well
https://github.com/apache/iceberg-python/pull/1621/files#diff-8d5e63f2a87ead8cebe2fd8ac5dcf2198d229f01e16bb9e06e21f7277c328abdR1335-R1338

@gabeiglio
Copy link
Contributor Author

@kevinjqliu IIUC removing the schema casting will allow pyarrow scanner to infer by itself if it needs or not large types? So it is basically a matter of changing the assertions in tests to the types of the result of the scan?

@kevinjqliu
Copy link
Contributor

I believe so. We can also do this in a follow up PR! I just saw that comment during code review

@kevinjqliu
Copy link
Contributor

Looks like theres an issue in CI tests

@gabeiglio
Copy link
Contributor Author

Yes, I think it would be better to split these changes in separate PRs since there are a lot of changes to be made to tests specially. (If thats okay ill open the other PR for schema casting @kevinjqliu @Fokko)

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment, looks great, and so much cleaner :)

@@ -1348,33 +1348,34 @@ def _task_to_record_batches(
next_index = 0
batches = fragment_scanner.to_batches()
for batch in batches:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, I think we can drop the batch:

Suggested change
for batch in batches:
for current_batch in batches:

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like CIs broken on poetry

Comment on lines 1351 to 1352
current_index = next_index
next_index = current_index + len(batch)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this logically equivalent? feels like there was a reason to write it the other way.

cc @sungwy do you have context on this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I wasn't planning on pushing this change 🤦. I'll revert it in the next commit if we want

@kevinjqliu
Copy link
Contributor

@gabeiglio when you get a chance could you rebase and also fix the change with next_index

@kevinjqliu
Copy link
Contributor

there a conflict with poetry.lock, heres something i found that worked

git fetch origin main
git merge origin/main -X theirs
poetry lock

This will ignore the conflict and regenerate the poetry lock file

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! do you mind double checking if we have sufficient test coverage for the changes in this PR. Since its part of the read path, i want to be extra careful here

@gabeiglio
Copy link
Contributor Author

Sounds good, Let me check!

@gabeiglio
Copy link
Contributor Author

gabeiglio commented Feb 12, 2025

  • 32 tests in tests_reads.py test filtering using case sensitive, and insensitive, multiple fields, one field, empty fields, nulls, etc.
  • 4 tests test filtering partitioned tables in test_partitions.py that also tests deletes files (this makes sure that filter is not being skipped when applying deletes)
  • 4 tests in test_pyarrow that also tests deletes + filtering

There could me more, but Im thinking this is enough, wdyt @kevinjqliu?

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @gabeiglio for the contribution and double checking the tests. The logic here is fairly straight forward after reverting the current_index change.
Thanks @Fokko for the review!

@kevinjqliu kevinjqliu merged commit 6d1c30c into apache:main Feb 13, 2025
7 checks passed
@gabeiglio
Copy link
Contributor Author

Thanks for the help! @kevinjqliu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[feat] push down filters and positional deletes to the record batch level
3 participants