Filter rows directly from pa.RecordBatch #1621

gabeiglio · 2025-02-07T02:43:25Z

This PR from Apache Arrow was merged to allow to filter with a boolean expression directly on pa.RecordBatch.

I believe pyiceberg is currently using pyarrow version 19.0.0.
Filtering from pa.RecordBatch was introduced in python in version 17.0.0

I have not run integration tests for some reason my docker setup is messed up. I believe this test should check this change:

iceberg-python/tests/integration/test_deletes.py

Line 314 in dfbee4b

    
           def test_read_multiple_batches_in_task_with_position_deletes(spark: SparkSession, session_catalog: RestCatalog) -> None:

Closes #1050

Fokko · 2025-02-07T09:32:18Z

Thanks for fixing this @gabeiglio

In addition, I think we also need to bump the minimal version of Arrow here:

iceberg-python/pyproject.toml

Line 64 in dfbee4b

pyarrow = { version = ">=14.0.0,<20.0.0", optional = true }

kevinjqliu · 2025-02-07T15:32:36Z

thanks for following up on that comment 😄

if we're bumping minimum pyarrow version to 17, we might want to address this comment as well
https://github.com/apache/iceberg-python/pull/1621/files#diff-8d5e63f2a87ead8cebe2fd8ac5dcf2198d229f01e16bb9e06e21f7277c328abdR1335-R1338

gabeiglio · 2025-02-08T00:19:17Z

@kevinjqliu IIUC removing the schema casting will allow pyarrow scanner to infer by itself if it needs or not large types? So it is basically a matter of changing the assertions in tests to the types of the result of the scan?

kevinjqliu · 2025-02-08T19:01:23Z

I believe so. We can also do this in a follow up PR! I just saw that comment during code review

kevinjqliu · 2025-02-08T19:02:17Z

Looks like theres an issue in CI tests

gabeiglio · 2025-02-09T07:46:22Z

Yes, I think it would be better to split these changes in separate PRs since there are a lot of changes to be made to tests specially. (If thats okay ill open the other PR for schema casting @kevinjqliu @Fokko)

pyiceberg/io/pyarrow.py

Fokko

One minor comment, looks great, and so much cleaner :)

Fokko · 2025-02-10T12:58:02Z

pyiceberg/io/pyarrow.py

@@ -1348,33 +1348,34 @@ def _task_to_record_batches(
        next_index = 0
        batches = fragment_scanner.to_batches()
        for batch in batches:


nit, I think we can drop the batch:

Suggested change

for batch in batches:

for current_batch in batches:

kevinjqliu

looks like CIs broken on poetry

kevinjqliu · 2025-02-10T17:35:55Z

pyiceberg/io/pyarrow.py

+            current_index = next_index
+            next_index = current_index + len(batch)


is this logically equivalent? feels like there was a reason to write it the other way.

cc @sungwy do you have context on this?

Oh I wasn't planning on pushing this change 🤦. I'll revert it in the next commit if we want

kevinjqliu · 2025-02-12T16:44:22Z

@gabeiglio when you get a chance could you rebase and also fix the change with next_index

kevinjqliu · 2025-02-12T19:15:58Z

there a conflict with poetry.lock, heres something i found that worked

git fetch origin main
git merge origin/main -X theirs
poetry lock

This will ignore the conflict and regenerate the poetry lock file

kevinjqliu

LGTM! do you mind double checking if we have sufficient test coverage for the changes in this PR. Since its part of the read path, i want to be extra careful here

gabeiglio · 2025-02-12T19:26:46Z

Sounds good, Let me check!

gabeiglio · 2025-02-12T20:36:14Z

32 tests in tests_reads.py test filtering using case sensitive, and insensitive, multiple fields, one field, empty fields, nulls, etc.
4 tests test filtering partitioned tables in test_partitions.py that also tests deletes files (this makes sure that filter is not being skipped when applying deletes)
4 tests in test_pyarrow that also tests deletes + filtering

There could me more, but Im thinking this is enough, wdyt @kevinjqliu?

poetry.lock

kevinjqliu

LGTM! Thanks @gabeiglio for the contribution and double checking the tests. The logic here is fairly straight forward after reverting the current_index change.
Thanks @Fokko for the review!

gabeiglio · 2025-02-13T03:28:22Z

Thanks for the help! @kevinjqliu

gabeiglio added 2 commits February 6, 2025 18:33

Filter rows from RecordBatch

081e75b

use num_rows for len

61772f5

gabeiglio force-pushed the filterRecordBatch branch from 8b0ab79 to 61772f5 Compare February 7, 2025 02:51

Update pyarrow, check for empty record batches

0923c92

kevinjqliu reviewed Feb 9, 2025

View reviewed changes

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

gabeiglio added 3 commits February 9, 2025 11:58

fix issues

7c1e085

add comments

9142c44

poetry lock and checking for empty batches

4dc7ff2

gabeiglio force-pushed the filterRecordBatch branch from e19ebe0 to 4dc7ff2 Compare February 9, 2025 21:39

Merge branch 'main' into filterRecordBatch

ad9410c

Fokko approved these changes Feb 10, 2025

View reviewed changes

kevinjqliu reviewed Feb 10, 2025

View reviewed changes

kevinjqliu added 2 commits February 11, 2025 21:25

Merge branch 'main' into filterRecordBatch

8896c7c

merge main

f2ced4b

rollback index change

bde54b9

kevinjqliu reviewed Feb 12, 2025

View reviewed changes

Merge remote-tracking branch 'origin/main' into filterRecordBatch

d9744e2

Gabriel Igliozzi added 3 commits February 12, 2025 12:37

poetry lock

6b69e85

Merge remote-tracking branch 'origin/main' into filterRecordBatch

6083c23

poetry lock try 2

42f1856

kevinjqliu reviewed Feb 13, 2025

View reviewed changes

poetry.lock Outdated Show resolved Hide resolved

Merge remote-tracking branch 'apache/main' into filterRecordBatch

ed8223b

kevinjqliu approved these changes Feb 13, 2025

View reviewed changes

kevinjqliu merged commit 6d1c30c into apache:main Feb 13, 2025
7 checks passed

kevinjqliu mentioned this pull request Mar 25, 2025

Minimally required pyarrow version #1822

Closed

		current_index = next_index
		next_index = current_index + len(batch)

Filter rows directly from pa.RecordBatch #1621

Filter rows directly from pa.RecordBatch #1621

Uh oh!

Conversation

gabeiglio commented Feb 7, 2025 • edited by Fokko Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fokko commented Feb 7, 2025

Uh oh!

kevinjqliu commented Feb 7, 2025

Uh oh!

gabeiglio commented Feb 8, 2025

Uh oh!

kevinjqliu commented Feb 8, 2025

Uh oh!

kevinjqliu commented Feb 8, 2025

Uh oh!

gabeiglio commented Feb 9, 2025

Uh oh!

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Fokko Feb 10, 2025

Choose a reason for hiding this comment

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Feb 10, 2025

Choose a reason for hiding this comment

Uh oh!

gabeiglio Feb 10, 2025

Choose a reason for hiding this comment

Uh oh!

kevinjqliu commented Feb 12, 2025

Uh oh!

kevinjqliu commented Feb 12, 2025

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

gabeiglio commented Feb 12, 2025

Uh oh!

gabeiglio commented Feb 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gabeiglio commented Feb 13, 2025

Uh oh!

Uh oh!

gabeiglio commented Feb 7, 2025 •

edited by Fokko

Loading

gabeiglio commented Feb 12, 2025 •

edited

Loading