Improve upsert memory pressure #1995

koenvo · 2025-05-13T10:19:44Z

Summary

This PR updates the upsert logic to use batch processing. The main goal is to prevent out-of-memory (OOM) issues when updating large tables by avoiding loading all data at once.

Note: This has only been tested against the unit tests—no real-world datasets have been evaluated yet.

This PR partially depends on functionality introduced in #1817.

Notes

Duplicate detection across multiple batches is not possible with this approach.
~~All data is read sequentially, which may be slower than the parallel read used by to_arrow.~~ fixed using concurrent_tasks parameter

Performance Comparison

In setups with many small files, network and metadata overhead become the dominant factor. This impacts batch reading performance, as each file contributes relatively more overhead than payload. In the test setup used here, metadata access was the largest cost.

Using `to_arrow_batch_reader` (sequential):

Scan: 9993.50 ms
To list: 19811.09 ms

Using `to_arrow` (parallel):

Scan: 10607.88 ms

…M when updating large tables

…all filter expressions. Prevents memory pressure due to large filters

jayceslesar · 2025-06-02T15:25:37Z

fwiw I think we should try to get this merged in at some point. Some ideas:

Make it a flag to use the batchreader or not, some users might have basically infinite memory
Is there a more optimal way to batch data? Thinking along the lines of using partitions although that may already happen under the hood

koenvo · 2025-06-02T20:15:12Z

fwiw I think we should try to get this merged in at some point. Some ideas:

Make it a flag to use the batchreader or not, some users might have basically infinite memory

Is there a more optimal way to batch data? Thinking along the lines of using partitions although that may already happen under the hood

I've been thinking about what I (as a developer) want. The answer is: set max memory usage.

Some ideas:

Determine which partitions can fit together in memory and batch load those together
Fetching of parquet files can happen parallel and only do loading sequential
Combine 1 and 2

koenvo · 2025-06-02T21:32:24Z

Did an update and ran a quick benchmark with different concurrent_tasks settings on to_arrow_batch_reader():

table = catalog.get_table("some_table")

# Benchmark loop
p = table.scan().to_arrow_batch_reader(concurrent_tasks=100)
for batch in tqdm.tqdm(p):
    print(pool.max_memory())

Results (including `pool.max_memory()`):

concurrent_tasks=1 → 52it [00:06, 7.73it/s] | Max memory: 7.4 MB
concurrent_tasks=10 → 391it [00:06, 61.98it/s] | Max memory: 36.3 MB
concurrent_tasks=20 → 1412it [00:15, 83.54it/s] | Max memory: 147 MB
concurrent_tasks=100 → 1030it [00:09, 106.84it/s] | Max memory: 1.76 GB

Some more testing (on 100mbit connection):

scan.to_arrow_batch_reader(concurrent_tasks=10)
2025-06-03 11:02:48.986 INFO Starting
2025-06-03 11:05:10.927 INFO Rows: 13584102
2025-06-03 11:05:10.927 INFO Memory usage: 78.4MB

scan.to_arrow()
2025-06-03 11:05:47.211 INFO Starting
2025-06-03 11:08:09.907 INFO Rows: 13584102
2025-06-03 11:08:09.907 INFO Memory usage: 11GB

Note: Performance also depends on the network connection.

pyiceberg/io/pyarrow.py

koenvo · 2025-06-03T14:39:52Z

Did another update to get rid of the concurrent_tasks argument. It now defaults to the max-workers Config.

I also refactored to_arrow to use to_arrow_batch_reader under the hood to prevent duplicate implementations of the same functionality.

Fokko · 2025-06-13T12:26:26Z

Make it a flag to use the batchreader or not, some users might have basically infinite memory

Typically, I'm not a big fan of this kind of flag, since it will delegate the responsibility to the user, and not every user might know the implications. Instead, let's optimize for the best experience out of the box.

Fokko · 2025-06-13T13:28:07Z

@koenvo Looks like there are some issues with the list vs large_list, do you have time to dig into the issue? In the original implementation we had cast operations to ensure that we keep the original types.

koenvo · 2025-06-13T13:35:22Z

@koenvo Looks like there are some issues with the list vs large_list, do you have time to dig into the issue? In the original implementation we had cast operations to ensure that we keep the original types.

I was kind of scared by the

        if property_as_bool(self._io.properties, PYARROW_USE_LARGE_TYPES_ON_READ, False):
            deprecation_message(
                deprecated_in="0.10.0",
                removed_in="0.11.0",
                help_message=f"Property `{PYARROW_USE_LARGE_TYPES_ON_READ}` will be removed.",
            )
            result = result.cast(arrow_schema)

Currently, I do the cast in my own application code, as I ran into the same issue but I though it was just my implementation. Had to add the cast.

And happy to look into this.

Fokko

Two minor comments, this looks great 👍

pyiceberg/io/pyarrow.py

Use arrow_schema.empty_table() Co-authored-by: Fokko Driesprong <[email protected]>

Co-authored-by: Fokko Driesprong <[email protected]>

Fokko · 2025-06-20T18:35:49Z

Moving this forward, this looks great @koenvo 🙌

### Summary This PR updates the upsert logic to use batch processing. The main goal is to prevent out-of-memory (OOM) issues when updating large tables by avoiding loading all data at once. **Note:** This has only been tested against the unit tests—no real-world datasets have been evaluated yet. This PR partially depends on functionality introduced in [apache#1817](apache/iceberg#1817). --- ### Notes - Duplicate detection across multiple batches is **not** possible with this approach. - ~All data is read sequentially, which may be slower than the parallel read used by `to_arrow`.~ fixed using `concurrent_tasks` parameter --- ### Performance Comparison In setups with many small files, network and metadata overhead become the dominant factor. This impacts batch reading performance, as each file contributes relatively more overhead than payload. In the test setup used here, metadata access was the largest cost. #### Using `to_arrow_batch_reader` (sequential): - **Scan:** 9993.50 ms - **To list:** 19811.09 ms #### Using `to_arrow` (parallel): - **Scan:** 10607.88 ms --------- Co-authored-by: Fokko Driesprong <[email protected]>

koenvo added 11 commits May 13, 2025 09:36

Move actual implementation of upsert from Table to Transaction

7abfee9

Fix some incorrect usage of schema

db334ae

Write a test for upsert transaction

cebfda3

Add failing test for multiple upserts in same transaction

52fd35e

Fix test

f336c0b

Add failing test

07890ac

Use Transaction.table_metadata when doing the data scan in upsert

ae0e60f

Remove as it's resolved

ce8d9ef

Use to_arrow_batch_reader instead of to_arrow in upsert to prevent OO…

5bdb0b8

…M when updating large tables

Filter rows to insert on each iteration instead of keeping a list of …

65fe36d

…all filter expressions. Prevents memory pressure due to large filters

Merge branch 'main' into feat/use-batchreader-in-upsert

4614543

jayceslesar mentioned this pull request Jun 2, 2025

table.upsert works only with batching #2058

Closed

Accept concurrent_tasks when fetching record_batches

88a4ad2

corleyma reviewed Jun 3, 2025

View reviewed changes

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

koenvo added 3 commits June 3, 2025 07:11

Use ExecutorFactory

f8acdb0

Simplify to_arrow to use the optimized to_record_batches

119d92f

Fix for shutdown pool after doing a map

5d3a6aa

koenvo marked this pull request as ready for review June 3, 2025 09:59

minor

445845d

koenvo changed the title ~~Use batchreader in upsert~~ Improve to_arrow_batch_reader performance + use to_arrow_batch_reader in upsert to lower memory pressure Jun 3, 2025

jayceslesar reviewed Jun 3, 2025

View reviewed changes

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

Improve comments in to_record_batches

6b8dace

Fokko approved these changes Jun 13, 2025

View reviewed changes

Make sure the 'infer the types when reading (apache#1669)' works again

2cd2137

Fokko approved these changes Jun 20, 2025

View reviewed changes

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

koenvo and others added 2 commits June 20, 2025 17:08

Update pyiceberg/io/pyarrow.py

8e6f5e9

Use arrow_schema.empty_table() Co-authored-by: Fokko Driesprong <[email protected]>

Update pyiceberg/io/pyarrow.py

6813061

Co-authored-by: Fokko Driesprong <[email protected]>

Fokko changed the title ~~Improve to_arrow_batch_reader performance + use to_arrow_batch_reader in upsert to lower memory pressure~~ Improve memory pressure by using batching Jun 20, 2025

Fokko changed the title ~~Improve memory pressure by using batching~~ Improve upsert memory pressure Jun 20, 2025

Fokko merged commit 1ac8ffb into apache:main Jun 20, 2025
10 checks passed

koenvo deleted the feat/use-batchreader-in-upsert branch June 20, 2025 22:07

Fokko mentioned this pull request Jun 23, 2025

Upsertion memory usage grows exponentially as table size grows #2138

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve upsert memory pressure #1995

Improve upsert memory pressure #1995

Uh oh!

koenvo commented May 13, 2025 •

edited

Loading

Uh oh!

jayceslesar commented Jun 2, 2025

Uh oh!

koenvo commented Jun 2, 2025

Uh oh!

koenvo commented Jun 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

koenvo commented Jun 3, 2025

Uh oh!

Fokko commented Jun 13, 2025

Uh oh!

Fokko commented Jun 13, 2025 •

edited

Loading

Uh oh!

koenvo commented Jun 13, 2025 •

edited

Loading

Uh oh!

Fokko left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fokko commented Jun 20, 2025

Uh oh!

Uh oh!

Improve upsert memory pressure #1995

Improve upsert memory pressure #1995

Uh oh!

Conversation

koenvo commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Notes

Performance Comparison

Using to_arrow_batch_reader (sequential):

Using to_arrow (parallel):

Uh oh!

jayceslesar commented Jun 2, 2025

Uh oh!

koenvo commented Jun 2, 2025

Uh oh!

koenvo commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results (including pool.max_memory()):

Some more testing (on 100mbit connection):

Uh oh!

Uh oh!

Uh oh!

koenvo commented Jun 3, 2025

Uh oh!

Fokko commented Jun 13, 2025

Uh oh!

Fokko commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

koenvo commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fokko commented Jun 20, 2025

Uh oh!

Uh oh!

koenvo commented May 13, 2025 •

edited

Loading

Using `to_arrow_batch_reader` (sequential):

Using `to_arrow` (parallel):

koenvo commented Jun 2, 2025 •

edited

Loading

Results (including `pool.max_memory()`):

Fokko commented Jun 13, 2025 •

edited

Loading

koenvo commented Jun 13, 2025 •

edited

Loading