Skip to content

Use batchreader in upsert #1995

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

koenvo
Copy link
Contributor

@koenvo koenvo commented May 13, 2025

Summary

This PR updates the upsert logic to use batch processing. The main goal is to prevent out-of-memory (OOM) issues when updating large tables by avoiding loading all data at once.

Note: This has only been tested against the unit tests—no real-world datasets have been evaluated yet.

This PR partially depends on functionality introduced in #1817.


Notes

  • Duplicate detection across multiple batches is not possible with this approach.
  • All data is read sequentially, which may be slower than the parallel read used by to_arrow.

Performance Comparison

In setups with many small files, network and metadata overhead become the dominant factor. This impacts batch reading performance, as each file contributes relatively more overhead than payload. In the test setup used here, metadata access was the largest cost.

Using to_arrow_batch_reader (sequential):

  • Scan: 9993.50 ms
  • To list: 19811.09 ms

Using to_arrow (parallel):

  • Scan: 10607.88 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant