Skip to content

feat(5851/iox-11168): memory usage vs encoding size for ArrowWriter #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

wiedld
Copy link
Collaborator

@wiedld wiedld commented Jun 27, 2024

For profiling only; branches off an iox-only base branch, not the apache base master branch. A separate upstream PR will be made for the actual change.

Which issue does this PR close?

This branch was made as part of the investigation into the datafusion memory tracking vs actual usage, as per https://github.com/influxdata/influxdb_iox/issues/11168. This PR adds 3 commits onto the current iox's arrow-rs base. Based upon the outcome of the tests shown below, we propose adding these 3 commits to an upstream PR.

These commits also fulfill the requested API change per apache#5851.

Rationale for this change

We have several profiling test cases which compare datafusion's tracked MemoryReservations with the actual peak memory usage. The largest single difference was in the tracking of memory used during the parquet encoding (via ArrowWriter). Here is a summary of the discrepancy per test case:

Use case root caller profiled heap peak (actual used) peak reserved bytes (datafusion estimate) difference (actual - estimate)
test case 1, incl 61 columns TrackedMemoryArrowWriter 798.9 MB 49.4 MB -749.5 MB
test case 2, incl 4869 columns TrackedMemoryArrowWriter 4.1 GB 220.1 MB -3.9 GB
test case 3, incl 1042 colums TrackedMemoryArrowWriter 3.3 GB 98.7 MB -3.2 GB

These^^ results provided significant motivation to fulfill the existing upstream feature request, to provide an ArrowWriter API for memory size used during encoding (refer to apache#5851). Currently, we have been reserving memory bytes based upon the anticipated encoded (compressed) size, as that was the only API available on the ArrowWriter.

The changes in this PR introduce a new memory_size() API, defined as both the already encoded size plus the uncompressed/unflushed bytes in buffer. Next, we limited our accounting of unflushed bytes to the DictEncoder, (although future PRs may expand this accounting). This change alone had a significant impact on our test case 3:

Use case root caller profiled heap peak (actual used) peak reserved bytes (datafusion estimate) difference (actual - estimate)
test case 3, CONTROL TrackedMemoryArrowWriter 3.3 GB 98.7 MB -3.2 GB
test case 3, WITH THESE CHANGES TrackedMemoryArrowWriter 3.3 GB 2.2 GB -1.1 GB

Accounting for the DictEncoder unflushed bytes has improved our memory tracking by ~2 GBs in this test case. We anticipate followup PRs which expand this memory_size() accounting to cover our other test cases as well.

What changes are included in this PR?

  • Delineate two APIs: the existing method for the anticipated encoded size, vs the new API for the memory size during encoding.
  • Then implement the new memory to include accounting for unflushed DictEncoder bytes.

Are there any user-facing changes?

Yes, the new ArrowWriter::memory_size() API.

@wiedld
Copy link
Collaborator Author

wiedld commented Jul 2, 2024

Upstream is merged: apache#5967
Closing.

@wiedld wiedld closed this Jul 2, 2024
wiedld pushed a commit that referenced this pull request Dec 27, 2024
* Implement bulk_delete_request for Azure

* Fix lint and add Azurite bug workaround

* Special 404 error case

* Clippy fix

* Make number of expected headers more conservative and better document invariants

* Use multer for multipart parsing

* Fix clippy

* Fix clippy #2

* Reuse part response buffer

* Make multer conditional to azure feature

* One more HeaderValue::from_static

* Add tests for bulk delete request building and response parsing

* Switch back to manual parsing to avoid multer dependency, other PR suggestions

* Fixes lint
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant