feat(5851/iox-11168): memory usage vs encoding size for ArrowWriter #2

wiedld · 2024-06-27T04:38:49Z

For profiling only; branches off an iox-only base branch, not the apache base master branch. A separate upstream PR will be made for the actual change.

Which issue does this PR close?

This branch was made as part of the investigation into the datafusion memory tracking vs actual usage, as per https://github.com/influxdata/influxdb_iox/issues/11168. This PR adds 3 commits onto the current iox's arrow-rs base. Based upon the outcome of the tests shown below, we propose adding these 3 commits to an upstream PR.

These commits also fulfill the requested API change per apache#5851.

Rationale for this change

We have several profiling test cases which compare datafusion's tracked MemoryReservations with the actual peak memory usage. The largest single difference was in the tracking of memory used during the parquet encoding (via ArrowWriter). Here is a summary of the discrepancy per test case:

Use case	root caller	profiled heap peak (actual used)	peak reserved bytes (datafusion estimate)	difference (actual - estimate)
test case 1, incl 61 columns	TrackedMemoryArrowWriter	798.9 MB	49.4 MB	-749.5 MB
test case 2, incl 4869 columns	TrackedMemoryArrowWriter	4.1 GB	220.1 MB	-3.9 GB
test case 3, incl 1042 colums	TrackedMemoryArrowWriter	3.3 GB	98.7 MB	-3.2 GB

These^^ results provided significant motivation to fulfill the existing upstream feature request, to provide an ArrowWriter API for memory size used during encoding (refer to apache#5851). Currently, we have been reserving memory bytes based upon the anticipated encoded (compressed) size, as that was the only API available on the ArrowWriter.

The changes in this PR introduce a new memory_size() API, defined as both the already encoded size plus the uncompressed/unflushed bytes in buffer. Next, we limited our accounting of unflushed bytes to the DictEncoder, (although future PRs may expand this accounting). This change alone had a significant impact on our test case 3:

Use case	root caller	profiled heap peak (actual used)	peak reserved bytes (datafusion estimate)	difference (actual - estimate)
test case 3, CONTROL	TrackedMemoryArrowWriter	3.3 GB	98.7 MB	-3.2 GB
test case 3, WITH THESE CHANGES	TrackedMemoryArrowWriter	3.3 GB	2.2 GB	-1.1 GB

Accounting for the DictEncoder unflushed bytes has improved our memory tracking by ~2 GBs in this test case. We anticipate followup PRs which expand this memory_size() accounting to cover our other test cases as well.

What changes are included in this PR?

Delineate two APIs: the existing method for the anticipated encoded size, vs the new API for the memory size during encoding.
Then implement the new memory to include accounting for unflushed DictEncoder bytes.

Are there any user-facing changes?

Yes, the new ArrowWriter::memory_size() API.

… ArrowWriter and column writers

…mentations and the DictEncoder

wiedld · 2024-07-02T19:44:07Z

Upstream is merged: apache#5967
Closing.

* Implement bulk_delete_request for Azure * Fix lint and add Azurite bug workaround * Special 404 error case * Clippy fix * Make number of expected headers more conservative and better document invariants * Use multer for multipart parsing * Fix clippy * Fix clippy #2 * Reuse part response buffer * Make multer conditional to azure feature * One more HeaderValue::from_static * Add tests for bulk delete request building and response parsing * Switch back to manual parsing to avoid multer dependency, other PR suggestions * Fixes lint

wiedld added 3 commits June 26, 2024 12:30

refactor(5851): delineate the different memory estimates APIs for the…

a101e3b

… ArrowWriter and column writers

feat(5851): add memory size estimates to the ColumnValueEncoder imple…

385075b

…mentations and the DictEncoder

test(5851): add memory_size() to in-progress test

440135f

wiedld closed this Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(5851/iox-11168): memory usage vs encoding size for ArrowWriter #2

feat(5851/iox-11168): memory usage vs encoding size for ArrowWriter #2

Uh oh!

wiedld commented Jun 27, 2024 •

edited

Loading

Uh oh!

wiedld commented Jul 2, 2024

Uh oh!

Uh oh!

feat(5851/iox-11168): memory usage vs encoding size for ArrowWriter #2

feat(5851/iox-11168): memory usage vs encoding size for ArrowWriter #2

Uh oh!

Conversation

wiedld commented Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

wiedld commented Jul 2, 2024

Uh oh!

Uh oh!

wiedld commented Jun 27, 2024 •

edited

Loading