feat(5851/iox-11168): memory usage vs encoding size for ArrowWriter #2
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
For profiling only; branches off an iox-only base branch, not the apache base
master
branch. A separate upstream PR will be made for the actual change.Which issue does this PR close?
This branch was made as part of the investigation into the datafusion memory tracking vs actual usage, as per https://github.com/influxdata/influxdb_iox/issues/11168. This PR adds 3 commits onto the current iox's arrow-rs base. Based upon the outcome of the tests shown below, we propose adding these 3 commits to an upstream PR.
These commits also fulfill the requested API change per apache#5851.
Rationale for this change
We have several profiling test cases which compare datafusion's tracked MemoryReservations with the actual peak memory usage. The largest single difference was in the tracking of memory used during the parquet encoding (via ArrowWriter). Here is a summary of the discrepancy per test case:
These^^ results provided significant motivation to fulfill the existing upstream feature request, to provide an ArrowWriter API for memory size used during encoding (refer to apache#5851). Currently, we have been reserving memory bytes based upon the anticipated encoded (compressed) size, as that was the only API available on the ArrowWriter.
The changes in this PR introduce a new
memory_size()
API, defined as both the already encoded size plus the uncompressed/unflushed bytes in buffer. Next, we limited our accounting of unflushed bytes to the DictEncoder, (although future PRs may expand this accounting). This change alone had a significant impact on our test case 3:Accounting for the DictEncoder unflushed bytes has improved our memory tracking by ~2 GBs in this test case. We anticipate followup PRs which expand this
memory_size()
accounting to cover our other test cases as well.What changes are included in this PR?
Are there any user-facing changes?
Yes, the new
ArrowWriter::memory_size()
API.