Skip to content

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Sep 26, 2025

Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.

Rationale for this change

While working on other PRs I noticed a lot of copying and that ParquetMetaData is significantly larger when the encryption feature is enabled This is unnecessary and could be avoided
by using Arc<FileDecryptionProperties> instead of FileDecryptionProperties directly.

What changes are included in this PR?

  1. Change most uses of FileDecryptionProperties into Arc<FileDecryptionProperties>

If this is an acceptable API change, I will make a follow on PR giving the same treatment to FileEncryptionProperties
(turn it into Arc<FileEncryptionProperties>)

Are these changes tested?

Yes by CI

Are there any user-facing changes?

Yes, some of the APIs now take Arc<FileDecryptionProperties> rather than FileDecryptionProperties

@alamb alamb force-pushed the alamb/file_encryption_properties branch from 5355828 to 2666244 Compare September 26, 2025 14:34
Ok(Self {
footer_decryptor: Arc::new(footer_decryptor),
decryption_properties: decryption_properties.clone(),
decryption_properties: Arc::clone(decryption_properties),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one example of a clone (deep copy) that is removed due to this change

let base_expected_size = 2312;
#[cfg(feature = "encryption")]
let base_expected_size = 2648;
// Not as accurate as it should be: https://github.com/apache/arrow-rs/issues/8472
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR does actually reduce heap size needed when the encryption feature is enabled, but there are no FileDecryptionProperties -- instead of having space inline for the entire FileDecryptionProperties now there is only an Arc (a pointer + refcount).

However, while reviewing this, I found the heap size accounting is somewhat inaccurate for encryption and filed a second ticket:

file_encryption_properties.validate_encrypted_column_names(schema_descriptor)?;

Ok(Some(Arc::new(FileEncryptor::new(
file_encryption_properties.clone(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is another clone that is removed by this PR

pub fn with_file_decryption_properties(
self,
file_decryption_properties: FileDecryptionProperties,
file_decryption_properties: Arc<FileDecryptionProperties>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this and the change to file_decryption_properties is an API change


/// Finalize the builder and return created [`FileDecryptionProperties`]
pub fn build(self) -> Result<FileDecryptionProperties> {
pub fn build(self) -> Result<Arc<FileDecryptionProperties>> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also an API change. It is somewhat strange to return an Arc from a builder but doing so eases the impact of this change on downstream users -- the output of the builder can be used by the other APIs directly

@alamb alamb changed the title Refactor: Use Arc<FileDecryptionProperties> to reduce size of ParquetMetadata and avoid copying Use Arc<FileDecryptionProperties> to reduce size of ParquetMetadata and avoid copying when encryption is enabled Sep 26, 2025
@alamb alamb marked this pull request as ready for review September 26, 2025 14:41
@mbrobbel mbrobbel added this to the 57.0.0 milestone Sep 29, 2025
@mbrobbel mbrobbel added the api-change Changes to the arrow API label Sep 29, 2025
Copy link
Contributor

@adamreeve adamreeve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me thanks Andrew, I think it's fine to change the API to use an Arc (and similarly for the encryption properties), especially as this feature is still experimental.

There are some conflicts with main now though. Let me know if you want me or @rok to help with this and clean up our code!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Parquet] Reduce size of ParquetMetadata when encryption feature is enabled
3 participants