Skip to content

Conversation

jecsand838
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

This PR introduces benchmark tests for the AvroWriter in the arrow-avro crate. Adding these benchmarks is essential for tracking the performance of the writer, identifying potential regressions, and guiding future optimizations.

What changes are included in this PR?

A new benchmark file, benches/avro_writer.rs, is added to the project. This file contains a suite of benchmarks that measure the performance of writing RecordBatches to the Avro format.

The benchmarks cover a variety of Arrow data types:

  • Boolean
  • Int32 and Int64
  • Float32 and Float64
  • Binary
  • Timestamp (Microsecond precision)
  • A schema with a mix of the above types

These benchmarks are run with varying numbers of rows (100, 10,000, and 1,000,000) to assess performance across different data scales.

Are these changes tested?

Yes, this pull request consists entirely of new benchmark tests. Therefore, no separate tests are needed.

Are there any user-facing changes?

NA

@github-actions github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Aug 18, 2025
- Introduce `avro_writer` benchmark suite in `arrow-avro/benches/avro_writer.rs`.
- Test writing performance for various data types, including Boolean, Int32, Int64, Float32, Float64, Binary, TimestampMicrosecond, and Mixed schemas.
- Update `Cargo.toml` to include the `avro_writer` benchmark target.
@jecsand838 jecsand838 force-pushed the avro-writer-benchmarks branch from c1dd2a8 to 936d255 Compare August 18, 2025 19:30
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jecsand838 -- I have some suggestions on how to improve these benchmarks, but I also think we could do that as a follow on


const SIZES: [usize; 3] = [100, 10_000, 1_000_000];

fn make_bool_array(n: usize) -> BooleanArray {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of our other benchmarks use random input to avoid pathalogical cases due to patterns in the input

I recommend doing the same in this PR

Here are some examples

pub fn create_random_batch(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that makes sense. I can add that in tonight.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb I just pushed up some changes based on your feedback regarding the random inputs. Let me know what you think!

.collect()
});

fn ocf_size_for_batch(batch: &RecordBatch) -> usize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does OCF size mean?

Copy link
Contributor Author

@jecsand838 jecsand838 Aug 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the size of the Avro Object Container File.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also to be a bit more clear, my intention here was to feed this usize byte count value into Throughput::Bytes() so Criterion reports MB/s for actual on‑disk OCF bytes written per iteration.

let bytes = ocf_size_for_batch(batch);
group.throughput(Throughput::Bytes(bytes as u64));
match rows {
10_000 => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A common batch size is 8k as well -- I might suggest testing 4k or 8k batches along with 100k to better represent any per-batch overhead (as with 100K it will be drowned out)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good call out. I'll push those changes up tonight as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed these changes up as well.

@jecsand838 jecsand838 force-pushed the avro-writer-benchmarks branch from 2f07609 to c6c6aeb Compare August 22, 2025 20:10
@alamb
Copy link
Contributor

alamb commented Aug 23, 2025

I took a look at the changes and they look good to me -- thanks @jecsand838

@alamb alamb merged commit 26c9c7a into apache:main Aug 23, 2025
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate arrow-avro arrow-avro crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants