Add benchmarks for arrow-avro writer #8165

jecsand838 · 2025-08-18T18:47:11Z

Which issue does this PR close?

Part of Add Avro Support #4886

Rationale for this change

This PR introduces benchmark tests for the AvroWriter in the arrow-avro crate. Adding these benchmarks is essential for tracking the performance of the writer, identifying potential regressions, and guiding future optimizations.

What changes are included in this PR?

A new benchmark file, benches/avro_writer.rs, is added to the project. This file contains a suite of benchmarks that measure the performance of writing RecordBatches to the Avro format.

The benchmarks cover a variety of Arrow data types:

Boolean
Int32 and Int64
Float32 and Float64
Binary
Timestamp (Microsecond precision)
A schema with a mix of the above types

These benchmarks are run with varying numbers of rows (100, 10,000, and 1,000,000) to assess performance across different data scales.

Are these changes tested?

Yes, this pull request consists entirely of new benchmark tests. Therefore, no separate tests are needed.

Are there any user-facing changes?

NA

- Introduce `avro_writer` benchmark suite in `arrow-avro/benches/avro_writer.rs`. - Test writing performance for various data types, including Boolean, Int32, Int64, Float32, Float64, Binary, TimestampMicrosecond, and Mixed schemas. - Update `Cargo.toml` to include the `avro_writer` benchmark target.

alamb

Thanks @jecsand838 -- I have some suggestions on how to improve these benchmarks, but I also think we could do that as a follow on

alamb · 2025-08-21T17:48:00Z

arrow-avro/benches/avro_writer.rs

+
+const SIZES: [usize; 3] = [100, 10_000, 1_000_000];
+
+fn make_bool_array(n: usize) -> BooleanArray {


Most of our other benchmarks use random input to avoid pathalogical cases due to patterns in the input

I recommend doing the same in this PR

Here are some examples

arrow-rs/arrow/src/util/data_gen.rs

Line 37 in a9b4221

pub fn create_random_batch(

Ah that makes sense. I can add that in tonight.

@alamb I just pushed up some changes based on your feedback regarding the random inputs. Let me know what you think!

alamb · 2025-08-21T17:52:27Z

arrow-avro/benches/avro_writer.rs

+        .collect()
+});
+
+fn ocf_size_for_batch(batch: &RecordBatch) -> usize {


What does OCF size mean?

It's the size of the Avro Object Container File.

Also to be a bit more clear, my intention here was to feed this usize byte count value into Throughput::Bytes() so Criterion reports MB/s for actual on‑disk OCF bytes written per iteration.

alamb · 2025-08-21T17:53:52Z

arrow-avro/benches/avro_writer.rs

+        let bytes = ocf_size_for_batch(batch);
+        group.throughput(Throughput::Bytes(bytes as u64));
+        match rows {
+            10_000 => {


A common batch size is 8k as well -- I might suggest testing 4k or 8k batches along with 100k to better represent any per-batch overhead (as with 100K it will be drowned out)

That's a good call out. I'll push those changes up tonight as well.

Pushed these changes up as well.

alamb · 2025-08-23T10:09:01Z

I took a look at the changes and they look good to me -- thanks @jecsand838

github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Aug 18, 2025

jecsand838 force-pushed the avro-writer-benchmarks branch from c1dd2a8 to 936d255 Compare August 18, 2025 19:30

alamb approved these changes Aug 21, 2025

View reviewed changes

jecsand838 added 2 commits August 22, 2025 13:46

Address PR Comments

1c74ce1

Merge branch 'main' into avro-writer-benchmarks

c6c6aeb

jecsand838 force-pushed the avro-writer-benchmarks branch from 2f07609 to c6c6aeb Compare August 22, 2025 20:10

alamb merged commit 26c9c7a into apache:main Aug 23, 2025
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add benchmarks for arrow-avro writer #8165

Add benchmarks for arrow-avro writer #8165

Uh oh!

jecsand838 commented Aug 18, 2025

Uh oh!

alamb left a comment

Uh oh!

alamb Aug 21, 2025

Uh oh!

jecsand838 Aug 21, 2025

Uh oh!

jecsand838 Aug 22, 2025

Uh oh!

alamb Aug 21, 2025

Uh oh!

jecsand838 Aug 21, 2025 •

edited

Loading

Uh oh!

jecsand838 Aug 22, 2025

Uh oh!

alamb Aug 21, 2025

Uh oh!

jecsand838 Aug 21, 2025

Uh oh!

jecsand838 Aug 22, 2025

Uh oh!

alamb commented Aug 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!


		const SIZES: [usize; 3] = [100, 10_000, 1_000_000];

		fn make_bool_array(n: usize) -> BooleanArray {

Add benchmarks for arrow-avro writer #8165

Add benchmarks for arrow-avro writer #8165

Uh oh!

Conversation

jecsand838 commented Aug 18, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jecsand838 Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jecsand838 Aug 21, 2025 •

edited

Loading

alamb commented Aug 23, 2025 •

edited

Loading