-
Notifications
You must be signed in to change notification settings - Fork 1k
Add benchmarks for arrow-avro writer #8165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Introduce `avro_writer` benchmark suite in `arrow-avro/benches/avro_writer.rs`. - Test writing performance for various data types, including Boolean, Int32, Int64, Float32, Float64, Binary, TimestampMicrosecond, and Mixed schemas. - Update `Cargo.toml` to include the `avro_writer` benchmark target.
c1dd2a8
to
936d255
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jecsand838 -- I have some suggestions on how to improve these benchmarks, but I also think we could do that as a follow on
arrow-avro/benches/avro_writer.rs
Outdated
|
||
const SIZES: [usize; 3] = [100, 10_000, 1_000_000]; | ||
|
||
fn make_bool_array(n: usize) -> BooleanArray { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of our other benchmarks use random input to avoid pathalogical cases due to patterns in the input
I recommend doing the same in this PR
Here are some examples
arrow-rs/arrow/src/util/data_gen.rs
Line 37 in a9b4221
pub fn create_random_batch( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that makes sense. I can add that in tonight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb I just pushed up some changes based on your feedback regarding the random inputs. Let me know what you think!
.collect() | ||
}); | ||
|
||
fn ocf_size_for_batch(batch: &RecordBatch) -> usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does OCF size mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the size of the Avro Object Container File.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also to be a bit more clear, my intention here was to feed this usize
byte count value into Throughput::Bytes()
so Criterion reports MB/s for actual on‑disk OCF bytes written per iteration.
arrow-avro/benches/avro_writer.rs
Outdated
let bytes = ocf_size_for_batch(batch); | ||
group.throughput(Throughput::Bytes(bytes as u64)); | ||
match rows { | ||
10_000 => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A common batch size is 8k as well -- I might suggest testing 4k or 8k batches along with 100k to better represent any per-batch overhead (as with 100K it will be drowned out)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good call out. I'll push those changes up tonight as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed these changes up as well.
2f07609
to
c6c6aeb
Compare
I took a look at the changes and they look good to me -- thanks @jecsand838 |
Which issue does this PR close?
Rationale for this change
This PR introduces benchmark tests for the
AvroWriter
in thearrow-avro
crate. Adding these benchmarks is essential for tracking the performance of the writer, identifying potential regressions, and guiding future optimizations.What changes are included in this PR?
A new benchmark file,
benches/avro_writer.rs
, is added to the project. This file contains a suite of benchmarks that measure the performance of writingRecordBatch
es to the Avro format.The benchmarks cover a variety of Arrow data types:
Boolean
Int32
andInt64
Float32
andFloat64
Binary
Timestamp
(Microsecond precision)These benchmarks are run with varying numbers of rows (100, 10,000, and 1,000,000) to assess performance across different data scales.
Are these changes tested?
Yes, this pull request consists entirely of new benchmark tests. Therefore, no separate tests are needed.
Are there any user-facing changes?
NA