Skip to content

Conversation

tobixdev
Copy link
Contributor

@tobixdev tobixdev commented Oct 7, 2025

Which issue does this PR close?

This is a follow-up to #8543 . There we discussed two outstanding issues that this PR tries to address.

Rationale for this change

Address the points from the PR.

What changes are included in this PR?

  1. Fix "Incorrect Behavior of Collecting a filtered iterator to a BooleanArray" #8543 (comment) : Improve benchmarks (use a dynamically dispatched iterator to avoid Vec-specific optimizations)
  2. Fix "Incorrect Behavior of Collecting a filtered iterator to a BooleanArray" #8543 (comment) : Also use ExactSizeIterator for the PrimitiveArray

Are these changes tested?

Existing test for PrimitiveArray

Are there any user-facing changes?

Yes, PrimitiveArray::from_trusted_len_iter is more restrictive.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Oct 7, 2025
@tobixdev tobixdev changed the title 8543 follow up Follow-up to 8543 Oct 7, 2025
@alamb
Copy link
Contributor

alamb commented Oct 7, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing 8543-follow-up (716735f) to ba22a21 diff
BENCH_NAME=array_from
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench array_from
BENCH_FILTER=
BENCH_BRANCH_NAME=8543-follow-up
Results will be posted here when complete

@alamb alamb changed the title Follow-up to 8543 Improve comments and change PrimitiveArray::from_trusted_len_iter to take an ExactSizeIterator Oct 7, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tobixdev -- I also kicked off some benchmarks

where
P: std::borrow::Borrow<Option<<T as ArrowPrimitiveType>::Native>>,
I: IntoIterator<Item = P>,
I: ExactSizeIterator<Item = P>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the rationale for this change? I am thinking that maybe someone who had an iterator that knew its length but did not implement ExactSizeIterator would have to change their code

Maybe that is ok, but the implementation doesn't even seem to use any of the methods on ExactSizeIterator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we just mentioned that as a side note in the last PR. I think it has two advantages:

  • Users have at least some help from the type system to avoid safety issues.
  • It's consistent with BooleanArray

However, I am also happy with the old API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that some of the functions could make use of len instead of size_hint. If we choose to keep ExactSizeIterator I'll go through them.

I think another cause of incompatibility is caused by users needing to call .into_iter() manually.

@alamb
Copy link
Contributor

alamb commented Oct 7, 2025

🤖: Benchmark completed

Details

group                                  8543-follow-up                         main
-----                                  --------------                         ----
BooleanArray::from_iter                2.51     59.9±0.15µs        ? ?/sec    1.00     23.9±0.08µs        ? ?/sec
BooleanArray::from_trusted_len_iter    1.79     40.0±0.06µs        ? ?/sec    1.00     22.4±0.03µs        ? ?/sec
Int64Array::from_iter                  2.21     92.0±0.19µs        ? ?/sec    1.00     41.6±0.10µs        ? ?/sec
Int64Array::from_trusted_len_iter      1.92     36.4±0.06µs        ? ?/sec    1.00     19.0±0.05µs        ? ?/sec
array_from_vec 128                     1.00    157.6±0.29ns        ? ?/sec    1.04    163.3±0.43ns        ? ?/sec
array_from_vec 256                     1.00    166.5±0.38ns        ? ?/sec    1.03    171.3±0.25ns        ? ?/sec
array_from_vec 512                     1.00    220.4±0.33ns        ? ?/sec    1.05    230.3±0.25ns        ? ?/sec
array_string_from_vec 128              1.00   1100.8±2.55ns        ? ?/sec    1.09   1197.5±1.43ns        ? ?/sec
array_string_from_vec 256              1.00   1890.3±3.01ns        ? ?/sec    1.02   1935.5±3.37ns        ? ?/sec
array_string_from_vec 512              1.05      3.4±0.02µs        ? ?/sec    1.00      3.2±0.01µs        ? ?/sec
decimal128_array_from_vec 32768        1.00     99.4±0.37µs        ? ?/sec    1.00     99.0±0.52µs        ? ?/sec
decimal256_array_from_vec 32768        1.02      3.9±0.03µs        ? ?/sec    1.00      3.9±0.01µs        ? ?/sec
decimal32_array_from_vec 32768         1.00     85.7±0.10µs        ? ?/sec    1.00     85.7±0.15µs        ? ?/sec
decimal64_array_from_vec 32768         1.00     97.3±0.24µs        ? ?/sec    1.00     97.6±0.37µs        ? ?/sec
struct_array_from_vec 1024             1.05      8.8±0.02µs        ? ?/sec    1.00      8.4±0.03µs        ? ?/sec
struct_array_from_vec 128              1.06  1972.7±10.11ns        ? ?/sec    1.00   1854.4±2.60ns        ? ?/sec
struct_array_from_vec 256              1.03      2.9±0.04µs        ? ?/sec    1.00      2.8±0.00µs        ? ?/sec
struct_array_from_vec 512              1.05      4.9±0.08µs        ? ?/sec    1.00      4.6±0.01µs        ? ?/sec

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants