-
Notifications
You must be signed in to change notification settings - Fork 1k
Add Decimal32 and Decimal64 support to arrow-avro Reader #8255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Expanded Decoder to handle Decimal32 and Decimal64 types alongside existing Decimal128 and Decimal256 support. - Added precision and fixed size validation for Decimal32 and Decimal64 to ensure Arrow compatibility. - Updated flush, decode, and append_null methods in the Decoder to process these new decimal types correctly. - Refined test suite with new test cases to validate Decimal32/Decimal64 decoding and integration logic. - Introduced thorough coverage across data-backed, fixed-sized, and precision-based decimal paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jecsand838
Can we please add a description of the new binary files added?
Also, one thing I worry about is that Decimal32 and Decimal64 are relatively new types (thanks to @CurtHagenlocher 🙏 ) but their support across the various arrow kernels now is not very large.
Thus if you have the avro reader return Decimal32/Decimal64 arrays, it might be hard to use them in the rest of the arrow-rs world until we complete the rest of the support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How were these files created? Maybe we could add a README explaining what is in them and how they were made?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can do that. I have an arrow-testing PR outstanding for these files and the script used to create them saved as a gist: apache/arrow-testing#112
https://gist.github.com/jecsand838/3890349bdb33082a3e8fdcae3257eef7
Also, one thing I worry about is that Decimal32 and Decimal64 are relatively new types (thanks to @CurtHagenlocher 🙏 ) but their support across the various arrow kernels now is not very large.
Ah, that makes sense. Do you think it's worth feature flagging the Decimal32
and Decimal64
support for the time being?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb I went ahead and got the README.md
file and new small_decimals
feature flag pushed up. Let me know what you think!
Also if we need to hold off until Decimal32
and Decimal64
support has matured, then that's totally fine with me as well.
94f473f
to
e2a153e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jecsand838
Which issue does this PR close?
Rationale for this change
Apache Avro’s
decimal
logical type annotates eitherbytes
orfixed
and carriesprecision
andscale
. Implementations should reject invalid combinations such asscale > precision
, and the underlying bytes are the two’s‑complement big‑endian representation of the unscaled integer. On the Arrow side, Rust now exposes first‑classDecimal32
,Decimal64
,Decimal128
, andDecimal256
data types with documented maximum precisions (9, 18, 38, 76 respectively). Until now,arrow-avro
decoded all Avro decimals to 128/256‑bit Arrow decimals, even when a narrower type would suffice.What changes are included in this PR?
arrow-avro/src/codec.rs
Codec::Decimal(precision, scale, _size)
to Arrow’sDecimal32
/64
/128
/256
by precision, preferring the narrowest type (≤9→32, ≤18→64, ≤38→128, otherwise 256).scale > precision
.precision
exceeds Arrow’s maximum (Decimal256).fixed
, check that declaredprecision
fits the byte width (≤4→max 9, ≤8→18, ≤16→38, ≤32→76).Codec::Decimal
to mentionDecimal32
/64
.arrow-avro/src/reader/record.rs
Add
Decoder::Decimal32
andDecoder::Decimal64
variants with corresponding builders (Decimal32Builder
,Decimal64Builder
).Builder selection:
Implement decode paths that sign‑extend Avro’s two’s‑complement payload to 4/8 bytes and append values to the new builders; update
append_null
/flush
for 32/64‑bit decimals.arrow-avro/src/reader/mod.rs
(tests)Expand
test_decimal
to assert that:Decimal32
; precision 10 map toDecimal64
;Decimal64
;Decimal128
.Add a nulls path test for bytes‑backed
Decimal32
.Are these changes tested?
Yes. Unit tests under
arrow-avro/src/reader/mod.rs
construct expectedDecimal32Array
/Decimal64Array
/Decimal128Array
withwith_precision_and_scale
, and compare against batches decoded from Avro files (including legacy fixed and bytes‑backed cases). The tests also exercise small batch sizes to cover buffering paths; a new Avro data file is added for higher‑width decimals.New Avro test file details:
These new Avro test files were created using this script: https://gist.github.com/jecsand838/3890349bdb33082a3e8fdcae3257eef7
There is also an arrow-testing PR for these new files: apache/arrow-testing#112
Are there any user-facing changes?
N/A due to
arrow-avro
not being public.