-
Notifications
You must be signed in to change notification settings - Fork 1k
[Variant] Support variant to Decimal32/64/128/256
#8552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Variant] Support variant to Decimal32/64/128/256
#8552
Conversation
# Conflicts: # parquet-variant-compute/src/variant_get.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- thank you @liamzwbao 🙏
let result = variant_get(&variant_array, options).unwrap(); | ||
let result = result.as_any().downcast_ref::<Decimal32Array>().unwrap(); | ||
|
||
assert_eq!(result.value(0), 124); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to match the test above, it would probably be good to assert result.precision() and result.scale() as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise for the Decimal 64/128/256 cases too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made an initial pass. The high-level comments are more important than the low-level syntax nits.
// scale_down means output has fewer fractional digits than input | ||
// divide by 10^(input_scale - output_scale) with rounding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit nervous about rounding (the whole point of decimal is to be lossless, unlike floating point). But I guess in this case the user specifically asked for the narrower type, so the usual worries about lossy coercion don't apply?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think rounding makes sense here as arrow variant conversion could also cause precision loss due to rescaling. But we could also introduce a new option to fail on precision loss if needed
let d = v.checked_div(div)?; | ||
let r = v % div; | ||
|
||
// rounding in the same way as convert_to_smaller_scale_decimal in arrow-cast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at convert_to_smaller_scale_decimal
, virtually all the logic in that function is doing exactly what we want here... and then the last line just applies the calculation as an appropriate unary operation on the input array
. Rather than duplicate the logic, is there some way we could factor it out or otherwise reuse it? Problem is, it's in a different crate, so the factored out logic would have to be pub
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those are internal helper functions, could refactor the logic, but not sure if it's good to expose that. WDYT @alamb ?
DataType::Decimal32(precision, scale) => Decimal32( | ||
VariantToDecimalArrowRowBuilder::new(cast_options, capacity, *precision, *scale)?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make sure I'm understanding correctly --
- Here, the user has requested e.g. Decimal32, so we create a decimal32 row builder
- The row builder invokes the
VariantDecimalScaler
trait, which eventually callsVariant::as_decimal4
- If the actual variant value was a wider decimal type, the conversion will produce
None
unless the unscaled value fits in the narrower type and the scale is small enough to fit as well (without rounding)?
But in this case, the user specifically requested rounding, so it seems odd to fail some of the time and not fail other times? In particular, going from Decimal32(9, 4)
to Decimal32(9, 2)
would succeed with rounding, but going from Decimal64(18, 4)
to Decimal32(9, 2)
would fail for a value like 1234567.8901
, even tho the rescaled result 1234567.89
is a valid Decimal32(9, 2)
?
In order to correctly handle all valid narrowing conversions, we need to rescale+round first, using the original variant type, and then try to narrow the result to the requested type.
The converse hazard exists for widening, where we need to widen first, and then rescale+round:
- Converting the
Decimal32(9, 9)
value0.999999999
toDecimal64(*, 0)
produces an intermediate value ten decimal digits. - Converting the
Decimal(9, 0)
value999999999
toDecimal64(18, 9)
produces an intermediate (and final) value with 18 digits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at all possible combinations:
- We are converting unscaled value
v1
of typeVariant::DecimalXX(p1, s1)
todatatypes::DecimalYY(p2, s2)
- The variant decimals have implied precision, so p1 is always one of {9, 19, 38} based on decimal type
- let
n1 = p1-s1
andn2 = p2-s2
(the max number of integer digits before and after conversion) - if n2 < n1, there is an inherent risk of overflow regardless of what scales are involved
- before even looking at scale and scale changes, we should first verify that
v1
fits inn2+s1
digits. If not, flag overflow immediately. Otherwise, set n1=n2 and proceed to the next case. - NOTE: This check does NOT require changing the type of
v1
, because total precision decreased.
- before even looking at scale and scale changes, we should first verify that
- else if n2 = n1 and s2 < s1, there is an overflow risk when e.g. 0.999 rounds to 1.00
- Rescale, and then verify that the rounded result fits in
p2
digits. - NOTE: This check does NOT require changing the type of
v1
, because total precision decreased.
- Rescale, and then verify that the rounded result fits in
- else, there is no risk of overflow
- Convert
v1
to the new native type first - Then rescale and round as needed
- NOTE: Both operations are infallible
- Convert
That would correspond to something like the following code:
fn variant_to_unscaled_decimal32(
variant: Variant<'_, '_>,
precision: u8,
scale: u8,
) -> Result<i32> {
match variant {
Variant::Decimal4(d) => {
let s1 = d.scale();
let mut n1 = VariantDecimal4::MAX_PRECISION - s1;
let n2 = precision - scale;
let v1 = d.integer();
if n2 < n1 {
// integer digits pose an overflow risk, and n2+s1 could even be out of precision range
let max_value = MAX_DECIMAL32_FOR_EACH_PRECISION.get(n2 + s1);
if max_value.is_none_or(|n| v1.unsigned_abs() > n) {
return Err(... overflow ...);
}
// else the value fits in n2 digits and we can pretend n1=n2
n1 = n2;
}
if n2 == n1 {
let v2 = ... rescale v1 and round up ...;
if v2.unsigned_abs() > MAX_DECIMAL32_FOR_EACH_PRECISION[precision] {
return Err(... overflow ...);
}
// else the value can safely convert to the target type
return Ok(v2 as _);
}
// no overflow possible, but still have to rescale and round
let v1 = v1 as _;
let v2 = ... rescale v1 and round up ...;
Ok(v2)
}
Variant::Decimal8(d) => {
... almost the same code as for Decimal4 case ...
... except we use VariantDecimal8::MAX_PRECISION ...
... and we index into MAX_DECIMAL64_FOR_EACH_PRECISION ...
}
Variant::Decimal16(d) => {
... almost the same code as for Decimal4 case ...
... except we use VariantDecimal16::MAX_PRECISION ...
... and we index into MAX_DECIMAL128_FOR_EACH_PRECISION ...
}
Variant::Int8(i) => { ... treat it like `Variant::Decimal4(i, 0)` ... }
Variant::Int16(i) => { ... treat it like `Variant::Decimal4(i, 0)` ... }
Variant::Int32(i) => { ... treat it like `Variant::Decimal8(i, 0)` ... }
Variant::Int64(i) => { ... treat it like `Variant::Decimal16(i, 0)` ... }
_ => return Err(... not exact numeric data ...),
}
}
fn variant_to_unscaled_decimal64(
variant: Variant<'_, '_>,
precision: u8,
scale: u8,
) -> Result<i64> {
... exactly the same code as for decimal32 case ...
... but the changed return type means the `as _` casts now produce i64 ...
}
fn variant_to_unscaled_decimal128(
variant: Variant<'_, '_>,
precision: u8,
scale: u8,
) -> Result<i128> {
... exactly the same code as for decimal32 case ...
... but the changed return type means the `as _` casts now produce i128 ...
}
fn variant_to_unscaled_decimal256(
variant: Variant<'_, '_>,
precision: u8,
scale: u8,
) -> Result<i256> {
... exactly the same code as for decimal32 case ...
... but the changed return type means the `as _` casts now produce i256 ...
}
So, we'd want two macros:
- Outer macro that produces the body of
variant_to_unscaled_decimalXX
functions - Inner macro that produces the body of
Variant::DecimalXX
match arms
We need macros because integer types lack any helpful trait hierarchy that generics could take advantage of.
Update: Corrected a potential array out of bounds index in the n2 < n1 case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! Watch out, arrow decimals can have negative scale. My analysis above didn't necessarily account for that; I'm not sure if the original code in this PR does?
In particular, negative scale allows infallible conversions such as VariantDecimal16(38, 0)
to Decimal4(9, -30)
with rounding, and the n1 vs. n2 checks I proposed above might not accurately capture this nuance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this! Let me dig a bit deeper and improve this conversion. Indeed it's possible to get a null for a valid decimal.
For negative scale, I think it's covered in this method, I will add more tests for it. Also, the validate function in the macro scale_variant_decimal
will check and make sure it fit into a decimal with specific precision
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the implementation of cast_decimal_to_decimal<I, O>
in arrow-cast
, and it seems to already handle our cases quite well. Specifically:
- It checks
is_infallible_cast
, which covers the case 3. - For scale-up (
s1 <= s2
), it first convertsI::Native
toO::Native
and then rescales. For scale-down (s1 > s2
), it divides and rounds the result (I::Native
) before converting toO::Native
. This approach gracefully handles native-type overflow. The subsequentDecimalType::is_valid_decimal_precision
call ensures precision validation, similar to our currentMAX_DECIMAL32_FOR_EACH_PRECISION.get(n2 + s1)
check, which effectively covers cases 1 & 2, wheren2 < n1
orn2 == n1
. - That said, case 1 (
n2 < n1
) might present an optimization opportunity since we could skip rescaling. Functionally tho, the results should be the same. This could be explored in a follow-up PR.
Given this overlap, instead of duplicating logic, I plan to refactor the decimal cast function by extracting the shared core logic into a helper like and expose it, then we need a dependency on arrow-cast
tho:
fn rescale_decimal<I, O>(
integer: I::Native,
input_precision: u8,
input_scale: i8,
output_precision: u8,
output_scale: i8,
) -> Option<O::Native>
where
I: DecimalType,
O: DecimalType,
I::Native: DecimalCast,
O::Native: DecimalCast,
Then, in our case, we can simply wire the type conversions through this helper:
fn variant_to_unscaled_decimal32(
variant: Variant<'_, '_>,
precision: u8,
scale: u8,
) -> Result<i32> {
match variant {
Variant::Decimal4(d) => rescale_decimal::<Decimal32, Decimal32>(
d.integer(), VariantDecimal4::MAX_PRECISION, d.scale(), precision, scale),
Variant::Decimal8(d) => rescale_decimal::<Decimal64, Decimal32>(
d.integer(), VariantDecimal8::MAX_PRECISION, d.scale(), precision, scale),
Variant::Decimal16(d) => rescale_decimal::<Decimal128, Decimal32>(
d.integer(), VariantDecimal16::MAX_PRECISION, d.scale(), precision, scale),
Variant::Int8(i) => rescale_decimal::<Decimal32, Decimal32>(
i, VariantDecimal4::MAX_PRECISION, 0, precision, scale),
Variant::Int16(i) => rescale_decimal::<Decimal32, Decimal32>(
i, VariantDecimal4::MAX_PRECISION, 0, precision, scale),
Variant::Int32(i) => rescale_decimal::<Decimal32, Decimal32>(
i, VariantDecimal4::MAX_PRECISION, 0, precision, scale),
Variant::Int64(i) => rescale_decimal::<Decimal64, Decimal32>(
i, VariantDecimal8::MAX_PRECISION, 0, precision, scale),
_ => return Err(... not exact numeric data ...),
}
}
Let me know if you see any potential risks or edge cases I might have overlooked.
0f7665f
to
b43ac66
Compare
b43ac66
to
9b6d0e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the thorough review, @scovich! Addressed most of the comments, will improve the type cast then
// scale_down means output has fewer fractional digits than input | ||
// divide by 10^(input_scale - output_scale) with rounding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think rounding makes sense here as arrow variant conversion could also cause precision loss due to rescaling. But we could also introduce a new option to fail on precision loss if needed
let d = v.checked_div(div)?; | ||
let r = v % div; | ||
|
||
// rounding in the same way as convert_to_smaller_scale_decimal in arrow-cast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
those are internal helper functions, could refactor the logic, but not sure if it's good to expose that. WDYT @alamb ?
DataType::Decimal32(precision, scale) => Decimal32( | ||
VariantToDecimalArrowRowBuilder::new(cast_options, capacity, *precision, *scale)?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this! Let me dig a bit deeper and improve this conversion. Indeed it's possible to get a null for a valid decimal.
For negative scale, I think it's covered in this method, I will add more tests for it. Also, the validate function in the macro scale_variant_decimal
will check and make sure it fit into a decimal with specific precision
I think this may be related to #8562 🤔 |
5dcb456
to
a7cdd33
Compare
🤖 |
🤖: Benchmark completed Details
|
🤖 |
🤖: Benchmark completed Details
|
🤖 |
🤖: Benchmark completed Details
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent the last couple days trying to come up with a refactor that really makes sense, and failed. The biggest issue is that several parts of the array casting code are fallible, producing an Err
result if e.g. a rescaling operation overflows, or if the precision and scale are invalid. And ArrowError
allocates a string.
But the variant casting code really wants Option
instead of Result
, and ends up calling Result:::ok
and discarding the allocated string. And then later it might convert that None
back to an error, if strict casting was enabled. Very annoying.
Overall, given the small amount of code we're trying to reuse, and the viral nature of Result vs. Option choices (see O::is_valid_decimal_precision
vs O::validate_decimal_precision
), I start to suspect that we should just duplicate the logic and be done with it. Which is unfortunate because duplication can diverge, and I already found one bug in the arrow cast code while playing around with this refactor.
I left my exploratory comments in place for others to look at and critique, tho.
arrow-cast/src/cast/decimal.rs
Outdated
let d = x.div_wrapping(div); | ||
let r = x.mod_wrapping(div); | ||
// make sure we don't perform calculations that don't make sense w/o validation | ||
validate_decimal_precision_and_scale::<O>(output_precision, output_scale)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did this move? It used to get called only for infallible casts, now it gets called for all casts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be an improvement since it validates the output precision and scale before performing operations on the array. However, these checks don’t affect correctness, because the same validation is performed again when creating the output decimal array here.
The main benefit is that it can fail early on invalid operations and avoid unnecessary operation on array, but it does add some overhead for valid operations since the conditions are checked twice.
So to be consistent, I think we should either add or remove this check across all branches.
arrow-cast/src/cast/decimal.rs
Outdated
|
||
/// Build a rescale function from (input_precision, input_scale) to (output_precision, output_scale) | ||
/// returning a closure `Fn(I::Native) -> Option<O::Native>` that performs the conversion. | ||
pub fn rescale_decimal<I, O>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This refactor seems a bit "backward" to me, which probably causes the benchmark regressions:
- Original code was dispatching to two methods (
convert_to_smaller_scale_decimal
andconvert_to_bigger_or_equal_scale_decimal
) from two locations (cast_decimal_to_decimal
andcast_decimal_to_decimal_same_type
). This avoided some branching in the inner cast loop, because the branch on direction of scale change is taken outside the loop. - New code pushes everything down into this new
rescale_decimal
method, which not only requires the introduction of a newis_infallible_cast
helper method, but also leaves the twoconvert_to_xxx_scale_decimal
methods with virtually identical bodies. At that point we may as well eliminate those helpers entirely and avoid the code bloat... but the helpers probably existed for a reason (to hoist at least some branches out of the inner loop). - The new code also allocates errors that get downgraded to empty options, where the original code upgraded empty options to errors. Arrow errors allocate strings, so that's a meaningful difference.
I wonder if we should instead do:
- rework
convert_to_smaller_scale_decimal
andconvert_to_bigger_or_equal_scale_decimal
- no longer take
array
orcast_options
as input - return
Ok((f, is_infallible_cast)
which corresponds to the return typeResult<(impl Fn(I::Native) -> Option<O::Native>, bool), ArrowError>
- no longer take
- define a new generic
apply_decimal_cast
function helper- it takes as input
array
,cast_options
and the(impl Fn, bool)
pair produced by aconvert_to_xxx_scale_decimal
helper - it handles the three ways of applying
f
to an array
- it takes as input
- rework
cast_decimal_to_decimal
andcast_decimal_to_decimal_same_type
to call those functions (see below) rescale_decimal
would be the single-row equivalent ofcast_decimal_to_decimal
, returningOption<O::Native>
- The decimal builder's constructor calls
validate_decimal_precision_and_scale
and fails on error, so we don't need to validate on a per-row basis.
cast_decimal_to_decimal
let array: PrimitiveArray<O> = if input_scale > output_scale {
let (f, is_infallible_cast) = convert_to_smaller_scale_decimal(...)?;
apply_decimal_cast(array, cast_options, f, is_infallible)?
} else {
let (f, is_infallible_cast) = convert_to_bigger_or_equal_scale_decimal(...)?;
apply_decimal_cast(array, cast_options, f, is_infallible)?
}
rescale_decimal
if input_scale > output_scale {
let (f, _) = convert_to_smaller_scale_decimal(...)?;
f(integer)
} else {
let (f, _) = convert_to_bigger_or_equal_scale_decimal(...)?;
f(integer)
}
6638d82
to
a48bbf4
Compare
Hi @scovich, yeah, the core functionality we need is just Once #8580 is merged, I will apply the same fix here. The downside is that if we find a similar bug in the future, we’ll need to fix it in both places. But I think the refactor of |
Which issue does this PR close?
We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.
Rationale for this change
What changes are included in this PR?
Variant
→Decimal32/64/128/256
Are these changes tested?
Yes
Are there any user-facing changes?
New cast types supported