Closed
Description
Hello everyone! Maybe I'm doing something wrong here, but I'm experiencing a pretty heavy performance hit with double.remainder()
on ARM64 (M3) in Dart 3.3.3 on macOS.
As per the documentation, a.remainder(b)
is mathematically equivalent to:
double remainder(final double a, final num b) => a - (a ~/ b) * b;
But when I compare the two, the custom implementation is somewhere between one and two orders of magnitude faster:
class RemainderBenchmark extends BenchmarkBase {
const RemainderBenchmark() : super('Remainder');
@override
void run() {
for (double i = -500; i <= 500; i += 0.75) {
for (double j = -500; j <= 500; j += 0.75) {
final _ = i.remainder(j);
}
}
}
}
class RemainderCustomBenchmark extends BenchmarkBase {
const RemainderCustomBenchmark() : super('RemainderCustom');
@override
void run() {
for (double i = -500; i <= 500; i += 0.75) {
for (double j = -500; j <= 500; j += 0.75) {
final _ = remainder(i, j);
}
}
}
}
AOT (dart compile exe
):
RemainderCustom(RunTime): 38108.96153846154 us.
Remainder(RunTime): 478124.0 us. <- About 13x slower!
JIT (dart run
):
RemainderCustom(RunTime): 9825.791469194313 us.
Remainder(RunTime): 507507.0 us. <- About 52x slower!
Any ideas why this is the case?
Metadata
Metadata
Assignees
Labels
Type
Projects
Milestone
Relationships
Development
No branches or pull requests
Activity
lrhn commentedon Apr 16, 2024
Similar results on (..checks...) intel CPU.
An up to 40x slowdown suggests to me that the code is either using a very inefficient assembler instruction, or it's calling into the C runtime for each
.remainder
call. Even using%
is faster, but still much slower than the custom version.Or maybe remainder on
double
s is just amazingly expensive to do correctly, and the approximation won't give the same result in all cases. (And if that's OK, it's a good optimization.)I do think
double.remainder
is compiled as a runtime call. The code for it is:which means calling a function in
double.cc
. There could be an intrinsified version that replaces this with generated assembler, but I can't find it.JosefWN commentedon Apr 16, 2024
Yes, I was also thinking it could be some form of call overhead, it seems unlikely that disastrous assembler is produced across multiple platforms? I actually re-implemented
%
as well, but it is quite performant on ARM64. I didn't manage to beat it...I think it's not a mathematical approximation, but an equivalence. I also considered that the two could be different numerically, but they seem to produce identical results over a pretty large number of inputs, i.e.
i.remainder(j) == remainder(i, j)
holds, in accordance with the docs (althoughi.remainder(j)
produces a more straight-forward exception whenj = 0
). Of course, there could be non-trivial edge cases I have overlooked.a-siva commentedon Apr 17, 2024
//cc @alexmarkov @aam
Vector2.clone()
performance hit for x86-64 and ARM64? google/vector_math.dart#319Vector2.clone()
performance hit for x86-64 and ARM64? [MOVED] #55542alexmarkov commentedon Apr 29, 2024
https://dart-review.googlesource.com/c/sdk/+/364780
[vm] Optimize double.remainder