Skip to content

Vector2.clone() performance hit for x86-64 and ARM64? #319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
JosefWN opened this issue Apr 21, 2024 · 6 comments
Closed

Vector2.clone() performance hit for x86-64 and ARM64? #319

JosefWN opened this issue Apr 21, 2024 · 6 comments

Comments

@JosefWN
Copy link

JosefWN commented Apr 21, 2024

Hello! I'm experiencing a pretty heavy performance hit with additions using Vector2 on ARM64 (M3) in Dart 3.3.3 on macOS. Haven't tested other arithmetics.

For comparison I have used math.Point and a custom PointDouble to avoid potential issues with generics and monomorphization (as detailed in dart-lang/sdk#53912):

class PointBenchmark extends BenchmarkBase {
  const PointBenchmark() : super('AddPoint');

  @override
  void run() {
    for (double i = -500; i <= 500; i += 0.75) {
      for (double j = -500; j <= 500; j += 0.75) {
        final _ = math.Point<double>(i, j) + math.Point<double>(j, i);
      }
    }
  }
}

class PointDoubleBenchmark extends BenchmarkBase {
  const PointDoubleBenchmark() : super('AddPointDouble');

  @override
  void run() {
    for (double i = -500; i <= 500; i += 0.75) {
      for (double j = -500; j <= 500; j += 0.75) {
        final _ = PointDouble(i, j) + PointDouble(j, i);
      }
    }
  }
}

class Vector2Benchmark extends BenchmarkBase {
  const Vector2Benchmark() : super('AddVector2');

  @override
  void run() {
    for (double i = -500; i <= 500; i += 0.75) {
      for (double j = -500; j <= 500; j += 0.75) {
        final _ = Vector2(i, j) + Vector2(j, i);
      }
    }
  }
}

Where PointDouble is:

class PointDouble {
  final double x;
  final double y;

  PointDouble operator +(final PointDouble other) =>
      PointDouble(x + other.x, y + other.y);

  const PointDouble(this.x, this.y);
}

AOT (dart compile exe):

AddPoint(RunTime): 9002.478813559323 us.
AddPointDouble(RunTime): 9050.669491525423 us.
AddVector2(RunTime): 145700.7142857143 us. <- about 16x slower!

JIT (dart run):

AddPoint(RunTime): 84623.54166666667 us. <- Monomorphization issues?
AddPointDouble(RunTime): 8925.813559322034 us.
AddVector2(RunTime): 8937.521186440677 us.
@JosefWN
Copy link
Author

JosefWN commented Apr 22, 2024

This is using vector_math_64.dart, but it also applies to vector_math.dart. I have confirmed that the code is compiled to native (i.e. not using Rosetta).

The benchmarks seem to get scheduled on the performance cores of the machine judging by the observed load pattern in the CPU history in the activity monitor, which should perform 128-bit NEON/SIMD instructions more efficiently than the efficiency cores.

Just initializing Vector2(i, j) and Vector2(j, i) takes about 9000 us (AOT), but the addition involves a call to clone() which seemingly is extremely expensive:

class Vector2CloneBenchmark extends BenchmarkBase {
  const Vector2CloneBenchmark() : super('CloneVector2');

  @override
  void run() {
    for (double i = -500; i <= 500; i += 0.75) {
      for (double j = -500; j <= 500; j += 0.75) {
        final _ = Vector2(j, i).clone();
      }
    }
  }
}

AOT:

CloneVector2(RunTime): 135596.2 us.

It could be completely unrelated, but the other day I observed another heavy performance hit for the double implementation of i.remainder(j) which seems to occur across different platforms: dart-lang/sdk#55479.

EDIT: Compiled and ran the benchmark (AOT) on x86-64 and it seems to be an issue there too (Vector2 is 22.5x slower than the alternatives from my initial post).

@JosefWN JosefWN changed the title Vector2 performance hit on ARM64? Vector2.clone() performance hit for x86-64 and ARM64? Apr 22, 2024
@johnmccutchan
Copy link
Collaborator

This should probably be a Dart VM bug as there is nothing that we can do here.

@PlugFox
Copy link
Contributor

PlugFox commented Dec 8, 2024

@JosefWN @johnmccutchan

Guys, you are writing benchmarks absolutely incorrectly!

// ignore_for_file: avoid_print

import 'package:benchmark_harness/benchmark_harness.dart';
import 'package:vector_math/vector_math_64.dart';

// dart run benchmarks/iterator_benchmark.dart
void main() {
  (<BenchmarkBase>[
    const Vector2CloneBenchmark$Bad(),
    const Vector2CloneBenchmark$Good(),
  ].map<({String name, double us})>(_measure).toList(growable: false)
        ..sort((a, b) => a.us.compareTo(b.us)))
      .map<String>((e) => 'Benchmark ${e.name}: ${e.us.toStringAsFixed(2)} us')
      .forEach(print);
}

({String name, double us}) _measure(BenchmarkBase benchmark) =>
    (name: benchmark.name, us: benchmark.measure());

class Vector2CloneBenchmark$Bad extends BenchmarkBase {
  const Vector2CloneBenchmark$Bad() : super('CloneVector2$Bad');

  @override
  void run() {
    for (var i = -500.0; i <= 500; i += 0.75) {
      for (var j = -500.0; j <= 500; j += 0.75) {
        final _ = Vector2(j, i).clone();
      }
    }
  }
}

class Vector2CloneBenchmark$Good extends BenchmarkBase {
  const Vector2CloneBenchmark$Good() : super('CloneVector2$Good');

  @override
  void run() {
    Vector2? vec;
    for (var i = -500.0; i <= 500; i += 0.75) {
      for (var j = -500.0; j <= 500; j += 0.75) {
        vec = Vector2(j, i).clone();
      }
    }
    if (vec == null) throw StateError('Vector2 is null');
  }
}

This is why you get strange results.

Benchmark CloneVector2$Bad: 7283.50 us
Benchmark CloneVector2$Good: 74382.63 us

This is not about the processor architecture, this is about compiler optimization.

The way you write the benchmark simply tells the compiler that all the code inside the loop can be simply thrown away, since it does nothing.

At the very least, always check the result, this will disable optimizations.

The same about /benchmark directory

@spydon
Copy link
Collaborator

spydon commented Dec 8, 2024

@PlugFox this doesn't seem related to this issue, I think you should open a separate issue for this.

@PlugFox
Copy link
Contributor

PlugFox commented Dec 8, 2024

@spydon I can fix benchmarks and open new PR, is that okay?

@spydon
Copy link
Collaborator

spydon commented Dec 8, 2024

@spydon I can fix benchmarks and open new PR, is that okay?

Absolutely!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants