Skip to content

[ffi] function call overhead? #52692

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
modulovalue opened this issue Jun 13, 2023 · 5 comments
Open

[ffi] function call overhead? #52692

modulovalue opened this issue Jun 13, 2023 · 5 comments
Labels
area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. library-ffi P3 A lower priority bug or feature request triaged Issue has been triaged by sub team type-performance Issue relates to performance or code size

Comments

@modulovalue
Copy link
Contributor

modulovalue commented Jun 13, 2023

I wanted to see if it would be possible to implement the features proposed in #52673 via FFI.

I've created a demo for compiling pure assembly to a dylib. (The demo can be found here. That file in that repo should be able to run on any arm/macos machine and reproduce my observations.)

I was able to make the following measurements on my arm macbook:

 - total: 9884992
   => via lookup table took: 3.816ms
 - total: 499999500000
   => via assembly took: 9.608ms
 - total: 499999500000
   => id control took: 0.473ms
 - total: 499999500000
   => id control via closure call took: 4.765ms
 - total: 499999500000
   => id control via function call took: 3.405ms
  • The "lookup table" part calculates popcounts via a lookup table approach in Dart.
  • The "assembly" part passes a single integer to Dart in assembly via a dylib.
  • "id control" does what "assembly" does and nothing else.
  • "id control via closure" does what "assembly" does via a closure.
  • "id control via function" does what "assembly" does via a do-not-inline annotated function.

My conclusion is that the overhead of calling some assembly via a dynamically linked library takes ~2x the time it would take to invoke a closure and ~3x the time it would take to invoke a function. Furthermore, it seems like it wouldn't make sense to use this approach for exposing ARM instructions to Dart, as the overhead of calling them is too big.

I did expect the dylib calls to have some overhead, but I don't know how much overhead to expect.

I wanted to ask the following:

  • Is this observation about the overhead of dylib function calls expected?
  • Is the overhead of calling dylib functions expected to change in the near feature?
  • Are native assets expected to improve the overhead of native function calls?
  • Would statically linked libraries (It appears like there's some work planned on that) improve the performance of native function calls?
  • Are there any plans to support some form of optimization (LTO?) that could inline functions found in statically linked libraries to remove the overhead of calling functions in native code?

All in all, I'd like to be able to ship custom assembly with Dart, and the above questions are meant to find out if that could become practical (in a way that does not incur any performance penalties) in some form in the near future.

@mit-mit
Copy link
Member

mit-mit commented Jun 13, 2023

cc @dcharkes

@mit-mit mit-mit added area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. library-ffi labels Jun 13, 2023
@mit-mit
Copy link
Member

mit-mit commented Jun 13, 2023

What Dart execution mode are you running the measurements on? JIT (dart run ...)? Or AOT (dart compile exe ...)?

@modulovalue
Copy link
Contributor Author

My measurements were taken on JIT (dart run ...). I haven't tested AOT (but I should do that. However, the demo does not support that because it uses dart:mirrors).

@modulovalue
Copy link
Contributor Author

Here are the results for AOT:

 - total: 9884992
   => via lookup table took: 3.013ms
 - total: 499999500000
   => via assembly took: 6.905ms
 - total: 499999500000
   => id control took: 0.483ms
 - total: 499999500000
   => id control via closure call took: 3.105ms
 - total: 499999500000
   => id control via function call took: 0.952ms

Pure function calls have improved significantly, "assembly" has improved only slightly in a way that does not affect my conclusion.

@dcharkes
Copy link
Contributor

dcharkes commented Jun 13, 2023

@modulovalue

lookupFunction -> should have isLeaf: true, this will make it faster.

Using the @Native external functions instead of DynamicLibrary.open + lookupFunction will make it faster (also use isLeaf: true). You need to use the --enable-experiment=native-assets flag for this. For more info see #50565. (An alternative to using the experimental flag is to dlopen with global flags first, and then @Native externals will be resolved in the process. See an example in https://github.com/dart-lang/sdk/tree/main/benchmarks/FfiCall/dart.)

@Native externals will become even faster after landing: https://dart-review.googlesource.com/c/sdk/+/284300

Background info:

  • lookupFunction creates a closure, so it is always slower than a closure call
  • a dart function call might not be a function call, it might be inlined 🚀 If you want to measure the function call itself you can add @pragma('vm:never-inline'). You can also check with dart --trace-inlining ... what inlining is happening.

The remaining questions:

Would statically linked libraries (It appears like there's some work planned on that) improve the performance of native function calls?

I have done some exploration #49418 https://dart-review.googlesource.com/c/sdk/+/251263.

I have not done any performance measurements on the exploration. I expect it to be maybe slightly faster but not a whole lot than @Native external calls, the only difference would be removing a single load and the call instruction taking a relative address rather than a register as argument. (Which might make the branch predictor happy possibly.)

Yes, I'd like to land that work =)

Are there any plans to support some form of optimization (LTO?) that could inline functions found in statically linked libraries to remove the overhead of calling functions in native code?

@mraleph mentioned that LTO doesn't work unfortunately: #49418 (comment)

@a-siva a-siva added type-performance Issue relates to performance or code size triaged Issue has been triaged by sub team P3 A lower priority bug or feature request labels Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. library-ffi P3 A lower priority bug or feature request triaged Issue has been triaged by sub team type-performance Issue relates to performance or code size
Projects
None yet
Development

No branches or pull requests

4 participants