Skip to content

VM compiler support for cachable function calls in IL #51618

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mkustermann opened this issue Mar 3, 2023 · 5 comments
Closed

VM compiler support for cachable function calls in IL #51618

mkustermann opened this issue Mar 3, 2023 · 5 comments
Labels
area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. library-ffi

Comments

@mkustermann
Copy link
Member

mkustermann commented Mar 3, 2023

In the implementation of FFI natives like this

@Native<Void Function(Int8)>("foo")
external void foo(int arg);

and a call site like this

main() {
 foo(1);
 foo(2);
}

we have two resolve the native symbol "foo" and then call the C function. We obviously want to avoid to lookup the symbol every time, so we need to either eagerly do it (e.g. at loading time of kernel/aot-snapshot) or we want to cache the result.

Old-school natives solve this problem by having a resolve function pointer in the native, which after resolving, gets patched to the actual native. This can be done for old natives more easily, since the calling convention of the resolve function and the target native function is the same. For FFI this is not the case. There's several compilations (e.g. old natives are unoptimized and cannot be inlined, so stack walker can obtain function object and native name, ffi calls could be inlined, ..., resolver function would need to preserve all args, ...)

There may be an easy way to solve this problem without too much overhead. We compile invocations like foo(1) to something like this:

<nativ-name> <- Constant(String("foo"))

<target_address> <- CachableStaticCall(ResolveNativeSymbol, <native-name>);
result <- FfiCall(<target_address>, args)

the CachableStaticCall() can be implemented as

void CachableStaticCall::MakeLocationSummary(...) {
  // We don't block any registers in common case, slow path code will push live regs in uncommon case.
  return LocationSummary(... kCallOnSlowPath); 
}
void CachableStaticCall::EmitNativeCode(...) {
  Label done;
  const auto idx = __ AllocatePatchablePoolIndex();
  auto slow_path = CachableStaticCallSlowPath(native_symbol_name, idx, &done);

  __ movq(R0, Address(PP, idx));
  __ BranchIfZero(&slow_path);
  __ Bind(&done);
}

class CachableStaticCallSlowPath : public SlowPath {
  void EmitnativeCode() {
    // push alive resg
    // perform call
    __ movq(Address(PP, idx), R0); // <-- patch pool entry
    // pop regs
   __ Branch(&done);
}

That would add a general capability of function calls with caching ability which could be used by any idempotent function call (i.e. arguments are constant and result will be constant).

@mkustermann mkustermann added area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. library-ffi labels Mar 3, 2023
@dcharkes
Copy link
Contributor

dcharkes commented Mar 3, 2023

Context https://dart-review.googlesource.com/c/sdk/+/284300/7/runtime/vm/compiler/frontend/kernel_to_il.cc

The prototype does:

  1. IL instruction for pool load/store
  2. loading potentially cached address from pool in IL (intermediate language, in kernel_to_il.cc)
  3. branching in IL
  4. StaticCall to FfiResolverFunction in IL
  5. storing address to cache in IL

This approach would be:

  1. CachableStaticCall in IL
  2. loading branching storing cache in MC (machine code, in il.cc)

This would also be cleaner from a layering perspective. The prototype exposes the 'pool' in IL, but the pool is not accessible in IL it only exists in MC generation.

@dcharkes
Copy link
Contributor

dcharkes commented May 9, 2023

Some observations from trying to implement this in a general way, using this in #47625, and chat discussions. (Posting them here so they don't get lost.)

  • Caching calls in general means using Object::sentinel() as marker that the call hasn't been performed yet. (Similar to final fields. We could think as these as isolate-group fields.)
  • For the use case in Remove static fields in FfiNative's #47625, we don't we want to set the value to Object::sentinel() again in appjit snaphots.
  • For the use case in Remove static fields in FfiNative's #47625, we could want to use unboxed values instead of tagged objects. But that would mean we would also need to specify the uninitialized value. (Both on the instruction as well on the pool entry so that we know what value to set it to for appjit.)

Since we only have a single use-case now, we might avoid over-generalizing the implementation for now.

@dcharkes
Copy link
Contributor

dcharkes commented May 10, 2023

We don't visit the object pool entries. If we were to visit object pool entries we would have to never store any raw values with the least significant bit set to 1 or split it into two groups.

Until then, cacheable static calls will only work for static functions returning an int and we store the value unboxed. And this also means that we have to designate an int value, for example 0 as cache-not-initialized.

@dcharkes
Copy link
Contributor

dcharkes commented Aug 16, 2023

void CachableStaticCall::MakeLocationSummary(...) {
  // We don't block any registers in common case, slow path code will push live regs in uncommon case.
  return LocationSummary(... kCallOnSlowPath); 
}
void CachableStaticCall::EmitNativeCode(...) {
  Label done;
  const auto idx = __ AllocatePatchablePoolIndex();
  auto slow_path = CachableStaticCallSlowPath(native_symbol_name, idx, &done);

  __ movq(R0, Address(PP, idx));
  __ BranchIfZero(&slow_path);
  __ Bind(&done);
}

Modeling these cachable static calls after static calls does not give use the entirely desired behavior. Because a slow path is an assembler concept using the normal IL pushing leads to the arguments being loaded on the fast path.

Kernel building:

  body += Constant(asset_id);
  body += Constant(symbol);
  body += Constant(Smi::ZoneHandle(Smi::New(arg_n)));
  body +=
      MemoizableIdempotentCall(TokenPosition::kNoSource, FfiResolverFunction(),
                               /*argument_count=*/3,
                               /*argument_names=*/Array::null_array(),
                               /*type_args_count=*/0);

Resulting machine code:

        ;; MoveArgument(v9, SP+2) // We'd like this to be on the slow path.
0x2f    4d8b5f0f               movq tmp,[pp+0xf]   file:///usr/local/google/home/dacoharkes/dart-sdk/sdk/tests/ffi/dylib_utils.dart
0x33    4c895c2410             movq [rsp+0x10],tmp
        ;; MoveArgument(v10, SP+1) // We'd like this to be on the slow path.
0x38    4d8b5f17               movq tmp,[pp+0x17]   dlopen
0x3c    4c895c2408             movq [rsp+0x8],tmp
        ;; MoveArgument(v11, SP+0) // We'd like this to be on the slow path.
0x41    48c7042404000000       movq [rsp],4
        ;; v12 <- MemoizableIdempotentCall:12( _ffi_resolver_function@8050071<0> v9, v10, v11) T{_Mint}
0x49    498b471f               movq rax,[pp+0x1f]
0x4d    4883f800               cmpq rax,0
0x51    0f8479000000           jz +127 // Jump to the slow path.

I'll investigate what would be required to achieve that.

Edit: We could possibly refer to the constant args in the instruction itself, rather than push them on the stack when building the IL graph. So the arguments would not be visible in IL graph, the instruction would have no inputs, and we'd the constants from the pool in machine code on the slow path to the right place. "The right place" should then correspond to what a static call expects.

@mkustermann
Copy link
Member Author

We could possibly refer to the constant args in the instruction itself, ...

Exactly. As long as the cacheable mechanism doesn't support dynamic args but only constants, that's the obvious way to do it.

copybara-service bot pushed a commit that referenced this issue Oct 27, 2023
Adds a `CachableIdempotentCallInstr` that can be invoked via
`@pragma('vm:cachable-idempotent')` if the call-sites is force
optimized.

The object pool is not visited by the scavenger. So, we store the
results as unboxed integers. Consequently, only dart functions that
return integers can be cached.

Cachable idempotent calls should never be inlined. After the first
call the function not be called again.

The call itself is on a slow path to avoid register spilling on the
fast path.

TEST=vm/cc/IRTest_CachableIdempotentCall
TEST=runtime/tests/vm/dart/cachable_idempotent_test.dart

Bug: #51618
Change-Id: I612e896f27add76f57796c060157e14cc687a0fd
Cq-Include-Trybots: luci.dart.try:vm-aot-android-release-arm64c-try,vm-aot-android-release-arm_x64-try,vm-aot-asan-linux-release-x64-try,vm-aot-linux-debug-simarm_x64-try,vm-aot-linux-debug-simriscv64-try,vm-aot-mac-release-arm64-try,vm-aot-mac-release-x64-try,vm-aot-msan-linux-release-x64-try,vm-aot-obfuscate-linux-release-x64-try,vm-aot-tsan-linux-release-x64-try,vm-aot-ubsan-linux-release-x64-try,vm-aot-win-debug-arm64-try,vm-aot-win-debug-x64c-try,vm-aot-win-release-x64-try,vm-appjit-linux-debug-x64-try,vm-asan-linux-release-x64-try,vm-checked-mac-release-arm64-try,vm-eager-optimization-linux-release-ia32-try,vm-eager-optimization-linux-release-x64-try,vm-kernel-linux-debug-x64-try,vm-kernel-precomp-linux-release-x64-try,vm-linux-debug-ia32-try,vm-linux-debug-simriscv64-try,vm-linux-debug-x64-try,vm-mac-debug-arm64-try,vm-mac-debug-x64-try,vm-msan-linux-release-x64-try,vm-reload-linux-debug-x64-try,vm-reload-rollback-linux-debug-x64-try,vm-tsan-linux-release-x64-try,vm-ubsan-linux-release-x64-try,vm-win-debug-arm64-try,vm-win-debug-x64-try,vm-win-debug-x64c-try,vm-win-release-ia32-try
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/301601
Reviewed-by: Ryan Macnak <[email protected]>
Reviewed-by: Martin Kustermann <[email protected]>
copybara-service bot pushed a commit that referenced this issue Nov 2, 2023
This CL removes static fields for storing the `@Native`'s function
addresses. Instead, the function addresses are stored in the object
pool for all archs except for ia32. ia32 has no AOT and no AppJit
snapshots, so the addresses are directly embedded in the code.

This CL removes the closure wrapping for `@Native`s. Instead of
`pointer.asFunctionInternal()()` where `asFunction` returns a closure
which contains the trampoline, the function is compiled to a body
which contains the trampoline `Native()`. This is possible for
`@Native`s because the dylib and symbol names are known statically.

Doing the compilation in kernel_to_il instead of a CFE transform
enables supporting static linking later. (The alternative would have
been to transform in the cfe to a `@pragma('vm:cachable-idempotent')`
instead of constructing the IL in kernel_to_il.

To enable running resolution in ia32 in kernel_to_il.cc, the
resolution function has been made available via
`runtime/lib/ffi_dynamic_library.h`.

Because the new calls are simply static calls, the TFA can figure
out const arguments flowing to these calls. This leads to constant
locations in the parameters to FfiCalls. So, this CL also introduces
logic to move constants into `NativeLocation`s.

TEST=runtime/vm/compiler/backend/il_test.cc
TEST=tests/ffi/function_*_native_(leaf_)test.dart
TEST=pkg/vm/testcases/transformations/ffi/ffinative_compound_return.dart

Closes: #47625
Closes: #51618
Change-Id: Ic5d017005dedcedea40c455c4d24dbe774f91603
CoreLibraryReviewExempt: Internal FFI implementation changes
Cq-Include-Trybots: luci.dart.try:vm-aot-android-release-arm64c-try,vm-aot-android-release-arm_x64-try,vm-aot-linux-debug-x64-try,vm-aot-linux-debug-x64c-try,vm-aot-mac-release-arm64-try,vm-aot-mac-release-x64-try,vm-aot-obfuscate-linux-release-x64-try,vm-aot-win-debug-arm64-try,vm-aot-win-debug-x64c-try,vm-aot-win-release-x64-try,vm-appjit-linux-debug-x64-try,vm-asan-linux-release-x64-try,vm-checked-mac-release-arm64-try,vm-eager-optimization-linux-release-ia32-try,vm-eager-optimization-linux-release-x64-try,vm-ffi-android-debug-arm-try,vm-ffi-android-debug-arm64c-try,vm-ffi-qemu-linux-release-arm-try,vm-ffi-qemu-linux-release-riscv64-try,vm-fuchsia-release-x64-try,vm-kernel-linux-debug-x64-try,vm-kernel-precomp-linux-release-x64-try,vm-linux-debug-ia32-try,vm-linux-debug-x64-try,vm-linux-debug-x64c-try,vm-mac-debug-arm64-try,vm-mac-debug-x64-try,vm-msan-linux-release-x64-try,vm-reload-linux-debug-x64-try,vm-reload-rollback-linux-debug-x64-try,vm-ubsan-linux-release-x64-try,vm-win-debug-arm64-try,vm-win-debug-x64-try,vm-win-debug-x64c-try,vm-win-release-ia32-try
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/284300
Commit-Queue: Daco Harkes <[email protected]>
Reviewed-by: Alexander Markov <[email protected]>
Reviewed-by: Martin Kustermann <[email protected]>
copybara-service bot pushed a commit that referenced this issue Nov 3, 2023
This reverts commit e16bb21.

Reason for revert: Indication this caused engine test
failures of the kind:
```
�[0;32m[ RUN      ] �[mEmbedderA11yTest.A11yTreeIsConsistentUsingV1Callbacks
../../flutter/shell/platform/embedder/tests/embedder_a11y_unittests.cc:639: Failure
Expected equality of these values:
  std::strncmp(kTooltip, node->tooltip, sizeof(kTooltip) - 1)
    Which is: 116
  0
```

Original change's description:
> [vm/ffi] Optimize `@Native` calls
>
> This CL removes static fields for storing the `@Native`'s function
> addresses. Instead, the function addresses are stored in the object
> pool for all archs except for ia32. ia32 has no AOT and no AppJit
> snapshots, so the addresses are directly embedded in the code.
>
> This CL removes the closure wrapping for `@Native`s. Instead of
> `pointer.asFunctionInternal()()` where `asFunction` returns a closure
> which contains the trampoline, the function is compiled to a body
> which contains the trampoline `Native()`. This is possible for
> `@Native`s because the dylib and symbol names are known statically.
>
> Doing the compilation in kernel_to_il instead of a CFE transform
> enables supporting static linking later. (The alternative would have
> been to transform in the cfe to a `@pragma('vm:cachable-idempotent')`
> instead of constructing the IL in kernel_to_il.
>
> To enable running resolution in ia32 in kernel_to_il.cc, the
> resolution function has been made available via
> `runtime/lib/ffi_dynamic_library.h`.
>
> Because the new calls are simply static calls, the TFA can figure
> out const arguments flowing to these calls. This leads to constant
> locations in the parameters to FfiCalls. So, this CL also introduces
> logic to move constants into `NativeLocation`s.
>
> TEST=runtime/vm/compiler/backend/il_test.cc
> TEST=tests/ffi/function_*_native_(leaf_)test.dart
> TEST=pkg/vm/testcases/transformations/ffi/ffinative_compound_return.dart
>
> Closes: #47625
> Closes: #51618
> Change-Id: Ic5d017005dedcedea40c455c4d24dbe774f91603
> CoreLibraryReviewExempt: Internal FFI implementation changes
> Cq-Include-Trybots: luci.dart.try:vm-aot-android-release-arm64c-try,vm-aot-android-release-arm_x64-try,vm-aot-linux-debug-x64-try,vm-aot-linux-debug-x64c-try,vm-aot-mac-release-arm64-try,vm-aot-mac-release-x64-try,vm-aot-obfuscate-linux-release-x64-try,vm-aot-win-debug-arm64-try,vm-aot-win-debug-x64c-try,vm-aot-win-release-x64-try,vm-appjit-linux-debug-x64-try,vm-asan-linux-release-x64-try,vm-checked-mac-release-arm64-try,vm-eager-optimization-linux-release-ia32-try,vm-eager-optimization-linux-release-x64-try,vm-ffi-android-debug-arm-try,vm-ffi-android-debug-arm64c-try,vm-ffi-qemu-linux-release-arm-try,vm-ffi-qemu-linux-release-riscv64-try,vm-fuchsia-release-x64-try,vm-kernel-linux-debug-x64-try,vm-kernel-precomp-linux-release-x64-try,vm-linux-debug-ia32-try,vm-linux-debug-x64-try,vm-linux-debug-x64c-try,vm-mac-debug-arm64-try,vm-mac-debug-x64-try,vm-msan-linux-release-x64-try,vm-reload-linux-debug-x64-try,vm-reload-rollback-linux-debug-x64-try,vm-ubsan-linux-release-x64-try,vm-win-debug-arm64-try,vm-win-debug-x64-try,vm-win-debug-x64c-try,vm-win-release-ia32-try
> Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/284300
> Commit-Queue: Daco Harkes <[email protected]>
> Reviewed-by: Alexander Markov <[email protected]>
> Reviewed-by: Martin Kustermann <[email protected]>

Change-Id: Icc87a6ca33bffecabb15c6b168a06ccc38c2fe5b
Cq-Include-Trybots: luci.dart.try:vm-aot-android-release-arm64c-try,vm-aot-android-release-arm_x64-try,vm-aot-linux-debug-x64-try,vm-aot-linux-debug-x64c-try,vm-aot-mac-release-arm64-try,vm-aot-mac-release-x64-try,vm-aot-obfuscate-linux-release-x64-try,vm-aot-win-debug-arm64-try,vm-aot-win-debug-x64c-try,vm-aot-win-release-x64-try,vm-appjit-linux-debug-x64-try,vm-asan-linux-release-x64-try,vm-checked-mac-release-arm64-try,vm-eager-optimization-linux-release-ia32-try,vm-eager-optimization-linux-release-x64-try,vm-ffi-android-debug-arm-try,vm-ffi-android-debug-arm64c-try,vm-ffi-qemu-linux-release-arm-try,vm-ffi-qemu-linux-release-riscv64-try,vm-fuchsia-release-x64-try,vm-kernel-linux-debug-x64-try,vm-kernel-precomp-linux-release-x64-try,vm-linux-debug-ia32-try,vm-linux-debug-x64-try,vm-linux-debug-x64c-try,vm-mac-debug-arm64-try,vm-mac-debug-x64-try,vm-msan-linux-release-x64-try,vm-reload-linux-debug-x64-try,vm-reload-rollback-linux-debug-x64-try,vm-ubsan-linux-release-x64-try,vm-win-debug-arm64-try,vm-win-debug-x64-try,vm-win-debug-x64c-try,vm-win-release-ia32-try
No-Presubmit: true
No-Tree-Checks: true
No-Try: true
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/333840
Bot-Commit: Rubber Stamper <[email protected]>
Reviewed-by: Daco Harkes <[email protected]>
Commit-Queue: Martin Kustermann <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-vm Use area-vm for VM related issues, including code coverage, and the AOT and JIT backends. library-ffi
Projects
None yet
Development

No branches or pull requests

2 participants