-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Clarify ccall evaluation semantics #57931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Is option 5 effectively the same as 3 except with caching as an optimization? Or is there some scenario where value = if lib === old_lib && bar === old_bar
cache_value
else
cache_value = dlsym(lib, bar)
end The tricky part that makes it not implementable in Julia is that you want |
Yes, under suitable assumptions on the behavior of
Correct. The objection is not to implementation difficulty, but to IR size or the explicit extra calls.
Yes |
Would it be worthwhile to have a language feature for creating and accessing static per-call site boxes? Seems like a generally useful feature. Given that feature, ccall lowering could be fairly straightforward and unmagical. |
We have one by accident, haha:
#58279 removes the global handling from |
Yeah you can hack it with generated functions too, but I mean something less awful and official. |
Uh oh!
There was an error while loading. Please reload this page.
This aims to summarize a design discussion between @vtjnash, @JeffBezanson, @StefanKarpinski, @gbaraldi, @topolarity, @xal-0, @mlechu, @oscardssmith and myself around ccall. Except where otherwise annotated, I've tried to capture my understanding the consensus view, but it is of course possible that I have misunderstood or failed to remember in objection. In such cases, error is mine.
Discussion of the problems and current design
The current evaluation semantics of
ccall
are quite old and predate us having a particular good understanding of what the evaluation semantics of the language should be (and in particular, predate any notion of world ages, partitioned bindings, effects, etc.). For this reason, it is somewhat hard to give a coherent description of the current design.However, I will try my best.
Syntax based disambiguation of (non-)library case
There are two basic cases for ccall. The first is call with a plain symbol or ptr:
The second is
These cases are distinguished both syntactically a semantically. In particular,
is always the same as
ccall(sym)
, but the same is not true forx = (sym, lib)
.Evaluation sematics of the first argument
We first consider the basic case without libraries. Here we are familiar
with the usual ccall syntax:
What happens if we use an expression instead?
So far so normal. However, this is where it starts to get weird. I think the
best way I can describe it is that ccall tries to find a symbol in this order
by:
Statically looking at the expression that is syntactically inside
ccall
anddetermining whether or not it can figure out the value of the first argument.
Using inference's constant propagation
Using codegen's constant propagation (but not LLVM's)
However, for the non-lib case, the first of these is generally a no-op.
This leads to the following observed behavior:
To me, this is entirely unintuitive and I had to actually go read the source to figure out which cases I think would work and which didn't.
Additional complications from lib lowering
The situation becomes more complicated when using the lib syntax:
Additional complications from bindings partition
Post-bindings partition, there is an additional complication that both cases 1 and 3 depend on inferred world age bounds of the rest of the function, which can lead to completely non-intuitive behavior. This is #57749, although I will mention it here also:
This is arguably a separate inference bug that should be addressed by properly modeling this in inference, but stems from the same underlying confusion around the evaluation semantics of the first argument of ccall.
Hidden generic function call inside :foreigncall
This one might be a bit academic, but as of #50074, there is a hidden call to
Libdl.dlopen
inside :foreigncall. This dynamic call edge is not modeled and thussusceptible to #265-like issues, invisible to trimming, etc. Now, there is already a non-standard caching here, but I wanted to list it for completeness.
Implicit caching of dlsym
This one isn't so much a problem as it is an aspect of the current design that needs
to be preserved for performance. In particular, the codegen for
:foreigncall
currently looks something like (in pesudo C syntax)Plus some optimizations to fold away the lookup if it can be resolved statically by the JIT or to turn the lookup into a PLT-like structure.
Additional desirable features
An additional desirable feature that was discussed was that in the context of
--trim
, we would like to have the ability to statically link executables without assuming the presence of a dynamic linker at runtime, ideally while preserving the namespace scoping behavior of the current design. This is a little tricky, because the underlying systme linker generally does not have namespacing. How exactly to do this is outside the scope of this document, but the key consideration is that this constrains us to require a solution thatjuliac
might be able to turn the dynamic references into static ones.Proposed solutions
The first and most immediate question to answer is what the evaluation scope of the first argument of
ccall
is. I think there are roughly three reasonable answers:The expression and dlsym lookup get evaluated everytime
The expresssion gets evaluated every time, but the
dlsym
lookup is cached the first time it gets evaluated (for a particular native code instance)The expression gets evaluated every time, but the
dlsym
lookup is cached the first time it gets evaluated (for a particular native code instance) plus gets recached when the value of expression changesDiscussion
Option 3 is the most straightforward behavior in that it is completely dynamic and does not rely on any compiler information. However, because of the lack of caching, it is also prohibitive. However, in general, I think everyone agreed that if it was fast, it would be a good semantic, which is a useful guiding principle for selecting which option to use.
Option 4 is somewhat reminiscent of what we have right now, except that we have extra requirements (i.e. the symbol name needs to be a constant expression). These requirements make the behavior of
ccall
very confusion, though to be fair, they also by default prohibit some problematic cases.In particular, there's a question about non-constant cases like
What should this do? Currently this case is disallowed due to the (semantically strange) constantness detection on the first argument. We do not have constantness detection anywhere else in the language, and in general our semantics are entirely value and type based, so we do need to decide some behavior for this case.
The tradeoff here is essentially one of performance vs surprise. Option 4 has detect performance, but a high surprise level. The answer would change whenever
f
gets re-codegen'ed which is not something that users are traditionally expected to have a mental model of. Option 2 has the same problem (but differs forccall((println("Hello"); globalsin)))
). Option 1 goes in the direction of retaining the performance while solving the surprise problem by never-reevaluating (although Revise might do so explicitly). Option 5 goes into the opposite direction of sacrificing some performance (in the - rare, unlikely - fully dynamic case), but reducing surprise by being closer to the native Option 3.As a general design principle in Julia, we do tend to choose the most dynamic behavior (as long as it is still possible to optimize the common case), which would weigh in favor of option 5.
Implementation Considerations
Given the above considerations, I think the general consenus was that option 5 is preferred. It is close enough to the current semantics (and identical in the cases
that people actually use) that we should be able to do it without deprecation cycle.
To implement this, I had suggested the following lowering:
where
consistent_once_per_process
is a new non-IPO effect annotation that specifies that the result of a call may be assumed:consistent
within each process (i.e. may be cached foregal
arguments). Andinline_cache
is a new_builtin that is semantically a no-op, but annotates to the optimizer to perform the caching optimization (using :consistent-cy inference to determine the scope of the region to cache). @vtjnash objected to this scheme on complexity grounds, @JeffBezanson objected on IR size grounds. However, I think people generally agreed that the semantics were reasonable.Thus, the ultimate proposal is to treat the above as the semantic representation of what :foreigncall should do, but keep both the
dlsym
generic call and the inline_cache inside :foreigncall where they are now. The various places that need toanalyze this (inference, trimming, etc.) should then model :foreigncall to include the generic call to
dlsym
, including providing edges for it.The text was updated successfully, but these errors were encountered: