-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Python: Basic implementation of variable capture #14944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
96ac77b
to
76bb328
Compare
c043f0c
to
a6e3746
Compare
This provides variable capture in standard situations: - nested functions - lambdas There are some deficiencies: - we do not yet handle objects capturing variables. - we do not handle variables captured via the `nonlocal` keyword. This should be solved at the AST level, though, and then it should "just work". There are still inconsistencies in the case where a `SynthesizedCaptureNode` has a comprehensions as its enclosing callable. In this case, `TFunction(cn.getEnclosingCallable())` is not defined and so getEnclosingCallable does not exist for the `CaptureNode`.
TODO: The member predicate `LibraryLambdaMethod::getACall` is currently too permissive. Ideally, we would have `libraryCallHasLambdaArg` as in Ruby. But even a more precise `libraryCall` predicate might be fine.
On the small test project, this reduces the number of instances from 285 to 22.
as in Ruby
These are the labmda self references. This is similar to how `BlockParameterArgumentNode` is excluded for Ruby. It is important that we restrict `call` in this logic. Otherwise, we get a cartesian product and the consistency check runs for a very long time...
This depends on the extractor fix
7f30996
to
479d81f
Compare
this must have been lost in my clean-up rebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reviewed the first 3 commits so far. Since the second commit is the major part of this PR, I'm submitting my review-in-progress comments so you can start looking into them, and we can have further discussions as needed 😊
Lambda self naming
My immediate thoughts are is "lambda self" a good name for this? I know in some languages you can refer to the function itself with this
/self
inside a function, so it might make sense there... but since self
already has established semantics in Python, I'm not convinced about it.
How about something like "capture context"? or potentially CapturedVariablesArgument
+ CapturedVariablesParameter
... makes it very clear what this is about. (potentially with Synth
in there as suggested in inline comments, so it becomes SynthCapturedVariablesArgument
+ SynthCapturedVariablesParameter
).
Variable capture code
The meaty bits has been added to DataFlowPrivate.qll. How about we move it into it's own file instead? (for a bit more modularity 😊)
python/ql/lib/semmle/python/dataflow/new/internal/DataFlowDispatch.qll
Outdated
Show resolved
Hide resolved
python/ql/lib/semmle/python/dataflow/new/internal/DataFlowDispatch.qll
Outdated
Show resolved
Hide resolved
python/ql/lib/semmle/python/dataflow/new/internal/DataFlowDispatch.qll
Outdated
Show resolved
Hide resolved
override Node getPreUpdateNode() { result = pre } | ||
} | ||
|
||
class CaptureArgumentNode extends CfgNode, ArgumentNode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised to see there isn't a TSynthCaptureArgumentNode
, like we have for TSynthDictSplatArgumentNode
-- is there some underlying reason for this I'm not aware of?
I'm raising this since I think adding synthetic nodes is a safer approach. If all such synthetic nodes extended CfgNode
, we would get cross-talk due to the shared super-class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I will do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, that did not work out. I ran into problems with non-monotonic recursion; both defining the node in a restricted way (say not a new node for every CallNode
) and also adding flow into it. I will note that other languages do not have a synthetic node here, so we can probably get away with it too for now.
python/ql/lib/semmle/python/dataflow/new/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
python/ql/lib/semmle/python/dataflow/new/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
// simpleAstFlowStep(e1, e2) | ||
// or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this commented out code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turned into a proper comment.
python/ql/lib/semmle/python/dataflow/new/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
python/ql/lib/semmle/python/dataflow/new/internal/DataFlowImplConsistency.qll
Outdated
Show resolved
Hide resolved
arg = call.getArgument(_) and | ||
arg instanceof CaptureArgumentNode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment as to why this is needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I don't quite understand why it isn't covered by
exists(DataFlowCall other, CallNode cfgCall | other != call |
call.getNode() = cfgCall and
other.getNode() = cfgCall and
isArgumentNode(arg, call, _) and
isArgumentNode(arg, other, _)
)
I've tried to come up with an example where the same ("lambda self") argument would be used for multiple calls:
def outer():
var = 42
def foo():
print("foo", var)
def bar():
print("bar", var)
var = 43
f = foo if <cond> else bar
f()
In the f()
call we would either call foo or bar, who would have captured var
, and with the changes in this PR, we should be able to correctly identify that the value printed is 43
.
There will be TWO DataFlowCall
for f()
, one that targets foo
and one that targets bar
.
There will be ONE CaptureArgumentNode
which is based on the cfg-node for .getFunction, meaning f
in f()
.
From the way I see it, that should 100% be covered by the exception above.
Interplay with self argument
In the char-pred of ExtractedArgumentNode
we have:
codeql/python/ql/lib/semmle/python/dataflow/new/internal/DataFlowPublic.qll
Lines 310 to 311 in f96c52e
// and self arguments | |
this.asCfgNode() = any(CallNode c).getFunction().(AttrNode).getObject() |
So RIGHT NOW it's the case that forall(CaptureArgumentNode arg | | arg instanceof ExtractedArgumentNode)
holds.
☝️ certainly wasn't true, had overlooked the AttrNode
part 🙈
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exclusion case you reference require the two calls to be the same CallNode
, I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You were right, it does not seem to be needed. It is removed in d6544cc without any consistency failures appearing..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! 👍
Co-authored-by: Rasmus Wriedt Larsen <[email protected]>
also improve comment
I chose `category: majorAnalysis`, the description is "An API has changed in a way that may affect the results produced by a query that consumes the API." The API in question here is `flowPath` which is used by all our data flow queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think this looks ok to merge now, given that the performance in the new experiments looks OK.
I just hope we will not need more synthetic parameters related to captured variables in the future 😁
Well if we do, you know who to point your finger at 😂
Anyway, thanks for enduring me in the naming thing 🙏 (and all the other fixes)
python/ql/lib/change-notes/2023-12-18-support-variable-capture.md
Outdated
Show resolved
Hide resolved
arg = call.getArgument(_) and | ||
arg instanceof CaptureArgumentNode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! 👍
python/ql/lib/semmle/python/dataflow/new/internal/VariableCapture.qll
Outdated
Show resolved
Hide resolved
Co-authored-by: Rasmus Wriedt Larsen <[email protected]>
New call graph edges looks good to me (although I've not 100% understood why we gain them). The new |
Indeed, those look wrong. I believe they are surfaced now because of the variable capture parameter, but ultimately caused by this diff from #14777. We now confuse all entry definitions into one entry node. I have pushed a fix which is to jump past the entry node and directly to all uses. I have started a DCA run to check that this does not kill performance. |
That fix caused test failures. It also may not be right for type tracking, where we would ideally just want a single node for each captured variable. I think the right fix is to reintroduce |
69cecdd
to
1417c2c
Compare
Performance evaluations confirmed this. We saw an explosion for some projects (worst was 35x analysis time), so it is indeed important to have a single target. I have rebased this last commit away. A proper fix will be implemented in a separate PR (where it can be tested on its own). Then this PR can merge from main and everything should look reasonable again. |
…aptured-variables-basic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. I have a few questions.
python/ql/lib/semmle/python/dataflow/new/internal/DataFlowDispatch.qll
Outdated
Show resolved
Hide resolved
python/ql/lib/semmle/python/dataflow/new/internal/DataFlowDispatch.qll
Outdated
Show resolved
Hide resolved
python/ql/lib/semmle/python/dataflow/new/internal/DataFlowDispatch.qll
Outdated
Show resolved
Hide resolved
python/ql/lib/semmle/python/dataflow/new/internal/DataFlowPublic.qll
Outdated
Show resolved
Hide resolved
python/ql/test/experimental/dataflow/variable-capture/dataflow-capture-consistency.ql
Show resolved
Hide resolved
@@ -83,7 +83,7 @@ def m(): | |||
sinkO2 = tainted | |||
m() | |||
captureOut2() | |||
SINK(sinkO2) #$ MISSING:captured | |||
SINK(sinkO2) #$ captured |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless you already have tests for this, I suggest adding tests that mix variable capture flow and field flow, see e.g. https://github.com/github/codeql/pull/11725/files#diff-a1db6555e7c41714eebba0d4e0f95f8cfefbdebcd13edd6135f070b08e53e1a6R56-R81
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have some tests mixing capture flow and content flow (dict and list but not fields).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, out of interest, where are they?
python/ql/lib/semmle/python/dataflow/new/internal/DataFlowDispatch.qll
Outdated
Show resolved
Hide resolved
python/ql/lib/semmle/python/dataflow/new/internal/DataFlowPrivate.qll
Outdated
Show resolved
Hide resolved
@@ -75,6 +75,7 @@ def check_tests_valid_after_version(testFile, version): | |||
check_tests_valid("variable-capture.test_collections") | |||
check_tests_valid("variable-capture.by_value") | |||
check_tests_valid("variable-capture.test_library_calls") | |||
check_tests_valid("variable-capture.test_fields") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Python have its own test framework? I was expecting a normal QL (inline expectation) test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. This is an extra layer that ensures that our test expectations agree with the Python interpreter :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(The variable capture test is also slightly custom inline dataflow test, only annotating with "captured", but that is for historical reasons. We should probably convert it to a standard inline dataflow test at some point.)
Evaluation looks as expected. New results now look reasonable. |
This reverts commit 45411f4.
class VariableWrite extends ControlFlowNode { | ||
CapturedVariable v; | ||
|
||
VariableWrite() { this = v.getAStore().getAFlowNode().(DefinitionNode).getValue() } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should really be the assignment node, as that is where the assignment takes place, and not the RHS of the assignment. Note that you will then have to change
result.(Flow::VariableWriteSourceNode).getVariableWrite() = n.(CfgNode).getNode()
in asClosureNode
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, we will change this follow-up, once Python has gained cfg nodes for assignments.
Instantiates the shared library to handle variable capture in basic cases.
The last commit is superseded by #15080. Ideally that should be merged first, and this should be rebased. But I think that this can be meaningfully reviewed already now.