Skip to content

Python: Basic implementation of variable capture #14944

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 61 commits into from
Dec 20, 2023

Conversation

yoff
Copy link
Contributor

@yoff yoff commented Nov 28, 2023

Instantiates the shared library to handle variable capture in basic cases.

The last commit is superseded by #15080. Ideally that should be merged first, and this should be rebased. But I think that this can be meaningfully reviewed already now.

@yoff yoff force-pushed the python/captured-variables-basic branch 2 times, most recently from c043f0c to a6e3746 Compare December 5, 2023 19:03
yoff added 13 commits December 14, 2023 10:20
This provides variable capture in standard situations:
- nested functions
- lambdas
There are some deficiencies:
- we do not yet handle objects capturing variables.
- we do not handle variables captured via the `nonlocal` keyword.
  This should be solved at the AST level, though, and then it
  should "just work".

There are still inconsistencies in the case where
a `SynthesizedCaptureNode` has a comprehensions
as its enclosing callable. In this case,
`TFunction(cn.getEnclosingCallable())` is not
defined and so getEnclosingCallable does not exist
for the `CaptureNode`.
TODO:
The member predicate `LibraryLambdaMethod::getACall` is
currently too permissive.
Ideally, we would have `libraryCallHasLambdaArg`
as in Ruby. But even a more precise
`libraryCall` predicate might be fine.
On the small test project, this reduces the number
of instances from 285 to 22.
These are the labmda self references. This is similar to
how `BlockParameterArgumentNode` is excluded for Ruby.

It is important that we restrict `call` in this logic.
Otherwise, we get a cartesian product and the consistency
check runs for a very long time...
This depends on the extractor fix
@yoff yoff force-pushed the python/captured-variables-basic branch from 7f30996 to 479d81f Compare December 14, 2023 09:37
this must have been lost in my
clean-up rebase.
Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed the first 3 commits so far. Since the second commit is the major part of this PR, I'm submitting my review-in-progress comments so you can start looking into them, and we can have further discussions as needed 😊

Lambda self naming

My immediate thoughts are is "lambda self" a good name for this? I know in some languages you can refer to the function itself with this/self inside a function, so it might make sense there... but since self already has established semantics in Python, I'm not convinced about it.

How about something like "capture context"? or potentially CapturedVariablesArgument + CapturedVariablesParameter ... makes it very clear what this is about. (potentially with Synth in there as suggested in inline comments, so it becomes SynthCapturedVariablesArgument + SynthCapturedVariablesParameter).

Variable capture code

The meaty bits has been added to DataFlowPrivate.qll. How about we move it into it's own file instead? (for a bit more modularity 😊)

override Node getPreUpdateNode() { result = pre }
}

class CaptureArgumentNode extends CfgNode, ArgumentNode {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised to see there isn't a TSynthCaptureArgumentNode, like we have for TSynthDictSplatArgumentNode -- is there some underlying reason for this I'm not aware of?

I'm raising this since I think adding synthetic nodes is a safer approach. If all such synthetic nodes extended CfgNode, we would get cross-talk due to the shared super-class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I will do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, that did not work out. I ran into problems with non-monotonic recursion; both defining the node in a restricted way (say not a new node for every CallNode) and also adding flow into it. I will note that other languages do not have a synthetic node here, so we can probably get away with it too for now.

Comment on lines 396 to 397
// simpleAstFlowStep(e1, e2)
// or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this commented out code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turned into a proper comment.

Comment on lines 144 to 145
arg = call.getArgument(_) and
arg instanceof CaptureArgumentNode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment as to why this is needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

@RasmusWL RasmusWL Dec 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I don't quite understand why it isn't covered by

    exists(DataFlowCall other, CallNode cfgCall | other != call |
      call.getNode() = cfgCall and
      other.getNode() = cfgCall and
      isArgumentNode(arg, call, _) and
      isArgumentNode(arg, other, _)
    )

I've tried to come up with an example where the same ("lambda self") argument would be used for multiple calls:

def outer():
    var = 42
    def foo():
        print("foo", var)
    def bar():
        print("bar", var)
    var = 43
    f = foo if <cond> else bar
    f()

In the f() call we would either call foo or bar, who would have captured var, and with the changes in this PR, we should be able to correctly identify that the value printed is 43.

There will be TWO DataFlowCall for f(), one that targets foo and one that targets bar.

There will be ONE CaptureArgumentNode which is based on the cfg-node for .getFunction, meaning f in f().

From the way I see it, that should 100% be covered by the exception above.

Interplay with self argument

In the char-pred of ExtractedArgumentNode we have:

// and self arguments
this.asCfgNode() = any(CallNode c).getFunction().(AttrNode).getObject()

So RIGHT NOW it's the case that forall(CaptureArgumentNode arg | | arg instanceof ExtractedArgumentNode) holds.

☝️ certainly wasn't true, had overlooked the AttrNode part 🙈

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exclusion case you reference require the two calls to be the same CallNode, I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were right, it does not seem to be needed. It is removed in d6544cc without any consistency failures appearing..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! 👍

yoff added 2 commits December 18, 2023 23:42
I chose `category: majorAnalysis`, the description is
"An API has changed in a way that may affect the results produced
by a query that consumes the API."
The API in question here is `flowPath` which is used by all our
data flow queries.
Copy link
Member

@RasmusWL RasmusWL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think this looks ok to merge now, given that the performance in the new experiments looks OK.

I just hope we will not need more synthetic parameters related to captured variables in the future 😁

Well if we do, you know who to point your finger at 😂

Anyway, thanks for enduring me in the naming thing 🙏 (and all the other fixes)

Comment on lines 144 to 145
arg = call.getArgument(_) and
arg instanceof CaptureArgumentNode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! 👍

@yoff yoff requested a review from RasmusWL December 19, 2023 09:09
@RasmusWL
Copy link
Member

New call graph edges looks good to me (although I've not 100% understood why we gain them).

The new py/meta/summarized-callable-call-sites results looks suspicious to me, so looking forward to a breakdown of those results from you 😊

@yoff
Copy link
Contributor Author

yoff commented Dec 19, 2023

The new py/meta/summarized-callable-call-sites results looks suspicious to me, so looking forward to a breakdown of those results from you 😊

Indeed, those look wrong. I believe they are surfaced now because of the variable capture parameter, but ultimately caused by this diff from #14777. We now confuse all entry definitions into one entry node. I have pushed a fix which is to jump past the entry node and directly to all uses. I have started a DCA run to check that this does not kill performance.

@yoff
Copy link
Contributor Author

yoff commented Dec 19, 2023

I have pushed a fix which is to jump past the entry node and directly to all uses.

That fix caused test failures. It also may not be right for type tracking, where we would ideally just want a single node for each captured variable. I think the right fix is to reintroduce ScopeEntryDefinitions as data flow nodes for this purpose.

@yoff yoff force-pushed the python/captured-variables-basic branch from 69cecdd to 1417c2c Compare December 19, 2023 20:36
@yoff
Copy link
Contributor Author

yoff commented Dec 19, 2023

It also may not be right for type tracking, where we would ideally just want a single node for each captured variable

Performance evaluations confirmed this. We saw an explosion for some projects (worst was 35x analysis time), so it is indeed important to have a single target. I have rebased this last commit away. A proper fix will be implemented in a separate PR (where it can be tested on its own). Then this PR can merge from main and everything should look reasonable again.

Copy link
Contributor

@hvitved hvitved left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. I have a few questions.

@@ -83,7 +83,7 @@ def m():
sinkO2 = tainted
m()
captureOut2()
SINK(sinkO2) #$ MISSING:captured
SINK(sinkO2) #$ captured
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you already have tests for this, I suggest adding tests that mix variable capture flow and field flow, see e.g. https://github.com/github/codeql/pull/11725/files#diff-a1db6555e7c41714eebba0d4e0f95f8cfefbdebcd13edd6135f070b08e53e1a6R56-R81

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have some tests mixing capture flow and content flow (dict and list but not fields).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, out of interest, where are they?

@@ -75,6 +75,7 @@ def check_tests_valid_after_version(testFile, version):
check_tests_valid("variable-capture.test_collections")
check_tests_valid("variable-capture.by_value")
check_tests_valid("variable-capture.test_library_calls")
check_tests_valid("variable-capture.test_fields")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Python have its own test framework? I was expecting a normal QL (inline expectation) test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is an extra layer that ensures that our test expectations agree with the Python interpreter :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The variable capture test is also slightly custom inline dataflow test, only annotating with "captured", but that is for historical reasons. We should probably convert it to a standard inline dataflow test at some point.)

@yoff yoff requested a review from hvitved December 20, 2023 14:47
@yoff
Copy link
Contributor Author

yoff commented Dec 20, 2023

Evaluation looks as expected. New results now look reasonable.

class VariableWrite extends ControlFlowNode {
CapturedVariable v;

VariableWrite() { this = v.getAStore().getAFlowNode().(DefinitionNode).getValue() }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should really be the assignment node, as that is where the assignment takes place, and not the RHS of the assignment. Note that you will then have to change

result.(Flow::VariableWriteSourceNode).getVariableWrite() = n.(CfgNode).getNode()

in asClosureNode as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, we will change this follow-up, once Python has gained cfg nodes for assignments.

@yoff yoff dismissed RasmusWL’s stale review December 20, 2023 15:53

Changes have been addressed.

@yoff yoff merged commit b83c743 into github:main Dec 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants