Not planned
Description
While investigating #89045 @steakhal noticed that the complexity of symbols seen by the ArrayBoundV2 checker increased significantly when that checker switched to using check::PostStmt
callbacks instead of the previously used check::Location
.
[...] prior to this change, the maximum Sym->computeComplexity() within getTaintedSymbolsImpl was significantly lower [4-5] than after the change. After the change this maximal complexity was more around the threshold (35), 30-33.
That slowdown was fixed by #89606 which restored the efficiency of getTaintedSymbolsImpl
and ensured that it runs quickly even when the symbols are complex. However, it would be still good to investigate the cause of this complexity increase.
Metadata
Metadata
Assignees
Type
Projects
Milestone
Relationships
Development
No branches or pull requests
Activity
NagyDonat commentedon Apr 23, 2024
@steakhal By the way, did you happen to have
ArrayBound
(V1) enabled during the bisection?I'm asking this because one very significant effect of my commit #72107
Switch to PostStmt callbacks in ArrayBoundV2
was that it "transferred" the analysis of many out-of-bounds situations from the V1 checker toArrayBoundV2
. (When they were bothLocation
checkers, V1 activated first AFAIK because it appears earlier inCheckers.td
; after the change thePostStmt
callbacks ofArrayBoundV2
happen before theLocation
callback of V1.)If V1 was enabled in the bisection, then perhaps the observed increase in the complexity is just caused by the fact that previously the V2 checker didn't see those complex symbols (because the V1 checker found and handled them first).
(I don't think that this is the most likely explanation of the situation, but it's worth to mention and check.)
llvmbot commentedon Apr 23, 2024
@llvm/issue-subscribers-clang-static-analyzer
Author: None (NagyDonat)
> [...] prior to this change, the maximum Sym->computeComplexity() within getTaintedSymbolsImpl was significantly lower [4-5] than after the change. After the change this maximal complexity was more around the threshold (35), 30-33.
That slowdown was fixed by #89606 which restored the efficiency of
getTaintedSymbolsImpl
and ensured that it runs quickly even when the symbols are complex. However, it would be still good to investigate the cause of this complexity increase.steakhal commentedon Apr 23, 2024
I believe we only use/used OOBv2.
NagyDonat commentedon Apr 23, 2024
Thanks, that's what I guessed.
NagyDonat commentedon May 2, 2024
I spent some time investigating this issue and I would summarize my research as "I still don't know the answer, but I realized that the question is not interesting".
The increased symbol complexity appeared after my old change #72107 "[analyzer] Switch to PostStmt callbacks in ArrayBoundV2" moved the activation of ArrayBoundV2 to a different point within the process of the path-sensitive analysis -- so the natural conclusion is that those more complex symbols were always there at that different step of the process. (If we are measuring river pollution and notice that the pollution increased when we started to take samples from a different location, the natural conclusion is that the new spot is simply more polluted.)
I have several rough conjectures that could lead to this difference between the
Location
and thePostStmt
callback:array[index].field
while the previous implementation ignored this situation (because here theElementRegion
is wrapped in aFieldRegion
object)." Note that sheervideo.c from FFMPEG mixes pointer arithmetic and field access, so it's plausible that the more complex symbols were simply not seen by theLocation
callback.Location
callbacks, but appear in thePostStmt
callback.PostStmt
callback triggers directly after constructing the suitableElementRegion
as the return value of the subscript expression; while theLocation
callback triggers significantly later, when theElementRegion
is accessed for reading or writing. In the meantime the analyzer parses several other statements and could perform some simplification / cleanup step that reduces the complexity of symbols and/or eliminates the cases with the more complex symbols.I wasn't able to understand exactly what happens between the
PostStmt
and theLocation
(the code is very complex, and I didn't set up a debugger), but I strongly suspect that the more complex symbols were always there, but theLocation
callback didn't see them.As the investigation is difficult, the complex symbols don't cause any concrete problems (after #89606) and there is probably an innocent explanation, I don't think that it's worth to spend more time on this issue. @steakhal What do you think about this? Woudl you agree with closing this ticket?
steakhal commentedon May 2, 2024
Thanks for your time. I think what you say makes sense. I didn't see yet glaring drawbacks of having these "complicated symbols" other than sometimes their dump took 26 megabytes. Given that ATM I'm not focusing on this checker, I'm fine with closing this.
I'll reopen it if it turns out relevant and I have evidence backing this up.
FYI right now I'm focusing on bounding Z3 refutation times. I've seen cases when Z3 refutation took 90% or more of the total time of an analysis.
isTainted()
by skipping complicated symbols #105493[analyzer] Limit `isTainted()` by skipping complicated symbols (#105493)
[analyzer] Limit `isTainted()` by skipping complicated symbols (llvm#…
LLVM 19.1.2 bump (llvm#129)