Investigate inference cache improvements for improving the performance #2912

kodonnell · 2019-05-10T09:59:10Z

Caching (lru_cache) is currently used to to speed up inferencing. I believe there may be a few ways to improve this, and this is a general issue to rethink caching (which is currently very naive, I believe). Some ideas:

Persisting the cache to disk (I've had good experiences with DiskCache), with the the idea being that most of the code base doesn't change between runs (especially on large projects) and the inferences should be the same. (Similar to here?) For big projects (or those that use big libraries - which is very common), this could reduce run-time by orders of magnitude (after the first run) which would, I think, be pretty significant. I see two options here for doing this:
- We'd need some way to detect that code has changed and which nodes are affected ... which we could possibly do by just hashing the code strings or something? I don't know enough about inferencing to say if this is possible.
- We don't try to be clever, and initially just allow the user to specify bits of code such as external libraries to cache as 'unchanging'. For example the user could specify pandas==X.X.X and it would build the cache once for that, and then assume that it won't change in their code from them on. This is probably an easy first step, still with some pretty big wins.
  - NB: one can easily imagine being able to share caches etc. i.e. once one user has created the cache for pandas==X.X.X there's no need for others to have to ...
Caching more/differently (currently it's just lru_cache(1024)). I'm not sure if there's ever been any investigation into how to optimize the caching.
Not duplicating the caching of astroid? (I believe astroid may also implement it's own caching.) This may also be more specific to astroid than pylint.
As per here: "There's also some caching involved there, but it's not working right all the time." I'm not sure what this means, but @PCManticore could elaborate.

NB - I'll try to keep this up-to-date based on any replies below (including removing any of the above if it's rubbish).

The text was updated successfully, but these errors were encountered:

PCManticore · 2019-07-07T14:42:40Z

We could definitely try to see if the disk caching helps a bit, but from the looks of it, I would guess it's going to incur a performance penalty for the offloading of inferred data to disk, given that we do infer a lot during the analysis of a single, relatively large file. We might try with a daemon approach where the caching is done separately from the main analysis program, but sounds like a significant challenge to take.

There hasn't been any investigation into the use of lru_cache() and the proper maxsize but that sounds like it could be a good idea to go over that and figure out where we have cache misses.

kodonnell · 2019-07-08T08:13:36Z

Re disk penalties - I guess where I'm coming from is that if it takes 30s to infer e.g. pandas, any disk caching will be negligible. In the cases where the user wouldn't want to cache (e.g. they start in a docker container for a new CI/CD run and they don't want to persist the cache for simplicity) then we could easily disable it - though, again, I suspect it won't have much impact. Especially if you share the caches (as mentioned above) in which case e.g. if it takes 3s to download and parse, you're immediately 27s faster - so it actually makes sense for fast builds (with internet connections).

0xADD1E · 2020-04-18T16:58:07Z

For simple repeated runs, would it be enough to use a hash of the file contents as a hash key, and cache the results on that?

I could see that being problematic if a dependency were to break something for a file that hasn't changed, but for something that brings linting from 15 seconds down to 1.5 it seems awfully tempting.

Not being incredibly familiar with the code yet, would there be any fairly cheap way to construct a module dependency tree and invalidate everything that depends on changed files? That would seem to be the safest way of getting that kind of speed improvements.

If anyone wants to see/mess with this approach, it's at https://github.com/0xADD1E/pylint/tree/caching-tests

CarliJoy · 2021-11-29T14:48:05Z

If anyone wants to see/mess with this approach, it's at https://github.com/0xADD1E/pylint/tree/caching-tests

@0xADD1E I can't access your code.

0xADD1E · 2021-11-29T14:58:16Z

@CarliJoy It looks like I may have gotten rid of it a bit ago due to inactivity here. I'll see if I can dig up an old copy from backups later today. If memory serves it was just using stdlib's shelve module and a sha256 key around https://github.com/PyCQA/pylint/blob/8a17bb557a23708ebbe64174c564f9c5741fb5dd/pylint/lint/pylinter.py#L1059 or so...

jacobtylerwalls · 2024-01-24T14:53:44Z

DIsk caching is worth investigating. Will close as a duplicate of #1416, but will call out the promising experiment described at #2912 (comment).

kodonnell mentioned this issue May 11, 2019

Current state of performance pylint-dev/astroid#610

Open

PCManticore added Enhancement ✨ Improvement to a component task labels Jul 7, 2019

PCManticore changed the title ~~inference cache improvements [performance]~~ Investigate inference cache improvements for improving the performance Jul 7, 2019

edreamleo mentioned this issue May 15, 2022

Run pylint correctly! leo-editor/leo-editor#2647

Closed

2 tasks

Pierre-Sassoulas added Maintenance Discussion or action around maintaining pylint or the dev workflow Needs investigation 🔬 A bug or crash where it's not immediately obvious what is happenning and removed task labels Jun 23, 2022

jacobtylerwalls closed this as not planned Won't fix, can't repro, duplicate, stale Jan 24, 2024

jacobtylerwalls added the Duplicate 🐫 Duplicate of an already existing issue label Jan 24, 2024

jacobtylerwalls mentioned this issue Jan 24, 2024

Make PyLint faster by providing a way to reuse ASTs and avoid startup time. #1416

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate inference cache improvements for improving the performance #2912

Investigate inference cache improvements for improving the performance #2912

kodonnell commented May 10, 2019 •

edited

Loading

PCManticore commented Jul 7, 2019

kodonnell commented Jul 8, 2019

0xADD1E commented Apr 18, 2020 •

edited

Loading

CarliJoy commented Nov 29, 2021 •

edited

Loading

0xADD1E commented Nov 29, 2021

jacobtylerwalls commented Jan 24, 2024

Investigate inference cache improvements for improving the performance #2912

Investigate inference cache improvements for improving the performance #2912

Comments

kodonnell commented May 10, 2019 • edited Loading

PCManticore commented Jul 7, 2019

kodonnell commented Jul 8, 2019

0xADD1E commented Apr 18, 2020 • edited Loading

CarliJoy commented Nov 29, 2021 • edited Loading

0xADD1E commented Nov 29, 2021

jacobtylerwalls commented Jan 24, 2024

kodonnell commented May 10, 2019 •

edited

Loading

0xADD1E commented Apr 18, 2020 •

edited

Loading

CarliJoy commented Nov 29, 2021 •

edited

Loading