Skip to content

Investigate inference cache improvements for improving the performance #2912

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kodonnell opened this issue May 10, 2019 · 6 comments
Closed
Labels
Duplicate 🐫 Duplicate of an already existing issue Enhancement ✨ Improvement to a component Maintenance Discussion or action around maintaining pylint or the dev workflow Needs investigation 🔬 A bug or crash where it's not immediately obvious what is happenning

Comments

@kodonnell
Copy link

kodonnell commented May 10, 2019

Caching (lru_cache) is currently used to to speed up inferencing. I believe there may be a few ways to improve this, and this is a general issue to rethink caching (which is currently very naive, I believe). Some ideas:

  • Persisting the cache to disk (I've had good experiences with DiskCache), with the the idea being that most of the code base doesn't change between runs (especially on large projects) and the inferences should be the same. (Similar to here?) For big projects (or those that use big libraries - which is very common), this could reduce run-time by orders of magnitude (after the first run) which would, I think, be pretty significant. I see two options here for doing this:
    • We'd need some way to detect that code has changed and which nodes are affected ... which we could possibly do by just hashing the code strings or something? I don't know enough about inferencing to say if this is possible.
    • We don't try to be clever, and initially just allow the user to specify bits of code such as external libraries to cache as 'unchanging'. For example the user could specify pandas==X.X.X and it would build the cache once for that, and then assume that it won't change in their code from them on. This is probably an easy first step, still with some pretty big wins.
      • NB: one can easily imagine being able to share caches etc. i.e. once one user has created the cache for pandas==X.X.X there's no need for others to have to ...
  • Caching more/differently (currently it's just lru_cache(1024)). I'm not sure if there's ever been any investigation into how to optimize the caching.
  • Not duplicating the caching of astroid? (I believe astroid may also implement it's own caching.) This may also be more specific to astroid than pylint.
  • As per here: "There's also some caching involved there, but it's not working right all the time." I'm not sure what this means, but @PCManticore could elaborate.

NB - I'll try to keep this up-to-date based on any replies below (including removing any of the above if it's rubbish).

@PCManticore
Copy link
Contributor

We could definitely try to see if the disk caching helps a bit, but from the looks of it, I would guess it's going to incur a performance penalty for the offloading of inferred data to disk, given that we do infer a lot during the analysis of a single, relatively large file. We might try with a daemon approach where the caching is done separately from the main analysis program, but sounds like a significant challenge to take.

There hasn't been any investigation into the use of lru_cache() and the proper maxsize but that sounds like it could be a good idea to go over that and figure out where we have cache misses.

@PCManticore PCManticore added Enhancement ✨ Improvement to a component task labels Jul 7, 2019
@PCManticore PCManticore changed the title inference cache improvements [performance] Investigate inference cache improvements for improving the performance Jul 7, 2019
@kodonnell
Copy link
Author

Re disk penalties - I guess where I'm coming from is that if it takes 30s to infer e.g. pandas, any disk caching will be negligible. In the cases where the user wouldn't want to cache (e.g. they start in a docker container for a new CI/CD run and they don't want to persist the cache for simplicity) then we could easily disable it - though, again, I suspect it won't have much impact. Especially if you share the caches (as mentioned above) in which case e.g. if it takes 3s to download and parse, you're immediately 27s faster - so it actually makes sense for fast builds (with internet connections).

@0xADD1E
Copy link

0xADD1E commented Apr 18, 2020

For simple repeated runs, would it be enough to use a hash of the file contents as a hash key, and cache the results on that?

I could see that being problematic if a dependency were to break something for a file that hasn't changed, but for something that brings linting from 15 seconds down to 1.5 it seems awfully tempting.

Not being incredibly familiar with the code yet, would there be any fairly cheap way to construct a module dependency tree and invalidate everything that depends on changed files? That would seem to be the safest way of getting that kind of speed improvements.

If anyone wants to see/mess with this approach, it's at https://github.com/0xADD1E/pylint/tree/caching-tests

@CarliJoy
Copy link
Contributor

CarliJoy commented Nov 29, 2021

If anyone wants to see/mess with this approach, it's at https://github.com/0xADD1E/pylint/tree/caching-tests

@0xADD1E I can't access your code.

@0xADD1E
Copy link

0xADD1E commented Nov 29, 2021

@CarliJoy It looks like I may have gotten rid of it a bit ago due to inactivity here. I'll see if I can dig up an old copy from backups later today. If memory serves it was just using stdlib's shelve module and a sha256 key around https://github.com/PyCQA/pylint/blob/8a17bb557a23708ebbe64174c564f9c5741fb5dd/pylint/lint/pylinter.py#L1059 or so...

@Pierre-Sassoulas Pierre-Sassoulas added Maintenance Discussion or action around maintaining pylint or the dev workflow Needs investigation 🔬 A bug or crash where it's not immediately obvious what is happenning and removed task labels Jun 23, 2022
@jacobtylerwalls
Copy link
Member

DIsk caching is worth investigating. Will close as a duplicate of #1416, but will call out the promising experiment described at #2912 (comment).

@jacobtylerwalls jacobtylerwalls closed this as not planned Won't fix, can't repro, duplicate, stale Jan 24, 2024
@jacobtylerwalls jacobtylerwalls added the Duplicate 🐫 Duplicate of an already existing issue label Jan 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate 🐫 Duplicate of an already existing issue Enhancement ✨ Improvement to a component Maintenance Discussion or action around maintaining pylint or the dev workflow Needs investigation 🔬 A bug or crash where it's not immediately obvious what is happenning
Projects
None yet
Development

No branches or pull requests

6 participants