-
Notifications
You must be signed in to change notification settings - Fork 1.2k
optimize local status #3867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize local status #3867
Conversation
Will look if I can add a benchmark like that to dvc-bench before this is merged. |
dvc/ignore.py
Outdated
@@ -134,6 +140,8 @@ def isexec(self, path): | |||
return self.tree.isexec(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we've discussed this before, but all of these methods don't take dvcignore
into account, which is wrong.
30a474c
to
d81b17a
Compare
Thank you @efiop ! Also, I noticed that the more I have entries in my |
@courentin Thanks for the feedback! Yeah, we are definitely not fully done optimizing it, but this seems like a good start 😉 |
We've done a lot of optimizations lately, which made unpacked dir trick obsoleted.
It is ~5.5 times slower than joining by hand.
On a repo with a dvcignore with 1 pattern and a directory with 400K files, `dvc status` now takes ~8 sec instead of ~30 sec. To achieve that, we make some assumptions about the paths formats that we are dealing with, so we could use simpler logic instead of using very slow `relpath`, `abspath` etc on every entry in a directory. It is also clear that CleanTree behavior is inconsistent (even tests expect very different outputs from it), so we will need to look into this later.
@@ -69,7 +69,6 @@ def test_ignore_from_file_should_filter_dirs_and_files(): | |||
), | |||
("dont_ignore.txt", ["dont_ignore"], False), | |||
("dont_ignore.txt", ["dont*", "!dont_ignore.txt"], False), | |||
("../../../something.txt", ["**/something.txt"], False), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is too expensive for match
to try to resolve paths like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Last
dvc status
results in ~50sec(way over 1minute if there is no unpacked dir) on current master and ~8sec with this PR.The main issue that was causing this is
relpath
that we were calling for each file that we check for match in dvcignore. Plus some alog-the-way optimizations likeos.path.join
and savingstat
s onis_protected
. Also removingunpacked dir
optimization, as it is no longer needed, mainly because of the recent optimizations that are done.The issue was reported by the user https://discordapp.com/channels/485586884165107732/563406153334128681/713338211128311808 and my reproduction script is based on the dataset structure that he has.
❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. If the CLI API is changed, I have updated tab completion scripts.
❌ I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)
Thank you for the contribution - we'll try to review it as soon as possible. 🙏