Skip to content

optimize local status #3867

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 25, 2020
Merged

optimize local status #3867

merged 5 commits into from
May 25, 2020

Conversation

efiop
Copy link
Contributor

@efiop efiop commented May 24, 2020

#!/bin/bash                                   
                                              
set -e                                        
set -x                                        
                                              
rm -rf test_corentin                          
mkdir test_corentin                           
cd test_corentin                              
                                              
if [ ! -d "data" ]; then                      
        mkdir data                            
        for i in $(seq 1 400000); do          
                echo $i > data/$i             
        done                                  
fi                                            
                                              
rm -rf myrepo                                 
mkdir myrepo                                  
cd myrepo                                     
                                              
git init                                      
dvc init                                      
                                              
dvc config cache.type hardlink                
dvc config core.analytics false               

echo '*.py' > .dvcignore

cp ../data -a data                            
dvc add data                                  
dvc status                                    
python -mcProfile -o status.prof -m dvc status

Last dvc status results in ~50sec(way over 1minute if there is no unpacked dir) on current master and ~8sec with this PR.

The main issue that was causing this is relpath that we were calling for each file that we check for match in dvcignore. Plus some alog-the-way optimizations like os.path.join and saving stats on is_protected. Also removing unpacked dir optimization, as it is no longer needed, mainly because of the recent optimizations that are done.

The issue was reported by the user https://discordapp.com/channels/485586884165107732/563406153334128681/713338211128311808 and my reproduction script is based on the dataset structure that he has.

  • ❗ I have followed the Contributing to DVC checklist.

  • 📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. If the CLI API is changed, I have updated tab completion scripts.

  • ❌ I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

@efiop
Copy link
Contributor Author

efiop commented May 24, 2020

Will look if I can add a benchmark like that to dvc-bench before this is merged.

dvc/ignore.py Outdated
@@ -134,6 +140,8 @@ def isexec(self, path):
return self.tree.isexec(path)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we've discussed this before, but all of these methods don't take dvcignore into account, which is wrong.

@efiop efiop requested a review from pared May 24, 2020 04:13
@efiop efiop force-pushed the master branch 3 times, most recently from 30a474c to d81b17a Compare May 24, 2020 19:21
@courentin
Copy link
Contributor

Thank you @efiop !
I can confirm that the dvc status is faster now !

Here is the call graph:
speech12

Also, I noticed that the more I have entries in my .dvcignore, the more it takes time to complete (tiny diff though less than 2sec)

@efiop
Copy link
Contributor Author

efiop commented May 24, 2020

@courentin Thanks for the feedback! Yeah, we are definitely not fully done optimizing it, but this seems like a good start 😉

efiop added 5 commits May 25, 2020 01:22
We've done a lot of optimizations lately, which made unpacked dir trick
obsoleted.
It is ~5.5 times slower than joining by hand.
On a repo with a dvcignore with 1 pattern and a directory with 400K
files, `dvc status` now takes ~8 sec instead of ~30 sec.

To achieve that, we make some assumptions about the paths formats that
we are dealing with, so we could use simpler logic instead of using
very slow `relpath`, `abspath` etc on every entry in a directory.

It is also clear that CleanTree behavior is inconsistent (even tests
expect very different outputs from it), so we will need to look into
this later.
@efiop efiop changed the title [WIP] optimize local status optimize local status May 24, 2020
@@ -69,7 +69,6 @@ def test_ignore_from_file_should_filter_dirs_and_files():
),
("dont_ignore.txt", ["dont_ignore"], False),
("dont_ignore.txt", ["dont*", "!dont_ignore.txt"], False),
("../../../something.txt", ["**/something.txt"], False),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is too expensive for match to try to resolve paths like this.

@efiop efiop merged commit d77af81 into iterative:master May 25, 2020
@efiop efiop mentioned this pull request May 25, 2020
1 task
@efiop efiop self-assigned this May 25, 2020
@efiop efiop added the performance improvement over resource / time consuming tasks label May 25, 2020
Copy link
Contributor

@pared pared left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance improvement over resource / time consuming tasks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants