-
Notifications
You must be signed in to change notification settings - Fork 1.2k
RFC: remote local, repo: use walk_files instead of os.walk #1750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
dvc/repo/__init__.py
Outdated
|
||
def filter_dirs(dname, root=root): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@efiop it seems obsolete, can you verify if I am right on this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pared It is not obsolete. Here we are excluding directories beforehand, so we don't spend time walking through them. It is very important when you have giant output directories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be changed in #1709. There will be a filesystem abstraction to unify the file tree access for the files inside the specific SCM branch and the working directory. I think dvc.utils.walk_files
instead could be made obsolete, but not sure as it looks like it is currently used only in places where it is only meaningful to work with the working tree and the mentioned abstraction is not needed. Duplication of os.walk
behaviour in utils and this abstraction is not good thing to have, though, this is the point behind of possibility to obsolete dvc.utils.walk_files
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, there is a problem with this filter_dirs
implementation, it actually could make the things slower if there are many outs, its current complexity is O(num_dirs * num_outs)
and could be reduced by using set
to store outs and checking that it contains path or any of it parents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added issue for that - #1751
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this functionality could be moved inside the walk
function, but I'm not sure yet about this.
dvc/repo/__init__.py
Outdated
|
||
def filter_dirs(dname, root=root): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pared It is not obsolete. Here we are excluding directories beforehand, so we don't spend time walking through them. It is very important when you have giant output directories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest to not touch the dvc.repo.Stage
this time, since there is ongoing refactoring on it in #1709. But it LGTM to keep other calls to os.walk
under the walk_files
to be able to keep the .dvcignore
implementation in the single place.
bb533d3
to
4dcfd9c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thank you! @ei-grad Mind reviewing again? 🙂
@efiop LGTM too, sorry for being late, I'm not yet accustomed with the notifications flow :( |
Related to #1499