Skip to content

RFC: remote local, repo: use walk_files instead of os.walk #1750

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 20, 2019

Conversation

pared
Copy link
Contributor

@pared pared commented Mar 19, 2019

Related to #1499

@pared pared requested review from a user and efiop March 19, 2019 15:35

def filter_dirs(dname, root=root):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efiop it seems obsolete, can you verify if I am right on this one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pared It is not obsolete. Here we are excluding directories beforehand, so we don't spend time walking through them. It is very important when you have giant output directories.

Copy link
Contributor

@ei-grad ei-grad Mar 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be changed in #1709. There will be a filesystem abstraction to unify the file tree access for the files inside the specific SCM branch and the working directory. I think dvc.utils.walk_files instead could be made obsolete, but not sure as it looks like it is currently used only in places where it is only meaningful to work with the working tree and the mentioned abstraction is not needed. Duplication of os.walk behaviour in utils and this abstraction is not good thing to have, though, this is the point behind of possibility to obsolete dvc.utils.walk_files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, there is a problem with this filter_dirs implementation, it actually could make the things slower if there are many outs, its current complexity is O(num_dirs * num_outs) and could be reduced by using set to store outs and checking that it contains path or any of it parents.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added issue for that - #1751

Copy link
Contributor

@ei-grad ei-grad Mar 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this functionality could be moved inside the walk function, but I'm not sure yet about this.


def filter_dirs(dname, root=root):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pared It is not obsolete. Here we are excluding directories beforehand, so we don't spend time walking through them. It is very important when you have giant output directories.

Copy link
Contributor

@ei-grad ei-grad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to not touch the dvc.repo.Stage this time, since there is ongoing refactoring on it in #1709. But it LGTM to keep other calls to os.walk under the walk_files to be able to keep the .dvcignore implementation in the single place.

@pared pared force-pushed the rfc_unify_walk branch 2 times, most recently from bb533d3 to 4dcfd9c Compare March 20, 2019 09:17
@pared pared requested review from ei-grad and efiop March 20, 2019 09:27
Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thank you! @ei-grad Mind reviewing again? 🙂

@efiop efiop merged commit 5a736cb into iterative:master Mar 20, 2019
@ei-grad
Copy link
Contributor

ei-grad commented Mar 20, 2019

@efiop LGTM too, sorry for being late, I'm not yet accustomed with the notifications flow :(

@pared pared deleted the rfc_unify_walk branch April 3, 2019 11:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants