-
Notifications
You must be signed in to change notification settings - Fork 1.2k
get/list/import/api: subrepo support #4247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -153,22 +150,29 @@ def tree(self, tree): | |||
def __repr__(self): | |||
return f"{self.__class__.__name__}: '{self.root_dir}'" | |||
|
|||
@cached_property | |||
def repo_tree(self): | |||
return RepoTree(self.tree, [self], stream=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need granular configs in each of the tree operations. I could get around with stream_open()
or just this ^ hack.
But, we need fetch
and stream
to be per-ops rather than per instance.
This comment has been minimized.
This comment has been minimized.
6939210
to
c7d1966
Compare
This comment has been minimized.
This comment has been minimized.
076e924
to
fa07554
Compare
def find(tree, top=None): | ||
top = top or tree.tree_root | ||
for root, _, _ in tree.walk(top): | ||
if tree.isdir(os.path.join(root, ".dvc")): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we use Repo.DVC_DIR
?
fa07554
to
aafac24
Compare
One having a top-level dvc and another without it
932a300
to
0b73b5c
Compare
0b73b5c
to
c594167
Compare
This allows us avoid collecting dvcignore for the whole repo if we only care about particular paths. As a result, in a repo with 2 datasets (2M + 0.5M files), creating a defunct stage takes ~4sec on 1.2.0, but ~1sec(most of it is actually dvc module initialization) with this PR. This is also a pre-requisite for dynamic dvcignore and subrepo collection (iterative#4247) while walking the tree. Also, it is important to clarify that regular `dvc status`(without arguments) has the same performance after this PR, because when we check dataset for changes, we call things like `tree.exists()`, which call dvcignore and make it collect dvcignore in the dataset itself, so we still endup collecting dvcignore for the whole repo (including walking into the datasets). This should be solved soon by telling dvcignore that it shouldn't walk into the datasets searching for `.dvcignore`s.
This allows us avoid collecting dvcignore for the whole repo if we only care about particular paths. As a result, in a repo with 2 datasets (2M + 0.5M files), creating a defunct stage takes ~4sec on 1.2.0, but ~1sec(most of it is actually dvc module initialization) with this PR. This is also a pre-requisite for dynamic dvcignore and subrepo collection (#4247) while walking the tree. Also, it is important to clarify that regular `dvc status`(without arguments) has the same performance after this PR, because when we check dataset for changes, we call things like `tree.exists()`, which call dvcignore and make it collect dvcignore in the dataset itself, so we still endup collecting dvcignore for the whole repo (including walking into the datasets). This should be solved soon by telling dvcignore that it shouldn't walk into the datasets searching for `.dvcignore`s.
# git-only erepo's do not need dvctree | ||
self.dvctree = None | ||
def __init__( | ||
self, tree, subrepos=None, **kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks a bit strange to pass a list of subrepos into RepoTree like this. It seems like a RepoTree should be able to find nested subrepos by walking itself.
β I have followed the Contributing to DVC checklist.
π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
β I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)
This PR at this time only works with ExternalRepo (should
Repo()
needs subrepo support, or should that be viaRepoTree
? probably both).TODO:
subrepo
support insideRepo
dvcx
? Currently, it seems it's broken, well before this PR. And,dvcx
depends on git repo to have a top-leveldvc
repo, which this PR does not work with at all.Possible future works
RepoTree.stat()
?Thank you for the contribution - we'll try to review it as soon as possible. π