Skip to content

get/list/import/api: subrepo support #4247

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

skshetry
Copy link
Collaborator

@skshetry skshetry commented Jul 20, 2020

  • ❗ I have followed the Contributing to DVC checklist.

  • πŸ“– If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.

  • ❌ I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)

This PR at this time only works with ExternalRepo (should Repo() needs subrepo support, or should that be via RepoTree? probably both).

TODO:

  • More tests
  • subrepo support inside Repo
  • What should be done for dvcx? Currently, it seems it's broken, well before this PR. And, dvcx depends on git repo to have a top-level dvc repo, which this PR does not work with at all.

Possible future works

  • Implement RepoTree.stat()?
  • Granular configs for dvctree and dvcignore
  • Subrepo support in DvcRepo?

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

@skshetry skshetry self-assigned this Jul 20, 2020
@@ -153,22 +150,29 @@ def tree(self, tree):
def __repr__(self):
return f"{self.__class__.__name__}: '{self.root_dir}'"

@cached_property
def repo_tree(self):
return RepoTree(self.tree, [self], stream=True)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need granular configs in each of the tree operations. I could get around with stream_open() or just this ^ hack.
But, we need fetch and stream to be per-ops rather than per instance.

@skshetry skshetry marked this pull request as ready for review July 21, 2020 06:32
@skshetry

This comment has been minimized.

@skshetry skshetry requested review from pmrowla and efiop July 21, 2020 06:36
@skshetry skshetry requested a review from pared July 21, 2020 06:37
@skshetry

This comment has been minimized.

def find(tree, top=None):
top = top or tree.tree_root
for root, _, _ in tree.walk(top):
if tree.isdir(os.path.join(root, ".dvc")):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we use Repo.DVC_DIR?

@efiop efiop mentioned this pull request Jul 25, 2020
2 tasks
efiop added a commit to efiop/dvc that referenced this pull request Jul 25, 2020
This allows us avoid collecting dvcignore for the whole repo if we only
care about particular paths. As a result, in a repo with 2 datasets
(2M + 0.5M files), creating a defunct stage takes ~4sec on 1.2.0, but
~1sec(most of it is actually dvc module initialization) with this PR.

This is also a pre-requisite for dynamic dvcignore and subrepo
collection (iterative#4247) while walking
the tree.

Also, it is important to clarify that regular `dvc status`(without
arguments) has the same performance after this PR, because when we check
dataset for changes, we call things like `tree.exists()`, which call
dvcignore and make it collect dvcignore in the dataset itself, so we
still endup collecting dvcignore for the whole repo (including walking
into the datasets). This should be solved soon by telling dvcignore that
it shouldn't walk into the datasets searching for `.dvcignore`s.
efiop added a commit that referenced this pull request Jul 26, 2020
This allows us avoid collecting dvcignore for the whole repo if we only
care about particular paths. As a result, in a repo with 2 datasets
(2M + 0.5M files), creating a defunct stage takes ~4sec on 1.2.0, but
~1sec(most of it is actually dvc module initialization) with this PR.

This is also a pre-requisite for dynamic dvcignore and subrepo
collection (#4247) while walking
the tree.

Also, it is important to clarify that regular `dvc status`(without
arguments) has the same performance after this PR, because when we check
dataset for changes, we call things like `tree.exists()`, which call
dvcignore and make it collect dvcignore in the dataset itself, so we
still endup collecting dvcignore for the whole repo (including walking
into the datasets). This should be solved soon by telling dvcignore that
it shouldn't walk into the datasets searching for `.dvcignore`s.
@pared pared mentioned this pull request Jul 28, 2020
2 tasks
# git-only erepo's do not need dvctree
self.dvctree = None
def __init__(
self, tree, subrepos=None, **kwargs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks a bit strange to pass a list of subrepos into RepoTree like this. It seems like a RepoTree should be able to find nested subrepos by walking itself.

@skshetry skshetry marked this pull request as draft July 29, 2020 06:39
@skshetry skshetry closed this Aug 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants