Skip to content

LocalRemoteTree: use repo tree as work_tree with local outputs #4125

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 16, 2020

Conversation

pared
Copy link
Contributor

@pared pared commented Jun 26, 2020

  • ❗ I have followed the Contributing to DVC checklist.

  • πŸ“– If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.

  • ❌ I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

EDIT:
Fixes #4110
Fixes #4197

@@ -173,3 +173,27 @@ def test_ignore_blank_line(tmp_dir, dvc):
tmp_dir.gen(DvcIgnore.DVCIGNORE_FILE, "foo\n\ndir/ignored")

assert _files_set("dir", dvc.tree) == {"dir/other"}


def test_ignore_in_added_dir(tmp_dir, dvc):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider making test for LocalRemoteTree only, but I guess, in the end, we want to be sure that one can actually ignore something in added directory.

@efiop efiop requested a review from pmrowla June 26, 2020 21:40
@pared pared changed the title 4110 dvcignore issue [WIP] LocalRemoteTree: use CleanTree with work_tree Jun 26, 2020
Copy link
Contributor

@pmrowla pmrowla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the real underlying issue in #4110 is that we still use "remotes" in DVC outputs rather than purely using trees.

This causes problems for LocalOutput, since we treat regular local DVC output paths the same way as a local external dependency (we handle them both using LocalRemoteTree) even though regular output paths should be repo tree paths rather than remote paths.

So for a regular dvc added dir output, right now when we run

    def save_info(self):
        return self.remote.save_info(self.path_info)

we eventually walk the LocalRemoteTree to determine what files should go in dir cache and generate our dir hash. This is what causes the bug w/not using DVC ignore for that directory, and why wrapping the remote work tree w/cleantree fixes the bug.

I think what we should really be doing is treating regular outputs separately from local external dependencies.

We do want to use CleanTree for regular outputs, since we are dealing with actual DVC repo paths (I'm not sure whether that means we use a RepoTree or wrapping LocalRemoteTree work tree w/CleanTree for regular outputs).

But for other types of local "remote" paths (including external dependencies) we should not need to use CleanTree.

dvc/ignore.py Outdated
@@ -216,7 +216,7 @@ def _parents_exist(self, path):
[os.path.abspath(path), self.tree_root]
)
):
return False
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works as a temporary solution, but it still seems unintuitive.

It seems to me that dvcignore should only ever apply to a DVC repo, and by definition anything outside the repo root directory should not exist in the clean tree.

@@ -66,7 +67,7 @@ def work_tree(self):
# GitTree arbitarily. When repo.tree is GitTree, local cache needs to
# use its own WorkingTree instance.
if self.repo:
return WorkingTree(self.repo.root_dir)
return CleanTree(WorkingTree(self.repo.root_dir))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as before, this works as part of a temporary fix but I'm not sure it makes sense as a real solution.

For local remotes, and DVC local cache we should need to wrap the work tree with CleanTree. With a local remote, it should only contain files which have been pushed to it, and we should not care about filtering/ignoring anything inside the remote. Likewise, for cache, we should not care about filtering anything inside .dvc/cache.

The main point of this tree refactor was that local remotes and local cache should only be dealing with paths inside the actual remote or cache. We run into issues when we start mixing remote/cache/repo tree paths and treating them all as just "local filesystem paths".

@pmrowla
Copy link
Contributor

pmrowla commented Jun 27, 2020

These changes do fix the user issue, and CleanTree used to allow paths outside the repo root before the remote tree refactoring, so I'd be ok with merging this as a bug fix for now. But long term I think we should keep the other things I mentioned in mind.

@pared pared force-pushed the 4110_dvcignore_issue branch from e311fa1 to 8000d80 Compare July 8, 2020 11:02
@pared pared force-pushed the 4110_dvcignore_issue branch from 8000d80 to ce61e28 Compare July 14, 2020 10:30
@pared pared marked this pull request as ready for review July 14, 2020 10:31
@pared pared changed the title [WIP] LocalRemoteTree: use CleanTree with work_tree LocalRemoteTree: use repo tree as work_tree with local outputs Jul 14, 2020
@pared
Copy link
Contributor Author

pared commented Jul 14, 2020

@pmrowla I tried to make it less hackish behavior, and determine what tree should be used on Output level. The results are in newest commit.

I do believe that current behavior is still hackish.
What I believe we should do after this change:

  1. rename *RemoteTree or (Git/Working/Repo/DVC/Clean)Tree - it gets really confusing which one we are using and where, and we are intermixing them, especially in LocalRemoteTree. Not sure how the renaming should go, so far I came up with (Git/Working/Repo/DVC)FileSystem since those objects provide fs-like operations, thought in that case CleanTree does not make sense, maybe it should be named IgnoreWrapper or something like that.
  2. Related to 1: we are duplicating Local Filesystem functionality in WorkingTree and LocalRemoteTree. I think we should get rid of that duplication. First thing that comes to my mind is creating more BaseTree children, implementing fs operations for all remote types, and using those trees in *RemoteTree's. That does not seem like perfect soultion as we already have this chierarchy *Remote -> *RemoteTree and now we would add another thing: *RemoteTree -> *Tree. Probably that should be reconsidered.

@pared pared requested review from efiop and skshetry July 14, 2020 10:46
@efiop efiop requested a review from pmrowla July 14, 2020 12:30
Comment on lines 29 to 38
@cached_property
def remote(self):
if self._remote:
return self._remote

work_tree = None
if self.repo and self.is_in_repo and is_working_tree(self.repo.tree):
work_tree = self.repo.tree
tree = LocalRemoteTree(self.repo, {}, work_tree)
return LocalRemote(tree)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do this remote setting in __init__? Just feels a bit weird introducing cached property & _remote mix in both here and base class for this hack.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could, but in that case, we will set remote inside parent constructor and override it here. Thats what I wanted to avoid creating cached_property

Copy link
Contributor

@efiop efiop Jul 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pared but we can set it after super(), no? Or pass the new instance to the constructor.

@efiop
Copy link
Contributor

efiop commented Jul 14, 2020

Had a similar issue with config and it really hurts πŸ™ Thinking that maybe we should tackle #4050 right away and get rid of that is_working_tree mess once and for all. Otherwise, we might be introducing much more hidden issues than we are solving (though in this case this is clearly very serious).

Let's discuss this during planning today, maybe someone will be able to look into this part of trees deeply ASAP. If not, we'll have to merge the hack, indeed. πŸ™

@pared pared force-pushed the 4110_dvcignore_issue branch from ce61e28 to aaea4e2 Compare July 15, 2020 08:22
@pared pared requested a review from efiop July 15, 2020 08:23
@pared pared force-pushed the 4110_dvcignore_issue branch 2 times, most recently from aaea4e2 to 99ad242 Compare July 16, 2020 13:04
@pared pared requested a review from efiop July 16, 2020 13:42
@efiop efiop merged commit 91f3036 into iterative:master Jul 16, 2020
@pared pared deleted the 4110_dvcignore_issue branch January 4, 2022 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dvc add: .dvcignore example not working as expected [qa] PermissionError on file which should be in .dvcignore.
3 participants