Skip to content

dvc 1.x tries to remove files in .dvcignore #4249

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dimatura opened this issue Jul 21, 2020 · 10 comments
Closed

dvc 1.x tries to remove files in .dvcignore #4249

dimatura opened this issue Jul 21, 2020 · 10 comments
Assignees
Labels
awaiting response we are waiting for your reply, please respond! :) bug Did we break something?

Comments

@dimatura
Copy link

dimatura commented Jul 21, 2020

Bug Report

Hi, first of all, let me congratulate you all for the good work on dvc -- it's a great tool. Alas, I'm running into some issues since 1.x.

Please provide information about your setup

Output of dvc version:

The issue occurs with 1.11.1, 1.11.10, 1.0.0b4 (I think), and dvc installed from git master on July 20 (1.1.11+b77ce0).
The issue does not occur with 0.94.1, and other earlier versions I had installed.

Other setup info: Ubuntu 18.04, dvc installed with venv, python 3.6.9. *Edit: the remote is S3.

Additional Information (if any):

My dvc repo has a layout like this:

root
| .gitignore
| .dvcignore
|----subdir1
|          |---- file1.jpg
|          |---- file1.xml 
|          |---- file2.jpg
|          \---- file2.xml
|----subdir1.dvc
|----subdir2
|         similar to subdir1
# etc

In other words, each subdirectory has its own subdir.dvc file. Inside, for each JPG file there is a "sidecar" xml file with annotation metadata. Importantly, I am using DVC only for JPG, and using regular git for the xml. (Before this was a dvc repo, it used to be a git-annex repo, which had explicit support for this with largefiles). I configure this by having *.jpg in .gitignore and having *.xml in .dvcignore. Until 1.x, this worked as expected, with each tool only minding their respective file types.

However, in the various versions of 1.x I tried (including yesterday's master, hoping that this was the same as Issue #4197), dvc status reports that the checksum of almost every subdirectory has changed, and whenever I do dvc pull it asks if I want to delete the xml files (it stops after denying the request). When going back to 0.94.1, everything works like it used to.

I did a quick check on dvc pull --verbose on each version. In 0.94.1, the output never mentions the xml files. On the 1.x versions, it does -- i.e. Path '/mnt/.../file1.xml' inode '624822, then fetched: [], which I guess means it's not really ignoring the xml. Since this does work pre 1.x, it seems like a bug, or at least unexpected behavior to me.

(Note: while the git/dvc mixed layout worked fine before 1.x, it seems like DVC does not like this setup, considering the way it automatically creates .gitignore files that ignore the whole content of the directory added to DVC -- I wish that behavior could be disabled.)

(Note 2: I also tried simply using a dvc file for each JPG, and while this works, it makes things very slow - dvc status goes from seconds to minutes. So it has its own issues, but I'm glad DVC can optimize for directories ;)

(Note 3: While I would like not having directories with mixed git/dvc content, other tools we use assume the sidecar layout. Adding the xml file into each DVC subdirectory .dvc would have its own issues, since the xmls change frequently. I even tried separate directories and simulating the sidecar layout with symlinks and union mounts, but both have issues on macOS, which we need to work with.)

@ghost ghost added the triage Needs to be triaged label Jul 21, 2020
@dimatura
Copy link
Author

dimatura commented Jul 21, 2020

Update: Trying to recreate this from scratch, I found that this happens only in subdirectories with more than one level of nesting.

So my original example would actually work, but not if the subdirectories are nested more deeply. For example:

.
├── foo
│   ├── 0982.jpg
│   └── 0982.xml
├── foo2
│   ├── sub1
│   │   ├── 0344.jpg
│   │   ├── 0344.xml
│   │   ├── 0345.jpg
│   │   ├── 0345.xml
│   └── sub1.dvc
└── foo.dvc

When trying the experiment with only foo, things worked fine. But as soon as I added foo2/sub1, the issue appeared: dvc status says foo2/sub1.dvc: changed outs: modified: foo2/sub1 and dvc pull tries to remove the xml files inside sub1 (but not (foo).
As a quick experiment I also tried adding **/*.xml to .dvcignore, but it had no visible effect.

Update to the update: ok, this might be a red herring, because I'm having trouble recreating it from scratch again. I think it's something else. Will keep investigating.

@pared
Copy link
Contributor

pared commented Jul 21, 2020

Hi @dimatura! Your use case depends heavily on .dvcignore. During our optimizations for 1.0 we introduced some bugs, that hopefully should be gone by now.
Could you check out 1.1.11 version? It includes #4125 which should be fixing your issue.
Related: #4110 #4197

@dimatura
Copy link
Author

Thanks for the quick response! Actually I did try it and still have the issue, but the of thing is I'm having difficultly consistently reproducing from scratch. I suspect it has something to do with the history, in particular what gets placed in the cache. I'm away from keyboard right now, but sometime today might have the chance to get a consistent way to recreate it.

@pared pared added the bug Did we break something? label Jul 21, 2020
@ghost ghost removed the triage Needs to be triaged label Jul 21, 2020
@pared pared added p0-critical awaiting response we are waiting for your reply, please respond! :) and removed p0-critical labels Jul 21, 2020
@pared
Copy link
Contributor

pared commented Jul 21, 2020

@dimatura
I was fooling around with the setup you provided and I think I got similar results that you did.
So here is reproduction script:

#!/bin/bash

rm -rf repo git_repo storage
mkdir repo git_repo

main=$(pwd)

set -ex
pushd git_repo
git init --quiet --bare
popd

pushd repo

git init --quiet
git remote add origin $main/git_repo

dvc init --quiet
dvc remote add -d str $main/storage

git commit -am "init dvc"

mkdir -p data/subdata1
echo "xml1" >> data/subdata1/meta.xml
echo "img1" >> data/subdata1/image.jpg

echo "*.jpg" >> .gitignore
echo "*.xml" >> .dvcignore

pushd data
dvc add subdata1
rm .gitignore
git add subdata1

dvc push
git add -A
git commit -am "add meta"

rm -rf subdata1 .dvc/cache

# se if we are looking for meta.xml
dvc pull -v subdata1.dvc | grep "xml"

For 0.94.1 grep does not show any activity related to xml, same for 1.1.11, and 1.0.0a11 (after that release #4110 was reported, and it is the origin cause of #4197). However, for versions in-between 1.0.0a11 and 1.1.11, the problem exists (I checked 1.0.0 and 1.1.10).

This reproduction script does not report changes in case of dvc status. I wonder whether it is possible that you have made a commit to git with faulty dvc? That could explain why you get some unexpected behavior.

@dimatura
Copy link
Author

dimatura commented Jul 22, 2020

Yes, I think you're on the right track. It looks like it's somehow related to bad history on my dvc state, due to mistakes on my part when adding files. I managed to recreate more or less what I believe is going on in my main repo. This is using dvc 1.1.11.

rm -rf dvc_testignore
mkdir dvc_testignore
cd dvc_testignore
git init
dvc init
dvc remote add -d s3 s3://<redacted>
git commit -am 'initial commit'
mkdir -p foo
echo "file1.jpg" > foo/file1.jpg
echo "file1.xml" > foo/file1.xml
echo '*.xml' > .dvcignore
echo '*.jpg' > .gitignore
# setup done, now add files
dvc add -R foo # oops, didn't want to add files individually
# naively undo the dvc add?
rm foo/*.dvc
rm -f foo/.gitignore
dvc add foo # try again with directory, not files
echo '*.jpg' >| .gitignore # overwrite dvc's .gitignore so it doesn't ignore foo/
git add foo
git add foo.dvc .dvcignore .gitignore
git commit -am 'another commit'
dvc push #edit: forgot to add this since I was reusing remote
# dvc status doesn't complain here actually
# dvc pull actually does try to delete file1.xml, but only once
# dvc pull
dvc pull -v # new edit: the -v flag does something strange!

Here, when I do dvc status it doesn't say anything about modified checksums (edit: see below). On the other hand, when I do dvc pull it does prompt me about deleting file1.xml, but only the first time! The next time around it works fine. I believe this may actually also be happening in my other repo, but I didn't notice because the number of files it prompts about are in the hundreds or thousands. It does ask me about a different file each time, so maybe if I just ran and denied the prompt each time it would eventually stop asking.

So, I almost certainly have been guilty in the past of accidentally dvc-adding various files individually, as in this example. I have then done more or less as I've done here, just remove the .dvc files/.gitignore and tried again. I'm not sure whether I have committed this in git, I think I usually catch the error before doing a commit, but I could be wrong. But regardless, it still seems like dvc is in a weird state, as in this example.

Now for me, I guess, the questions is whether there's a clean way to fix this. I tried a fresh git+dvc clone, but that still had the issue. I'm hesitant to try dvc gc -w ;). And even if I started from scratch, what is the best way to avoid this issue happening again? Maybe dvc add --no-commit, until I'm sure it's ok, could be an option.

Thanks for the amazing response!

Edit: OK, here's a head scratcher! If at the end of the script above, we do dvc pull -v instead of simply dvc pull, then that really leaves dvc confused for good. After doing dvc pull -v, it no longer stops asking whether I want to delete the file, it asks every time, whether I pass -v or not. Moreover, dvc status does now show foo as being modified (it says: foo.dvc: changed outs: modified: foo). Somehow the act of passing in the verbose flag changes something in dvc, so that I get behavior which is pretty much what I'm seeing in the main repo.
Edit 2: Sorry about all the edits... anyways, I realized I missed the dvc push before pull. After adding the dvc push, again (whether or not I do dvc status -v), it persistently asks about removing file1.xml, and afterwards dvc status again says foo.dvc has changed.

@pared
Copy link
Contributor

pared commented Jul 22, 2020

@dimatura
Running your script on 1.1.11 - I am also asked about removing the .xml file. That is happening because dvc, by default deletes path that its trying to checkout (to make sure it is in the same state as it is described by its dir md5).

So now, git is returning that foo.xml has been deleted, but if you run git checkout foo after dvc pull, you will have your xml
files and both git status and dvc status will say that everything is up to date. Am I missing something?

@dimatura
Copy link
Author

@pared I see. Yes, I can actually recover the xml from git (and actually did so in my main repo when I let dvc delete the files, just to see what would happen). But dvc will still try to delete it every time I do a dvc pull afterwards :/. That said, the following steps:

dvc pull -f # let dvc delete foo/file1.xml
git checkout -- file1.xml 
dvc commit

Seem to prevent dvc from asking about deleting file1.xml afterwards. Unfortunately, if I try the same steps in my main repo it tells me that the directory is under git control (which is sort of true, but only partially...). Seems like I have a mess on my hands. I guess there's always the nuclear option of starting from scratch, but if there's some other way I could make my repo sane again I'm open to suggestions :)

@pared
Copy link
Contributor

pared commented Jul 24, 2020

@dimatura

But dvc will still try to delete it every time I do a dvc pull afterwards
I am afraid this is due to fact that dvc tries to remove the dir content before checkout. And this behavior is there intentionally, to prevent user from using a directory that is joined results of few other operations (and is not properly committed).

I think the only solution, for now, is to always do git checkout after dvc pull/checkout.

As to your main repo: so you filled .dvcignore and .gitignore and firstly, you did git add, right? I guess that in such setup you should start with dvc add, editing auto-generated .gitignore and only then do git add.

To fix the already present situation, you would need to follow theese steps:

  1. git rm -r --cached subdata
  2. git commit -m "stop tracking subdata for a while"
  3. dvc add subdata
  4. edit .gitignore to not ignore subdata
  5. git add subdata

NOTE: if you forget to edit .gitignore you might get a message from git that you can go with -f option to override .gitignore. Don't do that, as other rules (*.jpg) will not be used as well and you will end up adding jpg to git too.

@pared pared self-assigned this Jul 24, 2020
@dimatura
Copy link
Author

Alright, I'll try that. Thanks for the great support!

@pared
Copy link
Contributor

pared commented Jul 27, 2020

@dimatura Hope it will help.
I think I didn't stress that at the beginning of our conversation, that by default, DVC is not supposed to version the same directory as git does and provided solutions are rather workarounds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting response we are waiting for your reply, please respond! :) bug Did we break something?
Projects
None yet
Development

No branches or pull requests

3 participants