-
Notifications
You must be signed in to change notification settings - Fork 1.2k
dvc 1.x tries to remove files in .dvcignore #4249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Update: Trying to recreate this from scratch, I found that this happens only in subdirectories with more than one level of nesting. So my original example would actually work, but not if the subdirectories are nested more deeply. For example:
When trying the experiment with only Update to the update: ok, this might be a red herring, because I'm having trouble recreating it from scratch again. I think it's something else. Will keep investigating. |
Thanks for the quick response! Actually I did try it and still have the issue, but the of thing is I'm having difficultly consistently reproducing from scratch. I suspect it has something to do with the history, in particular what gets placed in the cache. I'm away from keyboard right now, but sometime today might have the chance to get a consistent way to recreate it. |
@dimatura
For This reproduction script does not report changes in case of |
Yes, I think you're on the right track. It looks like it's somehow related to bad history on my dvc state, due to mistakes on my part when adding files. I managed to recreate more or less what I believe is going on in my main repo. This is using dvc rm -rf dvc_testignore
mkdir dvc_testignore
cd dvc_testignore
git init
dvc init
dvc remote add -d s3 s3://<redacted>
git commit -am 'initial commit'
mkdir -p foo
echo "file1.jpg" > foo/file1.jpg
echo "file1.xml" > foo/file1.xml
echo '*.xml' > .dvcignore
echo '*.jpg' > .gitignore
# setup done, now add files
dvc add -R foo # oops, didn't want to add files individually
# naively undo the dvc add?
rm foo/*.dvc
rm -f foo/.gitignore
dvc add foo # try again with directory, not files
echo '*.jpg' >| .gitignore # overwrite dvc's .gitignore so it doesn't ignore foo/
git add foo
git add foo.dvc .dvcignore .gitignore
git commit -am 'another commit'
dvc push #edit: forgot to add this since I was reusing remote
# dvc status doesn't complain here actually
# dvc pull actually does try to delete file1.xml, but only once
# dvc pull
dvc pull -v # new edit: the -v flag does something strange! Here, when I do So, I almost certainly have been guilty in the past of accidentally Now for me, I guess, the questions is whether there's a clean way to fix this. I tried a fresh git+dvc clone, but that still had the issue. I'm hesitant to try Thanks for the amazing response! Edit: OK, here's a head scratcher! If at the end of the script above, we do |
@dimatura So now, |
@pared I see. Yes, I can actually recover the xml from git (and actually did so in my main repo when I let dvc delete the files, just to see what would happen). But dvc will still try to delete it every time I do a dvc pull afterwards :/. That said, the following steps: dvc pull -f # let dvc delete foo/file1.xml
git checkout -- file1.xml
dvc commit Seem to prevent dvc from asking about deleting file1.xml afterwards. Unfortunately, if I try the same steps in my main repo it tells me that the directory is under git control (which is sort of true, but only partially...). Seems like I have a mess on my hands. I guess there's always the nuclear option of starting from scratch, but if there's some other way I could make my repo sane again I'm open to suggestions :) |
I think the only solution, for now, is to always do As to your main repo: so you filled To fix the already present situation, you would need to follow theese steps:
NOTE: if you forget to edit |
Alright, I'll try that. Thanks for the great support! |
@dimatura Hope it will help. |
Uh oh!
There was an error while loading. Please reload this page.
Bug Report
Hi, first of all, let me congratulate you all for the good work on dvc -- it's a great tool. Alas, I'm running into some issues since 1.x.
Please provide information about your setup
Output of
dvc version
:The issue occurs with 1.11.1, 1.11.10, 1.0.0b4 (I think), and dvc installed from git
master
on July 20 (1.1.11+b77ce0
).The issue does not occur with 0.94.1, and other earlier versions I had installed.
Other setup info: Ubuntu 18.04, dvc installed with venv, python 3.6.9. *Edit: the remote is S3.
Additional Information (if any):
My dvc repo has a layout like this:
In other words, each subdirectory has its own
subdir.dvc
file. Inside, for each JPG file there is a "sidecar" xml file with annotation metadata. Importantly, I am using DVC only for JPG, and using regular git for the xml. (Before this was a dvc repo, it used to be a git-annex repo, which had explicit support for this withlargefiles
). I configure this by having*.jpg
in.gitignore
and having*.xml
in.dvcignore
. Until 1.x, this worked as expected, with each tool only minding their respective file types.However, in the various versions of 1.x I tried (including yesterday's master, hoping that this was the same as Issue #4197),
dvc status
reports that the checksum of almost every subdirectory has changed, and whenever I dodvc pull
it asks if I want to delete the xml files (it stops after denying the request). When going back to 0.94.1, everything works like it used to.I did a quick check on
dvc pull --verbose
on each version. In 0.94.1, the output never mentions the xml files. On the 1.x versions, it does -- i.e.Path '/mnt/.../file1.xml' inode '624822
, thenfetched: []
, which I guess means it's not really ignoring the xml. Since this does work pre 1.x, it seems like a bug, or at least unexpected behavior to me.(Note: while the git/dvc mixed layout worked fine before 1.x, it seems like DVC does not like this setup, considering the way it automatically creates
.gitignore
files that ignore the whole content of the directory added to DVC -- I wish that behavior could be disabled.)(Note 2: I also tried simply using a dvc file for each JPG, and while this works, it makes things very slow -
dvc status
goes from seconds to minutes. So it has its own issues, but I'm glad DVC can optimize for directories ;)(Note 3: While I would like not having directories with mixed git/dvc content, other tools we use assume the sidecar layout. Adding the xml file into each DVC subdirectory
.dvc
would have its own issues, since the xmls change frequently. I even tried separate directories and simulating the sidecar layout with symlinks and union mounts, but both have issues on macOS, which we need to work with.)The text was updated successfully, but these errors were encountered: