Description
Bug Report
Description
This is an issue that is related to #4249, but I think it is somewhat different.
The issue is that if you are:
- tracking a directory with DVC where a subdirectory matches a dvcignore pattern
- write a file in that directory but in a place matching the dvcignore pattern
- pull an update to that tracked directory
DVC will attempt to remove that file that matches the dvcignore pattern.
Reproduce
#!/bin/bash
__doc__="
Minimal example to attempt to reproduce issue where DVC tries to remove data
that should be ignored.
This should be fixed https://github.com/iterative/dvc/issues/4249 But I'm
having an issue with it. This MWE does not currently seem to reproduce it, so
maybe it is something else.
"
BASE_DPATH=$HOME/tmp/dvc_pull_issue
write_dummy_pred_and_evaluation(){
__doc__="
Helper to write data similar to our prediction / evaluation script
"
MODEL_FPATH=$1
EXPT_DPATH=$(dirname "$MODEL_FPATH")
MODEL_NAME=$(basename -s .pt "$MODEL_FPATH")
PRED_DPATH=$EXPT_DPATH/pred_$MODEL_NAME/dataset
mkdir -p "$PRED_DPATH/_assets/class1"
mkdir -p "$PRED_DPATH/_assets/class2"
mkdir -p "$PRED_DPATH/eval/metrics"
# Intermediate results that should be ignored by DVC
head /dev/random > "$PRED_DPATH"/_assets/class1/img1.jpg
head /dev/random > "$PRED_DPATH"/_assets/class1/img2.jpg
head /dev/random > "$PRED_DPATH"/_assets/class2/img1.jpg
head /dev/random > "$PRED_DPATH"/_assets/class2/img2.jpg
head /dev/random > "$PRED_DPATH"/pred.json
# The summary should not be ignored by DVC
head /dev/random > "$PRED_DPATH"/eval/metrics/summary.json
}
# Create a clean start directory
rm -rf "$BASE_DPATH"
mkdir -p "$BASE_DPATH"
cd "$BASE_DPATH"
# Make a simple repo
mkdir -p "$BASE_DPATH/demo_repo"
cd "$BASE_DPATH/demo_repo"
git init --quiet
dvc init --quiet
dvc config core.autostage true
dvc config cache.type "symlink,hardlink,copy"
dvc config cache.shared group
dvc config cache.protected true
git config --local receive.denyCurrentBranch "warn"
# This pattern will have local visualizations and raw predictions we do not want to check in
echo "models/*/*/*/_assets" >> .dvcignore
echo "models/*/*/*/pred.json" >> .dvcignore
git add .dvcignore
# Add some data to the repo
mkdir -p "models/expt1"
mkdir -p "models/expt2"
echo "content of model1" > "models/expt1/model1.pt"
echo "content of model2" > "models/expt1/model2.pt"
echo "content of model3" > "models/expt2/model3.pt"
echo "content of model4" > "models/expt2/model4.pt"
dvc add models/expt1 models/expt2
git commit -am "Add data v1"
# Make a clone of the simple repo
cd "$BASE_DPATH"
git clone demo_repo/ demo_repo_clone
cd "$BASE_DPATH/demo_repo_clone"
# Set the remote to the other repo
dvc remote add custom "$BASE_DPATH/demo_repo/.dvc/cache"
dvc pull -r custom --recursive .
# Back to Originl Repo, add in basic eval data
cd "$BASE_DPATH/demo_repo"
write_dummy_pred_and_evaluation models/expt1/model1.pt
write_dummy_pred_and_evaluation models/expt2/model4.pt
dvc add models/expt1 models/expt2
git commit -am "model evals for 1 and 4 from orig repo"
# In the Clone Repo
# Do the same evaluations, but and then pull
cd "$BASE_DPATH/demo_repo_clone"
# Make a file that wouldn't be touched by a pull because it should be in the
# .dvcignore file.
mkdir -p models/expt1/pred_model1/_assets/should-be-ignored
head /dev/random > models/expt1/pred_model1/_assets/should-be-ignored/ignore-me.tmp
git pull
dvc pull -r custom --recursive .
# Causes:
#0% Checkout|
|0/6 [00:00<?, ?file/sfile/directory '/home/joncrall/tmp/dvc_pull_issue/demo_repo_clone/models/expt1/pred_model1/_assets/should-be-ignored/ignore-me.tmp' is going to be removed. Are you sure you want to proceed? [y/n]
This causes DVC to prompt me to delete a file I told it to ignore.
'models/expt1/pred_model1/_assets/should-be-ignored/ignore-me.tmp' is going to be removed. Are you sure you want to proceed? [y/n]
Expected
I would expect that because models/*/*/*/_assets
is in the .dvcignore and none of the files that are being pulled conflict with it, that dvc would simply gracefully ignore it.
Environment information
Output of dvc doctor
:
DVC version: 2.9.4.dev117+gd5b809d7
---------------------------------
Platform: Python 3.9.9 on Linux-5.13.0-28-generic-x86_64-with-glibc2.34
Supports:
azure (adlfs = 2021.10.0, knack = 0.9.0, azure-identity = 1.7.1),
gdrive (pydrive2 = 1.10.0),
hdfs (fsspec = 2022.1.0, pyarrow = 6.0.1),
webhdfs (fsspec = 2022.1.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
s3 (s3fs = 2022.1.0, boto3 = 1.20.24),
ssh (sshfs = 2021.11.2),
oss (ossfs = 2021.8.0),
webdav (webdav4 = 0.9.3),
webdavs (webdav4 = 0.9.3)
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: https
Workspace directory: ext4 on /dev/mapper/vgubuntu-root
Repo: dvc, git
Additional Info
When I was tracing the source code, I noticed that dvc/commands/checkout.py
and dvc/commands/experiments/pull.py
were not passing dvcignore=self.repo.dvcignore
to the checkout and pull commands. I haven't finished testing if adding that fixes the problem, but my current thought is that when computing the diff between the new and old state of the repo, there are files matched by .dvcignore that are (incorrectly?) parsed and then flagged as part of the diff. I think if they were just ignored, then this problem would go away.
This writeup is a bit rushed, as I have to head out, but the MWE does reproduce the issue. I'm hoping this is not intended behavior, because it breaks my workflow.