Skip to content

Directory dependency wrongly reported as changed #2144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pedroasad opened this issue Jun 18, 2019 · 3 comments
Closed

Directory dependency wrongly reported as changed #2144

pedroasad opened this issue Jun 18, 2019 · 3 comments

Comments

@pedroasad
Copy link

It seems that DVC 0.43.1 (also 0.40.1, surely) wrongly reports that directory dependencies have changed when working on different copies of a repository. Consider this scenario:

  • On computer A, I work on a pipeline that depends on a directory ABC
  • I commit the changes in A (revision abcd1234), and push them to the remote (git push)
  • I upload A's DVC cache to a Google Cloud Storage remote (dvc push -r <remote>)
  • On computer B, I check out the latest version of the repository (git checkout abcd1234) and synchronize the cache (dvc pull -r <remote>)

Now, running dvc status on B wrongly reports that the ABC directory has changed, albeit the repository has just been updated to the same revision as in A, in which the up-to-date cache was obtained.

I have observed this while working on a project on my home and office computers, and also while using a CI service (Gitlab CI) to build and test my project. I am providing a minimal working example, dvc-issue.zip, comprised of a Git repository with a simple Dvcfile and a Dockerfile, that allows to reproduce this behaviour without requiring actual Git or DVC remotes. Below, I am pasting the contents of this package's README, which contains detailed instructions for reproducing this bug.

README

This package demonstrates a bug in DVC, as of version 0.41.3: when a pipeline depends on a
directory, if the repository is cloned on a different machine than the developer's (for instance, in
a Docker container running in a CI service), DVC wrongly reports that the dependencies have changed,
even if the cache directory (.dvc/cache) is copied as-is to this working copy.

This behavior is counterproductive, because it signals changed dependencies and forces pipelines to
be rerun, when dependencies have not actually changed. Possibly, this scenario indicates a bug in
the algorithm that computes directory hashes, which might be taking some local filesystem attributes
into account (like file modification timestamps, or inode numbers, for instance), which it should not.

This repository contains a minimal working example, which consists in:

  • running a provided DVC pipeline (Dvcfile)
  • building a Docker image from a provided Dockerfile that
    • simulates cloning a repository (the attached tarball contains a .git directory)
    • copies the local .dvc/cache to the cloned repository
    • executes the dvc status command to show the erroneous behavior

Steps to reproduce

On Linux Mint 19.1 (lsb_release):

# 1. Install DVC (0.43.1) and Docker
pip install --user "dvc==0.43.1"  # drop "--user" if using a virtualenv
sudo apt install docker           # "docker --version" reports "Docker version 18.09.6, build 481bc77"

# 2. Decompress this repository package and enter it
unzip dvc-issue.zip
cd dvc-issue

# 3. Run the pipeline, then make sure any changes are recorded, although there shouldn't be any.
dvc repro
git commit -am "Update"  # Should actually complain about an empty commit.

# 4. Build the Docker image
docker build -t dvc-issue .

# 5. Run the Docker image
docker run dvc-issue

After step 5, the following message should be displayed

Dvcfile:
	changed deps:
		modified:           pkg
	changed outs:
		deleted:            data/output.txt

which confirms that DVC wrongly detects changed dependencies (the pkg directory, which remains
identical).

Another way of seeing this is by inspecting the change in the md5 key of the pkg directory
dependency (in Dvcfile) after running

docker build -t dvc-issue .
docker run -it --entrypoint /bin/bash dvc-issue

# Now, inside the container:
dvc repro
git diff

which shows that the pkg directory MD5 hash has changed, despite the actual content of the
directory being the same.

Expected behavior

The dvc status command which is run in the repository copy cloned in the Docker container should
report

Dvcfile:
	changed outs:
		deleted:            data/output.txt

since

  • the .dvc/cache directory was copied to the cloned repository,
  • the pkg directory contents were not changed by the git clone command (which is actually
    simulated by copying .git to the container and running git reset --hard, but results are the
    same if an actual repository was cloned from Github), and
  • the data/output.txt is the only file missing, but is present in the cache.

Furthermore, if the CMD instruction in the Dockerfile was replaced by

CMD dvc checkout && dvc status

the output of running the build image should then be

Pipeline is up to date. Nothing to reproduce.
@efiop
Copy link
Contributor

efiop commented Jun 18, 2019

Hi @pedroasad !

In the Dvcfile you are specifying the whole directory as a dependency and then you are running that directory as a python module, which probably creates *.pyc and __pycache__ files/dirs, that are taken into account by dvc when computing the checksum for pkg. We could help with that by using the upcoming .dvcignore feature #1876 (in our TODO for the next week) and ignoring *.pyc and pycache. Also, thinking about it, we might also consider #1471 once again. For now, you could try specifying individual files as dependencies. E.g. -d pkg/__init__.py -d pkg/main.py. Unfortunately, we don't support wildcards in the dvc deps right now #1462, which would also help with that workaround :(

@pedroasad
Copy link
Author

Bull's eye @efiop !

Indeed, the __pycache__ directory was causing DVC to see the pkg directory as changed. Thanks for pointing that out, it was simple, but it just didn't caught my attention. I rewrote the Dvcfile, specifying files individually, and now dvc status reports no changes in different working copies. I am including the patched example here: dvc-issue.zip.

Glad to know about #1876, it will surely make DVC a lot more usable. For now, specifying dependencies individually is inconvenient and error-prone (consider, for instance, adding a new module under a package that changed the code's behaviour, which would require to manually specify the new dependency), but will do the trick.

Now, considering it was just misusage, I guess this issue would be better off closed, right?

@efiop
Copy link
Contributor

efiop commented Jun 19, 2019

@pedroasad Great to know the mistery is solved! 🙂 Indeed, we are really looking forward to dvcignore ourselves 😄 Yes, let's close this one. Hang on tight, dvcignore should be coming soon! Thanks for the feedback!

@efiop efiop closed this as completed Jun 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants