-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Directory dependency wrongly reported as changed #2144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @pedroasad ! In the Dvcfile you are specifying the whole directory as a dependency and then you are running that directory as a python module, which probably creates |
Bull's eye @efiop ! Indeed, the Glad to know about #1876, it will surely make DVC a lot more usable. For now, specifying dependencies individually is inconvenient and error-prone (consider, for instance, adding a new module under a package that changed the code's behaviour, which would require to manually specify the new dependency), but will do the trick. Now, considering it was just misusage, I guess this issue would be better off closed, right? |
@pedroasad Great to know the mistery is solved! 🙂 Indeed, we are really looking forward to dvcignore ourselves 😄 Yes, let's close this one. Hang on tight, dvcignore should be coming soon! Thanks for the feedback! |
It seems that DVC 0.43.1 (also 0.40.1, surely) wrongly reports that directory dependencies have changed when working on different copies of a repository. Consider this scenario:
A
, I work on a pipeline that depends on a directoryABC
A
(revisionabcd1234
), and push them to the remote (git push
)A
's DVC cache to a Google Cloud Storage remote (dvc push -r <remote>
)B
, I check out the latest version of the repository (git checkout abcd1234
) and synchronize the cache (dvc pull -r <remote>
)Now, running
dvc status
onB
wrongly reports that theABC
directory has changed, albeit the repository has just been updated to the same revision as inA
, in which the up-to-date cache was obtained.I have observed this while working on a project on my home and office computers, and also while using a CI service (Gitlab CI) to build and test my project. I am providing a minimal working example, dvc-issue.zip, comprised of a Git repository with a simple Dvcfile and a Dockerfile, that allows to reproduce this behaviour without requiring actual Git or DVC remotes. Below, I am pasting the contents of this package's README, which contains detailed instructions for reproducing this bug.
README
This package demonstrates a bug in DVC, as of version 0.41.3: when a pipeline depends on a
directory, if the repository is cloned on a different machine than the developer's (for instance, in
a Docker container running in a CI service), DVC wrongly reports that the dependencies have changed,
even if the cache directory (
.dvc/cache
) is copied as-is to this working copy.This behavior is counterproductive, because it signals changed dependencies and forces pipelines to
be rerun, when dependencies have not actually changed. Possibly, this scenario indicates a bug in
the algorithm that computes directory hashes, which might be taking some local filesystem attributes
into account (like file modification timestamps, or inode numbers, for instance), which it should not.
This repository contains a minimal working example, which consists in:
Dvcfile
).git
directory).dvc/cache
to the cloned repositorydvc status
command to show the erroneous behaviorSteps to reproduce
On Linux Mint 19.1 (
lsb_release
):After step 5, the following message should be displayed
which confirms that DVC wrongly detects changed dependencies (the
pkg
directory, which remainsidentical).
Another way of seeing this is by inspecting the change in the
md5
key of thepkg
directorydependency (in
Dvcfile
) after runningwhich shows that the
pkg
directory MD5 hash has changed, despite the actual content of thedirectory being the same.
Expected behavior
The
dvc status
command which is run in the repository copy cloned in the Docker container shouldreport
since
.dvc/cache
directory was copied to the cloned repository,pkg
directory contents were not changed by thegit clone
command (which is actuallysimulated by copying
.git
to the container and runninggit reset --hard
, but results are thesame if an actual repository was cloned from Github), and
data/output.txt
is the only file missing, but is present in the cache.Furthermore, if the
CMD
instruction in the Dockerfile was replaced byCMD dvc checkout && dvc status
the output of running the build image should then be
The text was updated successfully, but these errors were encountered: