Skip to content

Support external local deps/outputs #764

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
efiop opened this issue Jun 12, 2018 · 2 comments
Closed

Support external local deps/outputs #764

efiop opened this issue Jun 12, 2018 · 2 comments
Assignees
Labels
enhancement Enhances DVC
Milestone

Comments

@efiop
Copy link
Contributor

efiop commented Jun 12, 2018

We currently support external local cache, but deps/outs are not yet supported. Adding support for that should be pretty trivial, since a lot of infrastructure already exists for other types.

See https://discuss.dvc.org/t/shared-cache-directory/

@efiop efiop added the enhancement Enhances DVC label Jun 12, 2018
@efiop efiop added this to the 0.9.8 milestone Jun 12, 2018
@efiop efiop self-assigned this Jun 12, 2018
@itcarroll
Copy link

As part of this enhancement, I'd caution against the step of moving the original to the cache (which is I think the usual and good behavior for internal deps/outs). My motivation for using external deps is the common case of multiple code repositories depending on the same dataset (with each repository being configured to have its own external cache).

FYI, I had tried to simply symlink the external deps/outs into the repo ... which failed :)
Having set dvc config cache.dir /nfs/shared-data/classify after the dvc init step of the tutorial, I had:

$ ls -l
total 28
-rw-r--r-- 1 icarroll centrify_primary  390 Mar 16 20:27 conf.py
lrwxrwxrwx 1 icarroll centrify_primary   18 Jun 14 10:47 data -> /nfs/shared-data
-rw-r--r-- 1 icarroll centrify_primary  807 Mar 16 20:27 evaluate.py
-rw-r--r-- 1 icarroll centrify_primary 2098 Mar 19 04:32 featurization.py
-rw-r--r-- 1 icarroll centrify_primary   34 Mar 18 05:57 requirements.txt
-rw-r--r-- 1 icarroll centrify_primary 1835 Mar 16 20:27 split_train_test.py
-rw-r--r-- 1 icarroll centrify_primary  887 Mar 16 20:27 train_model.py
-rw-r--r-- 1 icarroll centrify_primary 1385 Mar 16 20:27 xml_to_tsv.py
$ dvc add -v data/Posts.xml.tgz 
updater is not old enough to check for updates
Data 'data/Posts.xml.tgz' exists. Removing before checkout
Removing 'data/Posts.xml.tgz'
Checking out '../../../../nfs/shared-data/classify/59/88519f8465218abb23ce0e0e8b1384' with cache 'data/Posts.xml.tgz'
Cache type 'reflink' is not supported
Checking out '../../../../nfs/shared-data/classify/59/88519f8465218abb23ce0e0e8b1384' with cache 'data/Posts.xml.tgz'
Traceback (most recent call last):
  File "/research-home/itcarroll/.local/lib/python2.7/site-packages/dvc/command/add.py", line 9, in run
    self.project.add(target)
  File "/research-home/itcarroll/.local/lib/python2.7/site-packages/dvc/project.py", line 122, in add
    stage.save()
  File "/research-home/itcarroll/.local/lib/python2.7/site-packages/dvc/stage.py", line 227, in save
    self.project.scm.ignore(out.path)
  File "/research-home/itcarroll/.local/lib/python2.7/site-packages/dvc/scm.py", line 95, in ignore
    entry, gitignore = self._get_gitignore(path)
  File "/research-home/itcarroll/.local/lib/python2.7/site-packages/dvc/scm.py", line 90, in _get_gitignore
    raise FileNotInRepoError(path)
FileNotInRepoError: /research-home/itcarroll/tmp/classify/data/Posts.xml.tgz

Failed to add {}: /research-home/itcarroll/tmp/classify/data/Posts.xml.tgz

Seeing the "Removing ..." step is what got me here. Thanks for adding this feature!

@efiop
Copy link
Contributor Author

efiop commented Jun 14, 2018

Hi @itcarroll !

As part of this enhancement, I'd caution against the step of moving the original to the cache (which is I think the usual and good behavior for internal deps/outs). My motivation for using external deps is the common case of multiple code repositories depending on the same dataset (with each repository being configured to have its own external cache).

We actually never move dependencies to the cache, only when that dependency is also an output of some other stage. It has not been well tested yet, but starting from the upcoming 0.9.8 you would be able to use the same external cache dir for all of your projects so that you could avoid duplication.

FYI, I had tried to simply symlink the external deps/outs into the repo ... which failed :)
Having set dvc config cache.dir /nfs/shared-data/classify after the dvc init step of the tutorial, I had:

Yeah, we do not support symlinks as deps right now, but that is a great idea! I've added #774 to our todo list. Thank you for the feedback!

efiop added a commit to efiop/dvc that referenced this issue Jun 15, 2018
Fixes iterative#764

Signed-off-by: Ruslan Kuprieiev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhances DVC
Projects
None yet
Development

No branches or pull requests

2 participants