Skip to content

Make 'dvc add directory' add all files inside directory #367

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
efiop opened this issue Jan 3, 2018 · 24 comments
Closed

Make 'dvc add directory' add all files inside directory #367

efiop opened this issue Jan 3, 2018 · 24 comments
Assignees
Labels
enhancement Enhances DVC

Comments

@efiop
Copy link
Contributor

efiop commented Jan 3, 2018

No description provided.

@efiop efiop added the enhancement Enhances DVC label Jan 3, 2018
@efiop efiop self-assigned this Jan 3, 2018
@efiop
Copy link
Contributor Author

efiop commented Jan 18, 2018

Leaving for the next release.

@efiop
Copy link
Contributor Author

efiop commented Jan 19, 2018

Returning back to current release, as #434 made it clear that people need this feature.

@efiop
Copy link
Contributor Author

efiop commented Jan 20, 2018

Plus, need to add ability to add directory as a whole(discussed this IRL with @dmpetrov ). Relevant for parquet scenario.

@Nicolabo
Copy link

Nicolabo commented May 18, 2018

Hi, great tool! Really appreciate it!
Few words about my project... recently I took over a ML project. Data are stored in Amazon Athena. During the data preparation process, I query data, read them to EC2, make some data manipulation. During the process, I create hundreds of csv's which store them in several directories. Thus my question: Is it possible to treat directory as dependencies or output when executing dvc run?

For example, I was trying to accomplish it based on your tutorial. I created necessary directories, rewrite code in code/conf.py (with proper paths).

This's my modified code (based on code in tutorial):

dvc run -d data/tgz/Posts.xml.tgz -o data/xml/ tar zxf data/tgz/Posts.xml.tgz -C data/xml/
dvc run -d data/xml/ -d code/xml_to_tsv.py -d code/conf.py -o data/tsv/ python code/xml_to_tsv.py
dvc run -d data/tsv/ -d code/split_train_test.py -d code/conf.py -o data/tt/  python code/split_train_test.py 0.33 20180319
dvc run -d code/featurization.py -d code/conf.py -d data/tt -o data/matrix_t/ python code/featurization.py
dvc run -d data/model/model.p -d data/matrix_t/matrix-test.p -d code/evaluate.py -d code/conf.py -o data/eval.txt -f Dvcfile python code/evaluate.py

Then, when I want to create a new branch, modified code/featurization.py, and execute dvc repro, nothing happens. Like there was no changes in code.

Probably, I miss something.

I just read about your tool few days ago and since then I am excited! Please let me know what's wrong.

Thanks,

@efiop
Copy link
Contributor Author

efiop commented May 18, 2018

Hi @Nicolabo !

Is it possible to treat directory as dependencies or output when executing dvc run?

Yes, just specify it as a dependency/output in dvc run, no different from specifying single files.

Then, when I want to create a new branch, modified code/featurization.py, and execute dvc repro, nothing happens. Like there was no changes in code.

Your commands seem to be correct, but unfortunately I can't put my finger on anything without seeing the full code. Btw, what version(dvc --version) are you using? If it is an older version(current one is 0.9.7), could you try upgrading to see if the problem persists? Could you show the output of dvc repro -v(notice -v for verbosity) when you modify your code/featurization.py?

Thanks,
Ruslan

@Nicolabo
Copy link

Nicolabo commented May 18, 2018

Hi Ruslan,
Indeed, I had 0.9.6 version. But after upgrade, problem still exists.

this is my dvc repro after I checkout to new branch:

dvc repro -v
updater is not old enough to check for updates
Dependency 'data/eval.txt' didn't change
Dependency 'data/model/model.p' didn't change
Dependency 'data/matrix_t/matrix-test.p' didn't change
Dependency 'code/evaluate.py' didn't change
Dependency 'code/conf.py' didn't change
Dvc file 'Dvcfile' didn't change

It looks like it only check Dvcfile.

I also should mention that I store Dvcfile in project directory (.) and all .dvc files in custom directory (./dvcfiles/) (by using -f in each dvc run command). Is it right approach?

Thanks,

@efiop
Copy link
Contributor Author

efiop commented May 18, 2018

Hm, the approach is correct, as long as you adjust your commands properly.

Could you share your git repo with me somehow? So I can take a look and not bother you with a dozen of questions? ;)

@Nicolabo
Copy link

sure, you can find here

@efiop
Copy link
Contributor Author

efiop commented May 18, 2018

Thank you!

First thing that I notice is that your dvcfiles under ./dvcfiles/ have improper paths. I.e.

(python3-dvc) ➜  dvc_test git:(master) cat dvcfiles/tsv.dvc
cmd: python code/xml_to_tsv.py                             
deps:                                                      
- md5: d1cc6d9c7d0e872e93958c8dd04b411d.dir                
  path: data/xml                                           
- md5: 5b1d44d25ff148a90e6c40a1585fe68d                    
  path: code/xml_to_tsv.py                                 
- md5: b68fa555607d0399f8de130b41c6c211                    
  path: code/conf.py                                       
md5: 97a538b29de4c200578431e17e4427f2                      
outs:                                                      
- cache: true                                              
  md5: 8c0e72f83ebdc09c69ed0a17af50c62a.dir                
  path: data/tsv                                           

Notice how the paths are 'data/', which is not correct, as your command is being run in the same directory that your dvcfile is placed. So tsv.dvc is ran from dvcfiles/ directory, which makes your paths wrong, as they should actually be ../code/ and ../data/ . Your main Dvcfile is correct though. Could you correct that and try again?

@Nicolabo
Copy link

So does it mean that when I build .dvc file, I should be in my dvcfiles directory? It seems the only one way that this code will work.

dvc run -d ../data/tgz/Posts.xml.tgz -o ../data/xml/ tar zxf ../data/tgz/Posts.xml.tgz -C ../data/xml/

@efiop
Copy link
Contributor Author

efiop commented May 18, 2018

You don't have to be in that directory(please see --cwd option for dvc run), you just need to make sure that your parameters and commands are fully aware that they are being run relative to that directory. The command you specified is correct and if you want to run it from the root directory, just add --cwd dvcfiles to it, to tell dvc that it should place dvcfile in dvcfiles directory. I understand that it is a bit confusing at first, but the reason why dvc requires your -d/-o being specified relative to the directory that dvc file is placed in is that your command will be run from that same directory and thus you will avoid confusion with having different paths in your command and your dvc run command.

Please let me know if that worked for you.

@efiop
Copy link
Contributor Author

efiop commented May 18, 2018

Ah, it is probably worth explaining why it gets a bit tricky when you want to store your dvcfiles separately in a sibling directory. We actually recommend that you either store your dvcfiles alongside your code or in the parent directory. The logic is the same as with Makefiles, really. That way your projects are neatly organized and have an easy-to-see hierarchy. If you really need some special tricky order for your files, you might use hacks such as dvc run ... 'cd mydir && /path/to/mycode'(notice '') or smth like that, but that definitely doesn't generally look good and is prone to bugs.

@Nicolabo
Copy link

Nicolabo commented May 21, 2018

I created a new branch, removed all .dvc files, and start over based on your advice. Unfortunately when I executed:

$ pwd
~/dvc_test

$ dvc run -d ../data/Posts.xml.tgz -o ../data/xml/ --cwd dvcfiles tar zxf ../data/Posts.xml.tgz  -C ../data/xml/
Using 'Posts.xml.dvc' as a stage file
Reproducing 'dvcfiles/Posts.xml.dvc':
	tar zxf ../data/Posts.xml.tgz -C ../data/xml/
Unexpected error: [Errno 2] No such file or directory: '/Users/miko/Documents/archive/dvc_test/dvcfiles': '/Users/miko/Documents/archive/dvc_test/dvcfiles'

What's wrong?

Also, I observed that when I execute dvc run using file dependencies, .dvc file is built on root project directory. However when I use directory dependencies, it's built on child directory. For example,

$ pwd
~/dvc_test
$ ls -a
.		..		.dvc		.git		.gitignore	code		data
$ dvc run -d data/Posts.xml.tgz -o data/xml/ tar zxf data/Posts.xml.tgz  -C data/xml/
Using 'data/xml.dvc' as a stage file
Reproducing 'data/xml.dvc':
	tar zxf data/Posts.xml.tgz -C data/xml/
$ ls -a
.		..		.dvc		.git		.gitignore	code		data

The file is not exist in root project directory. When I move to data folder:

$ ls data/
Posts.xml.tgz	matrix		model		tsv		tt		xml		xml.dvc

we have our xml.dvc file in data folder. But it's wrong because path in .dvc are not valid. Using file dependencies, .dvc file is stored in project root directory:

$ dvc run -d data/Posts.xml.tgz -o data/xml/Posts.xml tar zxf data/Posts.xml.tgz  -C data/xml/
Using 'Posts.xml.dvc' as a stage file
Reproducing 'Posts.xml.dvc':
	tar zxf data/Posts.xml.tgz -C data/xml/
$ ls -a
.		..		.dvc		.git		.gitignore	Posts.xml.dvc	code		data

Based on documentation, --cwd should solve this problem. However, it doesn't. It stored .dvc file in data folder.

$ dvc run --cwd . -d data/Posts.xml.tgz -o data/xml/ tar zxf data/Posts.xml.tgz  -C data/xml/

Using ../, it couldn't find repository.

$ dvc run --cwd ../ -d data/Posts.xml.tgz -o data/xml/ tar zxf data/Posts.xml.tgz  -C data/xml/
Using 'data/xml.dvc' as a stage file
Failed to run command: Output file '../data/xml' error: outside of repository

@efiop
Copy link
Contributor Author

efiop commented May 21, 2018

Hi @Nicolabo !

What's wrong?

That particular error is caused by non-existing dvcfiles dir, as error says. I tried running your command with all the dirs created and it finished successfully. That being said, there is definitely a bug with dvc file placement, it should indeed be placed in ./dvcfiles and not in ./data, when --cwd dvcfiles is passed.

Thank you for the analysis, great catch!

I've created #720 and will merge a fix soon, that will be released in 0.9.8(early June). For now, there is an easy workaround: just add -f mydvcfilename.dvc to your dvc run --cwd ... command.

@Nicolabo
Copy link

Nicolabo commented May 21, 2018

Still something is wrong. I used this dvc run -c . -f file.dvc ... concept (store all .dvc files on proj. root directory ):

dvc run -c . -f unarchive.dvc -d data/Posts.xml.tgz -o data/xml/ tar zxf data/Posts.xml.tgz  -C data/xml/
dvc run -c . -f tsv.dvc -d data/xml/ -d code/xml_to_tsv.py -d code/conf.py -o data/tsv/ python code/xml_to_tsv.py
dvc run -c . -f split.dvc -d data/tsv/ -d code/split_train_test.py -d code/conf.py -o data/tt/  python code/split_train_test.py 0.33 20180319
dvc run -c . -f feature.dvc -d code/featurization.py -d code/conf.py -d data/tt -o data/matrix/ python code/featurization.py
dvc run -c . -f model.dvc -d data/matrix/matrix-train.p -d code/train_model.py -d code/conf.py -o data/model/model.p python code/train_model.py 20180319
dvc run -c . -f Dvcfile -d data/model/model.p -d data/matrix/matrix-test.p -d code/evaluate.py -d code/conf.py -o data/eval.txt python code/evaluate.py

when I executed dvc repro -v it turns out that it sees only few .dvc files.

$ dvc repro -v
updater is not old enough to check for updates
Dependency 'data/model/model.p' didn't change
Dependency 'data/matrix/matrix-train.p' didn't change
Dependency 'code/train_model.py' didn't change
Dependency 'code/conf.py' didn't change
Dvc file 'model.dvc' didn't change
Dependency 'data/eval.txt' didn't change
Dependency 'data/model/model.p' didn't change
Dependency 'data/matrix/matrix-test.p' didn't change
Dependency 'code/evaluate.py' didn't change
Dependency 'code/conf.py' didn't change
Dvc file 'Dvcfile' didn't change

Interestingly, it sees only those .dvc files in which file dependencies are used (it omits those with dir dependencies).

@efiop
Copy link
Contributor Author

efiop commented May 21, 2018

The reason for such behavior is that right now dvc can only track directories as a whole and cannot tell that a dependency file that you are pointing out is actually a part of output directory from another stage, and thus it cannot build a graph and reproduce your pipeline the way you want. E.g.:

dvc run -c . -f model.dvc -d data/matrix/matrix-train.p -d code/train_model.py -d code/conf.py -o data/model/model.p python code/train_model.py 20180319

In this stage you specify data/matrix/matrix-train.p as a dependency, but you've specified data/matrix as an output in previous stage. To make this pipeline work you need to use -d data/matrix in this stage or -o data/matrix/matrix-train.p in the previous one.

Created #722 for that. Should be ready for 0.9.8.

Thank you a lot for the feedback, we really appreciate it!

@Nicolabo
Copy link

Nicolabo commented May 21, 2018

Sure but still somethin is wrong. I did what you propose (to include dir dependency when defined output dir dependency in previous step).

$ ls
Dvcfile		code		data		feature.dvc	model.dvc	split.dvc	tsv.dvc		unarchive.dvc
$ ls data/
Posts.xml.tgz	eval.txt	matrix		model		split		tsv		xml
$ dvc repro -v
updater is not old enough to check for updates
Dependency 'data/xml' didn't change
Dependency 'data/Posts.xml.tgz' didn't change
Dvc file 'unarchive.dvc' didn't change
Dependency 'data/tsv' didn't change
Dependency 'data/xml' didn't change
Dependency 'code/xml_to_tsv.py' didn't change
Dependency 'code/conf.py' changed
Dvc file 'tsv.dvc' changed
Removing 'data/tsv'
Reproducing 'tsv.dvc':
	python code/xml_to_tsv.py
Traceback (most recent call last):
  File "code/xml_to_tsv.py", line 51, in <module>
    with open(OUTPUT, 'w') as fd_out:
FileNotFoundError: [Errno 2] No such file or directory: 'data/tsv/Posts.tsv'
Traceback (most recent call last):
  File "/anaconda3/lib/python3.6/site-packages/dvc/project.py", line 179, in _reproduce_stages
    result += self._reproduce_stage(stages, n, force)
  File "/anaconda3/lib/python3.6/site-packages/dvc/project.py", line 158, in _reproduce_stage
    stage = stages[node].reproduce(force=force)
  File "/anaconda3/lib/python3.6/site-packages/dvc/stage.py", line 131, in reproduce
    self.run()
  File "/anaconda3/lib/python3.6/site-packages/dvc/stage.py", line 229, in run
    raise StageCmdFailedError(self)
dvc.stage.StageCmdFailedError: Stage 'tsv.dvc' cmd python code/xml_to_tsv.py failed

So although I built all necessary steps, it informs that Dvc file 'tsv.dvc' changed. What is more interesting, it deletes my data/tsv/Posts.tsv file afterwards, and throws an error that such file doesn't exist.

@efiop
Copy link
Contributor Author

efiop commented May 21, 2018

Have you done conversions in all the steps or just in that one? Dvc removes outputs before reproduction, thus I suppose that that is precisely what happened here. 'tsv.dvc' is changed because conf.py was changed. I would look into conf.py and dvc run commands to make sure that dirs don't overlap.

I tried to look into your commands and your repo that you've shared with me previously. And managed to make it run properly with this script:

#!/usr/bin/bash                                                                                                                                      
                                                                                                                                                     
set -x                                                                                                                                               
set -e                                                                                                                                               
                                                                                                                                                     
rm -rf data                                                                                                                                          
mkdir -p data/xml                                                                                                                                    
mkdir -p data/tsv                                                                                                                                    
mkdir -p data/tt                                                                                                                                     
mkdir -p data/model                                                                                                                                  
mkdir -p data/matrix_t                                                                                                                               
                                                                                                                                                     
S3_DIR=https://s3-us-west-2.amazonaws.com/dvc-share/so                                                                                               
wget -P data $S3_DIR/10K/Posts.xml.tgz                                                                                                               
                                                                                                                                                     
dvc add data/Posts.xml.tgz                                                                                                                           
                                                                                                                                                     
dvc run -f unarchive.dvc -d data/Posts.xml.tgz -o data/xml/ tar zxf data/Posts.xml.tgz  -C data/xml/                                                 
dvc run -f tsv.dvc -d data/xml/ -d code/xml_to_tsv.py -d code/conf.py -o data/tsv/ python code/xml_to_tsv.py                                         
dvc run -f split.dvc -d data/tsv/ -d code/split_train_test.py -d code/conf.py -o data/tt/  python code/split_train_test.py 0.33 20180319             
dvc run -f feature.dvc -d code/featurization.py -d code/conf.py -d data/tt/ -o data/matrix_t/matrix-train.p python code/featurization.py             
dvc run -f model.dvc -d data/matrix_t/matrix-train.p -d code/train_model.py -d code/conf.py -o data/model/model.p python code/train_model.py 20180319
dvc run -f Dvcfile -d data/model/model.p -d data/matrix_t/matrix-test.p -d code/evaluate.py -d code/conf.py -o data/eval.txt python code/evaluate.py 

I had to correct some dir names(i.e. matrix -> matrix_t and so on) and create some dirs to make it work with your code. Please let me know if this solves your issue.

Thanks,
Ruslan

@Nicolabo
Copy link

Nicolabo commented May 22, 2018

Cool. Your code works although I don't get it. Below,you define data/matrix_t/matrix-train.p file output,

dvc run -f feature.dvc -d code/featurization.py -d code/conf.py -d data/tt/ -o data/matrix_t/matrix-train.p python code/featurization.py    

but in

dvc run -f Dvcfile -d data/model/model.p -d data/matrix_t/matrix-test.p -d code/evaluate.py -d code/conf.py -o data/eval.txt python code/evaluate.py 

the file dependency data/matrix_t/matrix-test.p was not previously added to pipeline. I assumed that it should not work because you didn't determine explicitly this file in pipeline.

When I did the same thing, but replace matrix_t file dependency/output into dir dependency /output it doesn't work when modified code/featurization.py file.

Dvc file 'feature.dvc' changed
Removing 'data/matrix'
Reproducing 'feature.dvc':
	python code/featurization.py
The input data frame data/split/Posts-train.tsv size is (6699, 3)
The output matrix data/matrix/matrix-train.p size is (6699, 10002) and data type is float64
Traceback (most recent call last):
  File "code/featurization.py", line 59, in <module>
    save_matrix(df_train, train_words_tfidf_matrix, train_output)
  File "code/featurization.py", line 43, in save_matrix
    with open(output, 'wb') as fd:
FileNotFoundError: [Errno 2] No such file or directory: 'data/matrix/matrix-train.p'
Traceback (most recent call last):
  File "/anaconda3/lib/python3.6/site-packages/dvc/project.py", line 179, in _reproduce_stages
    result += self._reproduce_stage(stages, n, force)
  File "/anaconda3/lib/python3.6/site-packages/dvc/project.py", line 158, in _reproduce_stage
    stage = stages[node].reproduce(force=force)
  File "/anaconda3/lib/python3.6/site-packages/dvc/stage.py", line 131, in reproduce
    self.run()
  File "/anaconda3/lib/python3.6/site-packages/dvc/stage.py", line 229, in run
    raise StageCmdFailedError(self)
dvc.stage.StageCmdFailedError: Stage 'feature.dvc' cmd python code/featurization.py failed

@efiop
Copy link
Contributor Author

efiop commented May 22, 2018

Hi @Nicolabo !

I assumed that it should not work because you didn't determine explicitly this file in pipeline.

That is correct, you actually need to add -o data/matrix_t/matrix-test.p in that first command. The reason it works as is is that matrix-test.p is generated alongside with matrix-train.p, which the second stage is dependent on anyway, thus triggering dvc repro for both.

FileNotFoundError: [Errno 2] No such file or directory: 'data/matrix/matrix-train.p'

Your featurization.py tells us that there is this file missing, so I suggest looking into paths being correct (i.e. in your conf.py and your commands), as it is hard for me to put my finger on anything more specific, as it is hard to follow rapid mutations of the code.

Thanks,
Ruslan

@Nicolabo
Copy link

Nicolabo commented May 22, 2018

Just to make sure that I don't make a mistake I just clear my local environment and clone the repo once again.

I updated master, and additionally created two more branches.

  • master branch stores your pipeline from previous post (with dep/out data/matrix_t/matrix-test.p). When I run dvc repro -v it shows all steps. It works fine.
  • bigram branch inherits from master and stores small change in code/featurisation.py file. (the same as in your tutorial in Medium).
bag_of_words = CountVectorizer(stop_words='english',
                               max_features=6000,
                               ngram_range=(1, 2))

After I run dvc repro -v the model rebuilds perfectly. It works fine.

  • dir_dependency inherits from master and stores modified three dvc files. Specifically, I modified the way how you invoke to dependency data/matrix_t/matrix_train.p. I changed it to dir dependency (i.e. data/matrix_t/) in each dvc file.
    After I run dvc repro -v the model rebuilds perfectly. It works fine.

However, when I build a new branch, which inherits from dir_dependency and modified code/featurisation.py file as before,

bag_of_words = CountVectorizer(stop_words='english',
                               max_features=6000,
                               ngram_range=(1, 2))

and run dvc repro -v I will get an error:

dvc repro -v
updater is not old enough to check for updates
Dependency 'data/Posts.xml.tgz' didn't change
Dvc file 'data/Posts.xml.tgz.dvc' didn't change
Dependency 'data/xml' didn't change
Dependency 'data/Posts.xml.tgz' didn't change
Dvc file 'unarchive.dvc' didn't change
Dependency 'data/tsv' didn't change
Dependency 'data/xml' didn't change
Dependency 'code/xml_to_tsv.py' didn't change
Dependency 'code/conf.py' didn't change
Dvc file 'tsv.dvc' didn't change
Dependency 'data/tt' didn't change
Dependency 'data/tsv' didn't change
Dependency 'code/split_train_test.py' didn't change
Dependency 'code/conf.py' didn't change
Dvc file 'split.dvc' didn't change
Dependency 'data/matrix_t' didn't change
Dependency 'code/featurization.py' changed
Dependency 'code/conf.py' didn't change
Dependency 'data/tt' didn't change
Dvc file 'feature.dvc' changed
Removing 'data/matrix_t'
Reproducing 'feature.dvc':
	python code/featurization.py
The input data frame data/tt/Posts-train.tsv size is (6699, 3)
The output matrix data/matrix_t/matrix-train.p size is (6699, 6002) and data type is float64
Traceback (most recent call last):
  File "code/featurization.py", line 60, in <module>
    save_matrix(df_train, train_words_tfidf_matrix, train_output)
  File "code/featurization.py", line 43, in save_matrix
    with open(output, 'wb') as fd:
FileNotFoundError: [Errno 2] No such file or directory: 'data/matrix_t/matrix-train.p'
Traceback (most recent call last):
  File "/anaconda3/lib/python3.6/site-packages/dvc/project.py", line 179, in _reproduce_stages
    result += self._reproduce_stage(stages, n, force)
  File "/anaconda3/lib/python3.6/site-packages/dvc/project.py", line 158, in _reproduce_stage
    stage = stages[node].reproduce(force=force)
  File "/anaconda3/lib/python3.6/site-packages/dvc/stage.py", line 131, in reproduce
    self.run()
  File "/anaconda3/lib/python3.6/site-packages/dvc/stage.py", line 229, in run
    raise StageCmdFailedError(self)
dvc.stage.StageCmdFailedError: Stage 'feature.dvc' cmd python code/featurization.py failed

Interestingly, take a look at message before the error.

...
Dvc file 'feature.dvc' changed
Removing 'data/matrix_t'
Reproducing 'feature.dvc':
...

It detects something changed in feature.dvc and removes data/matrix_t folder. I think this is the problem, because matrix_train.p couldn't find directory (because it doesn't exist :)).

When I look at bigram after changes in the code, when I run dvc repro -v model rebuilds correctly. Below you find subset of message:

Dvc file 'feature.dvc' changed
Removing 'data/matrix_t/matrix-train.p'
Reproducing 'feature.dvc':

So it removes the file rather than dir. Then it finds the path when rebuild matrix_train.p.

Hope it's clear now.

@efiop
Copy link
Contributor Author

efiop commented May 22, 2018

Hi @Nicolabo !

Ok, it is much clearer now, thank you! In your dir_deps branch, your feature.dvc has an output data/matrix_t(https://github.com/Nicolabo/dvc_test/blob/dir_dependency/feature.dvc#L13), but in your master and bigram branches it has an output data/matrix_t/matrix-train.p. Dvc removes outputs before stage reproduction, that is why in dir_deps branch it removes data/matrix_t directory, that you've specified as an output, but in other branches it removes matrix-train.p file. The error you get from your code/featurization.py file simply says that it couldn't find a part of the path, because data/matrix_t dir is removed before reproduction. You simply need to modify code/featurization.py to create data/matrix_t directory if it doesn't exist, before actually creating data/matrix_t/matrix-train.p. Hope this helps.

Thanks,
Ruslan

@Nicolabo
Copy link

You're right. I should've thought about it. Anyway, thanks for your help :)

@efiop
Copy link
Contributor Author

efiop commented May 22, 2018

Always happy to help :) Please let me know if that worked for you. And thank you for the feedback, we genuinely appreciate it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhances DVC
Projects
None yet
Development

No branches or pull requests

2 participants