-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Make 'dvc add directory' add all files inside directory #367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Leaving for the next release. |
Returning back to current release, as #434 made it clear that people need this feature. |
Plus, need to add ability to add directory as a whole(discussed this IRL with @dmpetrov ). Relevant for parquet scenario. |
Hi, great tool! Really appreciate it! For example, I was trying to accomplish it based on your tutorial. I created necessary directories, rewrite code in code/conf.py (with proper paths). This's my modified code (based on code in tutorial):
Then, when I want to create a new branch, modified code/featurization.py, and execute Probably, I miss something. I just read about your tool few days ago and since then I am excited! Please let me know what's wrong. Thanks, |
Hi @Nicolabo !
Yes, just specify it as a dependency/output in dvc run, no different from specifying single files.
Your commands seem to be correct, but unfortunately I can't put my finger on anything without seeing the full code. Btw, what version(dvc --version) are you using? If it is an older version(current one is 0.9.7), could you try upgrading to see if the problem persists? Could you show the output of Thanks, |
Hi Ruslan, this is my
It looks like it only check Dvcfile. I also should mention that I store Dvcfile in project directory ( Thanks, |
Hm, the approach is correct, as long as you adjust your commands properly. Could you share your git repo with me somehow? So I can take a look and not bother you with a dozen of questions? ;) |
sure, you can find here |
Thank you! First thing that I notice is that your dvcfiles under ./dvcfiles/ have improper paths. I.e.
Notice how the paths are 'data/', which is not correct, as your command is being run in the same directory that your dvcfile is placed. So tsv.dvc is ran from dvcfiles/ directory, which makes your paths wrong, as they should actually be ../code/ and ../data/ . Your main Dvcfile is correct though. Could you correct that and try again? |
So does it mean that when I build
|
You don't have to be in that directory(please see Please let me know if that worked for you. |
Ah, it is probably worth explaining why it gets a bit tricky when you want to store your dvcfiles separately in a sibling directory. We actually recommend that you either store your dvcfiles alongside your code or in the parent directory. The logic is the same as with Makefiles, really. That way your projects are neatly organized and have an easy-to-see hierarchy. If you really need some special tricky order for your files, you might use hacks such as |
I created a new branch, removed all
What's wrong? Also, I observed that when I execute
The file is not exist in root project directory. When I move to
we have our
Based on documentation,
Using
|
Hi @Nicolabo !
That particular error is caused by non-existing dvcfiles dir, as error says. I tried running your command with all the dirs created and it finished successfully. That being said, there is definitely a bug with dvc file placement, it should indeed be placed in ./dvcfiles and not in ./data, when --cwd dvcfiles is passed. Thank you for the analysis, great catch! I've created #720 and will merge a fix soon, that will be released in 0.9.8(early June). For now, there is an easy workaround: just add |
Still something is wrong. I used this
when I executed
Interestingly, it sees only those |
The reason for such behavior is that right now dvc can only track directories as a whole and cannot tell that a dependency file that you are pointing out is actually a part of output directory from another stage, and thus it cannot build a graph and reproduce your pipeline the way you want. E.g.:
In this stage you specify data/matrix/matrix-train.p as a dependency, but you've specified data/matrix as an output in previous stage. To make this pipeline work you need to use Created #722 for that. Should be ready for 0.9.8. Thank you a lot for the feedback, we really appreciate it! |
Sure but still somethin is wrong. I did what you propose (to include dir dependency when defined output dir dependency in previous step).
So although I built all necessary steps, it informs that |
Have you done conversions in all the steps or just in that one? Dvc removes outputs before reproduction, thus I suppose that that is precisely what happened here. 'tsv.dvc' is changed because conf.py was changed. I would look into conf.py and I tried to look into your commands and your repo that you've shared with me previously. And managed to make it run properly with this script:
I had to correct some dir names(i.e. matrix -> matrix_t and so on) and create some dirs to make it work with your code. Please let me know if this solves your issue. Thanks, |
Cool. Your code works although I don't get it. Below,you define
but in
the file dependency When I did the same thing, but replace
|
Hi @Nicolabo !
That is correct, you actually need to add
Your featurization.py tells us that there is this file missing, so I suggest looking into paths being correct (i.e. in your conf.py and your commands), as it is hard for me to put my finger on anything more specific, as it is hard to follow rapid mutations of the code. Thanks, |
Just to make sure that I don't make a mistake I just clear my local environment and clone the repo once again. I updated master, and additionally created two more branches.
After I run
However, when I build a new branch, which inherits from
and run
Interestingly, take a look at message before the error.
It detects something changed in feature.dvc and removes When I look at
So it removes the file rather than dir. Then it finds the path when rebuild Hope it's clear now. |
Hi @Nicolabo ! Ok, it is much clearer now, thank you! In your dir_deps branch, your feature.dvc has an output Thanks, |
You're right. I should've thought about it. Anyway, thanks for your help :) |
Always happy to help :) Please let me know if that worked for you. And thank you for the feedback, we genuinely appreciate it! |
No description provided.
The text was updated successfully, but these errors were encountered: