Skip to content

Wildcards in pipeline dependencies #5252

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amin-nejad opened this issue Jan 12, 2021 · 3 comments
Closed

Wildcards in pipeline dependencies #5252

amin-nejad opened this issue Jan 12, 2021 · 3 comments

Comments

@amin-nejad
Copy link

amin-nejad commented Jan 12, 2021

In pipeline dependencies, it would be a lot easier and more robust if we could include wildcards. For instance, let's say I have a data step in my pipeline which makes use of some source files. I'd like to specify all relevant python files in a given directory as being dependencies:

dvc.yaml:

stages:
    data:
        cmd: python -m path.to.data_generation_script
        deps:
          - src/data/*.py
        ...

However, that doesn't work as wildcards aren't recognised. Instead, I can specify the directory itself (src/data) but this appears to include __pycache__ and .ipynbcheckpoints (and anything else that may be in that directory) which is obviously not desired behaviour and results in the dvc hash being different to someone who has freshly cloned the repo.

@skshetry
Copy link
Collaborator

this appears to include __pycache__ and .ipynbcheckpoints (and anything else that may be in that directory) which is obviously not desired behaviour and results in the dvc hash being different to someone who has freshly cloned the repo.

You can specify what to ignore in the .dvcignore (it's similar to .gitignore). That'll help.

Regarding the globbing, I am not sure as it might hurt reproducibility.

@amin-nejad
Copy link
Author

Of course, I should have thought of that! That actually solves my problem as I can just add __pycache__ and .ipynbcheckpoints to .dvcignore and specify the whole directory as a dependency. Thanks!

I will just close the issue since I don't need it anymore but it could still be useful for someone who has multiple file types in a directory and only wants to specify a subset of them - not sure how niche that would be though. I think it would actually be better for reproducibility as it's easier to glob than specify individual files in that directory and maintain that list over time in the dvc.yaml. And it's still more restrictive than specifying the whole directory which is already allowed. I don't know.

@skshetry
Copy link
Collaborator

DVC works best if the outputs and the dependencies are immutable. Adding glob might make it inconsistent as it now depends on the workspace.

We might need to support this on #331, but I am not sure of other scenarios where it could be useful (for example, in yours, you just needed to specify the directory as a depenendency).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants