Description
It would be useful if I could parametrize my pipeline using environment variables, which could be read from a properties file specified using dvc config env my.properties
. DVC would load those environment variables when running the command.
For example, I could have this properties file:
DVC_NICKNAME=David
And run:
dvc run -o hello.txt 'echo "Hello ${DVC_NICKNAME}!" > hello.txt'
dvc run -o cheers.txt 'echo "Cheers ${DVC_NICKNAME}!" > cheers.txt'
And produce "Hello David!" and "Cheers David!" files.
Users would just have to make sure to quote the command or use interactive mode #1415.
The DVC file would contain the variable reference:
cmd: echo "Hello ${DVC_NICKNAME}!" > hello.txt
The value would be added to the environment by DVC at DVC startup so it would be handled natively by the shell.
In order for dvc status
to be able to detect that variables in a stage changed, we can calculate the internal md5 checksum on contents with the variable values injected in place of the variable names, so that it would be handled as if the contents of the DVC file changed. This can be done using os.path.expandvars. But unfortunately, this would just replace variable references used directly in the shell command, it would not cover cases where you're using the environment variable inside a script. The only foolproof way would be force the user to explicitly request environment variables that would be injected from the properties file, e.g. using dvc run -e DVC_NICKNAME -e DVC_OTHER
. That would basically allow adding additional "env dependencies" to stages.
It would be nice to inject the variables also into paths to dependencies, so that you can parametrize those as well. Could also be done using os.path.expandvars. This would change the DAG dynamically, but AFAIK it should actually magically work without breaking anything, right? As long as you just initialize the environment at each DVC startup and call expandvars when reading deps paths.