Description
Right now, if DVC cannot get the global lock, it aborts.
I would find it useful to have an option for DVC to wait (a configurable amount of time, perhaps) until it can get the lock and then continue.
My use case is running multiple parallel dvc repro
options. Now that they can run in parallel, my workflows can be much more efficient, but they still each take the global lock while they're loading initial state, etc. If I submit several repro
tasks to my cluster's batch scheduler, they may start at the same time, and thus conflict with each other for the global lock.
If they could each wait for the global lock, then the next one would be able to start as soon as the first has released the lock.
Right now my extremely ugly and unreliable workaround is to have job script sleep for a random amount of time before starting dvc
, and I usually still wind up with a couple lock failures when starting 10 jobs.