Description
Training a bunch of models using different parameters now becomes much easier with the parameterization feature, this is a great improvement by deduplicating dvc.yml
, but there's still one thing that blocks us from training multiple models asynchronously.
stages:
train:
foreach: ${models}
do:
cmd: julia --project=train train/main.jl data/train/data.h5 models/${item.name} ${item.config}
deps:
- data/train/data.h5
- train
outs:
- models/${item.name}
Naturally, we want to schedule multiple jobs asynchronously to different devices without error, e.g.,:
mkdir log
for i in 0 1 2 3; do
CUDA_VISIBLE_DEVICES=$i dvc repro train@model_$i > log/model_$i.txt 2>&1 &
sleep 2
done
but since data/train/train.h5
will be locked and thus we can only run one job at the same time. (we could work around it by creating multiple copies/symlinks but that's not elegant...)
I'm wondering if it's possible to introduce a looser version of read lock that's specified by users from dvc.yaml
, e.g.,
stages:
train:
foreach: ${models}
do:
cmd: julia --project=train train/main.jl data/train/data.h5 models/${item.name}
deps:
- - data/train/data.h5
+ - data/train/data.h5:readonly
- train
outs:
- models/${item.name}
When a user adds this property, he's explicitly saying that "okay I plan to use this in a read-only way and I'll take responsibility for whatever bugs it may occur due to my impropriate usage", then dvc
could choose to not add an entry for them in rwlock
, which enables concurrency.
I'm not sure if DVC has plans to give native support for concurrent job scheduling. With #4976, it can be very promising if dvc repro train@* -s
schedules multiple jobs in parallel.
It can be nice to also support environment variable passing, but it's also doable by passing params.yml
's values to language's internal utils (e.g., os.environ["CUDA_VISIBLE_DEVICES"]=config["gpu_device"]
).