introduce a readonly property for better parallelization

Training a bunch of models using different parameters now becomes much easier with the parameterization feature, this is a great improvement by deduplicating `dvc.yml`, but there's still one thing that blocks us from training multiple models asynchronously.

```yaml
stages:
  train:
    foreach: ${models}
    do:
      cmd: julia --project=train train/main.jl data/train/data.h5 models/${item.name} ${item.config}
      deps:
      - data/train/data.h5
      - train
      outs:
      - models/${item.name}
```

Naturally, we want to schedule multiple jobs asynchronously to different devices without error, e.g.,:

```console
mkdir log
for i in 0 1 2 3; do
    CUDA_VISIBLE_DEVICES=$i dvc repro train@model_$i > log/model_$i.txt 2>&1 &
    sleep 2
done
```

but since `data/train/train.h5` will be locked and thus we can only run one job at the same time. (we could work around it by creating multiple copies/symlinks but that's not elegant...)

I'm wondering if it's possible to introduce a looser version of read lock that's specified by users from `dvc.yaml`, e.g.,

```diff
 stages:
   train:
     foreach: ${models}
     do:
       cmd: julia --project=train train/main.jl data/train/data.h5 models/${item.name}
       deps:
-      - data/train/data.h5
+      - data/train/data.h5:readonly
       - train
       outs:
       - models/${item.name}
```

When a user adds this property, he's explicitly saying that "okay I plan to use this in a read-only way and I'll take responsibility for whatever bugs it may occur due to my impropriate usage", then `dvc` could choose to not add an entry for them in `rwlock`, which enables concurrency.

I'm not sure if DVC has plans to give native support for concurrent job scheduling. With #4976, it can be very promising if `dvc repro train@* -s` schedules multiple jobs in parallel.

---

It can be nice to also support environment variable passing, but it's also doable by passing `params.yml`'s values to language's internal utils (e.g., `os.environ["CUDA_VISIBLE_DEVICES"]=config["gpu_device"]`).
 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

introduce a readonly property for better parallelization #4979

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

introduce a readonly property for better parallelization #4979

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions