CML <-> DVC cache smooth integration #4268

DavidGOrtega · 2020-07-23T13:31:26Z

Hi guys,

im working/exploring lately the conjunction of both to solve ML problems. One of the most interesting requests is having the ability to start training from a checkpoint just in case something fails in the middle of your several-days-training-job. This is specially needed for spot instances that can be interrupted by the vendor with just 30 seconds of time to take actions.
The ideal solution that came to us was using DVC cache. However the CML and DVC integration is not yet smooth.

The requirements are:

The dvc pipeline is happening in the CI runners (hosted or self-hosted), never locally
Every batch the checkpoints should be stored in dvc cache to be restored in case the CI workflow has an issue and restating it should just restart the training from that stage.

So lets setup the dvc pipeline with a very easy to follow example:

train.sh

#!/bin/bash       

echo 'step1'
echo 'step1' >> model.data
dvc push --run-cache

sleep 30

echo 'step2'
echo 'step2' >> model.data
dvc push --run-cache

our github workfow file will be:

.github/workflows/cml.yaml

name: cml

on: [push]

jobs:
  train:
    runs-on: [self-hosted,gpu]

    steps:
      - uses: actions/checkout@v2

      - name: cml_run
        shell: bash
        env:
          repo_token: ${{ secrets.GITHUB_TOKEN }} 
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          dvc pull --run-cache -f || echo 'failed dvc pull :('
          dvc repro

          echo "Training..."
          echo 'Hi from CML' >> report.md
          cat model.data >> report.md
          cml-send-comment report.md
          echo "Trained!"

we setup dvc with:

dvc init
dvc remote add s3 s3://your-s3-bucket
dvc remote default s3
dvc run --no-exec -n train \
    --outs-persist model.data \
    ./train.sh

git add --all
git commit -m 'dvc'
dvc push
git push

Lets review it:
Ideally if our runner dies after stet1 (while sleeping) models.data should contain step1. If we restart the workflow we will end up with model.data containing

step1
step1
step2

well, thats ideally... As we can state DVC is not actually caching anything.
In fact the issue comes from having created a pipeline without run.
Please note the
dvc pull --run-cache -f || echo 'failed dvc pull :('

I put || as a try catch since dvc will aways fail, remember that ideally the very first time we do this should be empty but after restarting our failed workflow we should recover our model.data with

step1

inside. DVC does not seems to be handling well a deferred repro and pull without ids or empty caches. In fact cache is not working because of that?

WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:
name: model.data, md5: e0b7ab6cd3e2df496849e69c355045a7
WARNING: Cache 'e0b7ab6cd3e2df496849e69c355045a7' not found. File 'model.data' won't be created.
1 file failed
ERROR: failed to pull data from the cloud - Checkout failed for following targets:
model.data
Did you forget to fetch?

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!

Any ideas @dmpetrov ?

The text was updated successfully, but these errors were encountered:

casperdcl · 2021-06-15T16:10:08Z

is this the same as #4649?

efiop · 2021-10-08T19:28:38Z

closing as stale

ghost added the triage Needs to be triaged label Jul 23, 2020

weekly-digest bot mentioned this issue Jul 26, 2020

Weekly Digest (19 July, 2020 - 26 July, 2020) #4285

Closed

efiop added the research label Jul 31, 2020

ghost removed the triage Needs to be triaged label Jul 31, 2020

efiop added the feature request Requesting a new feature label Jul 31, 2020

casperdcl mentioned this issue May 25, 2021

DVC feature requests iterative/cml#560

Closed

4 tasks

efiop closed this as completed Oct 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CML <-> DVC cache smooth integration #4268

CML <-> DVC cache smooth integration #4268

DavidGOrtega commented Jul 23, 2020 •

edited

Loading

casperdcl commented Jun 15, 2021

Uh oh!

efiop commented Oct 8, 2021

Uh oh!

CML <-> DVC cache smooth integration #4268

CML <-> DVC cache smooth integration #4268

Comments

DavidGOrtega commented Jul 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

casperdcl commented Jun 15, 2021

Uh oh!

efiop commented Oct 8, 2021

Uh oh!

DavidGOrtega commented Jul 23, 2020 •

edited

Loading