Closed
Description
Ideally a third job could help, a workflow for GH and GL would be:
stages:
- ml
- check
train:
stage: ml
tags:
- gpu
- check
cache:
paths:
- ./models
script:
- echo "setup a pipeline here"
check:
stage: check
when: on_failure
needs:
- train
script:
- echo "Restarting..."
name: cml
on: [push]
jobs:
train:
# needs: deploy
runs-on: [self-hosted,gpu]
steps:
- uses: actions/checkout@v2
- name: Cache multiple paths
uses: actions/cache@v2
with:
path: |
./models
key: models
- name: cml_run
shell: bash
env:
repo_token: ${{ secrets.GITHUB_TOKEN }}
run: |
echo "setup a pipeline here"
check:
if: failure()
needs: train
runs-on: [ubuntu-latest]
steps:
- name: cml_check
run: |
echo "Restarting...."
however this approach has has two issues:
- While in GH the lost of the runner can be recovered ending with a failed job in GL the job without a valid runner can run forever. I opened a ticket here
- The biggest drawback would be restarting the workflow in a loop. Having the runner the ability to listen for the spot instance eviction will be a better warranty of acting properly
This implies that we have to provide the cleanup scripts when deploying the spot instances, this scrips just only need to run the runner cleanup and restart. of the workflow.