Skip to content

Spot instances- Runner must be able to restart workflow #174

Closed
@DavidGOrtega

Description

@DavidGOrtega

Ideally a third job could help, a workflow for GH and GL would be:

stages:
  - ml
  - check

train:
  stage: ml
  tags:
    - gpu
   - check

  cache:
    paths:
    - ./models
    
  script:
    -  echo "setup a pipeline here"

check:
  stage: check
  when: on_failure
  needs:
    - train

  script:
    - echo "Restarting..."
name: cml

on: [push]

jobs:
  train:
    # needs: deploy
    runs-on: [self-hosted,gpu]

    steps:
      - uses: actions/checkout@v2

      - name: Cache multiple paths
        uses: actions/cache@v2
        with:
          path: |
            ./models
          key: models

      - name: cml_run
        shell: bash
        env:
          repo_token: ${{ secrets.GITHUB_TOKEN }} 
        run: |
          echo "setup a pipeline here"

  check:
    if: failure()
    needs: train
    runs-on: [ubuntu-latest]
    steps:
      - name: cml_check
        run: |
          echo "Restarting...."

however this approach has has two issues:

  • While in GH the lost of the runner can be recovered ending with a failed job in GL the job without a valid runner can run forever. I opened a ticket here
  • The biggest drawback would be restarting the workflow in a loop. Having the runner the ability to listen for the spot instance eviction will be a better warranty of acting properly

This implies that we have to provide the cleanup scripts when deploying the spot instances, this scrips just only need to run the runner cleanup and restart. of the workflow.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions