Closed
Description
Github Actions max workflow timeout is 72 hours. This is a very limited time for training a model.
Depending on how the vendor's runners handle this a nice way to handle this should be restarting the workflow to be able to get the green light.
However, two possible scenarios comes to mind (if not more)
- The runner is able to finish the job (training)
- The runner stops since the workflow fails
In both cases the solution would be a mechanism to restart the workflow having a cache to save the intermediate models/state.