Add STOPPING state #44
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Heroku's docs say:
django-db-queue
currently handles theSIGTERM
signal by setting an internal flag on the worker classalive
toFalse
. When the currently running job finishes, the worker then stops looping and exits gracefully.However, if a worker is processing a job that takes longer than 30 seconds, then the process never gets a chance to exit gracefully before
SIGKILL
is received. The worker will be forcefully killed and the job will remain in thePROCESSING
state forever.Under normal circumstances this doesn't cause too much of an issue (other than being a bit weird). These
PROCESSING
jobs just sit there, never being cleaned up (becausedelete_old_jobs
doesn't delete them) but also never being processed.However, recently we've started doing some queries on the state of the jobs table before creating a new job - eg "is there already a job of this type in the queue? If so, don't bother creating another one". To do this, we've naively checked for jobs in the states
NEW
orPROCESSING
. The problem is, if one of these zombiePROCESSING
jobs exists, the code thinks that one is always already in the queue - so the queue grinds to a halt! We've worked around this with more complex queries (like "is there a job in stateNEW
or a job in statePROCESSING
that's newer than two hours old") but they're a bit of a hack.This PR adds a new possible state to the
state
field on theJob
model:STOPPING
. The job is put into this state as soon as theSIGTERM
signal is received. Assuming the job then finishes within the 30 second window, it goes intoCOMPLETE
(orFAILED
) as normal. However, if the job doesn't finish in time, the job will stay in thisSTOPPING
state forever.That means we can now distinguish between "this job is actually running" and "this job is actually running but has been asked to exit" (ie state
STOPPING
and less than 30 seconds old) and "this job has been asked to stop but never actually stopped and is therefore zombified" (stateSTOPPING
and more than 30 seconds old).I did some state machines to try to illustrate this.
Before:
(note that all nodes are double-circled because they are all possible "end states")
After:
Drawbacks of this approach:
STOPPING
may of course still be running, so if the calling code is only looking for jobs in statePROCESSING
, it doesn't guarantee that there isn't another job running already. So if it's critical that only one job of a certain type is running at a time, we have to go back to the workaround of looking forSTOPPING
jobs as well - but we can be much more constrained and only look forSTOPPING
jobs newer than 30 seconds old.SIGKILL
without receiving aSIGTERM
first, it will never have a chance to put the job intoSTOPPING
, and so it could still get stuck inPROCESSING
. This is unlikely, though.I think if we decide to merge this, it'll need to be version 3.0 as it's technically a change to the public API (ie people depending on the current behaviour may find their code broken).