Skip to content

Conversation

j4mie
Copy link
Member

@j4mie j4mie commented Nov 29, 2021

Heroku's docs say:

When a dyno is killed or restarted, the processes within the dyno have 30 seconds in which to exit on receiving a SIGTERM. The process is sent SIGKILL to force an exit after this 30 seconds.

django-db-queue currently handles the SIGTERM signal by setting an internal flag on the worker class alive to False. When the currently running job finishes, the worker then stops looping and exits gracefully.

However, if a worker is processing a job that takes longer than 30 seconds, then the process never gets a chance to exit gracefully before SIGKILL is received. The worker will be forcefully killed and the job will remain in the PROCESSING state forever.

Under normal circumstances this doesn't cause too much of an issue (other than being a bit weird). These PROCESSING jobs just sit there, never being cleaned up (because delete_old_jobs doesn't delete them) but also never being processed.

However, recently we've started doing some queries on the state of the jobs table before creating a new job - eg "is there already a job of this type in the queue? If so, don't bother creating another one". To do this, we've naively checked for jobs in the states NEW or PROCESSING. The problem is, if one of these zombie PROCESSING jobs exists, the code thinks that one is always already in the queue - so the queue grinds to a halt! We've worked around this with more complex queries (like "is there a job in state NEW or a job in state PROCESSING that's newer than two hours old") but they're a bit of a hack.

This PR adds a new possible state to the state field on the Job model: STOPPING. The job is put into this state as soon as the SIGTERM signal is received. Assuming the job then finishes within the 30 second window, it goes into COMPLETE (or FAILED) as normal. However, if the job doesn't finish in time, the job will stay in this STOPPING state forever.

That means we can now distinguish between "this job is actually running" and "this job is actually running but has been asked to exit" (ie state STOPPING and less than 30 seconds old) and "this job has been asked to stop but never actually stopped and is therefore zombified" (state STOPPING and more than 30 seconds old).

I did some state machines to try to illustrate this.

Before:

image

(note that all nodes are double-circled because they are all possible "end states")

After:

image

Drawbacks of this approach:

  • A job in STOPPING may of course still be running, so if the calling code is only looking for jobs in state PROCESSING, it doesn't guarantee that there isn't another job running already. So if it's critical that only one job of a certain type is running at a time, we have to go back to the workaround of looking for STOPPING jobs as well - but we can be much more constrained and only look for STOPPING jobs newer than 30 seconds old.
  • If the worker receives a SIGKILL without receiving a SIGTERM first, it will never have a chance to put the job into STOPPING, and so it could still get stuck in PROCESSING. This is unlikely, though.

I think if we decide to merge this, it'll need to be version 3.0 as it's technically a change to the public API (ie people depending on the current behaviour may find their code broken).

* origin/master:
  Quote all versions in ci.yml
  Correct support versions in README
  Remove Django 2.2
  Add Django 4.0 and Python 3.10 to test matrix, remove unsupported versions
  Remove .python-version
  Reformat to match new Black version
  Blacken INSTALLED_APPS docs
  Bump version
  Blacken docs and reword slightly
  Clarify workspace docs
  Fix black formatting
  formatted w/black; added note about Windows support to README
  Test if `signal` has attribute `SIGQUIT`
  Add instructions for migrating
@j4mie j4mie merged commit e2e5677 into master Jan 6, 2022
@j4mie j4mie deleted the stopping-state branch January 6, 2022 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants