Add STOPPING state #44

j4mie · 2021-11-29T13:57:59Z

Heroku's docs say:

When a dyno is killed or restarted, the processes within the dyno have 30 seconds in which to exit on receiving a SIGTERM. The process is sent SIGKILL to force an exit after this 30 seconds.

django-db-queue currently handles the SIGTERM signal by setting an internal flag on the worker class alive to False. When the currently running job finishes, the worker then stops looping and exits gracefully.

However, if a worker is processing a job that takes longer than 30 seconds, then the process never gets a chance to exit gracefully before SIGKILL is received. The worker will be forcefully killed and the job will remain in the PROCESSING state forever.

Under normal circumstances this doesn't cause too much of an issue (other than being a bit weird). These PROCESSING jobs just sit there, never being cleaned up (because delete_old_jobs doesn't delete them) but also never being processed.

However, recently we've started doing some queries on the state of the jobs table before creating a new job - eg "is there already a job of this type in the queue? If so, don't bother creating another one". To do this, we've naively checked for jobs in the states NEW or PROCESSING. The problem is, if one of these zombie PROCESSING jobs exists, the code thinks that one is always already in the queue - so the queue grinds to a halt! We've worked around this with more complex queries (like "is there a job in state NEW or a job in state PROCESSING that's newer than two hours old") but they're a bit of a hack.

This PR adds a new possible state to the state field on the Job model: STOPPING. The job is put into this state as soon as the SIGTERM signal is received. Assuming the job then finishes within the 30 second window, it goes into COMPLETE (or FAILED) as normal. However, if the job doesn't finish in time, the job will stay in this STOPPING state forever.

That means we can now distinguish between "this job is actually running" and "this job is actually running but has been asked to exit" (ie state STOPPING and less than 30 seconds old) and "this job has been asked to stop but never actually stopped and is therefore zombified" (state STOPPING and more than 30 seconds old).

I did some state machines to try to illustrate this.

Before:

(note that all nodes are double-circled because they are all possible "end states")

After:

Drawbacks of this approach:

A job in STOPPING may of course still be running, so if the calling code is only looking for jobs in state PROCESSING, it doesn't guarantee that there isn't another job running already. So if it's critical that only one job of a certain type is running at a time, we have to go back to the workaround of looking for STOPPING jobs as well - but we can be much more constrained and only look for STOPPING jobs newer than 30 seconds old.
If the worker receives a SIGKILL without receiving a SIGTERM first, it will never have a chance to put the job into STOPPING, and so it could still get stuck in PROCESSING. This is unlikely, though.

I think if we decide to merge this, it'll need to be version 3.0 as it's technically a change to the public API (ie people depending on the current behaviour may find their code broken).

* origin/master: Bump version Add note to README on precision of scheduled job timing Format with black Add support for scheduling jobs to run in the future

* origin/master: Quote all versions in ci.yml Correct support versions in README Remove Django 2.2 Add Django 4.0 and Python 3.10 to test matrix, remove unsupported versions Remove .python-version Reformat to match new Black version Blacken INSTALLED_APPS docs Bump version Blacken docs and reword slightly Clarify workspace docs Fix black formatting formatted w/black; added note about Windows support to README Test if `signal` has attribute `SIGQUIT` Add instructions for migrating

j4mie added 6 commits November 29, 2021 10:36

Refactor to move process_job function onto Worker class

305d9f4

Add current_job attribute to Worker

5a88fa7

Add STOPPING state and update when shutdown is requested

2619a49

Delete old jobs in state STOPPING

8214131

Merge remote-tracking branch 'origin/master' into stopping-state

0661ef2

* origin/master: Bump version Add note to README on precision of scheduled job timing Format with black Add support for scheduling jobs to run in the future

Rebase migrations

a93403b

dabapps-robot added the MIGRATIONS label Nov 29, 2021

j4mie added 2 commits November 29, 2021 14:37

Document STOPPING state and add state diagram

631dc03

j4mie merged commit e2e5677 into master Jan 6, 2022

j4mie deleted the stopping-state branch January 6, 2022 21:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add STOPPING state #44

Add STOPPING state #44

Uh oh!

j4mie commented Nov 29, 2021 •

edited

Loading

Uh oh!

Uh oh!

Add STOPPING state #44

Add STOPPING state #44

Uh oh!

Conversation

j4mie commented Nov 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

j4mie commented Nov 29, 2021 •

edited

Loading