Skip to content

Severe thread starvation issues #41586

Closed
Closed
@janrous-rai

Description

@janrous-rai

Summary

We have observed highly unpredictable thread starvation occurring on our production servers that are written in Julia. In particular, this thread starvation causes our monitoring (instrumentation) subsystem to stall for extended periods of time (tens of minutes, up to hours in extreme cases). This often leads to total loss of visibility into the operation of our production systems which makes it virtually impossible to diagnose the services and to run a reliable production system.

Solving this issue is a high priority for us at RelationalAI due to its impact on our production systems.

Background

Our database server (rai-server) is written in julia. The production deployment currently runs on Julia 1.6.1 with multi-threading enabled and number of threads set to number of cores of the underlying virtual machines (we are using Azure cloud).

Our monitoring subsystem relies heavily on background tasks - tasks that are spawned at the server startup and periodically kick off certain operations such as:

  1. flushing changes to metrics to the remote monitoring services
  2. collecting specific metrics such as Julia garbage collection statistics, memory usage details (by parsing /proc files) and so on.

In addition to metrics that are updated via background tasks, many other metrics are updated as soon as the events of interest occurs (e.g. when pager allocates memory for a new page, we increment metric that represents number of pages currently in use).

Users can run database queries. When these queries are received via http they're parsed, analyzed and executed. During execution, potentially large number of tasks that can have very long execution times can be generated. However, we currently do not have a good understanding of the fine-grained details (e.g. number of tasks spawned, duration of individual tasks). Some workloads may trigger long-running cpu-heavy work within a single task.

Perodic tasks

Monitoring subsystem and other batched or maintenance tasks are using the concept of a periodic task which implements the following pseudo-code:

while !periodic_task_should_terminate
  sleep(period)
  do_stuff()
end

In reality, the actual code is somewhat more complex because the above code reacts slowly to termination signals, so we are attempting to use analogue to C++ function wait_for that can wait for a specific period or a notification signal (whatever comes first).

Observed symptoms

On a server startup, we kick off periodic task that increments server_heartbeats metric once per second. Under normal conditions, when we plot the rate of change, we should get a flat line that is close to 1. However, when the server starts handling user queries, the rate of change of this metric dips well below 1 indicating that the background task is not running as frequently as it should.

heartbeats

Sometimes we experience complete blackout during which rai-server does not export any metrics at all. In some situations, this complete blackout can take tens of minutes or, in the example below, roughly 90 minutes. This seems to indicate that the statsd export thread is not running as frequently as it should.

memory-blackout

Suspected root cause

We suspect that the following factors contribute to this issue:

  1. task thread affinity where a long running task can only be scheduled on the same OS thread that it was assigned to initially. (Task migration is introduced in julia 1.7, which we've only just started testing.)
  2. Lack of task isolation - both user query handling as well as critical components of the server (http handlers, monitoring subsystem, ...) use the same Julia tasks and compete for the same limited pool of OS threads. There's no built-in way to isolate the two or to prioritize critical work over the user-supplied workloads.
  3. No task preemptions while tasks can explicitly yield and give up their OS thread, if they do not do that they can occupy OS thread indefinitely.
  4. Long running tasks if the code doesn't explicitly contain yield points or uses code that implicitly contains yield points, individual tasks can run for indefinite periods of time.
  5. Freeze-the-world GC pause waiting on long-running tasks if one Task triggers GC, and attempts to freeze the world, every thread will stop scheduling new work until all threads have paused. If one thread is running a long-running Task, GC pauses can keep the other threads starved for a long time.

Specifically we think that combination of (1), (3), (4) and maybe (5) are the primary causes that can explain the observed symptoms. The critical background tasks are scheduled on arbitrary threads. Whenever user generated tasks are spawned, they can get scheduled on those same threads. If any of these tasks are long-lived (we have queries that may run for several hours), will not yield and get scheduled on the same thread as the critical tasks, they can and will block them.

Suggested solutions

In the end, this is all about sharing of limited resources (threads) among tasks and enforcing some form of isolation. Ultimately, we need to be able to flag work that is critical (latency sensitive) and ensure that other tasks will not be able to starve it. There are several possible approaches:

  1. Task priorities - tasks could be associated with a priority that controls the order in which tasks are scheduled. In such a scenario, critical tasks will be marked as high-priority which will ensure that they're processed first before any lower priority work.
  2. Dedicated resources - this is somewhat analogous to thread pools. The idea is to dedicate one (or more) threads to critical tasks only (need to be able to flag those) which will ensure that bulk (lower priority) workloads will not be able to schedule on these threads. The downside of this solution is that some of the threads may end up being idle most of the time, but this may be a reasonable overhead. If the size of this exclusive thread pool can be controlled at startup time, we can easily tune the degree of this overhead.
  3. Ability to spawn tasks on a dedicated OS thread - the production critical components of our server are relatively lightweight but they're latency sensitive. We would likely be able to achieve good degree of isolation if we could launch these julia tasks on a new OS thread that will be dedicated to this task and this task only and will be cleaned up after the task finishes. This effectively means that the task will bypass julia scheduling entirely and will rely on kernel scheduler to effectively manage the underlying os thread.
  4. Task migration - as long as there are some available threads, allowing critical work to migrate and schedule on any thread should alleviate the risk from long-running tasks occupying the wrong thread. Though this still is vulnerable to problems if there are many tasks scheduled (perhaps starving critical tasks) or if enough long-running tasks occupy all available threads.
  5. Explicit yield points - the code itself can be instrumented with yield points to ensure that long-running tasks are unlikely. This approach may still not work if the latency sensitivity of critical task is higher than the granularity of the (time) distance between yield points. This approach also requires a lot of manual labor and may fall apart if new code that is not properly "yield instrumented" is introduced. This feels like the least favorable solution in the list but may be used as a quick-and-dirty interim hack to get things into a better shape.
  6. Better visibility into scheduling - this is not a solution per se but having tools/instrumentation that would help us inspect the task scheduling decisions, or log certain problematic conditions (e.g. long running tasks) will help us identify and profile sub-optimal behavior in our own code. For example, if we could easily flag long-running tasks, we might be able to insert yield-points and break these up into smaller bits.

We think that solution (3) seems like the most robust and flexible solution and would like feedback on the feasibility/complexity of this.

Example

The following example reproduces thread starvation. metronome function runs in fixed interval and measures the interval accuracy (drift between expected and observed interval). run_for method implements long-running non-yielding task that is spawned and blocks metronome.

using Dates
import Dates
using Base.Threads: Atomic, @spawn

function run_for(id::Int64, num_ns::Int64)
    println("$id: Kicked off run_for($num_ns) on thread: $(Threads.threadid())")
    start_ns = time_ns()
    local i = 0
    while time_ns() < start_ns + num_ns
        i += 1
    end
    println("$id: Finished run_for on thread: $(Threads.threadid())")
end

function metronome(term::Atomic{Bool}, period_ns::Int64)
    println("Kicked off metronome $period_ns on thread: $(Threads.threadid())")
    local i = 0
    while !term[]
        i += 1
        prev_ns = time_ns()
        println("Presleep $(time_ns())")
        sleep(Dates.Nanosecond(period_ns))
        actual_ns = time_ns() - prev_ns
        println("Postsleep $(time_ns()), delta $(actual_ns)")
        drift_ns = actual_ns - period_ns
        drift_s = drift_ns / 1_000_000_000
        println("$(Threads.threadid()): Metronome drift $(drift_s) seconds")
    end
    println("Terminating metronome after $i iterations")
end

# Kick of one metronome and bunch of worker threads that will compete for threads
@sync begin
    mterm = Atomic{Bool}(false)
    @spawn metronome(mterm, 1_000_000 * 250)  # Runs every 250ms
    sleep(4)
    println("Spawning busy tasks")
 
    @sync begin
       for j in 1:10
            @spawn run_for(j, 1_000_000 * 4000) # Blocks for 4s
        end
    end
    mterm[] = true
end

Metadata

Metadata

Assignees

No one assigned

    Labels

    multithreadingBase.Threads and related functionality

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions