Skip to content

runtime/trace: Add traceEvCPUProfileRate Event #60701

Open
@felixge

Description

@felixge

Use Case 1: Extract CPU Profile

If an execution trace contains CPU sample events, it would be useful to extract a cpu/nanoseconds CPU profile similar to the one produced by runtime.StartCPUProfile.

This can be useful when building tools. E.g. a "trace to CPU profile" tool. Or perhaps a tool for explaining the _Grunning time of a goroutine using the CPU samples for that goroutine. A naive solution would give equal weight to each collected CPU sample and stretch it over the sum of the _Grunning of the goroutine, but that could be misleading in case of scheduler latency, see below.

Use Case 2: Understand OS Scheduler Latency

A Go application might experience two types of scheduler latency: OS Scheduler Latency and Go Scheduler Latency. The latter can easily be analyzed using the execution tracer.

Detecting OS scheduler latency is a bit more tricky, but possible. Over a long enough time period, the cumulative time goroutines spend in running state should converge to the cumulative number of traceEvCPUSample events multiplied by their duration (default 10ms). If there are significantly less traceEvCPUSample events than expected, that's a strong indicator that the application is not getting enough scheduling time from the OS. That's a common problem for some setups, so it'd be nice to use tracing data to detect it.

(There are some dragons here when it comes to CPU samples received during syscalls/cgo ... but I think that deserves a separate discussion)

Problem:

The traceEvCPUSample event does not include a value indicating how much CPU time it represents:

traceEvCPUSample = 49 // CPU profiling sample [timestamp, real timestamp, real P id (-1 when absent), goroutine id, stack]

One could assume that it's always 10ms, but that won't work if the user calls runtime.SetCPUProfileRate. Unfortunately the execution trace does record this value, and it's not possible to get the currently active value from user land either. Unlike SetMutexProfileFraction, SetCPUProfileRate does not return a value, and there is no GetCPUProfileRate method either.

Additionally it's currently not possible calculate the expected number of traceEvCPUSample events if the CPU profiler is not enabled for the entire duration of the trace.

Suggestion:

Add a new traceEvCPUProfileRate event that is recorded in the following case:

  1. The tracer starts while the CPU profiler is already running.
  2. StartCPUProfile is being called.
  3. StopCPUProfile is called (record 0)

Alternatively we could also have a start/stop event for the CPU profiler.

cc @mknyszek @prattmic @nsrip-dd @rhysh

Activity

rhysh

rhysh commented on Jun 9, 2023

@rhysh
Contributor

OS scheduler latency is frustrating, and I agree that better tools for tracking it down would be nice.

First, on the problem of "how to detect OS scheduler latency". When I've seen it in execution traces (from quota exhaustion), it appears as if a goroutine is running on a P with no interruptions for 50+ milliseconds. (But, maybe it shows up in ways that I'm not currently able to see? Can you say more on how it appears in apps you support?)

Historically speaking, and IIUC, the sysmon goroutine has been written to interrupt a G after it's been on a P for a full 10-millisecond time quantum of its own. A G might use the remainder of another goroutine's time quantum "for free" and then get a full time slice of its own, so in my experience CPU-bound goroutines often run for 9.99 + 10 = 19.99 ms in a row. It's possible this behavior has changed recently. But if not, that may be a somewhat reliable way to detect latency from the OS scheduler (though not guaranteed to be stable, including for operators who might tweak the sysmon settings).

Second, on another hurdle to using CPU profile events for this purpose. The traceEvCPUSample event doesn't include the M, and (on Linux, with timer_create) the 10 ms interval timer is associated with the M. The M is also what the OS chooses to schedule (or not). M-centric execution traces will help here. The current situation gets particularly confusing when an M is stopping, and when it is re-starting. It has no P and no (user) G, so the traceEvCPUSample event isn't attached to anything in particular.

But, for your use case of knowing whether there's OS scheduler latency, if the P+M hasn't taken an exclusive lock on a G (by putting it in the _Grunnable state, then maybe it doesn't matter much if some particular M isn't scheduled by the OS?

Third, on SetCPUProfileRate. That particular function almost doesn't work, and might go away (#40094). On the other hand, that same concept is about to get a new home (#42502). But finally, it looks like "rate" won't continue to be sufficient to describe the meaning of CPU profile samples (#53286). So if there is to be a new tracing event to describe CPU profile samples, I think that event should have a clear path to how it'll work with perf_events or similar extensions.

The fourth thing on my mind is that a mismatch between time that a P is in _Prunning and the total value of CPU profile samples that its M collects can diverge (on small scales) for other reasons. In runqgrab, there's a call to usleep(3) when a P is trying to steal the runnext goroutine from other Ps (on the fourth and final iteration of the stealWork loop) https://github.com/golang/go/blob/go1.20.5/src/runtime/proc.go#L6163 . That doesn't generate an event in the current execution tracer; it's part of the blank time between a P's Gs.

I'm not sure how much coordination can be expected between the code that asks for a CPU profile and the code that asks for an execution trace. It sounds like you're preparing for there to be little to none. I wonder if there's a path forward that involves accessing more runtime internals via the runtime/metrics package. That wouldn't allow the execution-trace-collecting code to be notified of changes to the CPU profiler config, but it could allow it to ask about things like "is the sysmon interval still 10ms". And then identify OS scheduler latency by looking for goroutines that appear to run for more than 20ms. As I understand it, Go's 10ms scheduling quanta are small relative to the OS's scheduling quanta (which for the case of CPU quota might be 100ms or 250ms), and that could be an advantage in detection.

felixge

felixge commented on Jun 11, 2023

@felixge
ContributorAuthor

@rhysh thanks for your thoughtful analysis. In additions to my comments below, I've also added another use case to the issue description that is hopefully simpler: Extract CPU Profile.

OS scheduler latency is frustrating, and I agree that better tools for tracking it down would be nice.

First, on the problem of "how to detect OS scheduler latency". When I've seen it in execution traces (from quota exhaustion), it appears as if a goroutine is running on a P with no interruptions for 50+ milliseconds. (But, maybe it shows up in ways that I'm not currently able to see? Can you say more on how it appears in apps you support?)

I just started looking into this problem, so I haven't had a chance to analyze any real-world data for this problem yet. That being said, I suspect it will look similar to what you're describing.

Historically speaking, and IIUC, the sysmon goroutine has been written to interrupt a G after it's been on a P for a full 10-millisecond time quantum of its own. A G might use the remainder of another goroutine's time quantum "for free" and then get a full time slice of its own, so in my experience CPU-bound goroutines often run for 9.99 + 10 = 19.99 ms in a row. It's possible this behavior has changed recently. But if not, that may be a somewhat reliable way to detect latency from the OS scheduler (though not guaranteed to be stable, including for operators who might tweak the sysmon settings).

Last time I looked at this, I saw goroutines run up to 25ms before being preempted on darwin, but this 5ms mismatch is likely due to usleep accuracy on that platform. But on a platform with accurate timers I'd also expect goroutines running for > 20ms in a row to be a sign of OS scheduling latency.

That being said, using such a heuristic feels brittle in the future, see #60693.

Second, on another hurdle to using CPU profile events for this purpose. The traceEvCPUSample event doesn't include the M, and (on Linux, with timer_create) the 10 ms interval timer is associated with the M. The M is also what the OS chooses to schedule (or not). M-centric execution traces will help here. The current situation gets particularly confusing when an M is stopping, and when it is re-starting. It has no P and no (user) G, so the traceEvCPUSample event isn't attached to anything in particular.

But, for your use case of knowing whether there's OS scheduler latency, if the P+M hasn't taken an exclusive lock on a G (by putting it in the _Grunnable state, then maybe it doesn't matter much if some particular M isn't scheduled by the OS?

Yeah. I'm not trying to infer the OS scheduling latency experiencee by an M. I'm trying to understand the impact on Gs that are in _Grunning.

Third, on SetCPUProfileRate. That particular function almost doesn't work, and might go away (#40094). On the other hand, that same concept is about to get a new home (#42502).

Yup.

But finally, it looks like "rate" won't continue to be sufficient to describe the meaning of CPU profile samples (#53286). So if there is to be a new tracing event to describe CPU profile samples, I think that event should have a clear path to how it'll work with perf_events or similar extensions.

Good point. It seems like all of these events would still have a rate, so perhaps the only thing missing from my proposed traceEvCPUProfileRate event is a field to indiciate the selected CPUProfileEvent?

The fourth thing on my mind is that a mismatch between time that a P is in _Prunning and the total value of CPU profile samples that its M collects can diverge (on small scales) for other reasons. In runqgrab, there's a call to usleep(3) when a P is trying to steal the runnext goroutine from other Ps (on the fourth and final iteration of the stealWork loop) https://github.com/golang/go/blob/go1.20.5/src/runtime/proc.go#L6163 . That doesn't generate an event in the current execution tracer; it's part of the blank time between a P's Gs.

Interesting. But during this usleep the P would not have a G that is in _Grunning state itself? I'm trying to sum up _Grunning time rather than _PRunning time right now.

I'm not sure how much coordination can be expected between the code that asks for a CPU profile and the code that asks for an execution trace. It sounds like you're preparing for there to be little to none.

I expect the code that starts the trace to also be the code that controls the CPU profile rate. I'm just worried that I can't guarantee this right now.

I wonder if there's a path forward that involves accessing more runtime internals via the runtime/metrics package. That wouldn't allow the execution-trace-collecting code to be notified of changes to the CPU profiler config, but it could allow it to ask about things like "is the sysmon interval still 10ms". And then identify OS scheduler latency by looking for goroutines that appear to run for more than 20ms. As I understand it, Go's 10ms scheduling quanta are small relative to the OS's scheduling quanta (which for the case of CPU quota might be 100ms or 250ms), and that could be an advantage in detection.

I'm definitely open to exploring alternative solutions here, but as mentioned above, I'm worried that the concept of a scheduling quanta might go away when there are idle Ps, see #60693.

prattmic

prattmic commented on Jun 12, 2023

@prattmic
Member

Regarding OS scheduler latency, this could be a good opportunity to combine Go execution tracer with external data, notably perf sched, which records OS scheduler behavior, and can be used to understand scheduling latency.

https://github.com/google/schedviz/blob/master/doc/walkthrough.md is an example visualization tool of this data.

With this data, one could imagine the trace viewer having "M descheduled" events on a per-M view, as well as adding a per-CPU view.

mknyszek

mknyszek commented on Jun 12, 2023

@mknyszek
Contributor

Given that there may be better ways to identify OS scheduler latency, I think we should consider this in the context of a more general effort to make execution traces a superset of CPU profiles.

In that context, it might not make sense to have an event for the rate and it might be better to attach this information to a broader "CPUProfileStart" event. (A "CPUProfileStop" event would also be a fairly natural place to drop in some of the other information, for example an encoded form of the memory mapping.)

ianlancetaylor

ianlancetaylor commented on Jun 14, 2023

@ianlancetaylor
Contributor

I don't think this needs to go through the proposal process, so removing the proposal label. I think this is up to the runtime team.

added
NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.
on Jun 14, 2023
added this to the Backlog milestone on Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.compiler/runtimeIssues related to the Go compiler and/or runtime.

    Type

    No type

    Projects

    Status

    Todo

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @felixge@rhysh@mknyszek@prattmic@ianlancetaylor

        Issue actions

          runtime/trace: Add traceEvCPUProfileRate Event · Issue #60701 · golang/go