Description
Use Case 1: Extract CPU Profile
If an execution trace contains CPU sample events, it would be useful to extract a cpu/nanoseconds
CPU profile similar to the one produced by runtime.StartCPUProfile.
This can be useful when building tools. E.g. a "trace to CPU profile" tool. Or perhaps a tool for explaining the _Grunning time of a goroutine using the CPU samples for that goroutine. A naive solution would give equal weight to each collected CPU sample and stretch it over the sum of the _Grunning of the goroutine, but that could be misleading in case of scheduler latency, see below.
Use Case 2: Understand OS Scheduler Latency
A Go application might experience two types of scheduler latency: OS Scheduler Latency and Go Scheduler Latency. The latter can easily be analyzed using the execution tracer.
Detecting OS scheduler latency is a bit more tricky, but possible. Over a long enough time period, the cumulative time goroutines spend in running
state should converge to the cumulative number of traceEvCPUSample
events multiplied by their duration (default 10ms
). If there are significantly less traceEvCPUSample
events than expected, that's a strong indicator that the application is not getting enough scheduling time from the OS. That's a common problem for some setups, so it'd be nice to use tracing data to detect it.
(There are some dragons here when it comes to CPU samples received during syscalls/cgo ... but I think that deserves a separate discussion)
Problem:
The traceEvCPUSample
event does not include a value indicating how much CPU time it represents:
Line 74 in 39effbc
One could assume that it's always 10ms
, but that won't work if the user calls runtime.SetCPUProfileRate. Unfortunately the execution trace does record this value, and it's not possible to get the currently active value from user land either. Unlike SetMutexProfileFraction, SetCPUProfileRate does not return a value, and there is no GetCPUProfileRate
method either.
Additionally it's currently not possible calculate the expected number of traceEvCPUSample
events if the CPU profiler is not enabled for the entire duration of the trace.
Suggestion:
Add a new traceEvCPUProfileRate
event that is recorded in the following case:
- The tracer starts while the CPU profiler is already running.
- StartCPUProfile is being called.
- StopCPUProfile is called (record
0
)
Alternatively we could also have a start/stop event for the CPU profiler.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Activity
profiler: log cpuProfileRate when starting a trace
profiler: log cpuProfileRate when starting a trace
profiler: log cpuProfileRate when starting a trace
rhysh commentedon Jun 9, 2023
OS scheduler latency is frustrating, and I agree that better tools for tracking it down would be nice.
First, on the problem of "how to detect OS scheduler latency". When I've seen it in execution traces (from quota exhaustion), it appears as if a goroutine is running on a P with no interruptions for 50+ milliseconds. (But, maybe it shows up in ways that I'm not currently able to see? Can you say more on how it appears in apps you support?)
Historically speaking, and IIUC, the sysmon goroutine has been written to interrupt a G after it's been on a P for a full 10-millisecond time quantum of its own. A G might use the remainder of another goroutine's time quantum "for free" and then get a full time slice of its own, so in my experience CPU-bound goroutines often run for 9.99 + 10 = 19.99 ms in a row. It's possible this behavior has changed recently. But if not, that may be a somewhat reliable way to detect latency from the OS scheduler (though not guaranteed to be stable, including for operators who might tweak the sysmon settings).
Second, on another hurdle to using CPU profile events for this purpose. The
traceEvCPUSample
event doesn't include the M, and (on Linux, withtimer_create
) the 10 ms interval timer is associated with the M. The M is also what the OS chooses to schedule (or not). M-centric execution traces will help here. The current situation gets particularly confusing when an M is stopping, and when it is re-starting. It has no P and no (user) G, so thetraceEvCPUSample
event isn't attached to anything in particular.But, for your use case of knowing whether there's OS scheduler latency, if the P+M hasn't taken an exclusive lock on a G (by putting it in the _Grunnable state, then maybe it doesn't matter much if some particular M isn't scheduled by the OS?
Third, on
SetCPUProfileRate
. That particular function almost doesn't work, and might go away (#40094). On the other hand, that same concept is about to get a new home (#42502). But finally, it looks like "rate" won't continue to be sufficient to describe the meaning of CPU profile samples (#53286). So if there is to be a new tracing event to describe CPU profile samples, I think that event should have a clear path to how it'll work with perf_events or similar extensions.The fourth thing on my mind is that a mismatch between time that a P is in _Prunning and the total value of CPU profile samples that its M collects can diverge (on small scales) for other reasons. In
runqgrab
, there's a call tousleep(3)
when a P is trying to steal therunnext
goroutine from other Ps (on the fourth and final iteration of thestealWork
loop) https://github.com/golang/go/blob/go1.20.5/src/runtime/proc.go#L6163 . That doesn't generate an event in the current execution tracer; it's part of the blank time between a P's Gs.I'm not sure how much coordination can be expected between the code that asks for a CPU profile and the code that asks for an execution trace. It sounds like you're preparing for there to be little to none. I wonder if there's a path forward that involves accessing more runtime internals via the runtime/metrics package. That wouldn't allow the execution-trace-collecting code to be notified of changes to the CPU profiler config, but it could allow it to ask about things like "is the sysmon interval still 10ms". And then identify OS scheduler latency by looking for goroutines that appear to run for more than 20ms. As I understand it, Go's 10ms scheduling quanta are small relative to the OS's scheduling quanta (which for the case of CPU quota might be 100ms or 250ms), and that could be an advantage in detection.
felixge commentedon Jun 11, 2023
@rhysh thanks for your thoughtful analysis. In additions to my comments below, I've also added another use case to the issue description that is hopefully simpler: Extract CPU Profile.
I just started looking into this problem, so I haven't had a chance to analyze any real-world data for this problem yet. That being said, I suspect it will look similar to what you're describing.
Last time I looked at this, I saw goroutines run up to 25ms before being preempted on darwin, but this 5ms mismatch is likely due to usleep accuracy on that platform. But on a platform with accurate timers I'd also expect goroutines running for > 20ms in a row to be a sign of OS scheduling latency.
That being said, using such a heuristic feels brittle in the future, see #60693.
Yeah. I'm not trying to infer the OS scheduling latency experiencee by an M. I'm trying to understand the impact on Gs that are in _Grunning.
Yup.
Good point. It seems like all of these events would still have a
rate
, so perhaps the only thing missing from my proposedtraceEvCPUProfileRate
event is a field to indiciate the selectedCPUProfileEvent
?Interesting. But during this usleep the P would not have a G that is in _Grunning state itself? I'm trying to sum up _Grunning time rather than _PRunning time right now.
I expect the code that starts the trace to also be the code that controls the CPU profile rate. I'm just worried that I can't guarantee this right now.
I'm definitely open to exploring alternative solutions here, but as mentioned above, I'm worried that the concept of a scheduling quanta might go away when there are idle Ps, see #60693.
prattmic commentedon Jun 12, 2023
Regarding OS scheduler latency, this could be a good opportunity to combine Go execution tracer with external data, notably
perf sched
, which records OS scheduler behavior, and can be used to understand scheduling latency.https://github.com/google/schedviz/blob/master/doc/walkthrough.md is an example visualization tool of this data.
With this data, one could imagine the trace viewer having "M descheduled" events on a per-M view, as well as adding a per-CPU view.
mknyszek commentedon Jun 12, 2023
Given that there may be better ways to identify OS scheduler latency, I think we should consider this in the context of a more general effort to make execution traces a superset of CPU profiles.
In that context, it might not make sense to have an event for the rate and it might be better to attach this information to a broader "CPUProfileStart" event. (A "CPUProfileStop" event would also be a fairly natural place to drop in some of the other information, for example an encoded form of the memory mapping.)
ianlancetaylor commentedon Jun 14, 2023
I don't think this needs to go through the proposal process, so removing the proposal label. I think this is up to the runtime team.