goexectrace: add tooling around capture of execution traces #146873

tbg · 2025-05-16T13:30:10Z

Introduction

This PR is a prototype showcasing how we might meaningfully integrate CockroachDB more closely with Go execution traces¹. It is related to this internal doc².

As a reminder, Go execution tracing can be enabled on demand and, while enabled, causes the Go runtime to emit its garbage collection activity, goroutine scheduling decisions, as well as state transitions for each goroutine. Notably, this includes stack traces at each transition point, as well as causal event edges whenever one goroutine wakes up another (for example via a channel send).

Execution traces contain a wealth of information and they have played an important role in several support escalations, in particular when diagnosing tail latencies. Recent releases of Go have significantly reduced the overhead of Go execution traces, and so it is feasible to reach for them more frequently.

Two primary use cases have been identified:

Understanding Statement Tail Latency

Typically, in this approach a statement diagnostics request is made. Due to the high overhead of (CockroachDB) tracing, only a proportion (say 1%) of statements matching the fingerprint is sampled. While the statement diagnostics request is pending, Go execution traces are continuously collected across all nodes in the cluster. Once a statement bundle has been created, the contained Jaeger trace is investigated for trace spans displaying unexplained high latency. This trace span results in a node time window, and goroutine(s) of interest. The overlapping execution trace(s) from this node can then be investigated, and the goroutine(s) looked up within it.

The main problems with this current approach include

(TOIL) Customers need to manually collect execution traces. This is typically achieved using the pprof-loop script, but due to the need to present an authentication cookie and to orchestrate this script on each cluster node, it is a time-consuming and error-prone process.
(SPACE) Traces may need to be collected for extended periods of time, and potentially on 100s of nodes. Yet, most traces are not going to be needed in the end. Busy high-CPU systems may easily produce multiple megabytes per second of data, and yet customers are left to manage this on their own, which makes disk space issues more likely.

Understanding Spikes and Overload

This category of investigation is similar to the first, but we are interested in capturing "whatever we can" to understand why system-internal operations were not making the expected progress. For example, #147025 was filed recently as consequence of an incident in which the liveness range leaseholder failed to extend its lease; another example is prototype #146650 which attempts to make instances of CPU overload more understandable by capturing an execution trace.

Adding complication to making execution tracing useful in this context is the fact that the interesting behavior may well be in the past by the time an execution trace is solicited. Fortunately, the Go team recently added flight recording³ to the arsenal, which roughly speaking maintains a suffix (say, the last 10s) of a continuously running Go execution trace, and allows writing the current buffer out to a file on demand.

The flight recording facilities are useful, but they also need a layer of wrapping to be useful in CockroachDB:

(COALESCE) Only one dump of the execution trace can be pending at any given point in time (but multiple actors in CockroachDB might request one, resulting in spurious errors)
(STORE) no concept of storing and maintaining the resulting set of execution traces.

What's in this PR

This PR sketches out primitives to leverage execution traces while addressing the enumerated problems. Concretely, we introduce the CoalescedFlightRecorder type, which is a per-node singleton "owning" access to Go execution traces (both in "direct" mode and in flight recorder mode).

First up, CoalescedFlightRecorder (CFR) allows taking an execution trace of a specified duration, similar to hitting the /debug/pprof/trace endpoint.

Second, it also integrates with a wrapped flight recorder. Roughly speaking, any actor in the system that may potentially want a flight recording (for example, a "CPU overload" watcher), can obtain a Handle from the CFR, which will cause the internal flight recorder to start recording. If no Handles are registered on the CFR for some time, flight recording is disabled.

A Handle allows the holder to dump the flight recorder's execution trace. Recall that an execution trace is always available while a Handle is held and will contain information about the present and its immediate history (though it may only reach back in time to the creation of the handle or for the internally configured maximum duration or size). Concurrent requests to dump the trace (whether on one Handle or different Handles) are coalesced on top of each other, meaning that a Handle may be used without the potential for spurious errors (which would result from using a naked flight recorder). Importantly, synchronous dump calls can also coalesce onto synchronous execution trace requests, removing another source of spurious failures. Of note is also that Handle.Dump is careful to not block the caller; instead, the caller immediately receives the location of the file as well as a Future it can consult to wait for the operation to finish.

Third, the CFR cleanly abstracts its storage to the degree that we can integrate it with the storage mechanisms already used by the various "observability dumpers" (cpu profile, heap profile, etc). Files created by the CFR (as hooked up in this PR) go into an executiontrace_dump subdirectory and are subject to retention policies specified by cluster settings. They are also available through the existing APIs we expose to download observability artifacts (used for example by debug zip).

Fourth, the singleton CFR is wrapped in a newly introduced profiler.ExecutionTracer (mirroring profiler.{Heap,CPU}Profiler) which has the ability to take periodic execution traces (through the CFR). This then enables continuously collecting profiles via a cluster setting, and also applies the retention policies.

Fifth, the CFR is made available to the remainder of the system via the StoreConfig (and could be plumbed wherever else it needs to go). With this in place, new uses of flight recorder execution traces can be added across the system with minimal friction.

How this solves the identified problems

(TOIL) SET CLUSTER SETTING server.execution_tracer.interval='1ms' now results in continuous execution tracing on all nodes (ignoring the inconsequential 1ms pause between consecutive traces).
(SPACE+STORE): execution traces are stored in a central location (per node), are retrievable via APIs, and have a size and count based retention period applied to them (configurable via cluster settings). Further development could ensure that we more aggressively remove files that end up not being relevant.
(COALESCE): the CFR allows multiple callers to request a dump at the same time, and they will be pointed at a single file.

Next Steps

I'm not signing up to push this over the finish line. Instead, I want to show that with only a little more effort, we can make significant strides toward better integration with execution traces. I'm happy to help move this along in a consulting fashion, and lend a hand with reviews.

A strategy for landing and extending this could be:

first, land the part that allows continuous integration with cluster setting (i.e. roughly what's here, minus using the Handle anywhere)
add a mechanism that protects (or at least upgrades the retention priority) for "interesting" profiles.
various teams can adopt Handle. Profiles created via Handle are high-priority for retention.
integrate with statement bundles: enable continuous execution traces whenever a statement bundle is pending. When a statement bundle is created, revisit the involved nodes, and dump inflight recordings (via Handle). Retrieve them via the files API and embed them in the statement bundle.

In parallel to this, execution traces could become more intertwined with tracing and logging. For example, CRDB trace spans could create trace "regions" or "tasks", events/messages logged to a CRDB trace span could also be logged into the execution trace.

Comments on the code in this PR:

The (name,Future) API is not going to be helpful if name is collected by a later call to the node (as one could imagine might happen in distributed tracing). The future won't be accessible at this time, but the file may still be pending writes. Maybe solved by writing the file to a temporary location, and renaming on completion only, plus a bit of wait time on collection. But that feels smelly.
CFR tries really hard not to have goroutines running when no flight recording is active. But this results in relatively complex concurrency handling and may not be worth the trouble, especially if in practice we expect the flight recorder to be enabled all of the time.
It's possible that synchronous traces just aren't necessary. If the flight recorder reliably captures 10s of the past, then instead of recording "forward" for 10s, we can sleep 10s and then dump the flight recorder. This may be good enough for ~all purposes and would serve as a simplification.
the /debug/pprof/trace endpoint has not been integrated with the CFR. As a result, while the endpoint is being hit, the CFR will not work or vice versa. (The runtime allows only one consumer of execution traces at a time).
the example Handle integration in Replica is completely untested (other than that it doesn't crash the server) and has obvious deficiencies. Doing this properly requires little reworking of the Replica lifecycle (to have better triggers for handle acquisition and release), or simply making this conditional on always enabling inflight execution tracing.⁴
no performance/overhead testing has been performed.
there's an earlier prototype for integrating the flight recorder in tracing: add execution tracing #138013. It doesn't integrate with the storage and doesn't try to solve all of the problems identified here, but it does do a few integrations with flight recording that may be useful to also look at.

Informs #97215.

Epic: none

https://go.dev/blog/execution-traces-2024 ↩
https://docs.google.com/document/d/1i2b7xdXVXR7ETm_Rh1x0qt-olriJVUrmSp32KMdp5YA/edit?tab=t.0 ↩
https://github.com/golang/go/issues/63185 ↩
Come to think of it, the tenant metrics ref could already have this figured out: when we initialize a replica, we acquire a tenant ref in setDescLockedRaftMuLocked. When we destroy a replica, we have to reliably release that ref again. That happens in postDestroyRaftMuLocked. It's not easy to convince yourself that any of the many paths that lead to replica destruction hit that method, but it seems that they do. ↩

cockroach-teamcity · 2025-05-16T13:30:22Z

This change is

Mechanical.

This is a wrapper around trace.FlightRecorder that allows multiple callers to request execution traces concurrently.

This is set up so that ``` set cluster setting server.execution_tracer.interval='1ms' ``` can replaces pprof-loop by taking continuous execution traces. The retention policies keep 24h worth of traces (subject to an additional disk usage restriction of 4GiB).

tbg · 2025-05-28T16:22:57Z

In scale testing, we are seeing blips around high fsync latency and jumpy goroutine count. Cause and effect unclear. Flight recorder dumping at this point in time could have been instructive. Ref

147756: kvserver: add useful execution trace regions r=tbg a=tbg This PR adds regions for the range on which a BatchRequest executes, as well as regions indicating the method of each request while it's being evaluated. Some samples from running this script: ```bash #!/bin/bash set -euo pipefail c=local-tpcc wh=10 (roachprod create -n 4 $c || true) &> /dev/null roachprod stop $c roachprod put $c cockroach roachprod start $c:1-3 roachprod run $c:4 -- "if [ ! -f loaded ]; then ./cockroach workload init tpcc --warehouses $wh {pgurl:1-3} && touch loaded; fi" roachprod run $c:4 -- ./cockroach workload run tpcc --warehouses $wh --concurrency 10 {pgurl:1-3} ``` and then ``` curl -o regiontrace.bin "$(roachprod adminui local-tpcc:1)debug/pprof/trace?seconds=10" gotraceui regiontrace.bin ``` <img width="735" alt="image" src="https://github.com/user-attachments/assets/938b871b-c5c5-4960-9cd2-420ade908a2b" /> <img width="992" alt="image" src="https://github.com/user-attachments/assets/012ed3c6-75d6-4fde-b00e-f0bf731f172e" /> Related to #146873. Epic: none Release note: none Co-authored-by: Tobias Grieger <[email protected]>

tbg added 2 commits May 21, 2025 13:21

future: add AwaitableFuture.Wait

a199ed6

profiler: rename a const

ee8c3ea

Mechanical.

tbg force-pushed the flightrec branch from d9b35bf to 33af11d Compare May 21, 2025 11:21

tbg changed the title ~~goexectrace: add CoalescedFlightRecorder~~ goexectrace: add tooling around capture of execution traces May 21, 2025

tbg force-pushed the flightrec branch from 33af11d to 7439181 Compare May 21, 2025 15:39

tbg added 4 commits May 21, 2025 17:47

goexectrace: add CoalescedFlightRecorder

920aedc

This is a wrapper around trace.FlightRecorder that allows multiple callers to request execution traces concurrently.

server,kvserver: plumb CoalescedFlightRecorder to StoreConfig

2d20521

kvserver: dump flight recorder on expired node liveness lease

3aa4078

tbg force-pushed the flightrec branch from 7439181 to 3aa4078 Compare May 21, 2025 15:47

tbg mentioned this pull request Jun 4, 2025

kvserver: add useful execution trace regions #147756

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

goexectrace: add tooling around capture of execution traces #146873

goexectrace: add tooling around capture of execution traces #146873

tbg commented May 16, 2025 •

edited

Loading

Uh oh!

cockroach-teamcity commented May 16, 2025

Uh oh!

tbg commented May 28, 2025

Uh oh!

Uh oh!

goexectrace: add tooling around capture of execution traces #146873

Are you sure you want to change the base?

goexectrace: add tooling around capture of execution traces #146873

Conversation

tbg commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Introduction

Understanding Statement Tail Latency

Understanding Spikes and Overload

What's in this PR

How this solves the identified problems

Next Steps

Footnotes

Uh oh!

cockroach-teamcity commented May 16, 2025

Uh oh!

tbg commented May 28, 2025

Uh oh!

Uh oh!

tbg commented May 16, 2025 •

edited

Loading