Skip to content

goexectrace: add tooling around capture of execution traces #146873

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

tbg
Copy link
Member

@tbg tbg commented May 16, 2025

Introduction

This PR is a prototype showcasing how we might meaningfully integrate CockroachDB more closely with Go execution traces1. It is related to this internal doc2.

As a reminder, Go execution tracing can be enabled on demand and, while enabled, causes the Go runtime to emit its garbage collection activity, goroutine scheduling decisions, as well as state transitions for each goroutine. Notably, this includes stack traces at each transition point, as well as causal event edges whenever one goroutine wakes up another (for example via a channel send).

Execution traces contain a wealth of information and they have played an important role in several support escalations, in particular when diagnosing tail latencies. Recent releases of Go have significantly reduced the overhead of Go execution traces, and so it is feasible to reach for them more frequently.

Two primary use cases have been identified:

Understanding Statement Tail Latency

Typically, in this approach a statement diagnostics request is made. Due to the high overhead of (CockroachDB) tracing, only a proportion (say 1%) of statements matching the fingerprint is sampled. While the statement diagnostics request is pending, Go execution traces are continuously collected across all nodes in the cluster. Once a statement bundle has been created, the contained Jaeger trace is investigated for trace spans displaying unexplained high latency. This trace span results in a node time window, and goroutine(s) of interest. The overlapping execution trace(s) from this node can then be investigated, and the goroutine(s) looked up within it.

The main problems with this current approach include

  • (TOIL) Customers need to manually collect execution traces. This is typically achieved using the pprof-loop script, but due to the need to present an authentication cookie and to orchestrate this script on each cluster node, it is a time-consuming and error-prone process.
  • (SPACE) Traces may need to be collected for extended periods of time, and potentially on 100s of nodes. Yet, most traces are not going to be needed in the end. Busy high-CPU systems may easily produce multiple megabytes per second of data, and yet customers are left to manage this on their own, which makes disk space issues more likely.

Understanding Spikes and Overload

This category of investigation is similar to the first, but we are interested in capturing "whatever we can" to understand why system-internal operations were not making the expected progress. For example, #147025 was filed recently as consequence of an incident in which the liveness range leaseholder failed to extend its lease; another example is prototype #146650 which attempts to make instances of CPU overload more understandable by capturing an execution trace.

Adding complication to making execution tracing useful in this context is the fact that the interesting behavior may well be in the past by the time an execution trace is solicited. Fortunately, the Go team recently added flight recording3 to the arsenal, which roughly speaking maintains a suffix (say, the last 10s) of a continuously running Go execution trace, and allows writing the current buffer out to a file on demand.

The flight recording facilities are useful, but they also need a layer of wrapping to be useful in CockroachDB:

  • (COALESCE) Only one dump of the execution trace can be pending at any given point in time (but multiple actors in CockroachDB might request one, resulting in spurious errors)
  • (STORE) no concept of storing and maintaining the resulting set of execution traces.

What's in this PR

This PR sketches out primitives to leverage execution traces while addressing the enumerated problems. Concretely, we introduce the CoalescedFlightRecorder type, which is a per-node singleton "owning" access to Go execution traces (both in "direct" mode and in flight recorder mode).

First up, CoalescedFlightRecorder (CFR) allows taking an execution trace of a specified duration, similar to hitting the /debug/pprof/trace endpoint.

Second, it also integrates with a wrapped flight recorder. Roughly speaking, any actor in the system that may potentially want a flight recording (for example, a "CPU overload" watcher), can obtain a Handle from the CFR, which will cause the internal flight recorder to start recording. If no Handles are registered on the CFR for some time, flight recording is disabled.

A Handle allows the holder to dump the flight recorder's execution trace. Recall that an execution trace is always available while a Handle is held and will contain information about the present and its immediate history (though it may only reach back in time to the creation of the handle or for the internally configured maximum duration or size). Concurrent requests to dump the trace (whether on one Handle or different Handles) are coalesced on top of each other, meaning that a Handle may be used without the potential for spurious errors (which would result from using a naked flight recorder). Importantly, synchronous dump calls can also coalesce onto synchronous execution trace requests, removing another source of spurious failures. Of note is also that Handle.Dump is careful to not block the caller; instead, the caller immediately receives the location of the file as well as a Future it can consult to wait for the operation to finish.

Third, the CFR cleanly abstracts its storage to the degree that we can integrate it with the storage mechanisms already used by the various "observability dumpers" (cpu profile, heap profile, etc). Files created by the CFR (as hooked up in this PR) go into an executiontrace_dump subdirectory and are subject to retention policies specified by cluster settings. They are also available through the existing APIs we expose to download observability artifacts (used for example by debug zip).

Fourth, the singleton CFR is wrapped in a newly introduced profiler.ExecutionTracer (mirroring profiler.{Heap,CPU}Profiler) which has the ability to take periodic execution traces (through the CFR). This then enables continuously collecting profiles via a cluster setting, and also applies the retention policies.

Fifth, the CFR is made available to the remainder of the system via the StoreConfig (and could be plumbed wherever else it needs to go). With this in place, new uses of flight recorder execution traces can be added across the system with minimal friction.

How this solves the identified problems

  • (TOIL) SET CLUSTER SETTING server.execution_tracer.interval='1ms' now results in continuous execution tracing on all nodes (ignoring the inconsequential 1ms pause between consecutive traces).
  • (SPACE+STORE): execution traces are stored in a central location (per node), are retrievable via APIs, and have a size and count based retention period applied to them (configurable via cluster settings). Further development could ensure that we more aggressively remove files that end up not being relevant.
  • (COALESCE): the CFR allows multiple callers to request a dump at the same time, and they will be pointed at a single file.

Next Steps

I'm not signing up to push this over the finish line. Instead, I want to show that with only a little more effort, we can make significant strides toward better integration with execution traces. I'm happy to help move this along in a consulting fashion, and lend a hand with reviews.

A strategy for landing and extending this could be:

  • first, land the part that allows continuous integration with cluster setting (i.e. roughly what's here, minus using the Handle anywhere)
  • add a mechanism that protects (or at least upgrades the retention priority) for "interesting" profiles.
  • various teams can adopt Handle. Profiles created via Handle are high-priority for retention.
  • integrate with statement bundles: enable continuous execution traces whenever a statement bundle is pending. When a statement bundle is created, revisit the involved nodes, and dump inflight recordings (via Handle). Retrieve them via the files API and embed them in the statement bundle.

In parallel to this, execution traces could become more intertwined with tracing and logging. For example, CRDB trace spans could create trace "regions" or "tasks", events/messages logged to a CRDB trace span could also be logged into the execution trace.

Comments on the code in this PR:

  • The (name,Future) API is not going to be helpful if name is collected by a later call to the node (as one could imagine might happen in distributed tracing). The future won't be accessible at this time, but the file may still be pending writes. Maybe solved by writing the file to a temporary location, and renaming on completion only, plus a bit of wait time on collection. But that feels smelly.
  • CFR tries really hard not to have goroutines running when no flight recording is active. But this results in relatively complex concurrency handling and may not be worth the trouble, especially if in practice we expect the flight recorder to be enabled all of the time.
  • It's possible that synchronous traces just aren't necessary. If the flight recorder reliably captures 10s of the past, then instead of recording "forward" for 10s, we can sleep 10s and then dump the flight recorder. This may be good enough for ~all purposes and would serve as a simplification.
  • the /debug/pprof/trace endpoint has not been integrated with the CFR. As a result, while the endpoint is being hit, the CFR will not work or vice versa. (The runtime allows only one consumer of execution traces at a time).
  • the example Handle integration in Replica is completely untested (other than that it doesn't crash the server) and has obvious deficiencies. Doing this properly requires little reworking of the Replica lifecycle (to have better triggers for handle acquisition and release), or simply making this conditional on always enabling inflight execution tracing.4
  • no performance/overhead testing has been performed.
  • there's an earlier prototype for integrating the flight recorder in tracing: add execution tracing #138013. It doesn't integrate with the storage and doesn't try to solve all of the problems identified here, but it does do a few integrations with flight recording that may be useful to also look at.

Informs #97215.

Epic: none

Footnotes

  1. https://go.dev/blog/execution-traces-2024

  2. https://docs.google.com/document/d/1i2b7xdXVXR7ETm_Rh1x0qt-olriJVUrmSp32KMdp5YA/edit?tab=t.0

  3. https://github.com/golang/go/issues/63185

  4. Come to think of it, the tenant metrics ref could already have this figured out: when we initialize a replica, we acquire a tenant ref in setDescLockedRaftMuLocked. When we destroy a replica, we have to reliably release that ref again. That happens in postDestroyRaftMuLocked. It's not easy to convince yourself that any of the many paths that lead to replica destruction hit that method, but it seems that they do.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg changed the title goexectrace: add CoalescedFlightRecorder goexectrace: add tooling around capture of execution traces May 21, 2025
tbg added 4 commits May 21, 2025 17:47
This is a wrapper around trace.FlightRecorder that allows multiple callers to
request execution traces concurrently.
This is set up so that

```
set cluster setting server.execution_tracer.interval='1ms'
```

can replaces pprof-loop by taking continuous execution traces. The retention
policies keep 24h worth of traces (subject to an additional disk usage
restriction of 4GiB).
@tbg
Copy link
Member Author

tbg commented May 28, 2025

In scale testing, we are seeing blips around high fsync latency and jumpy goroutine count. Cause and effect unclear. Flight recorder dumping at this point in time could have been instructive. Ref

craig bot pushed a commit that referenced this pull request Jun 4, 2025
147756: kvserver: add useful execution trace regions r=tbg a=tbg

This PR adds regions for the range on which a BatchRequest executes, as well as regions indicating the method of each request while it's being evaluated.

Some samples from running this script:

```bash
#!/bin/bash
set -euo pipefail

c=local-tpcc
wh=10
(roachprod create -n 4 $c || true) &> /dev/null
roachprod stop $c
roachprod put $c cockroach
roachprod start $c:1-3
roachprod run $c:4 -- "if [ ! -f loaded ]; then ./cockroach workload init tpcc --warehouses $wh {pgurl:1-3} && touch loaded; fi"
roachprod run $c:4 -- ./cockroach workload run tpcc --warehouses $wh --concurrency 10 {pgurl:1-3}
```

and then

```
curl -o regiontrace.bin "$(roachprod adminui local-tpcc:1)debug/pprof/trace?seconds=10"
gotraceui regiontrace.bin
```

<img width="735" alt="image" src="https://github.com/user-attachments/assets/938b871b-c5c5-4960-9cd2-420ade908a2b" />

<img width="992" alt="image" src="https://github.com/user-attachments/assets/012ed3c6-75d6-4fde-b00e-f0bf731f172e" />

Related to #146873.

Epic: none
Release note: none


Co-authored-by: Tobias Grieger <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants