-
Notifications
You must be signed in to change notification settings - Fork 3.9k
goexectrace: add tooling around capture of execution traces #146873
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
tbg
wants to merge
6
commits into
cockroachdb:master
Choose a base branch
from
tbg:flightrec
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+887
−28
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This is a wrapper around trace.FlightRecorder that allows multiple callers to request execution traces concurrently.
This is set up so that ``` set cluster setting server.execution_tracer.interval='1ms' ``` can replaces pprof-loop by taking continuous execution traces. The retention policies keep 24h worth of traces (subject to an additional disk usage restriction of 4GiB).
In scale testing, we are seeing blips around high fsync latency and jumpy goroutine count. Cause and effect unclear. Flight recorder dumping at this point in time could have been instructive. Ref |
craig bot
pushed a commit
that referenced
this pull request
Jun 4, 2025
147756: kvserver: add useful execution trace regions r=tbg a=tbg This PR adds regions for the range on which a BatchRequest executes, as well as regions indicating the method of each request while it's being evaluated. Some samples from running this script: ```bash #!/bin/bash set -euo pipefail c=local-tpcc wh=10 (roachprod create -n 4 $c || true) &> /dev/null roachprod stop $c roachprod put $c cockroach roachprod start $c:1-3 roachprod run $c:4 -- "if [ ! -f loaded ]; then ./cockroach workload init tpcc --warehouses $wh {pgurl:1-3} && touch loaded; fi" roachprod run $c:4 -- ./cockroach workload run tpcc --warehouses $wh --concurrency 10 {pgurl:1-3} ``` and then ``` curl -o regiontrace.bin "$(roachprod adminui local-tpcc:1)debug/pprof/trace?seconds=10" gotraceui regiontrace.bin ``` <img width="735" alt="image" src="https://github.com/user-attachments/assets/938b871b-c5c5-4960-9cd2-420ade908a2b" /> <img width="992" alt="image" src="https://github.com/user-attachments/assets/012ed3c6-75d6-4fde-b00e-f0bf731f172e" /> Related to #146873. Epic: none Release note: none Co-authored-by: Tobias Grieger <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Introduction
This PR is a prototype showcasing how we might meaningfully integrate CockroachDB more closely with Go execution traces1. It is related to this internal doc2.
As a reminder, Go execution tracing can be enabled on demand and, while enabled, causes the Go runtime to emit its garbage collection activity, goroutine scheduling decisions, as well as state transitions for each goroutine. Notably, this includes stack traces at each transition point, as well as causal event edges whenever one goroutine wakes up another (for example via a channel send).
Execution traces contain a wealth of information and they have played an important role in several support escalations, in particular when diagnosing tail latencies. Recent releases of Go have significantly reduced the overhead of Go execution traces, and so it is feasible to reach for them more frequently.
Two primary use cases have been identified:
Understanding Statement Tail Latency
Typically, in this approach a statement diagnostics request is made. Due to the high overhead of (CockroachDB) tracing, only a proportion (say 1%) of statements matching the fingerprint is sampled. While the statement diagnostics request is pending, Go execution traces are continuously collected across all nodes in the cluster. Once a statement bundle has been created, the contained Jaeger trace is investigated for trace spans displaying unexplained high latency. This trace span results in a node time window, and goroutine(s) of interest. The overlapping execution trace(s) from this node can then be investigated, and the goroutine(s) looked up within it.
The main problems with this current approach include
pprof-loop
script, but due to the need to present an authentication cookie and to orchestrate this script on each cluster node, it is a time-consuming and error-prone process.Understanding Spikes and Overload
This category of investigation is similar to the first, but we are interested in capturing "whatever we can" to understand why system-internal operations were not making the expected progress. For example, #147025 was filed recently as consequence of an incident in which the liveness range leaseholder failed to extend its lease; another example is prototype #146650 which attempts to make instances of CPU overload more understandable by capturing an execution trace.
Adding complication to making execution tracing useful in this context is the fact that the interesting behavior may well be in the past by the time an execution trace is solicited. Fortunately, the Go team recently added flight recording3 to the arsenal, which roughly speaking maintains a suffix (say, the last 10s) of a continuously running Go execution trace, and allows writing the current buffer out to a file on demand.
The flight recording facilities are useful, but they also need a layer of wrapping to be useful in CockroachDB:
What's in this PR
This PR sketches out primitives to leverage execution traces while addressing the enumerated problems. Concretely, we introduce the
CoalescedFlightRecorder
type, which is a per-node singleton "owning" access to Go execution traces (both in "direct" mode and in flight recorder mode).First up,
CoalescedFlightRecorder
(CFR
) allows taking an execution trace of a specified duration, similar to hitting the/debug/pprof/trace
endpoint.Second, it also integrates with a wrapped flight recorder. Roughly speaking, any actor in the system that may potentially want a flight recording (for example, a "CPU overload" watcher), can obtain a
Handle
from theCFR
, which will cause the internal flight recorder to start recording. If noHandle
s are registered on theCFR
for some time, flight recording is disabled.A
Handle
allows the holder to dump the flight recorder's execution trace. Recall that an execution trace is always available while aHandle
is held and will contain information about the present and its immediate history (though it may only reach back in time to the creation of the handle or for the internally configured maximum duration or size). Concurrent requests to dump the trace (whether on oneHandle
or differentHandle
s) are coalesced on top of each other, meaning that aHandle
may be used without the potential for spurious errors (which would result from using a naked flight recorder). Importantly, synchronous dump calls can also coalesce onto synchronous execution trace requests, removing another source of spurious failures. Of note is also thatHandle.Dump
is careful to not block the caller; instead, the caller immediately receives the location of the file as well as aFuture
it can consult to wait for the operation to finish.Third, the
CFR
cleanly abstracts its storage to the degree that we can integrate it with the storage mechanisms already used by the various "observability dumpers" (cpu profile, heap profile, etc). Files created by theCFR
(as hooked up in this PR) go into anexecutiontrace_dump
subdirectory and are subject to retention policies specified by cluster settings. They are also available through the existing APIs we expose to download observability artifacts (used for example bydebug zip
).Fourth, the singleton
CFR
is wrapped in a newly introducedprofiler.ExecutionTracer
(mirroringprofiler.{Heap,CPU}Profiler
) which has the ability to take periodic execution traces (through theCFR
). This then enables continuously collecting profiles via a cluster setting, and also applies the retention policies.Fifth, the
CFR
is made available to the remainder of the system via theStoreConfig
(and could be plumbed wherever else it needs to go). With this in place, new uses of flight recorder execution traces can be added across the system with minimal friction.How this solves the identified problems
SET CLUSTER SETTING server.execution_tracer.interval='1ms'
now results in continuous execution tracing on all nodes (ignoring the inconsequential 1ms pause between consecutive traces).Next Steps
I'm not signing up to push this over the finish line. Instead, I want to show that with only a little more effort, we can make significant strides toward better integration with execution traces. I'm happy to help move this along in a consulting fashion, and lend a hand with reviews.
A strategy for landing and extending this could be:
Handle
. Profiles created viaHandle
are high-priority for retention.In parallel to this, execution traces could become more intertwined with tracing and logging. For example, CRDB trace spans could create trace "regions" or "tasks", events/messages logged to a CRDB trace span could also be logged into the execution trace.
Comments on the code in this PR:
name
is collected by a later call to the node (as one could imagine might happen in distributed tracing). The future won't be accessible at this time, but the file may still be pending writes. Maybe solved by writing the file to a temporary location, and renaming on completion only, plus a bit of wait time on collection. But that feels smelly.Handle
integration inReplica
is completely untested (other than that it doesn't crash the server) and has obvious deficiencies. Doing this properly requires little reworking of theReplica
lifecycle (to have better triggers for handle acquisition and release), or simply making this conditional on always enabling inflight execution tracing.4Informs #97215.
Epic: none
Footnotes
https://go.dev/blog/execution-traces-2024 ↩
https://docs.google.com/document/d/1i2b7xdXVXR7ETm_Rh1x0qt-olriJVUrmSp32KMdp5YA/edit?tab=t.0 ↩
https://github.com/golang/go/issues/63185 ↩
Come to think of it, the tenant metrics ref could already have this figured out: when we initialize a replica, we acquire a tenant ref in setDescLockedRaftMuLocked. When we destroy a replica, we have to reliably release that ref again. That happens in postDestroyRaftMuLocked. It's not easy to convince yourself that any of the many paths that lead to replica destruction hit that method, but it seems that they do. ↩