-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime/trace: TestTraceCPUProfile failures #55317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Found new dashboard test flakes for:
2022-08-18 13:16 linux-arm-aws go@38edd9bd runtime/trace.TestTraceCPUProfile (log)
2022-09-02 19:09 windows-arm64-10 go@55ca6a20 runtime/trace.TestTraceCPUProfile (log)
|
As a first pass: The failure on
The goroutine created on src/runtime/trace/trace.go:128 is the one that calls Lines 128 to 136 in 38edd9b
The failure on Line 162 in 55ca6a2
|
Found new dashboard test flakes for:
2023-10-14 22:30 linux-arm-aws go@bc9dc8d4 runtime/trace.TestTraceCPUProfile (log)
|
Found new dashboard test flakes for:
2023-10-30 17:08 linux-arm-aws go@ecda959b runtime/trace.TestTraceCPUProfile (log)
|
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
Found new dashboard test flakes for:
2023-11-10 15:50 windows-arm64-11 go@9d914015 TestTraceCPUProfile (log)
2023-11-10 15:51 windows-arm64-11 go@3b303fa9 TestTraceCPUProfile (log)
2023-11-10 16:18 windows-arm64-11 go@abf84221 TestTraceCPUProfile (log)
2023-11-13 22:12 freebsd-arm64-dmgk go@773039ed TestTraceCPUProfile (log)
2023-11-14 16:45 linux-386-longtest go@e14b96cb internal/trace/v2.TestTraceCPUProfile (log)
2023-11-15 20:08 freebsd-arm64-dmgk go@2380862c TestTraceCPUProfile (log)
2023-11-16 19:26 openbsd-386-72 go@d573a8a8 runtime/trace.TestTraceCPUProfile (log)
|
This latest set of failures seem related to the v2 tracer; those on arm64 are in the The Line 908 in e14b96c
CC @mknyszek for those. The go/src/internal/trace/parser.go Line 269 in d573a8a
events = append(events, ev) , where ev is a type rawEvent struct {int; byte; []uint64; []string} , which on GOARCH=386 would be 8 words of 4 bytes each, 32 bytes in total. The failed allocation is runtime.growslice(0xac500000, 0x2c301, 0x2c300, 0x1, 0x81e4340) , growing that slice beyond 180992 entries. That's 5.8 MB (growing to 0x375 pages of 8 kiB each, 7.3 MB) which doesn't seem large by itself, but maybe 180k+ entries in the test trace is unusually large? I don't have a baseline for that. There are nice log lines in mheap.grow when the call to mheap.sysAlloc fails, but the crash here is in sysMap (which throws directly in sysMapOS ) so the log doesn't include context on the total heap size.
|
Compare: |
Found new dashboard test flakes for:
2023-11-20 15:10 windows-amd64-longtest go@0709f1bb internal/trace/v2.TestTraceCPUProfile (log)
|
Sorry for not replying sooner. The linux-386-longtest failure was fixed by https://go.dev/cl/542335. The new failure is fun -- that may be an actual failure at runtime from the new tracer. I'll try to reproduce. As a side-note, the runtime/trace tests barely use the new tracer at all ( |
Oh! No, that new failure is not real actually. I put a low-quality optimistic "deadlock detector" but it turns out there was no deadlock in this case at all. The test just didn't expect the additional output. I'll put the "deadlock detector" behind a constant. It's for debugging anyway. |
Change https://go.dev/cl/544235 mentions this issue: |
The v2 execution tracer has a rudimentary deadlock detector, but it's based on an arbitrary threshold that an actually get hit even if there's no deadlock. This ends up breaking tests sometimes, and it would be bad if this just appeared in production logs. Put this 'deadlock detector' behind a flag. For #55317. Change-Id: I286f0c05b3ac9600f4f2f9696065cac8bbd25f00 Reviewed-on: https://go-review.googlesource.com/c/go/+/544235 Reviewed-by: Michael Pratt <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]>
Found new dashboard test flakes for:
2023-11-21 22:23 freebsd-arm64-dmgk go@ff05cdbd TestTraceCPUProfile (log)
|
Found new dashboard test flakes for:
2023-11-21 17:24 freebsd-arm64-dmgk go@0cb45bac TestTraceCPUProfile (log)
|
Found new dashboard test flakes for:
2023-11-23 17:33 freebsd-arm64-dmgk go@0c7e5d3b internal/trace/v2.TestTraceCPUProfile (log)
|
Found new dashboard test flakes for:
2023-12-01 19:58 netbsd-386-9_3 go@40f6fbf1 internal/trace/v2.TestTraceCPUProfile (log)
|
Found new dashboard test flakes for:
2023-12-01 21:24 linux-ppc64-sid-buildlet go@ecb9d9b9 internal/trace/v2.TestTraceCPUProfile (log)
|
Found new dashboard test flakes for:
2023-12-01 19:30 netbsd-arm64-bsiegert go@70c7fb75 internal/trace/v2.TestTraceCPUProfile (log)
|
Yeah, you're right. I frequently forget that other users might run I think I can just make the test more resilient to lost samples. It seems like in all these failures at least some of the samples show up in the trace. I figure we can just make the test pass if at least one of these samples shows up in the trace. This only verifies that functionality works at a baseline, not how well it works, but given that the functionality is best-effort anyway, I think that's about as good as we can make it for now. |
Taking a step back, I think this actually represents a regression in behavior vs. the old tracer, but it was really hard to tell because of how best-effort this functionality is. Specifically, the old tracer was subtly ensuring that there was a CPU profile buffer flush sometime close to when a trace would end, because the trace reader would get woken up a whole bunch. In the new tracer the CPU profile buffer is translated into the execution trace via a separate goroutine. This has in general been positive for understanding the best-effort-ness of this behavior, but what we lost was this property that a flush was likely to happen close in time to when the trace ended. My proposal to restore this behavior is just to wake the trace CPU profile reader goroutine before we're about to change generations. It does not ensure that a flush happens, but makes it far more likely. It's still best-effort, but a bit more predictable and will provide more understandable results. Overall, I'm not really that happy with the fact that this is so best-effort still. Early on in the trace design I tried to write this trace CPU profile reader goroutine to actually block on the CPU profile buffer getting filled, making it much more precise. But because of the problems on Darwin with CPU profiles we encountered earlier this year, I had to work around that. In the future it would be good to consider blocking in I'll send a CL with the better-effort wakeup which should hopefully resolve most, if not all of the failures seen in this issue. |
Change https://go.dev/cl/555495 mentions this issue: |
@prattmic convinced me to try to do the right thing here, and eliminate all parts of the tracer making this CPU profile -> trace process best-effort. https://go.dev/cl/555495 implements this. I was originally skeptical given that we're pretty late in the release cycle, but I figure at the very least we can have this on hand and decide whether it's worth submitting now. If it's not, I can land it for next cycle and I'll send the wakeup patch (which is like, 5 lines of code) which will certainly not make the situation any worse. |
Found new dashboard test flakes for:
2024-01-12 16:19 freebsd-arm64-dmgk go@e58e8139 internal/trace/v2.TestTraceCPUProfile (log)
|
Change https://go.dev/cl/557838 mentions this issue: |
…U profile in a trace Currently the new execution tracer's handling of CPU profile samples is very best-effort. The same CPU profile buffer is used across generations, leading to a high probability that CPU samples will bleed across generations. Also, because the CPU profile buffer (not the trace buffer the samples get written into) isn't guaranteed to be flushed when we close out a generation, nor when tracing stops. This has led to test failures, but can more generally just lead to lost samples. In general, lost samples are considered OK. The CPU profile buffer is only read from every 100 ms, so if it fills up too much before then, old samples will get overwritten. The tests already account for this, and in that sense the CPU profile samples are already best-effort. But with actual CPU profiles, this is really the only condition under which samples are dropped. This CL aims to align CPU profiles better with traces by eliminating all best-effort parts of the implementation aside from the possibility of dropped samples from a full buffer. To achieve this, this CL adds a second CPU profile buffer and has the SIGPROF handler pick which CPU profile buffer to use based on the generation, much like every other part of the tracer. The SIGPROF handler then reads the trace generation, but not before ensuring it can't change: it grabs its own thread's trace seqlock. It's possible that a SIGPROF signal lands while this seqlock is already held by the thread. Luckily this is detectable and the SIGPROF handler can simply elide the locking if this happens (the tracer will already wait until all threads exit their seqlock critical section). Now that there are two CPU profile buffers written to, the read side needs to change. Instead of calling traceAcquire/traceRelease for every single CPU sample event, the trace CPU profile reader goroutine holds this conceptual lock over the entirety of flushing a buffer. This means it can pick the CPU profile buffer for the current generation to flush. With all this machinery in place, we're now at a point where all CPU profile samples get divided into either the previous generation or the current generation. This is good, since it means that we're able to emit profile samples into the correct generation, avoiding surprises in the final trace. All that's missing is to flush the CPU profile buffer from the previous generation, once the runtime has moved on from that generation. That is, when the generation counter updates, there may yet be CPU profile samples sitting in the last generation's buffer. So, traceCPUFlush now first flushes the CPU profile buffer, followed by any trace buffers containing CPU profile samples. The end result of all this is that no sample gets left behind unless it gets overwritten in the CPU profile buffer in the first place. CPU profile samples in the trace will now also get attributed to the right generation, since the SIGPROF handler now participates in the tracer's synchronization across trace generations. Fixes #55317. Change-Id: I47719fad164c544eef0bb12f99c8f3c15358e344 Reviewed-on: https://go-review.googlesource.com/c/go/+/555495 Auto-Submit: Michael Knyszek <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Michael Pratt <[email protected]> (cherry picked from commit f5e475e) Reviewed-on: https://go-review.googlesource.com/c/go/+/557838 Reviewed-by: Cherry Mui <[email protected]>
Found new dashboard test flakes for:
2024-02-02 16:26 openbsd-386-72 go@2fdad8af runtime/trace.TestTraceCPUProfile (log)
|
Currently the new execution tracer's handling of CPU profile samples is very best-effort. The same CPU profile buffer is used across generations, leading to a high probability that CPU samples will bleed across generations. Also, because the CPU profile buffer (not the trace buffer the samples get written into) isn't guaranteed to be flushed when we close out a generation, nor when tracing stops. This has led to test failures, but can more generally just lead to lost samples. In general, lost samples are considered OK. The CPU profile buffer is only read from every 100 ms, so if it fills up too much before then, old samples will get overwritten. The tests already account for this, and in that sense the CPU profile samples are already best-effort. But with actual CPU profiles, this is really the only condition under which samples are dropped. This CL aims to align CPU profiles better with traces by eliminating all best-effort parts of the implementation aside from the possibility of dropped samples from a full buffer. To achieve this, this CL adds a second CPU profile buffer and has the SIGPROF handler pick which CPU profile buffer to use based on the generation, much like every other part of the tracer. The SIGPROF handler then reads the trace generation, but not before ensuring it can't change: it grabs its own thread's trace seqlock. It's possible that a SIGPROF signal lands while this seqlock is already held by the thread. Luckily this is detectable and the SIGPROF handler can simply elide the locking if this happens (the tracer will already wait until all threads exit their seqlock critical section). Now that there are two CPU profile buffers written to, the read side needs to change. Instead of calling traceAcquire/traceRelease for every single CPU sample event, the trace CPU profile reader goroutine holds this conceptual lock over the entirety of flushing a buffer. This means it can pick the CPU profile buffer for the current generation to flush. With all this machinery in place, we're now at a point where all CPU profile samples get divided into either the previous generation or the current generation. This is good, since it means that we're able to emit profile samples into the correct generation, avoiding surprises in the final trace. All that's missing is to flush the CPU profile buffer from the previous generation, once the runtime has moved on from that generation. That is, when the generation counter updates, there may yet be CPU profile samples sitting in the last generation's buffer. So, traceCPUFlush now first flushes the CPU profile buffer, followed by any trace buffers containing CPU profile samples. The end result of all this is that no sample gets left behind unless it gets overwritten in the CPU profile buffer in the first place. CPU profile samples in the trace will now also get attributed to the right generation, since the SIGPROF handler now participates in the tracer's synchronization across trace generations. Fixes golang#55317. Change-Id: I47719fad164c544eef0bb12f99c8f3c15358e344 Reviewed-on: https://go-review.googlesource.com/c/go/+/555495 Auto-Submit: Michael Knyszek <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Michael Pratt <[email protected]>
Found new dashboard test flakes for:
2024-04-22 15:29 netbsd-arm64-bsiegert go@734fd7a9 internal/trace/v2.TestTraceCPUProfile (log)
|
This hasn't happened in a while, and even then, only on secondary ports. (I think flaky failures there are now captured in newer watchflakes bugs anyway). Closing. |
Issue created automatically to collect these failures.
Example (log):
— watchflakes
The text was updated successfully, but these errors were encountered: