-
Notifications
You must be signed in to change notification settings - Fork 18k
runtime: excessive scavengeOne work slows mutator progress #57069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
That's really interesting. It seems like just after a forced GC call, a lot of memory is suddenly available to scavenge, and in the rapid allocation the runtime thinks all that memory is going to cause it to exceed the memory limit. That's why all the Back of envelope, I would expect the worst case for 2 MiB at once is a ~5 ms delay (which is kind of bad). However, the lock contention is a little weird though because the scavenger shouldn't ever be holding the lock across the Does the application hang completely, or does it recover after some time? I'd love to get a On that topic fix to #55328 might help in other ways here. I think the fix there results in a much better scavenging heuristic too, which would cause less scavenged memory to be allocated (causing these calls) in the first place. However, it does not tackle the root of the problem (or problems). One problem is definitely scalability of Another problem is getting into this state where the scavenger is being rapidly called into by every P because of the memory limit. This is harder to avoid because the memory limit still needs to be maintained. One idea I have here though is to have the sweeper actually help return some memory to the OS that it doesn't think will be used, to help prevent this situation. The trouble is identifying what to return to the OS. |
cc @golang/runtime @mknyszek @aclements |
Can you say a bit more please? Does this mean that forcing a GC causes memory to become available to the scavenger, in a way that automatic GCs do not?
I don't have data on the size of the allocated heap before this GC, or of how much the runtime has requested from the OS (like
Focusing on But I've also seen CPU profiles that show lots of time in
I'll see if the team is able to run with a modified toolchain that reduces that window.
It recovers, yes. Certainly within two minutes (at its next scheduled telemetry dump time, which showed unremarkable behavior).
I think this is going to be hard, especially before Go 1.20. I haven't seen this behavior outside of the team's production environment, and the team isn't set up well to run like this (tip, plus a GOEXPERIMENT, plus a large/growing file on disk) in production.
This sounds like you're saying that our use of Thank you! |
Kind of. Here's what I was thinking (but I'm less convinced this specifically is an issue now, see my other replies below): forced GCs ensure the sweep phase is complete in addition to the mark phase before continuing. Sweeping frees pages, which makes them available for scavenging. The runtime kicks the background scavenger awake at the end of each sweep phase, but there's potentially a multi-millisecond delay as sysmon is responsible for waking it. If the background scavenger is asleep during a sweep phase, this can in theory also happen with automatic GCs. However, I think in your particular case where there's this sort of "calm before the storm" as a GC gets forced, I suspect the scavenger is more likely to be asleep, leaving all the maintenance work to allocating goroutines. Writing this out makes me think that maybe if the memory allocator starts scavenging, it immediately kicks the scavenger awake. Though, we should confirm that the background scavenger is indeed asleep on the job in this case. It could be that even if it was constantly working, it wouldn't be doing enough to avoid this.
That's really interesting. Allocations that large definitely do have the potential to increase fragmentation a good bit, making it more likely that the allocator has to find new address space. That in turn will mean more frequent scavenging to stay below the memory limit.
Thanks!
Yeah, I figured. A tool of last resort I suppose. I could also hack together a version that enables the page tracer with the execution trace. It should also be possible to make the page trace tooling a little more resilient to partial traces, so no crazy changes to the format are necessary. Let me know if this is of interest to you.
Yeah. That call path ( I think your description is painting a fairly clear picture: it sounds like fragmentation and the sudden need to allocate larger things is forcing the runtime to scramble and find memory to release to stay under the memory limit. The background scavenger in this scenario is instructed to scavenge until it's 5% under the memory limit which is supposed to help avoid these scenarios, but with I think the runtime might be doing something wrong here. I think that maybe it should be reserving a little bit more space off of the heap limit for hedging against future fragmentation. I suspect at your heap size this wouldn't really cost you very much in terms of GC frequency, but would make this transition to periodic work much smoother. More specifically, the fixed headroom found at https://cs.opensource.google/go/go/+/master:src/runtime/mgcpacer.go;l=1036 should probably be proportional to the heap limit (something like 3-5%, so multiplied by 0.95-0.97). I was originally reluctant to do this, but I think your issue is solid evidence for doing this. I also think it might help with another memory limit issue, which is that it works less well for smaller heaps (around 64 MiB or less). It's only a 2-3 line change (well, probably a little bit more including a pacer test) and it might resolve this particular issue. Of course, a number here won't be perfect for every scenario, but it's a start. A more adaptive solution would be good to consider for the future. Also, this follows a more general pattern of the pacer pacing for the "edge" (#56966) and not hedging enough for noise or changes in the steady-state in general.
|
Here's an attempt at a fix if you'd like to try it out: https://go.dev/cl/460375 |
Change https://go.dev/cl/460375 mentions this issue: |
@rhysh Do you think you'll have a chance to try this patch out for Go 1.21? I'm curious if it makes a meaningful difference in performance and/or if it helps make this situation less likely. |
I built a small reproducer: main.go
I applied https://go.dev/cl/495916 get some better introspection. Then I did:
Result before https://go.dev/cl/460375: before.txt
Result after https://go.dev/cl/460375: after.txt
I think it's pretty clear from the output logs that:
|
Here's the log output in summarized form:
|
Change https://go.dev/cl/495916 mentions this issue: |
Also, clean up atomics on released-per-cycle while we're here. For #57069. Change-Id: I14026e8281f01dea1e8c8de6aa8944712b7b24d9 Reviewed-on: https://go-review.googlesource.com/c/go/+/495916 Reviewed-by: Michael Pratt <[email protected]> Run-TryBot: Michael Knyszek <[email protected]> TryBot-Result: Gopher Robot <[email protected]>
I believe my change mitigates the issue sufficiently. Please reopen if you see this again in production, @rhysh! |
Some background: I have an app that serves interactive HTTP traffic (typical response time is tens of milliseconds), and which also does periodic work to refresh the state necessary to serve that interactive traffic. Its live heap is usually around 8 GiB. The work to refresh the state spans over several GC cycles and can involve allocating hundreds of MiBs or even a couple GiBs over a few hundred milliseconds, which gives the pacer a hard time and can lead to the pacer choosing a high assist factor. To work around that, we're 1/ calling
runtime.GC
at a few key points in the state-refresh process, and 2/ running with GOMEMLIMIT=20GiB and GOGC=off. The app runs on a machine with 16 hardware threads (GOMAXPROCS=16).The problem: There appears to be contention on the
mheap
lock inruntime.(*pageAlloc).scavengeOne
. Execution traces show this can tie up all 16 Ps for hundreds of milliseconds.CC @golang/runtime and @mknyszek
What version of Go are you using (
go version
)?go1.19.3 linux/amd64
Does this issue reproduce with the latest release?
Yes: go1.19.3 is the latest stable release. I have not tried with tip.
What operating system and processor architecture are you using (
go env
)?linux/amd64
What did you do?
GOMAXPROCS=16, GOMEMLIMIT=20GiB, GOGC=off
Manual call to
runtime.GC
, followed (after that call returns, indicating that the sweep work is done) by allocating memory especially quickly.All while serving interactive (tens of milliseconds typical response time) HTTP requests.
What did you expect to see?
Consistent responsiveness for the interactive HTTP requests this process serves, regardless of which phase the GC is executing.
When the app has its own runnable goroutines, I expect the runtime to stay close to within its goal of 25% of P time spent on GC-related work.
What did you see instead?
The app's Ps are kept busy in a nearly-opaque form of work. Rather than running a G for around 100 µs and then moving on to the next G, each P has a single G that it claims to be running for hundreds of milliseconds without interruption. (Note, if those goroutines had "normal" work, I'd still expect them to be preempted by
sysmon
every 20 ms or so.) At times, every P in the app is kept busy this way.When the Ps are busy this way, their CPUSample events in the execution trace (each of which represents 10 milliseconds of on-CPU time) are spread tens of milliseconds apart, and sometimes more than 100 ms. That suggests that these Ps are not spending their time executing any code, even within the kernel, and that instead they're asleep waiting for a lock that is not visible to the execution trace (such as a runtime-internal lock).
The CPUSample events during those periods show the Ps are trying to allocate new spans to use for new memory allocations. Most samples are within
runtime.(*pageAlloc).scavengeOne
and show a call toruntime.madvise
. A few show time in that function's calls toruntime.lock
andruntime.unlock
.Because those periods include so little on-CPU time outside of
runtime.(*pageAlloc).scavengeOne
, and also show so little on-CPU time in total, and show some on-CPU time interacting with themheap
lock, I take this to mean that the Ps are spending their withruntime.(*pageAlloc).scavengeOne
on their stack, but sleeping, likely waiting for themheap
lock.Here's one of the call stacks from a CPUSample event that arrived while a P was busy for 100+ milliseconds (showing from
mallocgc
):Here's how the execution trace looks. The sweep work ends around 275 ms. That also means the app's manual
runtime.GC
call returns, which allows the allocation-intensive refresh procedure to continue. Procs 1, 12, 15 do some allocation-heavy work and then get stuck inscavengeOne
from the 400 to 800 ms marks. (Some other Procs, like 3 and 11, do heavy allocations around the 400 ms mark but do not get stuck.) Around 800 ms, all 16 Procs get stuck and the queue of runnable goroutines grows. And around 1000 ms, a timer should have triggered to callruntime/trace.Stop
, but the trace continues beyond the 1300 ms mark; the app is no longer responsive.And here's the
call_tree
view from undermallocgc
. This is the runtime/pprof output of the same CPU profile that is shown in the execution trace; these samples will cover the same time period as that entire execution trace, not filtered to any smaller time range.The text was updated successfully, but these errors were encountered: