Description
Proposal Details
Fine-grained performance metrics (such as L1-iTLB, IPC, branch-load, branch-load-miss, etc.) and optimization experiments(#63670) have repeatedly shown that block layout can noticably impact code execution efficiency.
In contrast to the current case-by-case tuning approach, I propose adopting the traditional and time-tested Pettis-Hansen (PH) basic block layout algorithm. Its core concept is to place hot block as closely together as possible, allowing basic blocks to fall through whenever feasible, thereby reducing jump instructions. This principle is considered the golden rule in the field of block layout, and state-of-the-art algorithms like extTSP and almost all variants are based on this idea, incorporating advanced heuristic techniques.
The PH algorithm relies on a weighted chain graph, where the weights represent the frequency of edges. In the absence of PGO information, we can only resort to branch prediction results from likelyadjust
pass. In the future, we can incorporate PGO data as weights to make the algorithm even more effective.
Experiment Results
The x-axis represents testcase id, the y-axis indicates the performance change in percentage points, and negative values denote performance improvement.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Activity
cmd/compile: greedy basic block layout
gopherbot commentedon Mar 20, 2024
Change https://go.dev/cl/572975 mentions this issue:
cmd/compile: greedy basic block layout
mdempsky commentedon Mar 20, 2024
What does the graph represent? What tests were run? What is a "testcase ID"? What's the difference between the orange and blue lines? How do you interpret the graph to conclude the CL is net positive? Thanks.
mknyszek commentedon Mar 20, 2024
https://go.dev/cl/c/go/+/571535 is another CL related to basic block reordering.
y1yang0 commentedon Mar 21, 2024
@mknyszek Thanks for your reminder, I'll take a closer look at it.
@mdempsky Hi,
A Chain Graph is composed of chains and edges. It aggregates a series of blocks into chains, which are then interconnected by edges. The purpose of this is to minimize jmp instructions and maximize fallthrough occurrences as much as possible.
A vivid and real-life example can be referenced in the image below.
id means testcase1
pkg: archive/tar
testcase2pkg: archive/zip
etc. Due to the length of the package names, I've used IDs to represent them on the x-axis. orange means round1 test and blue indicates round2.Each round runs all go tests with count=10 at given CPU set
Changes within 2% are considered reasonable and tolerable fluctuations, which allows us to disregard the majority of cases. However, it has notably improved several cases, specifically those with sharp fluctuations exceeding 6%. On the other hand, this is a simple and time-tested algorithm; variations of it are used worldwide for block layout implementation, and I believe Go is also an excellent candidate for its application. Furthermore, the current graph does not contain PGO information; each edge is either 100% taken or 0% taken. Yet, it still manages to achieve such results. We can optimistically and reasonably expect that a graph augmented with PGO information will yield very favorable outcomes.
mdempsky commentedon Mar 21, 2024
Sorry, I was asking what the graph in your "experimental results" indicated. Your argument seems to be that the graph demonstrates an experimental improvement on the benchmarks, but it looks like noise to me.
Typically we look at the results from running
benchstat
.Okay. What are "round1" and "round2"?
Can you please instead use "old" and "new"? Or "baseline" and "experiment"? These are unambiguous terms.
Thanks.
6 remaining items
y1yang0 commentedon May 15, 2024
Hi @mdempsky @thanm , sorry for the delay.
Yes, the above benchmark results are all std library package benchmarks.
The
golang/benchmarks
results are as follows:I think current performance results of both
std
andgolang/benchmakrs
are sufficient to drive this patch forward. It has a slight overall advantage compared to the current algorithm and offers a significant benefit for functions with many blocks. Furthermore, to optimize code layout using PGO information in the future, it is a reasonable choice to replace case-by-case tuning algorithms with the time-tested PH algorithm. I look forward to your comments and reviews.y1yang0 commentedon May 21, 2024
Gentle PING: @mdempsky @thanm
y1yang0 commentedon Jun 5, 2024
PING: Anyone?
randall77 commentedon Jun 5, 2024
When we last left the CL (572975, right?) it had trybot failures. Those need to be fixed at some point. (Generally it is probably better to ping on the CL than the issue.)
More generally, it is unclear to me at least that this is actually a performance improvement, as opposed to just noise.
I think it would be good to take 2 benchmarks, one which improves by a bunch (perhaps Encoding4KBVerySparse) and one which gets worse (perhaps WalkAllBreadthFirstGnp_1000_tenth) and investigate whether those performance changes are real, and if so, why. What does the different block ordering do that makes it better/worse?
I think we're still waiting for an answer for how this compares to CL 571535. That CL uses pgo information, but otherwise is it the same algorithm? A different one? If different, how do we choose which one to use?
y1yang0 commentedon Jun 11, 2024
@randall77
Thank you for the reply.
I hope for a positive start to the review process, and I am happy to fix the errors that the trybot points out.
In the first chart of performance results I provided, I ran all std benchmarks (archive, bufio, bytes, crypto, ...) twice, and the results (blue line and orange line) show a high degree of similarity. This means that the same packages consistently receive similar proportions of performance improvements, which is no mere coincidence. I believe it sufficiently demonstrates that the performance results are not noise.
Nearly all basic block algorithms (such as exttsp, cache) are variations of the PH algorithm, which can be derived by altering the PH algorithm's weight computation formula. Another rationale for proposing the current CL is its modest size and clarity, which could simplify the review process. This mitigates one of the concerns I have perceived—why many large patches from external contributors do not get merged/move forward. If we clear the first hurdle, that is, if PH gets integrated, then we can proceed to incorporate PGO information into the PH algorithm.
randall77 commentedon Jun 11, 2024
Unfortunately just running twice is not enough to avoid noise. Noise comes from many sources, some of which are repeatable given the same binary. See, for example, https://go-review.googlesource.com/c/go/+/562157
I really need to see a few examples of where this makes a difference. A particular change this CL makes in this function makes this benchmark faster. And here's the difference in assembly, note the layout changes make this jump instruction go away.
That sort of thing.
y1yang0 commentedon Jun 18, 2024
@randall77 Sure~ one of examples is
src/runtime/map_test.go,benchmarkMapAssignInt32
, reproducible improvementbench source
benchstat diff
perf diff
ssa diff
victim of branch-load-miss is


corresponding ssa is
The base handles b37 'likely' and 'unlikely' without distinction, while 'opt' shows a preference for 'likely'. This is precisely the issue of concern for 'ph'. See more traces in comment#2
ssa.zip
randall77 commentedon Jun 18, 2024
Hmm, I don't see that improvement, but my work machine is x86:
I will try again on my arm (M2 ultra) when I get home.
randall77 commentedon Jun 18, 2024
Looking at the generated arm64 assembly for
runtime.mapassign_fast32
, with your CL the code grows from 808 bytes to 824 bytes. That seems like a general trend, for example the binary generated bygo test -c runtime
is 3.2% bigger.My suspicion is the following. For the code
The tip compiler does
Whereas with your CL it does
The latter is larger (by 1 unconditional jump) but is probably better (forward branches are default predicted not taken?) when c is seldom true.
3+% is a lot of binary size increase. On microbenchmarks binary size doesn't really matter, but on real applications that extra size can lead to more general slowdowns. It's not clear if that would end up countering the observed speedup or not.
Maybe it would be worth doing under PGO when we know a function is hot.
y1yang0 commentedon Jun 19, 2024
Yes
Yes
Based on test performance, the cost of increasing binary size is that a few cases see significant performance improvements(10%+), while most cases do not show obvious changes. The performance with PGO may lead to noticeable improvements for the majority of cases, which is exactly what I want to do next. Perhaps in the future, we could consider a configurable block layout algorithm, like GOEXPERIMENT=ph/basic. Additionally, the fact that a large number of compilers apply this algorithm also strengthens our confidence to some extent.
randall77 commentedon Jun 19, 2024
On my M2 ultra, the performance of your CL is quite good.
y1yang0 commentedon Jun 20, 2024
So, considering all the information given, is it enough to start our review process? Or is there any additional information required from me? Thanks.
randall77 commentedon Jun 20, 2024
I will take a look at your CL today. I think we still have to figure out the right way to decide when to turn it on.
y1yang0 commentedon Jun 29, 2024
I've found a heuristic spot where we could consider sorting adjacent chains together(9fd9a54). For the given chains:

before:
Even if [b1 b19] and [b2 b18 b5] are very close to each other, the final generated block order will still arrange them far apart. If we consider arranging adjacent chains together, we could generate better block order:
after:
--bench
7.1 update:
I also found that "before" precedence relation is not completely correct now, because "a" comes before "b", and "b" comes before "c", we expect that before precedence is transitive, i.e. "a" comes before "c", this had been fixed in 36f564c, now the block order is as follows:
alexanius commentedon Aug 14, 2024
Hello. I was experimenting with my pgobb and integrated your patch with it. My original patch just adds counters to basic blocks and edits the likely branch information. Integration with your patch was the following:
go build -a -pgo=file.prof -pgobb -gcflags=all="-pgogreed"
WeightTaken
we use the counter of basic block. Compile line:go build -a -pgo=file.prof -pgobb -gcflags=all="-pgogreed -pgobbgreed"
Later I will share the results of my performance evaluations, but maybe you will find this patch useful for your purposes.
PS. Please, see my evaluations here.