Closed
Description
Please answer these questions before submitting your issue. Thanks!
- What version of Go are you using (
go version
)?
go 1.6.2, go1.7beta2
- What operating system and processor architecture are you using (
go env
)?
linux/arm64, linux/arm
- What did you do?
run the standard go1 benchmarks
- What did you expect to see?
little to no performance regression
- What did you see instead?
see:
https://gist.github.com/ajstarks/fa05919377de73a552870ce9e45e4844 for linux/arm64 (pine64)
https://gist.github.com/ajstarks/c937a1a80e76b4692cb52c1d57a4494d for linux/arm (CHIP)
regressions occur on GobEncode, GobDecode, JSONEncode, and JSONDecode
Metadata
Metadata
Assignees
Type
Projects
Relationships
Development
No branches or pull requests
Activity
[-]go1.7beta2 performance regressions on ARM[/-][+]all: go1.7beta2 performance regressions on ARM[/+]josharian commentedon Jun 19, 2016
I'm away from my arm(64) machines for a couple of weeks. Help bisecting would be lovely.
ajstarks commentedon Jun 19, 2016
benchviz for Pine64
pine64.pdf
[-]all: go1.7beta2 performance regressions on ARM[/-][+]cmd/compile: go1.7beta2 performance regressions on ARM[/+]ianlancetaylor commentedon Jun 20, 2016
CC @randall77
cherrymui commentedon Jun 20, 2016
@quentinmit and I are looking at this. We found that it also slows down on linux/386, and linux/amd64 with
-ssa=0
, withGobDecode
,JSONDecode
andTemplate
being the most significant.Gzip
is getting faster, though. So, probably it is not the compiler?josharian commentedon Jun 20, 2016
Could be the runtime, or it could be front-end compiler changes. I think there were near zero 386 or arm-specific fixes this cycle, so it might be worth comparing the generated code to rule out the compiler (perhaps after some profiling to pinpoint the change).
josharian commentedon Jun 20, 2016
Al those packages use reflection. Could it be related to the type reworking?
[-]cmd/compile: go1.7beta2 performance regressions on ARM[/-][+]cmd/compile: go1.7beta2 performance regressions[/+]quentinmit commentedon Jun 21, 2016
Reflection is definitely part of the puzzle. Here's the state of our investigation so far:
RLH commentedon Jun 21, 2016
The slow version contains a call to this routine that the faster version
does not contain. This accounts for aroutne 3% of the total time. It is the
only routine that stood out in the top 20 routines.
inst retired| cpu_clk unhalted
reflect.(_name).name 3.3% 2.9% type.go 0x8166a40 go1-template-bad
reflect.(_name).name
On Tue, Jun 21, 2016 at 2:50 PM, Quentin Smith notifications@github.com
wrote:
ianlancetaylor commentedon Jun 21, 2016
CC @crawshaw
ianlancetaylor commentedon Jun 21, 2016
@RLH Is it easy for you to find the significant callers of
reflect.(*name).name
?It's possible that the significant callers are from code that is looking for a specific name, which we could probably speed up somewhat by doing things like checking the length before building the string we return.
dr2chase commentedon Jun 21, 2016
I have an existence proof that code alignment can cost you 10% on Revcomp.
34 remaining items
quentinmit commentedon Jun 29, 2016
@dsnet We are running the Gzip benchmark as found in test/bench/go1/gzip_test.go
dsnet commentedon Jun 29, 2016
tl;dr, leave gzip as is for go1.7, we can address arm performance issues in go1.8
Alright, I think I've figured out what's causing the slow-down for ARM. While all of the go1.7 optimizations as a whole have helped arm and arm64 (as seen in my post above), there are some cases where performance regresses (as is the case for test/bench/go1/gzip_test.go)
One of major changes made to compress/flate was that we started using 4-byte hashing instead of 3-byte hashing. The new hash function is stronger and leads to less collisions and thus fewer searches, which is good for performance. On amd64, we take advantage of the fact that most CPUs can perform 4-byte unaligned loads pretty efficiently (#14267). However, this optimization is not possible on arm, so the arm targets have the pay extra cost to utilize the stronger hash function.
On Go1.6, the simpler hash function compiled on arm32 to approximately 17 instrs:
On Go1.7, the stronger hash function compiled on arm32 to approximately 77 instrs:
On Go1.7, the stronger hash function compiles on amd64 to approximately 31 instrs:
Any changes to fix this would probably be too big to do this late in the release cycle. We can think about better approaches in Go 1.8. I don't know enough about ARM to know if it will ever have unaligned reads/writes. So it may be worth investigating using the previous hash function over the new one for non x86 architectures.
josharian commentedon Jun 29, 2016
If you use the dev.ssa branch and hack in config to make arm use ssa, how do things look? I.e. Will ssa for arm save us for free?
cherrymui commentedon Jun 30, 2016
@dsnet, which function did you disassemble? Is it
hash4
incompress/flate
?@josharian, SSA can get the performance back on ARM:
This is tip of
dev.ssa
plus my pending CLs, comparing with SSA off vs on. Not sure about the hash function specifically (as I don't know which function it is...)dsnet commentedon Jun 30, 2016
Correct, it is hash4. That is awesome seeing the performance being boosted with SSA :)
cherrymui commentedon Jun 30, 2016
For
hash4
, somehow I see a different disassembly than what you saw:On ARM32, with SSA off:
With SSA on:
So it is about 50% shorter. Currently it does not do combined unaligned load. It seems CSE helps mostly.
It might be possible to do combined load. Newer ARM support unaligned load. But currently the compiler (both backends) generates same instructions, only the assembler rewrite some instructions based on GOARM.
gopherbot commentedon Jun 30, 2016
CL https://golang.org/cl/24640 mentions this issue.
runtime/internal/sys: implement Ctz and Bswap in assembly for 386
gopherbot commentedon Jun 30, 2016
CL https://golang.org/cl/24642 mentions this issue.
gopherbot commentedon Jun 30, 2016
CL https://golang.org/cl/24643 mentions this issue.
ianlancetaylor commentedon Jul 6, 2016
This issue has headed in several different directions.
I don't think we currently plan to do any further work to speed up the reflect package, which is the core of the original report. It would be nice to know what the performance losses are on tip.
I'm going to close this issue. If there is further work to be done for 1.7 let's move it to more focused issues.
gopherbot commentedon Sep 30, 2016
CL https://golang.org/cl/24521 mentions this issue.