Skip to content

cmd/compile: go1.7beta2 performance regressions #16117

Closed
@ajstarks

Description

@ajstarks

Please answer these questions before submitting your issue. Thanks!

  1. What version of Go are you using (go version)?

go 1.6.2, go1.7beta2

  1. What operating system and processor architecture are you using (go env)?

linux/arm64, linux/arm

  1. What did you do?

run the standard go1 benchmarks

  1. What did you expect to see?

little to no performance regression

  1. What did you see instead?

see:
https://gist.github.com/ajstarks/fa05919377de73a552870ce9e45e4844 for linux/arm64 (pine64)
https://gist.github.com/ajstarks/c937a1a80e76b4692cb52c1d57a4494d for linux/arm (CHIP)

regressions occur on GobEncode, GobDecode, JSONEncode, and JSONDecode

Activity

changed the title [-]go1.7beta2 performance regressions on ARM[/-] [+]all: go1.7beta2 performance regressions on ARM[/+] on Jun 19, 2016
added this to the Go1.7 milestone on Jun 19, 2016
josharian

josharian commented on Jun 19, 2016

@josharian
Contributor

I'm away from my arm(64) machines for a couple of weeks. Help bisecting would be lovely.

ajstarks

ajstarks commented on Jun 19, 2016

@ajstarks
ContributorAuthor

benchviz for Pine64
pine64.pdf

changed the title [-]all: go1.7beta2 performance regressions on ARM[/-] [+]cmd/compile: go1.7beta2 performance regressions on ARM[/+] on Jun 20, 2016
ianlancetaylor

ianlancetaylor commented on Jun 20, 2016

@ianlancetaylor
Contributor
cherrymui

cherrymui commented on Jun 20, 2016

@cherrymui
Member

@quentinmit and I are looking at this. We found that it also slows down on linux/386, and linux/amd64 with -ssa=0, with GobDecode, JSONDecode and Template being the most significant. Gzip is getting faster, though. So, probably it is not the compiler?

josharian

josharian commented on Jun 20, 2016

@josharian
Contributor

Could be the runtime, or it could be front-end compiler changes. I think there were near zero 386 or arm-specific fixes this cycle, so it might be worth comparing the generated code to rule out the compiler (perhaps after some profiling to pinpoint the change).

josharian

josharian commented on Jun 20, 2016

@josharian
Contributor

Al those packages use reflection. Could it be related to the type reworking?

changed the title [-]cmd/compile: go1.7beta2 performance regressions on ARM[/-] [+]cmd/compile: go1.7beta2 performance regressions[/+] on Jun 21, 2016
quentinmit

quentinmit commented on Jun 21, 2016

@quentinmit
Contributor

Reflection is definitely part of the puzzle. Here's the state of our investigation so far:

  • Template: ~22% performance hit on arm32, arm64, and 386. Largest hit due to http://golang.org/cl/20968. @RLH is trying to use VTune to pin down the slow code.
  • Revcomp: ~30% performance hit on 386 only. Entirely due to http://golang.org/cl/20901. @dr2chase is looking at this.
  • GobDecode: ~10% performance hit on arm32, arm64, and 386. Very noisy, no obvious culprit.
  • JsonDecode: ~30% performance hit on arm32, arm64, ~15% on 386. Less noisy, still no obvious culprit.
  • Gzip: ~20% performance hit on arm32, arm64. Still need to bisect (not as easy as bisecting on 386).
RLH

RLH commented on Jun 21, 2016

@RLH
Contributor

The slow version contains a call to this routine that the faster version
does not contain. This accounts for aroutne 3% of the total time. It is the
only routine that stood out in the top 20 routines.
inst retired| cpu_clk unhalted
reflect.(_name).name 3.3% 2.9% type.go 0x8166a40 go1-template-bad
reflect.(_name).name

On Tue, Jun 21, 2016 at 2:50 PM, Quentin Smith notifications@github.com
wrote:

Reflection is definitely part of the puzzle. Here's the state of our
investigation so far:

  • Template: ~22% performance hit on arm32, arm64, and 386. Largest hit
    due to http://golang.org/cl/20968. @RLH https://github.com/rlh is
    trying to use VTune to pin down the slow code.
  • Revcomp: ~30% performance hit on 386 only. Entirely due to
    http://golang.org/cl/20901. @dr2chase https://github.com/dr2chase is
    looking at this.
  • GobDecode: ~10% performance hit on arm32, arm64, and 386. Very
    noisy, no obvious culprit.
  • JsonDecode: ~30% performance hit on arm32, arm64, ~15% on 386. Less
    noisy, still no obvious culprit.
  • Gzip: ~20% performance hit on arm32, arm64. Still need to bisect
    (not as easy as bisecting on 386).


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#16117 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AA7Wn6TQ1oaw8NNBTuIsKaDeSjMWq_SXks5qODKDgaJpZM4I5NAI
.

ianlancetaylor

ianlancetaylor commented on Jun 21, 2016

@ianlancetaylor
Contributor
ianlancetaylor

ianlancetaylor commented on Jun 21, 2016

@ianlancetaylor
Contributor

@RLH Is it easy for you to find the significant callers of reflect.(*name).name?

It's possible that the significant callers are from code that is looking for a specific name, which we could probably speed up somewhat by doing things like checking the length before building the string we return.

dr2chase

dr2chase commented on Jun 21, 2016

@dr2chase
Contributor

I have an existence proof that code alignment can cost you 10% on Revcomp.

34 remaining items

quentinmit

quentinmit commented on Jun 29, 2016

@quentinmit
Contributor

@dsnet We are running the Gzip benchmark as found in test/bench/go1/gzip_test.go

dsnet

dsnet commented on Jun 29, 2016

@dsnet
Member

tl;dr, leave gzip as is for go1.7, we can address arm performance issues in go1.8

Alright, I think I've figured out what's causing the slow-down for ARM. While all of the go1.7 optimizations as a whole have helped arm and arm64 (as seen in my post above), there are some cases where performance regresses (as is the case for test/bench/go1/gzip_test.go)

One of major changes made to compress/flate was that we started using 4-byte hashing instead of 3-byte hashing. The new hash function is stronger and leads to less collisions and thus fewer searches, which is good for performance. On amd64, we take advantage of the fact that most CPUs can perform 4-byte unaligned loads pretty efficiently (#14267). However, this optimization is not possible on arm, so the arm targets have the pay extra cost to utilize the stronger hash function.

On Go1.6, the simpler hash function compiled on arm32 to approximately 17 instrs:

102638: MOVW 0x44(R5), R1
10263c: MOVW $2, R2
102640: ADD R2, R1, R2
102644: ADD $72, R5, R1
102648: MOVW 0x4(R1), R3
10264c: CMP R3, R2
102650: B.CC 0x10265c
102654: BL runtime.panicindex(SB)
102658: UNDEF
10265c: MOVW (R1), R1
102660: MOVB (R1)(R2), R0
102664: MOVW 0x74(R5), R1
102668: LSL $6, R1, R1
10266c: ADD R1, R0, R0
102670: MOVW 0x46c(R15), R1
102674: AND R1, R0, R0
102678: MOVW R0, 0x74(R5)

On Go1.7, the stronger hash function compiled on arm32 to approximately 77 instrs:

e7d74: MOVW 0x5e8(R15), R11
e7d78: MOVW (R5)(R11), R2
e7d7c: MOVW 0x5e0(R15), R11
e7d80: MOVW (R5)(R11), R0
e7d84: MOVW $4, R1
e7d88: ADD R1, R0, R3
e7d8c: MOVW R2, R1
e7d90: MOVW R3, R2
e7d94: MOVW 0x5d4(R15), R11
e7d98: MOVW (R5)(R11), R3
e7d9c: CMP R3, R2
e7da0: B.HI 0xe8a90
e7da4: CMP R2, R1
e7da8: B.HI 0xe8a90
e7dac: SUB R1, R2, R2
e7db0: SUB R1, R3, R3
e7db4: MOVW 0x5b8(R15), R11
e7db8: MOVW (R5)(R11), R4
e7dbc: CMP $0, R3
e7dc0: B.EQ 0xe7dc8
e7dc4: ADD R1, R4, R4
e7dc8: MOVW R2, R8
e7dcc: MOVW R3, R7
e7dd0: MOVW R4, 0x50(R13)
e7dd4: MOVW R2, 0x54(R13)
e7dd8: MOVW R3, 0x58(R13)
e7ddc: MOVW $0, R0
e7de0: ADD $80, R13, R0
e7de4: MOVW 0x4(R0), R1
e7de8: MOVW $3, R2
e7dec: CMP R2, R1
e7df0: B.HI 0xe7dfc
e7df4: BL runtime.panicindex(SB)
e7df8: UNDEF
e7dfc: MOVW (R0), R0
e7e00: MOVB 0x3(R0), R0
e7e04: ADD $80, R13, R1
e7e08: MOVW 0x4(R1), R2
e7e0c: MOVW $2, R3
e7e10: CMP R3, R2
e7e14: B.HI 0xe7e20
e7e18: BL runtime.panicindex(SB)
e7e1c: UNDEF
e7e20: MOVW (R1), R1
e7e24: MOVB 0x2(R1), R1
e7e28: LSL $8, R1, R1
e7e2c: ORR R1, R0, R0
e7e30: ADD $80, R13, R1
e7e34: MOVW 0x4(R1), R2
e7e38: MOVW $1, R3
e7e3c: CMP R3, R2
e7e40: B.HI 0xe7e4c
e7e44: BL runtime.panicindex(SB)
e7e48: UNDEF
e7e4c: MOVW (R1), R1
e7e50: MOVB 0x1(R1), R1
e7e54: LSL $16, R1, R1
e7e58: ORR R1, R0, R0
e7e5c: ADD $80, R13, R1
e7e60: MOVW 0x4(R1), R2
e7e64: CMP $0, R2
e7e68: B.HI 0xe7e74
e7e6c: BL runtime.panicindex(SB)
e7e70: UNDEF
e7e74: MOVW (R1), R1
e7e78: MOVB (R1), R1
e7e7c: LSL $24, R1, R1
e7e80: ORR R1, R0, R0
e7e84: MOVW 0x4ec(R15), R1
e7e88: MUL R0, R1, R0
e7e8c: LSR $15, R0, R2
e7e90: CMP $0, R5
e7e94: MOVW.EQ R5, (R5)
e7e98: MOVW 0x4dc(R15), R11
e7e9c: ?
e8a90: BL runtime.panicslice(SB)
e8a94: UNDEF

On Go1.7, the stronger hash function compiles on amd64 to approximately 31 instrs:

4bb0d9: MOVQ 0xa0080(AX), CX
4bb0e0: MOVQ 0xa0070(AX), BX
4bb0e7: LEAQ 0x4(DX), SI
4bb0eb: CMPQ SI, DX
4bb0ee: JA 0x4bbc9f
4bb0f4: CMPQ CX, SI
4bb0f7: JA 0x4bbc9f
4bb0fd: SUBQ DX, CX
4bb100: TESTQ CX, CX
4bb103: JE 0x4bbc98
4bb109: MOVZX 0x3(BX)(DX*1), CX
4bb10e: MOVZX 0x2(BX)(DX*1), SI
4bb113: MOVZX 0x1(BX)(DX*1), DI
4bb118: MOVZX 0(BX)(DX*1), DX
4bb11c: SHLL $0x8, SI
4bb11f: ORL SI, CX
4bb121: SHLL $0x10, DI
4bb124: ORL DI, CX
4bb126: SHLL $0x18, DX
4bb129: ORL DX, CX
4bb12b: IMULL $0x1e35a7bd, CX, CX
4bb131: SHRL $0xf, CX
4bb134: MOVL CX, 0xa00c8(AX)
4bb261: XORL SI, SI
4bb5a8: XORL DX, DX
4bb644: XORL BX, BX
4bbab2: XORL CX, CX
4bbc98: XORL DX, DX
4bbc9a: JMP 0x4bb109
4bbc9f: CALL runtime.panicslice(SB)
4bbca4: UD2

Any changes to fix this would probably be too big to do this late in the release cycle. We can think about better approaches in Go 1.8. I don't know enough about ARM to know if it will ever have unaligned reads/writes. So it may be worth investigating using the previous hash function over the new one for non x86 architectures.

josharian

josharian commented on Jun 29, 2016

@josharian
Contributor

If you use the dev.ssa branch and hack in config to make arm use ssa, how do things look? I.e. Will ssa for arm save us for free?

cherrymui

cherrymui commented on Jun 30, 2016

@cherrymui
Member

@dsnet, which function did you disassemble? Is it hash4 in compress/flate?

@josharian, SSA can get the performance back on ARM:

Gzip-4                      3.68s ± 0%      2.59s ± 1%  -29.76%

Gzip-4                   5.27MB/s ± 0%   7.50MB/s ± 1%  +42.36%

This is tip of dev.ssa plus my pending CLs, comparing with SSA off vs on. Not sure about the hash function specifically (as I don't know which function it is...)

dsnet

dsnet commented on Jun 30, 2016

@dsnet
Member

Correct, it is hash4. That is awesome seeing the performance being boosted with SSA :)

cherrymui

cherrymui commented on Jun 30, 2016

@cherrymui
Member

For hash4, somehow I see a different disassembly than what you saw:
On ARM32, with SSA off:

    0x0000 00000    TEXT    "".hash4(SB), $0-16
    0x0000 00000    MOVW    8(g), R1
    0x0004 00004    CMP R1, R13
    0x0008 00008    BLS 204
    0x000c 00012    MOVW.W  R14, -4(R13)
    0x0010 00016    FUNCDATA    $0, gclocals·8355ad952265fec823c17fcf739bd009(SB)
    0x0010 00016    FUNCDATA    $1, gclocals·69c1753bd5f81501d95132d08af04464(SB)
    0x0010 00016    MOVW    $0, R0
    0x0014 00020    MOVW    $"".b(FP), R0
    0x0018 00024    MOVW    4(R0), R1
    0x001c 00028    MOVW    $3, R2
    0x0020 00032    CMP R2, R1
    0x0024 00036    BHI $1, 48
    0x0028 00040    PCDATA  $0, $1
    0x0028 00040    CALL    runtime.panicindex(SB)
    0x002c 00044    UNDEF
    0x0030 00048    MOVW    (R0), R0
    0x0034 00052    MOVBU   3(R0), R0
    0x0038 00056    MOVW    $"".b(FP), R1
    0x003c 00060    MOVW    4(R1), R2
    0x0040 00064    MOVW    $2, R3
    0x0044 00068    CMP R3, R2
    0x0048 00072    BHI $1, 84
    0x004c 00076    PCDATA  $0, $1
    0x004c 00076    CALL    runtime.panicindex(SB)
    0x0050 00080    UNDEF
    0x0054 00084    MOVW    (R1), R1
    0x0058 00088    MOVBU   2(R1), R1
    0x005c 00092    MOVW    R1<<8, R1
    0x0060 00096    ORR R1, R0, R0
    0x0064 00100    MOVW    $"".b(FP), R1
    0x0068 00104    MOVW    4(R1), R2
    0x006c 00108    MOVW    $1, R3
    0x0070 00112    CMP R3, R2
    0x0074 00116    BHI $1, 128
    0x0078 00120    PCDATA  $0, $1
    0x0078 00120    CALL    runtime.panicindex(SB)
    0x007c 00124    UNDEF
    0x0080 00128    MOVW    (R1), R1
    0x0084 00132    MOVBU   1(R1), R1
    0x0088 00136    MOVW    R1<<16, R1
    0x008c 00140    ORR R1, R0
    0x0090 00144    MOVW    $"".b(FP), R1
    0x0094 00148    MOVW    4(R1), R2
    0x0098 00152    CMP $0, R2
    0x009c 00156    BHI $1, 168
    0x00a0 00160    PCDATA  $0, $1
    0x00a0 00160    CALL    runtime.panicindex(SB)
    0x00a4 00164    UNDEF
    0x00a8 00168    MOVW    (R1), R1
    0x00ac 00172    MOVBU   (R1), R1
    0x00b0 00176    MOVW    R1<<24, R1
    0x00b4 00180    ORR R1, R0
    0x00b8 00184    MOVW    $506832829, R1
    0x00bc 00188    MULU    R1, R0
    0x00c0 00192    MOVW    R0>>15, R0
    0x00c4 00196    MOVW    R0, "".~r1+12(FP)
    0x00c8 00200    MOVW.P  4(R13), R15
    0x00cc 00204    NOP
    0x00cc 00204    MOVW    R14, R3
    0x00d0 00208    CALL    runtime.morestack_noctxt(SB)
    0x00d4 00212    JMP 0
    0x00d8 00216    JMP 0(PC)
    0x00dc 00220    WORD    $506832829

With SSA on:

    0x0000 00000    TEXT    "".hash4(SB), $4-16
    0x0000 00000    MOVW    8(g), R1
    0x0004 00004    CMP R1, R13
    0x0008 00008    BLS 92
    0x000c 00012    MOVW.W  R14, -8(R13)
    0x0010 00016    FUNCDATA    $0, gclocals·8355ad952265fec823c17fcf739bd009(SB)
    0x0010 00016    FUNCDATA    $1, gclocals·69c1753bd5f81501d95132d08af04464(SB)
    0x0010 00016    MOVW    "".b+4(FP), R0
    0x0014 00020    CMP $3, R0
    0x0018 00024    BLS 84
    0x001c 00028    MOVW    "".b(FP), R0
    0x0020 00032    MOVBU   3(R0), R1
    0x0024 00036    MOVBU   2(R0), R2
    0x0028 00040    MOVBU   1(R0), R3
    0x002c 00044    MOVBU   (R0), R0
    0x0030 00048    ORR R2<<8, R1, R1
    0x0034 00052    MOVW    R3, R2
    0x0038 00056    ORR R2<<16, R1, R1
    0x003c 00060    ORR R0<<24, R1, R0
    0x0040 00064    MOVW    $506832829, R1
    0x0044 00068    MUL R0, R1, R0
    0x0048 00072    SRL $15, R0, R0
    0x004c 00076    MOVW    R0, "".~r1+12(FP)
    0x0050 00080    MOVW.P  8(R13), R15
    0x0054 00084    PCDATA  $0, $1
    0x0054 00084    CALL    runtime.panicindex(SB)
    0x0058 00088    UNDEF
    0x005c 00092    NOP
    0x005c 00092    MOVW    R14, R3
    0x0060 00096    CALL    runtime.morestack_noctxt(SB)
    0x0064 00100    JMP 0
    0x0068 00104    JMP 0(PC)
    0x006c 00108    WORD    $506832829

So it is about 50% shorter. Currently it does not do combined unaligned load. It seems CSE helps mostly.

It might be possible to do combined load. Newer ARM support unaligned load. But currently the compiler (both backends) generates same instructions, only the assembler rewrite some instructions based on GOARM.

gopherbot

gopherbot commented on Jun 30, 2016

@gopherbot
Contributor

CL https://golang.org/cl/24640 mentions this issue.

gopherbot

gopherbot commented on Jun 30, 2016

@gopherbot
Contributor

CL https://golang.org/cl/24642 mentions this issue.

gopherbot

gopherbot commented on Jun 30, 2016

@gopherbot
Contributor

CL https://golang.org/cl/24643 mentions this issue.

ianlancetaylor

ianlancetaylor commented on Jul 6, 2016

@ianlancetaylor
Contributor

This issue has headed in several different directions.

I don't think we currently plan to do any further work to speed up the reflect package, which is the core of the original report. It would be nice to know what the performance losses are on tip.

I'm going to close this issue. If there is further work to be done for 1.7 let's move it to more focused issues.

gopherbot

gopherbot commented on Sep 30, 2016

@gopherbot
Contributor

CL https://golang.org/cl/24521 mentions this issue.

locked and limited conversation to collaborators on Sep 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @bradfitz@josharian@rsc@quentinmit@crawshaw

        Issue actions

          cmd/compile: go1.7beta2 performance regressions · Issue #16117 · golang/go