Performance bottleneck in function maskBytes #31

ghjp · 2014-06-15T10:41:55Z

I have investigated the performance of the websocket communication. With the kernel "perf" utility I saw that the function maskBytes consumes a lot of CPU cycles. Therefore I have optimized this function. Now I get twice the data throughout than before (900MB/s instead of 440MB/s). At the moment it is optimized for 64-bit machines. I'm also not sure if the optimization works on big endian machines. The trick is to apply the mask on 64-bit chunks instead of bytes. Here it is:

import "unsafe"

 func maskBytes(key [4]byte, pos int, b []byte) int {
        // Construct 64 bit mask
        k32 := uint32(key[0]) | uint32(key[1]) << 8 | uint32(key[2]) << 16 | uint32(key[3]) << 24
        k64 := uint64(k32)
        k64 = k64<<32 | k64
        // Correct mask by pos offset
        k64 <<= uint(pos) << 3
        k64 |= uint64(k32 >> ((4 - uint(pos)) << 3))
        // Apply 64 bit mask to the data set
        for len(b) >= 8 {
                *(*uint64)(unsafe.Pointer(&b[0])) ^= k64
                b = b[8:]
                pos += 8
        }
        // Mask the last bytes by a slow method
        for i := range b {
                b[i] ^= key[pos&3]
                pos++
        }
        return pos & 3
}

The text was updated successfully, but these errors were encountered:

aktau · 2014-06-23T10:55:13Z

If it really is such a bottleneck then it might even be worth it to do this on 16-byte chunks at a time. The intrinsics in the following: http://stackoverflow.com/questions/17742741/websocket-data-unmasking-multi-byte-xor can be transformed relatively easily into ASM (one intrinsic = one intruction).

It could conditionally compile this on x64 (which always has SSE2) and fallback to your implementation on other architectures. For really good performance one would need a preamble and epilogue to align the buffer to a 16-byte boundary. (mem-address & 0x10, examples of prologues/epilogues can be found all over the net).

Good find!

EDIT: Since you did perf testing, do you know how gorilla/websocket stacks up to go.net on that front?

EDIT2: An example for integrating assembly and go can be found here for example: https://bitbucket.org/zombiezen/math3/src

garyburd · 2016-10-19T23:10:58Z

Yes, I am still interested.

It's possible to get a big performance gain without writing assembly. The basic idea is to mask a byte at a time to a word boundary, use the unsafe package to mask a word at a time, and finish by masking the individual bytes after the last whole word. See xor.go for a similar approach used in the standard library. Because the mask can be rotated, it's always possible to use aligned writes.

wI2L · 2016-10-20T01:36:15Z

I benchmarked the current implementation against @ghjp one, and it is 8x times faster on my machine.

2,6 GHz Intel Core i7

BenchmarkMaskBytes-8         2000000           659 ns/op    1551.58 MB/s
BenchmarkGhjpMaskBytes-8    10000000           116 ns/op    8769.75 MB/s

garyburd · 2016-10-20T02:04:44Z

Doh! I should have refreshed my memory on what's in this issue before responding. How does it perform when the the slice is not aligned on a word boundary?

wI2L · 2016-10-20T02:22:18Z

@garyburd I am not very sure of how to simulate this with Go.
Here is the current bench func

func BenchmarkMaskBytes(b *testing.B) {
    var key [4]byte
    data := make([]byte, 1024)
    pos := 0
    for i := 0; i < b.N; i++ {
        pos = maskBytes(key, pos, data)
    }
    b.SetBytes(int64(len(data)))
}

garyburd · 2016-10-20T02:50:07Z

make allocates a word aligned backing array. Index into the backing array to get an unaligned slice.

func BenchmarkMaskBytes(b *testing.B) {
    var key [4]byte
    data := make([]byte, 1025)
        data = data[1:]  // <--- backing array of new slice is not aligned on word boundary.
    pos := 0
    for i := 0; i < b.N; i++ {
        pos = maskBytes(key, pos, data)
    }
    b.SetBytes(int64(len(data)))
}

Edit: see this test.

wI2L · 2016-10-20T11:42:59Z

Performance decrease with an unaligned slice.

BenchmarkMaskBytes-8                     2000000           668 ns/op    1530.78 MB/s
BenchmarkGhjpMaskBytes-8                20000000           116 ns/op    8819.79 MB/s
BenchmarkGhjpMaskBytesUnaligned-8        2000000           665 ns/op    1538.29 MB/s

wI2L · 2016-10-20T12:36:42Z

Tried to write the code with the help of the go xor examples, but I don't know how to mask a whole word using the key and how to detect the first word boundary (mem & 0x07 ?), so what i've done does not pass the TestFraming test.

garyburd · 2016-10-20T16:26:53Z

Here's a start on the function.

const (
    wordSize         = int(unsafe.Sizeof(uintptr(0)))
    simpleMaskCutoff = wordSize * 2
)

func maskBytesSimple(key [4]byte, pos int, b []byte) int {
    for i := range b {
        b[i] ^= key[pos&3]
        pos++
    }
    return pos & 3
}

func maskBytes(key [4]byte, pos int, b []byte) int {
    // TODO: Run benchmarks to determine if simple mask function is faster on
    // small buffers. Adjust simpleMaskCutoff for best performance.
    if len(b) < simpleMaskCutoff {
        return maskBytesSimple(key, pos, b)
    }

    // Mask to aligned position in b.

    if n := int(uintptr(unsafe.Pointer(&b)) % uintptr(wordSize)); n != 0 {
        n = wordSize - n
        pos = maskBytesSimple(key, pos, b[:n])
        b = b[n:]
    }

    // Mask aligned words.

    w := len(b) / wordSize
    if w > 0 { // TODO: Remove this test if simpleMaskCutoff >= 2 * wordSize
        var k [wordSize]byte
        for i := range k {
            k[i] = key[(pos+i)&3]
        }

        bw := *(*[]uintptr)(unsafe.Pointer(&b))
        kw := *(*uintptr)(unsafe.Pointer(&k))
        for i := 0; i < w; i++ {
            // TODO: Use unsafe trickery to avoid bounds check on bw.
            bw[i] ^= kw
        }
    }

    // Mask remaining bytes.

    return maskBytesSimple(key, pos, b[w*wordSize:])
}

wI2L · 2016-10-20T17:36:29Z

Tbh, I overvalued my skills for this task, so it seems to me your are basically doing this yourself but I am happy to help a bit for the benchmarks, and at least I am learning from you.
First benchmarks from your implementation looks very promising.
Regarding the unsafe trickery to avoid bounds check i tried this :

bww := *(*uintptr)(unsafe.Pointer(&bw[i]))
bww ^= kw

but TestFraming test fail

--- FAIL: TestFraming (0.09s)
    conn_test.go:107: z:false, s:false, r:half, n:65534 c:true: bad byte at offset 1016
    conn_test.go:107: z:false, s:false, r:half, n:65534 c:false: bad byte at offset 1016
    conn_test.go:107: z:false, s:false, r:half, n:65535 c:true: bad byte at offset 1016
    conn_test.go:107: z:false, s:false, r:half, n:65535 c:false: bad byte at offset 1016
    conn_test.go:107: z:false, s:false, r:half, n:65536 c:true: bad byte at offset 1016
    conn_test.go:107: z:false, s:false, r:half, n:65536 c:false: bad byte at offset 1017
    conn_test.go:107: z:false, s:false, r:half, n:65537 c:true: bad byte at offset 1016
    conn_test.go:107: z:false, s:false, r:half, n:65537 c:false: bad byte at offset 1016
    conn_test.go:107: z:false, s:false, r:one, n:124 c:true: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:124 c:false: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:125 c:true: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:125 c:false: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:126 c:true: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:126 c:false: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:127 c:true: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:127 c:false: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:128 c:true: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:128 c:false: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:129 c:true: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:129 c:false: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:65534 c:true: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:65534 c:false: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:65535 c:true: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:65535 c:false: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:65536 c:true: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:65536 c:false: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:65537 c:true: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:65537 c:false: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:asis, n:65534 c:true: bad byte at offset 1016
    conn_test.go:107: z:false, s:false, r:asis, n:65534 c:false: bad byte at offset 1016
    conn_test.go:107: z:false, s:false, r:asis, n:65535 c:true: bad byte at offset 1016
    conn_test.go:107: z:false, s:false, r:asis, n:65535 c:false: bad byte at offset 1016
    conn_test.go:107: z:false, s:false, r:asis, n:65536 c:true: bad byte at offset 1016
    conn_test.go:107: z:false, s:false, r:asis, n:65536 c:false: bad byte at offset 1016
    conn_test.go:107: z:false, s:false, r:asis, n:65537 c:true: bad byte at offset 1016
    conn_test.go:107: z:false, s:false, r:asis, n:65537 c:false: bad byte at offset 1016
    conn_test.go:96: z:true, s:false, r:one, n:124 c:true: ReadFull() returned rbuf, flate: corrupt input before offset 4
    conn_test.go:96: z:true, s:false, r:one, n:124 c:false: ReadFull() returned rbuf, flate: corrupt input before offset 13
    conn_test.go:96: z:true, s:false, r:one, n:125 c:true: ReadFull() returned rbuf, flate: corrupt input before offset 5
    conn_test.go:96: z:true, s:false, r:one, n:125 c:false: ReadFull() returned rbuf, flate: corrupt input before offset 5
    conn_test.go:96: z:true, s:false, r:one, n:126 c:true: ReadFull() returned rbuf, flate: corrupt input before offset 1
    conn_test.go:96: z:true, s:false, r:one, n:126 c:false: ReadFull() returned rbuf, flate: corrupt input before offset 1
    conn_test.go:96: z:true, s:false, r:one, n:127 c:true: ReadFull() returned rbuf, flate: corrupt input before offset 5
    conn_test.go:96: z:true, s:false, r:one, n:127 c:false: ReadFull() returned rbuf, flate: corrupt input before offset 3
    conn_test.go:96: z:true, s:false, r:one, n:128 c:true: ReadFull() returned rbuf, flate: corrupt input before offset 3
    conn_test.go:96: z:true, s:false, r:one, n:128 c:false: ReadFull() returned rbuf, flate: corrupt input before offset 9
    conn_test.go:96: z:true, s:false, r:one, n:129 c:true: ReadFull() returned rbuf, flate: corrupt input before offset 5
    conn_test.go:96: z:true, s:false, r:one, n:129 c:false: ReadFull() returned rbuf, flate: corrupt input before offset 3
    conn_test.go:96: z:true, s:false, r:one, n:65534 c:true: ReadFull() returned rbuf, flate: corrupt input before offset 1
    conn_test.go:96: z:true, s:false, r:one, n:65534 c:false: ReadFull() returned rbuf, flate: corrupt input before offset 9
    conn_test.go:96: z:true, s:false, r:one, n:65535 c:true: ReadFull() returned rbuf, flate: corrupt input before offset 5
    conn_test.go:96: z:true, s:false, r:one, n:65535 c:false: ReadFull() returned rbuf, flate: corrupt input before offset 7
    conn_test.go:96: z:true, s:false, r:one, n:65536 c:true: ReadFull() returned rbuf, flate: corrupt input before offset 15
    conn_test.go:96: z:true, s:false, r:one, n:65536 c:false: ReadFull() returned rbuf, flate: corrupt input before offset 5
    conn_test.go:96: z:true, s:false, r:one, n:65537 c:true: ReadFull() returned rbuf, flate: corrupt input before offset 6
    conn_test.go:96: z:true, s:false, r:one, n:65537 c:false: ReadFull() returned rbuf, flate: corrupt input before offset 4
FAIL

garyburd · 2016-10-20T23:31:52Z

See #169.

wI2L · 2016-10-20T23:51:56Z

@garyburd Nice! Perf improvements are really good for 32bytes chunk and upper.
I see you left the TODO mention on the bounds check. Do you think this is what cause the TestFraming to fail ?

garyburd · 2016-10-20T23:58:42Z

I don't know what caused the test to fail.

I didn't use *(*uintptr)(unsafe.Pointer(&bw[i])) because I think the expression does a bounds check. I need to look at the generated code, but I have not had a chance to do so.

garyburd · 2016-10-21T19:14:19Z

Fixed by 77f1107

Here's the code that I used to avoid the bounds check:

n := (len(b) / wordSize) * wordSize
for i := 0; i < n; i += wordSize {
    *(*uintptr)(unsafe.Pointer(uintptr(unsafe.Pointer(&b[0])) + uintptr(i))) ^= kw
}

garyburd added the help wanted label Nov 8, 2014

garyburd closed this as completed Oct 21, 2016

gorilla locked and limited conversation to collaborators Feb 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance bottleneck in function maskBytes #31

Performance bottleneck in function maskBytes #31

ghjp commented Jun 15, 2014

aktau commented Jun 23, 2014

garyburd commented Oct 19, 2016 •

edited

Loading

wI2L commented Oct 20, 2016 •

edited

Loading

garyburd commented Oct 20, 2016

wI2L commented Oct 20, 2016

garyburd commented Oct 20, 2016 •

edited

Loading

wI2L commented Oct 20, 2016 •

edited

Loading

wI2L commented Oct 20, 2016 •

edited

Loading

garyburd commented Oct 20, 2016

wI2L commented Oct 20, 2016 •

edited

Loading

garyburd commented Oct 20, 2016

wI2L commented Oct 20, 2016

garyburd commented Oct 20, 2016 •

edited

Loading

garyburd commented Oct 21, 2016

Performance bottleneck in function maskBytes #31

Performance bottleneck in function maskBytes #31

Comments

ghjp commented Jun 15, 2014

aktau commented Jun 23, 2014

garyburd commented Oct 19, 2016 • edited Loading

wI2L commented Oct 20, 2016 • edited Loading

garyburd commented Oct 20, 2016

wI2L commented Oct 20, 2016

garyburd commented Oct 20, 2016 • edited Loading

wI2L commented Oct 20, 2016 • edited Loading

wI2L commented Oct 20, 2016 • edited Loading

garyburd commented Oct 20, 2016

wI2L commented Oct 20, 2016 • edited Loading

garyburd commented Oct 20, 2016

wI2L commented Oct 20, 2016

garyburd commented Oct 20, 2016 • edited Loading

garyburd commented Oct 21, 2016

garyburd commented Oct 19, 2016 •

edited

Loading

wI2L commented Oct 20, 2016 •

edited

Loading

garyburd commented Oct 20, 2016 •

edited

Loading

wI2L commented Oct 20, 2016 •

edited

Loading

wI2L commented Oct 20, 2016 •

edited

Loading

wI2L commented Oct 20, 2016 •

edited

Loading

garyburd commented Oct 20, 2016 •

edited

Loading