Skip to content

Performance bottleneck in function maskBytes #31

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghjp opened this issue Jun 15, 2014 · 14 comments
Closed

Performance bottleneck in function maskBytes #31

ghjp opened this issue Jun 15, 2014 · 14 comments

Comments

@ghjp
Copy link

ghjp commented Jun 15, 2014

I have investigated the performance of the websocket communication. With the kernel "perf" utility I saw that the function maskBytes consumes a lot of CPU cycles. Therefore I have optimized this function. Now I get twice the data throughout than before (900MB/s instead of 440MB/s). At the moment it is optimized for 64-bit machines. I'm also not sure if the optimization works on big endian machines. The trick is to apply the mask on 64-bit chunks instead of bytes. Here it is:

import "unsafe"

 func maskBytes(key [4]byte, pos int, b []byte) int {
        // Construct 64 bit mask
        k32 := uint32(key[0]) | uint32(key[1]) << 8 | uint32(key[2]) << 16 | uint32(key[3]) << 24
        k64 := uint64(k32)
        k64 = k64<<32 | k64
        // Correct mask by pos offset
        k64 <<= uint(pos) << 3
        k64 |= uint64(k32 >> ((4 - uint(pos)) << 3))
        // Apply 64 bit mask to the data set
        for len(b) >= 8 {
                *(*uint64)(unsafe.Pointer(&b[0])) ^= k64
                b = b[8:]
                pos += 8
        }
        // Mask the last bytes by a slow method
        for i := range b {
                b[i] ^= key[pos&3]
                pos++
        }
        return pos & 3
}
@aktau
Copy link

aktau commented Jun 23, 2014

If it really is such a bottleneck then it might even be worth it to do this on 16-byte chunks at a time. The intrinsics in the following: http://stackoverflow.com/questions/17742741/websocket-data-unmasking-multi-byte-xor can be transformed relatively easily into ASM (one intrinsic = one intruction).

It could conditionally compile this on x64 (which always has SSE2) and fallback to your implementation on other architectures. For really good performance one would need a preamble and epilogue to align the buffer to a 16-byte boundary. (mem-address & 0x10, examples of prologues/epilogues can be found all over the net).

Good find!

EDIT: Since you did perf testing, do you know how gorilla/websocket stacks up to go.net on that front?

EDIT2: An example for integrating assembly and go can be found here for example: https://bitbucket.org/zombiezen/math3/src

@garyburd
Copy link
Contributor

garyburd commented Oct 19, 2016

Yes, I am still interested.

It's possible to get a big performance gain without writing assembly. The basic idea is to mask a byte at a time to a word boundary, use the unsafe package to mask a word at a time, and finish by masking the individual bytes after the last whole word. See xor.go for a similar approach used in the standard library. Because the mask can be rotated, it's always possible to use aligned writes.

@wI2L
Copy link

wI2L commented Oct 20, 2016

I benchmarked the current implementation against @ghjp one, and it is 8x times faster on my machine.

2,6 GHz Intel Core i7

BenchmarkMaskBytes-8         2000000           659 ns/op    1551.58 MB/s
BenchmarkGhjpMaskBytes-8    10000000           116 ns/op    8769.75 MB/s

@garyburd
Copy link
Contributor

Doh! I should have refreshed my memory on what's in this issue before responding. How does it perform when the the slice is not aligned on a word boundary?

@wI2L
Copy link

wI2L commented Oct 20, 2016

@garyburd I am not very sure of how to simulate this with Go.
Here is the current bench func

func BenchmarkMaskBytes(b *testing.B) {
    var key [4]byte
    data := make([]byte, 1024)
    pos := 0
    for i := 0; i < b.N; i++ {
        pos = maskBytes(key, pos, data)
    }
    b.SetBytes(int64(len(data)))
}

@garyburd
Copy link
Contributor

garyburd commented Oct 20, 2016

make allocates a word aligned backing array. Index into the backing array to get an unaligned slice.

func BenchmarkMaskBytes(b *testing.B) {
    var key [4]byte
    data := make([]byte, 1025)
        data = data[1:]  // <--- backing array of new slice is not aligned on word boundary.
    pos := 0
    for i := 0; i < b.N; i++ {
        pos = maskBytes(key, pos, data)
    }
    b.SetBytes(int64(len(data)))
}

Edit: see this test.

@wI2L
Copy link

wI2L commented Oct 20, 2016

Performance decrease with an unaligned slice.

BenchmarkMaskBytes-8                     2000000           668 ns/op    1530.78 MB/s
BenchmarkGhjpMaskBytes-8                20000000           116 ns/op    8819.79 MB/s
BenchmarkGhjpMaskBytesUnaligned-8        2000000           665 ns/op    1538.29 MB/s

@wI2L
Copy link

wI2L commented Oct 20, 2016

Tried to write the code with the help of the go xor examples, but I don't know how to mask a whole word using the key and how to detect the first word boundary (mem & 0x07 ?), so what i've done does not pass the TestFraming test.

@garyburd
Copy link
Contributor

Here's a start on the function.

const (
    wordSize         = int(unsafe.Sizeof(uintptr(0)))
    simpleMaskCutoff = wordSize * 2
)

func maskBytesSimple(key [4]byte, pos int, b []byte) int {
    for i := range b {
        b[i] ^= key[pos&3]
        pos++
    }
    return pos & 3
}

func maskBytes(key [4]byte, pos int, b []byte) int {
    // TODO: Run benchmarks to determine if simple mask function is faster on
    // small buffers. Adjust simpleMaskCutoff for best performance.
    if len(b) < simpleMaskCutoff {
        return maskBytesSimple(key, pos, b)
    }

    // Mask to aligned position in b.

    if n := int(uintptr(unsafe.Pointer(&b)) % uintptr(wordSize)); n != 0 {
        n = wordSize - n
        pos = maskBytesSimple(key, pos, b[:n])
        b = b[n:]
    }

    // Mask aligned words.

    w := len(b) / wordSize
    if w > 0 { // TODO: Remove this test if simpleMaskCutoff >= 2 * wordSize
        var k [wordSize]byte
        for i := range k {
            k[i] = key[(pos+i)&3]
        }

        bw := *(*[]uintptr)(unsafe.Pointer(&b))
        kw := *(*uintptr)(unsafe.Pointer(&k))
        for i := 0; i < w; i++ {
            // TODO: Use unsafe trickery to avoid bounds check on bw.
            bw[i] ^= kw
        }
    }

    // Mask remaining bytes.

    return maskBytesSimple(key, pos, b[w*wordSize:])
}

@wI2L
Copy link

wI2L commented Oct 20, 2016

Tbh, I overvalued my skills for this task, so it seems to me your are basically doing this yourself but I am happy to help a bit for the benchmarks, and at least I am learning from you.
First benchmarks from your implementation looks very promising.
Regarding the unsafe trickery to avoid bounds check i tried this :

bww := *(*uintptr)(unsafe.Pointer(&bw[i]))
bww ^= kw

but TestFraming test fail

--- FAIL: TestFraming (0.09s)
    conn_test.go:107: z:false, s:false, r:half, n:65534 c:true: bad byte at offset 1016
    conn_test.go:107: z:false, s:false, r:half, n:65534 c:false: bad byte at offset 1016
    conn_test.go:107: z:false, s:false, r:half, n:65535 c:true: bad byte at offset 1016
    conn_test.go:107: z:false, s:false, r:half, n:65535 c:false: bad byte at offset 1016
    conn_test.go:107: z:false, s:false, r:half, n:65536 c:true: bad byte at offset 1016
    conn_test.go:107: z:false, s:false, r:half, n:65536 c:false: bad byte at offset 1017
    conn_test.go:107: z:false, s:false, r:half, n:65537 c:true: bad byte at offset 1016
    conn_test.go:107: z:false, s:false, r:half, n:65537 c:false: bad byte at offset 1016
    conn_test.go:107: z:false, s:false, r:one, n:124 c:true: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:124 c:false: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:125 c:true: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:125 c:false: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:126 c:true: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:126 c:false: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:127 c:true: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:127 c:false: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:128 c:true: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:128 c:false: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:129 c:true: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:129 c:false: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:65534 c:true: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:65534 c:false: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:65535 c:true: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:65535 c:false: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:65536 c:true: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:65536 c:false: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:65537 c:true: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:one, n:65537 c:false: bad byte at offset 0
    conn_test.go:107: z:false, s:false, r:asis, n:65534 c:true: bad byte at offset 1016
    conn_test.go:107: z:false, s:false, r:asis, n:65534 c:false: bad byte at offset 1016
    conn_test.go:107: z:false, s:false, r:asis, n:65535 c:true: bad byte at offset 1016
    conn_test.go:107: z:false, s:false, r:asis, n:65535 c:false: bad byte at offset 1016
    conn_test.go:107: z:false, s:false, r:asis, n:65536 c:true: bad byte at offset 1016
    conn_test.go:107: z:false, s:false, r:asis, n:65536 c:false: bad byte at offset 1016
    conn_test.go:107: z:false, s:false, r:asis, n:65537 c:true: bad byte at offset 1016
    conn_test.go:107: z:false, s:false, r:asis, n:65537 c:false: bad byte at offset 1016
    conn_test.go:96: z:true, s:false, r:one, n:124 c:true: ReadFull() returned rbuf, flate: corrupt input before offset 4
    conn_test.go:96: z:true, s:false, r:one, n:124 c:false: ReadFull() returned rbuf, flate: corrupt input before offset 13
    conn_test.go:96: z:true, s:false, r:one, n:125 c:true: ReadFull() returned rbuf, flate: corrupt input before offset 5
    conn_test.go:96: z:true, s:false, r:one, n:125 c:false: ReadFull() returned rbuf, flate: corrupt input before offset 5
    conn_test.go:96: z:true, s:false, r:one, n:126 c:true: ReadFull() returned rbuf, flate: corrupt input before offset 1
    conn_test.go:96: z:true, s:false, r:one, n:126 c:false: ReadFull() returned rbuf, flate: corrupt input before offset 1
    conn_test.go:96: z:true, s:false, r:one, n:127 c:true: ReadFull() returned rbuf, flate: corrupt input before offset 5
    conn_test.go:96: z:true, s:false, r:one, n:127 c:false: ReadFull() returned rbuf, flate: corrupt input before offset 3
    conn_test.go:96: z:true, s:false, r:one, n:128 c:true: ReadFull() returned rbuf, flate: corrupt input before offset 3
    conn_test.go:96: z:true, s:false, r:one, n:128 c:false: ReadFull() returned rbuf, flate: corrupt input before offset 9
    conn_test.go:96: z:true, s:false, r:one, n:129 c:true: ReadFull() returned rbuf, flate: corrupt input before offset 5
    conn_test.go:96: z:true, s:false, r:one, n:129 c:false: ReadFull() returned rbuf, flate: corrupt input before offset 3
    conn_test.go:96: z:true, s:false, r:one, n:65534 c:true: ReadFull() returned rbuf, flate: corrupt input before offset 1
    conn_test.go:96: z:true, s:false, r:one, n:65534 c:false: ReadFull() returned rbuf, flate: corrupt input before offset 9
    conn_test.go:96: z:true, s:false, r:one, n:65535 c:true: ReadFull() returned rbuf, flate: corrupt input before offset 5
    conn_test.go:96: z:true, s:false, r:one, n:65535 c:false: ReadFull() returned rbuf, flate: corrupt input before offset 7
    conn_test.go:96: z:true, s:false, r:one, n:65536 c:true: ReadFull() returned rbuf, flate: corrupt input before offset 15
    conn_test.go:96: z:true, s:false, r:one, n:65536 c:false: ReadFull() returned rbuf, flate: corrupt input before offset 5
    conn_test.go:96: z:true, s:false, r:one, n:65537 c:true: ReadFull() returned rbuf, flate: corrupt input before offset 6
    conn_test.go:96: z:true, s:false, r:one, n:65537 c:false: ReadFull() returned rbuf, flate: corrupt input before offset 4
FAIL

@garyburd
Copy link
Contributor

See #169.

@wI2L
Copy link

wI2L commented Oct 20, 2016

@garyburd Nice! Perf improvements are really good for 32bytes chunk and upper.
I see you left the TODO mention on the bounds check. Do you think this is what cause the TestFraming to fail ?

@garyburd
Copy link
Contributor

garyburd commented Oct 20, 2016

I don't know what caused the test to fail.

I didn't use *(*uintptr)(unsafe.Pointer(&bw[i])) because I think the expression does a bounds check. I need to look at the generated code, but I have not had a chance to do so.

@garyburd
Copy link
Contributor

Fixed by 77f1107

Here's the code that I used to avoid the bounds check:

n := (len(b) / wordSize) * wordSize
for i := 0; i < n; i += wordSize {
    *(*uintptr)(unsafe.Pointer(uintptr(unsafe.Pointer(&b[0])) + uintptr(i))) ^= kw
}

@gorilla gorilla locked and limited conversation to collaborators Feb 14, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants