-
Notifications
You must be signed in to change notification settings - Fork 3.5k
Performance bottleneck in function maskBytes #31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
If it really is such a bottleneck then it might even be worth it to do this on 16-byte chunks at a time. The intrinsics in the following: http://stackoverflow.com/questions/17742741/websocket-data-unmasking-multi-byte-xor can be transformed relatively easily into ASM (one intrinsic = one intruction). It could conditionally compile this on x64 (which always has SSE2) and fallback to your implementation on other architectures. For really good performance one would need a preamble and epilogue to align the buffer to a 16-byte boundary. ( Good find! EDIT: Since you did perf testing, do you know how gorilla/websocket stacks up to go.net on that front? EDIT2: An example for integrating assembly and go can be found here for example: https://bitbucket.org/zombiezen/math3/src |
Yes, I am still interested. It's possible to get a big performance gain without writing assembly. The basic idea is to mask a byte at a time to a word boundary, use the unsafe package to mask a word at a time, and finish by masking the individual bytes after the last whole word. See xor.go for a similar approach used in the standard library. Because the mask can be rotated, it's always possible to use aligned writes. |
I benchmarked the current implementation against @ghjp one, and it is 8x times faster on my machine. 2,6 GHz Intel Core i7
|
Doh! I should have refreshed my memory on what's in this issue before responding. How does it perform when the the slice is not aligned on a word boundary? |
@garyburd I am not very sure of how to simulate this with Go.
|
Edit: see this test. |
Performance decrease with an unaligned slice.
|
Tried to write the code with the help of the go xor examples, but I don't know how to mask a whole word using the key and how to detect the first word boundary ( |
Here's a start on the function.
|
Tbh, I overvalued my skills for this task, so it seems to me your are basically doing this yourself but I am happy to help a bit for the benchmarks, and at least I am learning from you.
but
|
See #169. |
@garyburd Nice! Perf improvements are really good for 32bytes chunk and upper. |
I don't know what caused the test to fail. I didn't use |
Fixed by 77f1107 Here's the code that I used to avoid the bounds check:
|
I have investigated the performance of the websocket communication. With the kernel "perf" utility I saw that the function maskBytes consumes a lot of CPU cycles. Therefore I have optimized this function. Now I get twice the data throughout than before (900MB/s instead of 440MB/s). At the moment it is optimized for 64-bit machines. I'm also not sure if the optimization works on big endian machines. The trick is to apply the mask on 64-bit chunks instead of bytes. Here it is:
The text was updated successfully, but these errors were encountered: