-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Description
TLDR; basically I think there are a bunch of small tweaks to the code which could make this faster, along with the 'bigger' tweaks (risks) of using unsafe code, and really baking little-endian assumptions into the implementation, which I am thinking of doing a pull request for - high level QN: does it sound useful?
I saw a thread today talking about System.Compression.Crc32Helper using relatively a lot of CPU. Idly I wondered how this could be possible, so I copied the code from this repo into a project, came up with a dumb synthetic benchmark where I did millions of CRC32 computation on an array of length 12 for a few seconds, and looked at what the VS 2017 profiler (for .net framework, AnyCPU, release mode) told me the line costs where.
What I found was that the hottest line was the inner statement of the latter loop:
for (int i = 0; i < endBytes; i++)
{
crc32 = s_crcTable_0[(crc32 ^ buffer[offset++]) & 0x000000FF] ^ (crc32 >> 8);
}
Which for me was compiled (JITted?) as native code:
for (int i = 0; i < endBytes; i++)
00E108A8 cmp dword ptr [ebp-14h],0
00E108AC jle 00E108EF
00E108AE mov eax,dword ptr [ebp-28h]
00E108B1 mov esi,dword ptr [eax+4]
{
crc32 = s_crcTable_0[(crc32 ^ buffer[offset++]) & 0x000000FF] ^ (crc32 >> 8);
00E108B4 mov dword ptr [ebp-20h],edi
00E108B7 inc edi
00E108B8 mov eax,dword ptr ds:[03CE353Ch]
00E108BD mov edx,dword ptr [ebp-20h]
00E108C0 mov ecx,dword ptr [ebp-28h]
00E108C3 cmp edx,esi
00E108C5 jae 00E108FE
00E108C7 movzx edx,byte ptr [ecx+edx+8]
00E108CC xor edx,ebx
00E108CE and edx,0FFh
00E108D4 cmp edx,dword ptr [eax+4]
00E108D7 jae 00E108FE
00E108D9 mov eax,dword ptr [eax+edx*4+8]
00E108DD shr ebx,8
00E108E0 xor eax,ebx
00E108E2 mov ebx,eax
for (int i = 0; i < endBytes; i++)
00E108E4 inc dword ptr [ebp-1Ch]
00E108E7 mov eax,dword ptr [ebp-1Ch]
00E108EA cmp eax,dword ptr [ebp-14h]
00E108ED jl 00E108B4
}
The thing that jumped out at me first while looking at this was the 'jae' checks, which I guessed might be array bound checks! Using 'unsafe' should make this faster right? (maybe)
So I tried it out. I didn't notice an obviously significant change from that, but while trying to understand the disassembly further, I realized that the way 'slice by 8' was being implemented by JIT or whoever did the assembler here was not very optimal. It was literally reading 1 byte at a time from the array and zero extending, then xoring all the zero extended words, just to read (and XOR) all the asserted-as-little-endian bytes into the uint32 variable 'crc32?' I think no C or assembly programmer would do it that way...
crc32 ^= unchecked((uint)(buffptr[offset] | buffptr[offset + 1] << 8 | buffptr[offset + 2] << 16 | buffptr[offset + 3] << 24));
01AE0748 movzx eax,byte ptr [ebx+esi]
01AE074C movzx edx,byte ptr [ebx+esi+1]
01AE0751 shl edx,8
01AE0754 or eax,edx
01AE0756 movzx edx,byte ptr [ebx+esi+2]
01AE075B shl edx,10h
01AE075E or eax,edx
01AE0760 movzx edx,byte ptr [ebx+esi+3]
01AE0765 shl edx,18h
01AE0768 or eax,edx
01AE076A xor edi,eax
So I switched that line out with crc32 ^= ((uint*)buffptr)[offset / 4]'
and it came out with something that looked much more sensible to me, assuming I introduced no bug
00DA0748 mov eax,edi
00DA074A test eax,eax
00DA074C jns 00DA0751
00DA074E add eax,3
00DA0751 sar eax,2
00DA0754 xor esi,dword ptr [ebx+eax*4]
this seems to shave several percents off how many CPU samples I was spending on those particular lines, which were hot lines dotnet/corefx#2 and dotnet/corefx#3 for my scenario. And happily, these are the lines which are expected to be in the 'hot path' when processing nice big buffer inputs...
There was still some sorta 'crazy looking' inverse operations happening with multiplying and dividing (using right shift) by 4 or 8 to get array offsets which should be able to be canceled out by better manual assembly code... and the fact that both 'offset' and 'i' are getting incremented each time around the loop by 8 and 1 each, is a bit odd, you should only really need one loop counter....