Skip to content

Remove netfx-specific code from projects that no longer build in netfx configurations #27114

@TimLovellSmith

Description

@TimLovellSmith

TLDR; basically I think there are a bunch of small tweaks to the code which could make this faster, along with the 'bigger' tweaks (risks) of using unsafe code, and really baking little-endian assumptions into the implementation, which I am thinking of doing a pull request for - high level QN: does it sound useful?

I saw a thread today talking about System.Compression.Crc32Helper using relatively a lot of CPU. Idly I wondered how this could be possible, so I copied the code from this repo into a project, came up with a dumb synthetic benchmark where I did millions of CRC32 computation on an array of length 12 for a few seconds, and looked at what the VS 2017 profiler (for .net framework, AnyCPU, release mode) told me the line costs where.

What I found was that the hottest line was the inner statement of the latter loop:

             for (int i = 0; i < endBytes; i++)
            {
                crc32 = s_crcTable_0[(crc32 ^ buffer[offset++]) & 0x000000FF] ^ (crc32 >> 8);
            }

Which for me was compiled (JITted?) as native code:

           for (int i = 0; i < endBytes; i++)
00E108A8  cmp         dword ptr [ebp-14h],0  
00E108AC  jle         00E108EF  
00E108AE  mov         eax,dword ptr [ebp-28h]  
00E108B1  mov         esi,dword ptr [eax+4]  
            {
                crc32 = s_crcTable_0[(crc32 ^ buffer[offset++]) & 0x000000FF] ^ (crc32 >> 8);
00E108B4  mov         dword ptr [ebp-20h],edi  
00E108B7  inc         edi  
00E108B8  mov         eax,dword ptr ds:[03CE353Ch]  
00E108BD  mov         edx,dword ptr [ebp-20h]  
00E108C0  mov         ecx,dword ptr [ebp-28h]  
00E108C3  cmp         edx,esi  
00E108C5  jae         00E108FE  
00E108C7  movzx       edx,byte ptr [ecx+edx+8]  
00E108CC  xor         edx,ebx  
00E108CE  and         edx,0FFh  
00E108D4  cmp         edx,dword ptr [eax+4]  
00E108D7  jae         00E108FE  
00E108D9  mov         eax,dword ptr [eax+edx*4+8]  
00E108DD  shr         ebx,8  
00E108E0  xor         eax,ebx  
00E108E2  mov         ebx,eax  
            for (int i = 0; i < endBytes; i++)
00E108E4  inc         dword ptr [ebp-1Ch]  
00E108E7  mov         eax,dword ptr [ebp-1Ch]  
00E108EA  cmp         eax,dword ptr [ebp-14h]  
00E108ED  jl          00E108B4  
            }

The thing that jumped out at me first while looking at this was the 'jae' checks, which I guessed might be array bound checks! Using 'unsafe' should make this faster right? (maybe)

So I tried it out. I didn't notice an obviously significant change from that, but while trying to understand the disassembly further, I realized that the way 'slice by 8' was being implemented by JIT or whoever did the assembler here was not very optimal. It was literally reading 1 byte at a time from the array and zero extending, then xoring all the zero extended words, just to read (and XOR) all the asserted-as-little-endian bytes into the uint32 variable 'crc32?' I think no C or assembly programmer would do it that way...

                        crc32 ^= unchecked((uint)(buffptr[offset] | buffptr[offset + 1] << 8 | buffptr[offset + 2] << 16 | buffptr[offset + 3] << 24));
01AE0748  movzx       eax,byte ptr [ebx+esi]  
01AE074C  movzx       edx,byte ptr [ebx+esi+1]  
01AE0751  shl         edx,8  
01AE0754  or          eax,edx  
01AE0756  movzx       edx,byte ptr [ebx+esi+2]  
01AE075B  shl         edx,10h  
01AE075E  or          eax,edx  
01AE0760  movzx       edx,byte ptr [ebx+esi+3]  
01AE0765  shl         edx,18h  
01AE0768  or          eax,edx  
01AE076A  xor         edi,eax  

So I switched that line out with crc32 ^= ((uint*)buffptr)[offset / 4]'

and it came out with something that looked much more sensible to me, assuming I introduced no bug

00DA0748  mov         eax,edi  
00DA074A  test        eax,eax  
00DA074C  jns         00DA0751  
00DA074E  add         eax,3  
00DA0751  sar         eax,2  
00DA0754  xor         esi,dword ptr [ebx+eax*4]  

this seems to shave several percents off how many CPU samples I was spending on those particular lines, which were hot lines dotnet/corefx#2 and dotnet/corefx#3 for my scenario. And happily, these are the lines which are expected to be in the 'hot path' when processing nice big buffer inputs...

There was still some sorta 'crazy looking' inverse operations happening with multiplying and dividing (using right shift) by 4 or 8 to get array offsets which should be able to be canceled out by better manual assembly code... and the fact that both 'offset' and 'i' are getting incremented each time around the loop by 8 and 1 each, is a bit odd, you should only really need one loop counter....

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-System.IO.CompressionenhancementProduct code improvement that does NOT require public API changes/additions

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions