Description
Style changes needed to solve part of #823
After implementing "double-compute", it is expected to make hardware intrinsics more efficient.
Details (mostly from @tannergooding)
- In
src\Microsoft.ML.CpuMath\SseIntrinsics.cs
andsrc\Microsoft.ML.CpuMath\AvxIntrinsics.cs
, change the last loop of the existing 3-loop code pattern into the following:- Saving the stored result (
dstVector
) from the last iteration of the vectorized code - Moving
pDstCurrent
back such thatpDstCurrent + elementsPerIteration == pEnd
- Doing a single iteration for the remaining elements
- Mix the saved result from the last iteration of the vectorized code with the result from the remaining elements
- Write the result
- Saving the stored result (
This generally results in more performant code, depending on the exact algorithm and number of remaining elements
- On handling unpadded parts in AVX intrinsics:
For some algorithms (like Sum
), it is possible to “double-compute” a few elements in the beginning and end to have better overall performance. See the following pseudo-code:
if addr not aligned
tmp = unaligned load from addr
tmp &= mask which zero's elements after the first aligned address
result = tmp
move addr forward to the first aligned address
while addr is aligned and remaining bits >= 128
result += aligned load
addr += 128-bits
if any remaining
addr = endAddr - 128
tmp = unaligned load from addr
tmp &= mask which zero's elements already processed
result += tmp
Sum the elements in result (using "horizontal add" or "shuffle and add")
So, your overall algorithm will probably look like:
if (Avx.IsSupported && (Length >= AvxLimit))
{
// Process 256-bits, we have a limit since 256-bit
// AVX instructions can cause a downclock in the CPU
// Algorithm would be similar to the SSE pseudo-code
}
else if (Sse.IsSupported && (Length >= SseLimit))
{
// Pseudo-code algorithm given above
// 128-bit instructions operate at full frequency
// and don't downclock the CPU, we can only use
// them for more than 128-bits so we don't AV
}
else
{
// Software Implementation
}
If you can’t “double-compute” for some reason, then you generally do the “software” processing for the beginning (to become aligned) and end (to catch stray elements).
• AvxLimit
is generally a number that takes into account the “downclocking” that can occur for heavy 256-bit instruction usage
• SseLimit
is generally 128-bits for algorithms where you can “double-compute” and some profiled number for other algorithms
cc: @tannergooding since he suggested this approach.