diff --git a/impeller/README.md b/impeller/README.md index fae3621d719ea..637a2a87c8054 100644 --- a/impeller/README.md +++ b/impeller/README.md @@ -185,3 +185,4 @@ To your `AndroidManifest.xml` file, add under the `` tag: * [Learning to Read GPU Frame Captures](docs/read_frame_captures.md) * [How to Enable Metal Validation for Command Line Apps.](docs/metal_validation.md) * [How Impeller Works Around The Lack of Uniform Buffers in Open GL ES 2.0.](docs/ubo_gles2.md) +* [Guidance for writing efficient shaders](docs/shader_optimization.md) diff --git a/impeller/docs/shader_optimization.md b/impeller/docs/shader_optimization.md new file mode 100644 index 0000000000000..397d4787f3f66 --- /dev/null +++ b/impeller/docs/shader_optimization.md @@ -0,0 +1,289 @@ +# Writing efficient shaders + +When it comes to optimizing shaders for a wide range of devices, there is no +perfect strategy. The reality of different drivers written by different vendors +targeting different hardware is that they will vary in behavior. Any attempt at +optimizing against a specific driver will likely result in a performance loss +for some other drivers that end users will run Flutter apps against. + +That being said, newer graphics devices have architectures that allow for both +simpler shader compilation and better handling of traditionally slow shader +code. In fact, ostensibly "unoptimized" shader code filled with branches may +significantly outperform the equivalent branchless optimized shader code when +targeting newer GPU architectures. (See the "Don't flatten simple varying +branches" recommendation for an explanation of this with respect to different +architectures). + +Flutter actively supports mobile devices that are more than a decade old, which +requires us to write shaders that perform well across multiple generations of +GPU architectures featuring radically different behavior. Most optimization +choices are direct tradeoffs between these GPU architectures, and so having an +accurate mental model for how these common architectures maximize parallelism is +essential for making good decisions while authoring shaders. + +For these reasons, it's also important to profile shaders against some of the +older devices that Flutter can target (such as the iPhone 6s) when making +changes intended to improve shader performance. + +Also, even though the branching behavior is largely architecture dependent and +should remain the same when using different graphics APIs, it's still also a +good idea to test changes against the different backends supported by Impeller +(Metal and GLES). Early stage shader compilation (as well as the high level +shader code generated by ImpellerC) may vary quite a bit between APIs. + +## GPU architecture primer + +GPUs are designed to have functional units running single instructions over many +elements (the "data path") each clock cycle. This is the fundamental aspect of +GPUs that makes them work well for massively parallel compute work; they're +essentially specialized SIMD engines. + +GPU parallelism generally comes in two broad architectural flavors: +**Instruction-level parallelism** and **Thread-level parallelism** -- these +architecture designs handle shader branching very differently and are covered +in the sections below. In general, older GPU architectures (on some products +released before ~2015) leverage instruction-level parallelism, while most if not +all newer GPUs leverage thread-level parallelism. + +Some of the earliest GPU architectures had no runtime control flow primitives at +all (i.e. jump instructions), and compilers for these architectures needed to +handle branches ahead of time by unrolling loops, compiling a different program +for every possible branch combination, and then executing all of them. However, +virtually all GPU architectures in use today have instruction-level support for +dynamic branching, and it's quite unlikely that we'll come across a mobile +device capable of running Flutter that doesn't. For example, the old devices we +test against in CI (iPhone 6s and Moto G4) run GPUs that support dynamic +runtime branching. For these reasons, the optimization advice in this document +isn't aimed at branchless architectures. + +### Instruction-level parallelism + +Some older GPUs (including the PowerVR GT7600 GPU on the iPhone 6s SoC) rely on +SIMD vector or array instructions to maximize the number of computations +performed per clock cycle on each functional unit. This means that the shader +compiler must figure out which parts of the program are safe to parallelize +ahead of time and emit appropriate instructions. This presents a problem for +certain kinds of branches: If the compiler doesn't know that the same decision +will always be taken by all of the data lanes at runtime (meaning the branch is +_varying_), it can't safely emit SIMD instructions while compiling the branch. +The result is that instructions within non-uniform branches incur a +`1/[data width]` performance penalty when compared to non-branched instructions +because they can't be parallelized. + +VLIW ("Very Long Instruction Width") is another common instruction-level +parallelism design that suffers from the same compile time reasoning +disadvantage that SIMD does. + +### Thread-level parallelism + +Newer GPUs (but also some older hardware such as the Adreno 306 GPU found on the +Moto G4's Snapdragon SoC) use scalar functional units (no SIMD/VLIW/MIMD) and +parallelize instructions at runtime by running the same instruction over many +threads in groups often referred to as "warps" (Nvidia terminology) or +"wavefronts" (AMD terminology), usually consisting of 32 or 64 threads per +warp/wavefront. This design is also commonly referred to as SIMT ("Single +Instruction Multiple Thread"). + +To handle branching, SIMT programs use special instructions to write a thread +mask that determines which threads are activated/deactivated in the warp; only +the warp's activated threads will actually execute instructions. Given this +setup, the program can first deactivate threads that failed the branch +condition, run the positive path, invert the mask, run the negative path, and +finally restore the mask to its original state prior to the branch. The compiler +may also insert mask checks to skip over branches when all of the threads have +been deactivated. + +Therefore, the best case scenario for a SIMT branch is that it only incurs the +cost of the conditional. The worst case scenario is that some of the warp's +threads fail the conditional and the rest succeed, requiring the program to +execute both paths of the branch back-to-back in the warp. Note that this is +very favorable to the SIMD scenario with non-uniform/varying branches, as SIMT +is able to retain significant parallelism in all cases, whereas SIMD cannot. + +## Recommendations + +### Don't flatten uniform or constant branches + +Uniforms are pipeline variables accessible within a shader which are guaranteed +to not vary during a GPU program's invocation. + +Example of a uniform branch in action: + +```glsl +uniform struct FrameInfo { + mat4 mvp; + bool invert_y; +} frame_info; + +in vec2 position; + +void main() { + gl_Position = frame_info.mvp * vec4(position, 0, 1) + + if (frame_info.invert_y) { + gl_Position *= vec4(1, -1, 1, 1); + } +} +``` + +While it's true that driver stacks have the opportunity to generate multiple +pipeline variants ahead of time to handle these branches, this advanced +functionality isn't actually necessary to achieve for good runtime performance +of uniform branches on widely used mobile architectures: +* On SIMT architectures, branching on a uniform means that every thread in every + warp will resolve to the same path, so only one path in the branch will ever + execute. +* On VLIW/SIMD architectures, the compiler can be certain that all of the + elements in the data path for every functional unit will resolve to the same + path, and so it can safely emit fully parallelized instructions for the + contents of the branch! + +### Don't flatten simple varying branches + +Widely used mobile GPU architectures generally don't benefit from flattening +simple varying branches. While it's true that compilers for VLIW/SIMD-based +architectures can't emit efficient instructions for these branches, the +detrimental effects of this are minimal with small branches. For modern SIMT +architectures, flattened branches can actually perform measurably worse than +straight forward branch solutions. Also, some shader compilers can collapse +small branches automatically. + +Instead of this: + +```glsl +vec3 ColorBurn(vec3 dst, vec3 src) { + vec3 color = 1 - min(vec3(1), (1 - dst) / src); + color = mix(color, vec3(1), 1 - abs(sign(dst - 1))); + color = mix(color, vec3(0), 1 - abs(sign(src - 0))); + return color; +} +``` + +...just do this: + +```glsl +vec3 ColorBurn(vec3 dst, vec3 src) { + vec3 color = 1 - min(vec3(1), (1 - dst) / src); + if (1 - dst.r < kEhCloseEnough) { + color.r = 1; + } + if (1 - dst.g < kEhCloseEnough) { + color.g = 1; + } + if (1 - dst.b < kEhCloseEnough) { + color.b = 1; + } + if (src.r < kEhCloseEnough) { + color.r = 0; + } + if (src.g < kEhCloseEnough) { + color.g = 0; + } + if (src.b < kEhCloseEnough) { + color.b = 0; + } + return color; +} +``` + +It's easier to understand, doesn't prevent compiler optimizations, runs +measurably faster on SIMT devices, and works out to be at most marginally slower +on older VLIW devices. + +### Avoid complex varying branches + +Consider the following fragment shader: + +```glsl +in vec4 color; +out vec4 frag_color; + +void main() { + vec4 result; + + if (color.a == 0) { + result = vec4(0); + } else { + result = DoExtremelyExpensiveThing(color); + } + + frag_color = result; +} +``` + +Note that `color` is _varying_. Specifically, it's an interpolated output from a +vertex shader -- so the value may change from fragment to fragment (as opposed +to a _uniform_ or _constant_, which will remain the same for the whole draw +call). + +On SIMT architectures, this branch incurs very little overhead because +`DoExtremelyExpensiveThing` will be skipped over if `color.a == 0` across all +the threads in a given warp. +However, architectures that use instruction-level parallelism (VLIW or SIMD) +can't handle this branch efficiently because the compiler can't safely emit +parallelized instructions on either side of the branch. + +To achieve maximum parallelism across all of these architectures, one possible +solution is to unbranch the more complex path: + +```glsl +in vec4 color; +out vec4 frag_color; + +void main() { + frag_color = DoExtremelyExpensiveThing(color); + + if (color.a == 0) { + frag_color = vec4(0); + } +} +``` + +However, this may be a big tradeoff depending on how this shader is used -- this +solution will perform worse on SIMT devices in cases where `color.a == 0` across +all threads in a given warp, since `DoExtremelyExpensiveThing` will no longer be +skipped with this solution! So if the cheap branch path covers a large solid +portion of a draw call's coverage area, alternative designs may be favorable. + +### Beware of return branching + +Consider the following glsl function: +```glsl +vec4 FrobnicateColor(vec4 color) { + if (color.a == 0) { + return vec4(0); + } + + return DoExtremelyExpensiveThing(color); +} +``` + +At first glance, this may appear cheap due to its simple contents, but this +branch has two exclusive paths in practice, and the generated shader assembly +will reflect the same behavior as this code: + +```glsl +vec4 FrobnicateColor(vec4 color) { + vec4 result; + + if (color.a == 0) { + result vec4(0); + } else { + result = DoExtremelyExpensiveThing(color); + } + + return result; +} +``` + +The same concerns and advice apply to this branch as the scenario under "Avoid +complex varying branches". + +### Use lower precision whenever possible + +Most desktop GPUs don't support 16 bit (mediump) or 8 bit (lowp) floating point +operations. But many mobile GPUs (such as the Qualcomm Adreno series) do, and +according to the +[Adreno documentation](https://developer.qualcomm.com/sites/default/files/docs/adreno-gpu/developer-guide/gpu/best_practices_shaders.html#use-medium-precision-where-possible), +using lower precision floating point operations is more efficient on these +devices.