From 0b5f7910078880354850ddb9bffb12735bc8278c Mon Sep 17 00:00:00 2001
From: Brandon DeRosier <x@bdero.me>
Date: Wed, 13 Jul 2022 04:35:03 -0700
Subject: [PATCH 1/3] Add shader optimization doc

---
 impeller/docs/shader_optimization.md | 280 +++++++++++++++++++++++++++
 1 file changed, 280 insertions(+)
 create mode 100644 impeller/docs/shader_optimization.md

diff --git a/impeller/docs/shader_optimization.md b/impeller/docs/shader_optimization.md
new file mode 100644
index 0000000000000..53cc23c17eed2
--- /dev/null
+++ b/impeller/docs/shader_optimization.md
@@ -0,0 +1,280 @@
+# Writing efficient shaders
+
+When it comes to optimizing shaders for a wide range of devices, there is no
+perfect strategy. The reality of different drivers written by different vendors
+targeting different hardware is that they will vary in behavior. Any attempt at
+optimizing against a specific driver will likely result in a performance loss
+for some other drivers that end users will run Flutter apps against.
+
+That being said, newer graphics devices have architectures that allow for both
+simpler shader compilation and better handling of traditionally slow shader
+code. In fact, straight forward "unoptimized" shader code filled with branches
+may significantly outperform the equivalent branchless optimized shader code
+when targeting newer GPU architectures.
+
+Flutter actively supports devices that are more than a decade old, which
+requires us to write shaders that perform well across multiple generations of
+GPU architectures featuring radically different behavior. Most optimization
+choices are direct tradeoffs between GPU architectures, and having an accurate
+mental model for how these common architectures maximize parallelism is
+essential for making good tradeoff decisions while writing shaders.
+
+For these reasons, it's also important to profile shaders against some of the
+older devices that Flutter can target (such as the iPhone 4s) when making
+changes intended to improve shader performance.
+
+## GPU architecture primer
+
+GPUs are designed to have functional units running single instructions over many
+elements (the "data path") each clock cycle. This is the fundamental aspect of
+GPUs that makes them work well for massively parallel compute work; they're
+essentially specialized SIMD engines.
+
+GPU parallelism generally comes in two broad architectural flavors:
+**Instruction-level parallelism** and **Thread-level parallelism** -- these
+architecture designs handle shader branching very differently and are covered
+in great detail in sections below. In general, older GPU architectures (on some
+products released before ~2015) leverage instruction-level parallelism, while
+most if not all newer GPUs leverage thread-level parallelism.
+
+Some of the earliest GPU architectures had no runtime control flow primitives at
+all (i.e. jump instructions), and compilers for these architectures needed to
+handle branches ahead of time by unrolling loops, compiling a different program
+for every possible branch combination, and then executing all of them. However,
+virtually all GPU architectures in use today have instruction-level support for
+dynamic branching, and it's quite unlikely that we'll come across a mobile
+device capable of running Flutter that doesn't. For example, the oldest devices
+we test against in CI (iPhone 4s and Moto G4) run GPUs that support dynamic
+runtime branching. For these reasons, the optimization advice in this document
+isn't aimed at branchless architectures.
+
+### Instruction-level parallelism
+
+Some older GPUs (including the PowerVR SGX543MP2 GPU on the iPhone 4s SOC) rely
+on SIMD vector or array instructions to maximize the number of computations
+performed per clock cycle on each functional unit. This means that the shader
+compiler must figure out which parts of the program are safe to parallelize
+ahead of time and emit appropriate instructions. This presents a problem for
+certain kinds of branches: If the compiler doesn't know that the same decision
+will always be taken by all of the data lanes at runtime (meaning the branch is
+_varying_), it can't safely emit SIMD instructions while compiling the branch.
+The result is that instructions within non-uniform branches incur a
+`1/[data width]` performance penalty when compared to non-branched instructions
+because they can't be parallelized.
+
+VLIW ("Very Long Instruction Width") is another common instruction-level
+parallelism design that suffers from the same compile time reasoning
+disadvantage that SIMD does.
+
+### Thread-level parallelism
+
+Newer GPUs (but also some older hardware such as the Adreno 306 GPU found on the
+Moto G4's Snapdragon SOC) use scalar functional units (no SIMD/VLIW/MIMD) and
+parallelize instructions at runtime by running the same instruction over many
+threads in groups often referred to as "warps" (Nvidia terminology) or
+"wavefronts" (AMD terminology), usually consisting of 32 or 64 threads per
+warp/wavefront. This design is also commonly referred to as SIMT ("Single
+Instruction Multiple Thread").
+
+To handle branching, SIMT programs use special instructions to write a thread
+mask that determines which threads are activated/deactivated in the warp; only
+the warp's activated threads will actually execute instructions. Given this
+setup, the program can first deactivate threads that failed the branch
+condition, run the positive path, invert the mask, run the negative path, and
+finally restore the mask to its original state prior to the branch. The compiler
+may also insert mask checks to skip over branches when all of the threads have
+been deactivated.
+
+Therefore, the best case scenario for a SIMT branch is that it only incurs the
+cost of the conditional. The worst case scenario is that some of the warp's
+threads fail the conditional and the rest succeed, requiring the program to
+execute both paths of the branch back-to-back in the warp. Note that this is
+very favorable to the SIMD scenario with non-uniform/varying branches, as SIMT
+is able to retain significant parallelism in all cases, whereas SIMD cannot.
+
+## Recommendations
+
+### Don't flatten uniform or constant branches
+
+Uniforms are pipeline variables accessible within a shader which are guaranteed
+to not vary during a GPU program's invocation.
+
+Example of a uniform branch in action:
+
+```glsl
+uniform struct FrameInfo {
+  mat4 mvp;
+  bool invert_y;
+} frame_info;
+
+in vec2 position;
+
+void main() {
+  gl_Position = mvp * vec4(position, 0, 1)
+  if (invert_y) {
+    gl_Position *= vec2(1, -1);
+  }
+}
+```
+
+While it's true that driver stacks have the opportunity to generate multiple
+pipeline variants ahead of time to handle these branches, this advanced
+functionality isn't actually necessary to achieve for good runtime performance
+of uniform branches on widely used mobile architectures:
+* On SIMT architectures, branching on a uniform means that every thread in every
+  warp will resolve to the same path, so only one path in the branch will ever
+  execute.
+* On VLIW/SIMD architectures, the compiler can be certain that all of the
+  elements in the data path for every functional unit will resolve to the same
+  path, and so it can safely emit fully parallelized instructions for the
+  contents of the branch!
+
+### Don't flatten simple varying branches
+
+Widely used mobile GPU architectures generally don't benefit from flattening
+simple varying branches. While it's true that compilers for VLIW/SIMD-based
+architectures can't emit efficient instructions for these branches, the
+detrimental effects of this are minimal with small branches. For modern SIMT
+architectures, flattened branches can actually perform measurably worse than
+straight forward branch solutions. Also, some shader compilers can collapse
+small branches automatically.
+
+Instead of this:
+
+```glsl
+vec3 ColorBurn(vec3 dst, vec3 src) {
+  vec3 color = 1 - min(vec3(1), (1 - dst) / src);
+  color = mix(color, vec3(1), 1 - abs(sign(dst - 1)));
+  color = mix(color, vec3(0), 1 - abs(sign(src - 0)));
+  return color;
+}
+```
+
+...just do this:
+
+```glsl
+vec3 ColorBurn(vec3 dst, vec3 src) {
+  vec3 color = 1 - min(vec3(1), (1 - dst) / src);
+  if (1 - dst.r < kEhCloseEnough) {
+    color.r = 1;
+  }
+  if (1 - dst.g < kEhCloseEnough) {
+    color.g = 1;
+  }
+  if (1 - dst.b < kEhCloseEnough) {
+    color.b = 1;
+  }
+  if (src.r < kEhCloseEnough) {
+    color.r = 0;
+  }
+  if (src.g < kEhCloseEnough) {
+    color.g = 0;
+  }
+  if (src.b < kEhCloseEnough) {
+    color.b = 0;
+  }
+  return color;
+}
+```
+
+It's easier to understand, doesn't prevent compiler optimizations, runs
+measurably faster on SIMT devices, and works out to be at most marginally slower
+on older VLIW devices.
+
+### Avoid complex varying branches
+
+Consider the following fragment shader:
+
+```glsl
+in vec4 color;
+out vec4 frag_color;
+
+void main() {
+  vec4 result;
+
+  if (color.a == 0) {
+    result = vec4(0);
+  } else {
+    result = DoExtremelyExpensiveThing(color);
+  }
+
+  frag_color = result;
+}
+```
+
+Note that `color` is _varying_. Specifically, it's an interpolated output from a
+vertex shader -- so the value may change from fragment to fragment (as opposed
+to a _uniform_ or _constant_, which will remain the same for the whole draw
+call).
+
+On SIMT architectures, this branch incurs very little overhead because, and
+`DoExtremelyExpensiveThing` will be skipped over if `color.a == 0` across all
+the threads in a given warp.
+However, architectures that use instruction-level parallelism (VLIW or SIMD)
+can't handle this branch efficiently because the compiler can't safely emit
+parallelized instructions on either side of the branch.
+
+To achieve maximum parallelism across all of these architectures, one possible
+solution is to unbranch the more complex path:
+
+```glsl
+in vec4 color;
+out vec4 frag_color;
+
+void main() {
+  frag_color = DoExtremelyExpensiveThing(color);
+
+  if (color.a == 0) {
+    frag_color = vec4(0);
+  }
+}
+```
+
+However, this may be a big tradeoff depending on how this shader is used -- this
+solution will perform worse on SIMT devices in cases where `color.a == 0` across
+all threads in a given warp, since `DoExtremelyExpensiveThing` will no longer be
+skipped with this solution! So if the cheap branch path covers a large solid
+portion of a draw call's coverage area, alternative designs may be favorable.
+
+### Beware of return branching
+
+Consider the following glsl function:
+```glsl
+vec4 FrobnicateColor(vec4 color) {
+  if (color.a == 0) {
+    return vec4(0);
+  }
+
+  return DoExtremelyExpensiveThing(color);
+}
+```
+
+At first glance, this may appear cheap due to its simple contents, but this
+branch has two exclusive paths in practice, and the generated shader assembly
+will reflect the same behavior as this code:
+
+```glsl
+vec4 FrobnicateColor(vec4 color) {
+  vec4 result;
+
+  if (color.a == 0) {
+    result vec4(0);
+  } else {
+    result = DoExtremelyExpensiveThing(color);
+  }
+
+  return result;
+}
+```
+
+The same concerns and advice apply to this branch as the scenario under "Avoid
+complex varying branches".
+
+### Use lower precision whenever possible
+
+Most desktop GPUs don't support 16 bit (mediump) or 8 bit (lowp) floating point
+operations. But many mobile GPUs (such as the Qualcomm Adreno series) do, and
+according to the
+[Adreno documentation](https://developer.qualcomm.com/sites/default/files/docs/adreno-gpu/developer-guide/gpu/best_practices_shaders.html#use-medium-precision-where-possible),
+using lower precision floating point operations is more efficient on these
+devices.

From fcbabd7a559701a657d966877df65dc5c31a93c4 Mon Sep 17 00:00:00 2001
From: Brandon DeRosier <x@bdero.me>
Date: Wed, 13 Jul 2022 12:40:40 -0700
Subject: [PATCH 2/3] Update impeller/docs/shader_optimization.md

Co-authored-by: Zachary Anderson <zanderso@users.noreply.github.com>
---
 impeller/docs/shader_optimization.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/impeller/docs/shader_optimization.md b/impeller/docs/shader_optimization.md
index 53cc23c17eed2..2c697f014e112 100644
--- a/impeller/docs/shader_optimization.md
+++ b/impeller/docs/shader_optimization.md
@@ -8,7 +8,7 @@ for some other drivers that end users will run Flutter apps against.
 
 That being said, newer graphics devices have architectures that allow for both
 simpler shader compilation and better handling of traditionally slow shader
-code. In fact, straight forward "unoptimized" shader code filled with branches
+code. In fact, straightforward "unoptimized" shader code filled with branches
 may significantly outperform the equivalent branchless optimized shader code
 when targeting newer GPU architectures.
 

From 405b44322d08a8c012b3092049ed3394a4479e0f Mon Sep 17 00:00:00 2001
From: Brandon DeRosier <bdero@google.com>
Date: Wed, 13 Jul 2022 19:34:27 -0700
Subject: [PATCH 3/3] Address comments

---
 impeller/README.md                   |  1 +
 impeller/docs/shader_optimization.md | 49 ++++++++++++++++------------
 2 files changed, 30 insertions(+), 20 deletions(-)

diff --git a/impeller/README.md b/impeller/README.md
index fae3621d719ea..637a2a87c8054 100644
--- a/impeller/README.md
+++ b/impeller/README.md
@@ -185,3 +185,4 @@ To your `AndroidManifest.xml` file, add under the `<application>` tag:
 * [Learning to Read GPU Frame Captures](docs/read_frame_captures.md)
 * [How to Enable Metal Validation for Command Line Apps.](docs/metal_validation.md)
 * [How Impeller Works Around The Lack of Uniform Buffers in Open GL ES 2.0.](docs/ubo_gles2.md)
+* [Guidance for writing efficient shaders](docs/shader_optimization.md)
diff --git a/impeller/docs/shader_optimization.md b/impeller/docs/shader_optimization.md
index 2c697f014e112..397d4787f3f66 100644
--- a/impeller/docs/shader_optimization.md
+++ b/impeller/docs/shader_optimization.md
@@ -8,21 +8,29 @@ for some other drivers that end users will run Flutter apps against.
 
 That being said, newer graphics devices have architectures that allow for both
 simpler shader compilation and better handling of traditionally slow shader
-code. In fact, straightforward "unoptimized" shader code filled with branches
-may significantly outperform the equivalent branchless optimized shader code
-when targeting newer GPU architectures.
+code. In fact, ostensibly "unoptimized" shader code filled with branches may
+significantly outperform the equivalent branchless optimized shader code when
+targeting newer GPU architectures. (See the "Don't flatten simple varying
+branches" recommendation for an explanation of this with respect to different
+architectures).
 
-Flutter actively supports devices that are more than a decade old, which
+Flutter actively supports mobile devices that are more than a decade old, which
 requires us to write shaders that perform well across multiple generations of
 GPU architectures featuring radically different behavior. Most optimization
-choices are direct tradeoffs between GPU architectures, and having an accurate
-mental model for how these common architectures maximize parallelism is
-essential for making good tradeoff decisions while writing shaders.
+choices are direct tradeoffs between these GPU architectures, and so having an
+accurate mental model for how these common architectures maximize parallelism is
+essential for making good decisions while authoring shaders.
 
 For these reasons, it's also important to profile shaders against some of the
-older devices that Flutter can target (such as the iPhone 4s) when making
+older devices that Flutter can target (such as the iPhone 6s) when making
 changes intended to improve shader performance.
 
+Also, even though the branching behavior is largely architecture dependent and
+should remain the same when using different graphics APIs, it's still also a
+good idea to test changes against the different backends supported by Impeller
+(Metal and GLES). Early stage shader compilation (as well as the high level
+shader code generated by ImpellerC) may vary quite a bit between APIs.
+
 ## GPU architecture primer
 
 GPUs are designed to have functional units running single instructions over many
@@ -33,9 +41,9 @@ essentially specialized SIMD engines.
 GPU parallelism generally comes in two broad architectural flavors:
 **Instruction-level parallelism** and **Thread-level parallelism** -- these
 architecture designs handle shader branching very differently and are covered
-in great detail in sections below. In general, older GPU architectures (on some
-products released before ~2015) leverage instruction-level parallelism, while
-most if not all newer GPUs leverage thread-level parallelism.
+in the sections below. In general, older GPU architectures (on some products
+released before ~2015) leverage instruction-level parallelism, while most if not
+all newer GPUs leverage thread-level parallelism.
 
 Some of the earliest GPU architectures had no runtime control flow primitives at
 all (i.e. jump instructions), and compilers for these architectures needed to
@@ -43,15 +51,15 @@ handle branches ahead of time by unrolling loops, compiling a different program
 for every possible branch combination, and then executing all of them. However,
 virtually all GPU architectures in use today have instruction-level support for
 dynamic branching, and it's quite unlikely that we'll come across a mobile
-device capable of running Flutter that doesn't. For example, the oldest devices
-we test against in CI (iPhone 4s and Moto G4) run GPUs that support dynamic
+device capable of running Flutter that doesn't. For example, the old devices we
+test against in CI (iPhone 6s and Moto G4) run GPUs that support dynamic
 runtime branching. For these reasons, the optimization advice in this document
 isn't aimed at branchless architectures.
 
 ### Instruction-level parallelism
 
-Some older GPUs (including the PowerVR SGX543MP2 GPU on the iPhone 4s SOC) rely
-on SIMD vector or array instructions to maximize the number of computations
+Some older GPUs (including the PowerVR GT7600 GPU on the iPhone 6s SoC) rely on
+SIMD vector or array instructions to maximize the number of computations
 performed per clock cycle on each functional unit. This means that the shader
 compiler must figure out which parts of the program are safe to parallelize
 ahead of time and emit appropriate instructions. This presents a problem for
@@ -69,7 +77,7 @@ disadvantage that SIMD does.
 ### Thread-level parallelism
 
 Newer GPUs (but also some older hardware such as the Adreno 306 GPU found on the
-Moto G4's Snapdragon SOC) use scalar functional units (no SIMD/VLIW/MIMD) and
+Moto G4's Snapdragon SoC) use scalar functional units (no SIMD/VLIW/MIMD) and
 parallelize instructions at runtime by running the same instruction over many
 threads in groups often referred to as "warps" (Nvidia terminology) or
 "wavefronts" (AMD terminology), usually consisting of 32 or 64 threads per
@@ -110,9 +118,10 @@ uniform struct FrameInfo {
 in vec2 position;
 
 void main() {
-  gl_Position = mvp * vec4(position, 0, 1)
-  if (invert_y) {
-    gl_Position *= vec2(1, -1);
+  gl_Position = frame_info.mvp * vec4(position, 0, 1)
+
+  if (frame_info.invert_y) {
+    gl_Position *= vec4(1, -1, 1, 1);
   }
 }
 ```
@@ -207,7 +216,7 @@ vertex shader -- so the value may change from fragment to fragment (as opposed
 to a _uniform_ or _constant_, which will remain the same for the whole draw
 call).
 
-On SIMT architectures, this branch incurs very little overhead because, and
+On SIMT architectures, this branch incurs very little overhead because
 `DoExtremelyExpensiveThing` will be skipped over if `color.a == 0` across all
 the threads in a given warp.
 However, architectures that use instruction-level parallelism (VLIW or SIMD)