|
| 1 | +## Design of the Glow IR |
| 2 | + |
| 3 | +### Introduction |
| 4 | + |
| 5 | +This document describes the motivation behind the Glow intermediate |
| 6 | +representation and some implementation details. |
| 7 | + |
| 8 | +Glow is a retargetable compiler that supports a number of different backends. |
| 9 | +This means that the first few layers of the compiler are target-independent, but |
| 10 | +as you get closer to the different backends things start to diverge. The first |
| 11 | +two levels of IR are shared between all targets. Different backends may have |
| 12 | +additional layers of IR. |
| 13 | + |
| 14 | +### High-level Graph |
| 15 | + |
| 16 | +The high-level IR, is a graph-based representation that's similar to the graph |
| 17 | +that you may find inside Caffe. When we load the model from a file we construct |
| 18 | +this graph in a direct translation of one operator to one node. It's a simple |
| 19 | +graph that allows basic transformations such as swapping the order of nodes and |
| 20 | +removing nodes. The graph is strongly typed, which means that inputs and output |
| 21 | +have a known tensor type (dimension and element type), and that the types must |
| 22 | +match. This compile has a debug method for dumping a graphical representation of |
| 23 | +the graph into a dotty file. The method is called 'dumpDAG'. The textual |
| 24 | +representation of the graph is less informative and it looks like this: |
| 25 | + |
| 26 | + ``` |
| 27 | + pool |
| 28 | + name : "pool" |
| 29 | + input : float<8 x 28 x 28 x 16> |
| 30 | + output : float<8 x 9 x 9 x 16> |
| 31 | + kernel : 3 |
| 32 | + stride : 3 |
| 33 | + pad : 0 |
| 34 | + kind : max |
| 35 | +
|
| 36 | + convolution |
| 37 | + name : "conv" |
| 38 | + input : float<8 x 9 x 9 x 16> |
| 39 | + output : float<8 x 9 x 9 x 16> |
| 40 | + filter : float<16 x 5 x 5 x 16> |
| 41 | + bias : float<16> |
| 42 | + kernel : 5 |
| 43 | + stride : 1 |
| 44 | + pad : 2 |
| 45 | + depth : 16 |
| 46 | +
|
| 47 | + relu |
| 48 | + name : "conv" |
| 49 | + input : float<8 x 9 x 9 x 16> |
| 50 | + ``` |
| 51 | + |
| 52 | +After optimizing the graph with target-independent optimizations the code is |
| 53 | +lowered into the mid-level IR in a phase that's called "IRGen" (stands for IR |
| 54 | +generation). This is a one-to-many translation where each operator is translated |
| 55 | +into one or more instructions. |
| 56 | + |
| 57 | +### Mid-level Graph |
| 58 | + |
| 59 | +The low-level IR enables a different kind of target independent optimizations |
| 60 | +that are not possible with the high-level graph format. For example, the ability |
| 61 | +to share the memory buffers during the forward pass can't be expressed in the |
| 62 | +Graph form because buffers are not explicit. |
| 63 | + |
| 64 | +The mid-level IR is built like a sequence of instructions that perform things |
| 65 | +like copy-memory and perform-convolution. The IR is not Static Single |
| 66 | +Assignment (SSA) based representation, because the IR does not support control |
| 67 | +flow. The IR is strongly typed and each instruction operand kind has known |
| 68 | +parameter types. The IR representation is designed to be used as an in-memory |
| 69 | +form. The IR can be dumped to human readable assembly-like format. |
| 70 | + |
| 71 | +The IR has two sections: 'declare' and 'program'. In the first section of the IR |
| 72 | +we declare a number of memory regions that live throughout the lifetime of the |
| 73 | +program. This is similar to global variables in C++. The second part of the IR |
| 74 | +is list of instructions. Each variable is annotated with the kind of |
| 75 | +initialization that the program should do. |
| 76 | + |
| 77 | +There are two kinds of memory regions. The global memory regions and locally |
| 78 | +allocated regions. The locally allocated memory regions are similar to 'alloca' |
| 79 | +in C++, and in LLVM. Memory regions are strongly typed, which means that the |
| 80 | +kind of type of tensor that the region represents is known. |
| 81 | + |
| 82 | +Instructions operate on either global variables or locally allocated buffers. |
| 83 | +Each operand is annotated with one of the qualifiers '@in'/'@out'/'@inout'. In |
| 84 | +means that the buffer is read from. Out means that the buffer is written into. |
| 85 | +And InOut means that the instruction may read and write into the buffer. These |
| 86 | +operand qualifiers help the optimizer decide when it is legal to share buffers. |
| 87 | +Instructions may have other attributes that specify the legality of some |
| 88 | +optimizations. For example, some operands require that the data from the forward |
| 89 | +pass would be kept around for the backward pass, so if the program is not |
| 90 | +optimized for inference-only mode then certain memory optimizations can't |
| 91 | +happen. |
| 92 | + |
| 93 | + |
| 94 | +This is an example of an unoptimized IR. |
| 95 | + |
| 96 | + ``` |
| 97 | + declare { |
| 98 | + %input = weight float<8 x 28 x 28 x 1>, broadcast, 0.0 |
| 99 | + %filter = weight float<16 x 5 x 5 x 1>, xavier, 25.0 |
| 100 | + %filter0 = weight float<16>, broadcast, 0.100 |
| 101 | + %weights = weight float<10 x 144>, xavier, 144.0 |
| 102 | + %bias = weight float<10>, broadcast, 0.100 |
| 103 | + %selected = weight index<8 x 1> |
| 104 | + ... |
| 105 | + %result = weight float<8 x 10> |
| 106 | + } |
| 107 | +
|
| 108 | + program { |
| 109 | + %allo = alloc float<8 x 28 x 28 x 16> |
| 110 | + %conv = convolution [5 1 2 16] @out %allo, @in %input, @in %filter3, @in %bias0 |
| 111 | + %allo0 = alloc float<8 x 28 x 28 x 16> |
| 112 | + %relu = relu @out %allo0, @in %allo |
| 113 | + %allo1 = alloc index<8 x 9 x 9 x 16 x 2> |
| 114 | + %allo2 = alloc float<8 x 9 x 9 x 16> |
| 115 | + %pool = pool max [3 3 0] @out %allo2, @in %allo0, @inout %allo1 |
| 116 | + ... |
| 117 | + %deal6 = dealloc @out %allo6 |
| 118 | + %deal7 = dealloc @out %allo7 |
| 119 | + %deal8 = dealloc @out %allo8 |
| 120 | + %deal9 = dealloc @out %allo9 |
| 121 | + } |
| 122 | + ``` |
| 123 | + |
| 124 | + |
| 125 | + |
0 commit comments