You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Arm backend: Document Ethos-U memory modes and add Ethos-U porting guide (#14144)
Memory modes: The Shared_Sram, Sram_Only and Dedicated_Sram memory modes
are specified in the compile spec and are tightly coupled with how the
ethos-U scratch buffer and NN should be placed in the embedded
application. Different memory modes profoundly impact the performance
and memory footprint of the application and it is important to use the
NPU in the most suitable memory mode for optimal performance.
Porting guide: A document explaining the key steps to port a new
hardware target with an Ethos-U NPU to the Ethos-U backend in ExecuTorch
Copy file name to clipboardExpand all lines: docs/source/backends-arm-ethos-u.md
+101-2Lines changed: 101 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -72,17 +72,116 @@ with open("mv2_arm_ethos_u55.pte", "wb") as file:
72
72
edge_program_manager.write_to_file(file)
73
73
```
74
74
75
+
### Ethos-U memory modes
76
+
The Ethos-U NPU provides two distinct memory interfaces:
77
+
- One interface for **low-latency, high-bandwidth memory**
78
+
Typically on-chip memory such as **SRAM**.
79
+
- One interface for **higher-latency, lower-bandwidth memory**
80
+
Typically external (off-chip) memory such as **Flash** or **DRAM**.
81
+
82
+
On all Ethos-U NPUs(Ethos-U55, Ethos-U65, Ethos-U85), the low-latency interface is usually the SRAM of the SoC.
83
+
The external memory type depends on the SoC:
84
+
- On a low-power microcontorller, the external memory is usually Flash.
85
+
- On systems with Cortex-A and rich operating system, the external memory is typically DRAM.
86
+
87
+
When running an inference, the Ethos-U compiler and Ethos-U driver make use of three logical memory regions:
88
+
- Ethos-U scratch buffer - a contiguous block of memory used by the NPU to store the intermediate tensors produced and consumed during inference.
89
+
- Neural Network - a contiguous block of memory holding constant data such as weights, biases, quantization parameters required to run an inference.
90
+
- Ethos-U fast scratch buffer - a contiguous block of memory, assumed to reside in on-chip memory in order to hide the higher latency/lower bandwidth of external memory. Only applicable for Ethos-U65 and Ethos-U85 on systems
91
+
with Cortex-A and the external memory is assumed to be DRAM.
92
+
93
+
The placement of the scratch buffer and the Neural Network determine the memory mode to be used in the Ethos-U
94
+
compile specificiation. We support three different placements of the scratch buffer and the ML model.
95
+
96
+
#### 1. Sram-Only Memory Mode
97
+
- Ethos-U scratch buffer resides in the SRAM.
98
+
- Neural Network resides in the SRAM.
99
+
- Ethos-U fast scratch buffer is not used.
100
+
- Characteristics:
101
+
- Provides the best performance since all the memory traffic passes via the low-latency/high-bandwidth memory.
102
+
- The performance uplift is especially noticeable on memory-bound workloads on the external interface.
103
+
- Available on Ethos-U55, Ethos-U65 and Ethos-U85.
104
+
- Limitations:
105
+
- Embedded SoCs often have limited SRAM and NNs are becoming larger. This memory mode may be unsuitable for a system running a big model relative to the amount of SRAM available on the SoC.
106
+
Below, you can see a visual representation of the placement of the two logical memory regions for the Sram Only configuration.
107
+
108
+

109
+
110
+
#### 2. Shared-Sram Memory Mode
111
+
- Ethos-U scratch buffer resides in the SRAM.
112
+
- Neural Network resides in the External memory.
113
+
- Ethos-U fast scratch buffer is not used.
114
+
- Characteristics:
115
+
- Intermediate tensors are stored in the SRAM, leveraging its low-latency and high-bandwidth.
116
+
- The Ethos-U compiler can prefetch weights from the external memory to the SRAM ahead of time so that when the NPU needs the data, it will already be avaialbe in the on-chip memory.
117
+
- In this mode, the external interface is Read-Only, the on-chip memory interface is Read/Write
118
+
- Shared-Sram offers great balance between performance and low SRAM usage.
119
+
- Available on Ethos-U55, Ethos-U65 and Ethos-U85.
120
+
- Limitations:
121
+
- You need to have enough space in the SRAM to hold the peak intermediate tensor.
122
+
Below, you can see a visual representation of the placement of the two logical memory regions for the Shared_Sram configuration.
123
+
124
+

125
+
126
+
#### 3. Dedicated-Sram Memory Mode
127
+
- Ethos-U scratch buffer resides in the External memory.
128
+
- Neural Network resides in the External memory.
129
+
- Ethos-U fast scratch buffer resides in the on-chip memory.
130
+
- Characteristics:
131
+
- Used when the peak intermediate tensor is too big to fit into the on-chip memory.
132
+
- Enables silicon acceleration of large models.
133
+
- The NPU stores the results from the intermediate computations in the external memory.
134
+
- The dedicated SRAM acts as a software managed cache, improving performance by pre-fetching frequently accessed tensors to the on-chip memory.
135
+
- Available on Ethos-U65 and Ethos-U85.
136
+
- Limitations:
137
+
- The SRAM space must be dedicated exculisely to the Ethos-U(the host processor should not access it).
138
+
- Not available on Ethos-U55.
139
+
Below, you can see a visual representation of the placement of the two logical memory regions for the Shared_Sram configuration.
140
+
141
+

142
+
143
+
Here is a table comparing the three memory modes:
144
+
145
+
| Memory Mode | Ethos-U Scratch Buffer Placement | Neural Network Placement | When to Use | Trade-off |
|**SRAM-Only**| On-chip SRAM | On-chip SRAM | When the ML model, the Ethos-U scratch buffer and the wider software stack fit within the SRAM of the SoC | Limited by SRAM size; often not feasible for larger NNs |
148
+
|**Shared-SRAM**| On-chip SRAM | External memory (Flash/DRAM) | Most common mode on Cortex-M and Ethos-U systems; balances good performance and SRAM usage | Requires enough SRAM to hold the largest intermediate tensor |
149
+
|**Dedicated-SRAM**| External memory | External memory (Flash/DRAM) | Most common mode for Cortex-A and Ethos-U systems. For very large models where the peak intermediates cannot fit in SRAM | Need high-bandwidth external memory to deliver good performance |
150
+
151
+
152
+
The memory modes are defined within the [vela.ini file](https://gitlab.arm.com/artificial-intelligence/ethos-u/ethos-u-vela/-/blob/main/ethosu/config_files/Arm/vela.ini?ref_type=heads). When you install
153
+
ExecuTorch for the Ethos-U backend, you automatically install the compiler containing the vela.ini file so you can directly create a compile specification with these memory modes.
154
+
155
+
#### Interpreting the output from the Ethos-U compiler regarding the memory footprint
156
+
As part of the `to_edge_transform_and_lower` step, you will see a memory footprint information presented as:
157
+
158
+
```
159
+
Total SRAM used 2467.27 KiB
160
+
Total Off-chip Flash used 12.20 KiB
161
+
````
162
+
The `Total SRAM used` indicates the peak SRAM utilization needed by the NPU in order to perform an inference. In the snippet above, the Ethos-U compiler requires 2467.27 KiB of SRAM in order to schedule the inference.
163
+
Therefore, from an application standpoint, you need to ensure you have at least 2467.27 KiB of SRAM on the SoC to run this model. The Ethos-U compiler provides a scheduling algorithm allowing to
164
+
lower the peak SRAM usage within reasonable limits, you need to add the `--optimise Size` or `--arena-cache-size` CLI options for to the compile spec. You can read more about the options of the
165
+
Ethos-U compiler in the documentation [here](https://gitlab.arm.com/artificial-intelligence/ethos-u/ethos-u-vela/-/blob/main/OPTIONS.md#optimise). If the peak SRAM usage remains too high in
166
+
Shared Sram memory mode, you would need to us the Dedicated Sram mode in order to store the Neural Network and the Ethos-U scratch buffer in the external memory.
167
+
The main advantage of the Dedicated_Sram memory mode is that you can run large models and still benefit from the low-latency/high-bandwidth of the SRAM, used as a cache.
168
+
169
+
It is important to highlight that when you specify a memory mode in the compile spec, in the runtime, the user is expected to place the scratch buffer and NN in the correct memory location.
170
+
In other words, when you specify for ex. Shared Sram memory mode, the runtime application logic should place the ethos-U scratch buffer in the on-chip memory and the NN in the external memory for optimal performance.
171
+
172
+
You can see how we are doing this coupling between the memory mode and runtime application the [Ethos-U porting guide](../../examples/arm/ethos-u-porting-guide.md).
173
+
75
174
### Partitioner API
76
175
77
176
`EthosUPartitioner` tries to partition as much of the model as possible. It will never delegate unsupported operators, but a user can pass additional checks to the constructor to avoid partitioning additional operators. To do this, subclass `OperatorSupportBase` and implement the function `is_node_supported`. A few such checks exist in `executorch.exir.backend.operator_support`:
78
177
79
178
- `DontPartition`: Don't partition operators based on operator type.
80
179
- `DontPartitionModule`: Don't partition operators based on which python module the operator comes from.
81
-
-`DontPartitionName`: Don't partition opertors based on the operator name.
180
+
- `DontPartitionName`: Don't partition operators based on the operator name.
82
181
83
182
### Quantization
84
183
85
-
A fully integer model is required for using the Arm Ethos-U backend. As discussed above, you can quantize floating point models with the the `EthosUQuantizer`. Quantizers are backend specific, which means the `EthosUQuantizer` is configured to quantize models correctly for the target.
184
+
A fully integer model is required for using the Arm Ethos-U backend. As discussed above, you can quantize floating point models with the `EthosUQuantizer`. Quantizers are backend specific, which means the `EthosUQuantizer` is configured to quantize models correctly for the target.
0 commit comments