Summary: Add Stateful FC Cortex-m linearOps #14252

psiddh · 2025-09-12T09:55:19Z

Integrate with CMSIS-NN with per-channel quantization support

Test Plan:
Run e2e test on FVP simulator
./examples/arm/run_mcu_models_fvp.sh --target=cortex-m55 --models=qlinear

Reviewers:

Subscribers:

Tasks:

Tags:

Summary

[PLEASE REMOVE] See CONTRIBUTING.md's Pull Requests for ExecuTorch PR guidelines.

[PLEASE REMOVE] If this PR closes an issue, please add a Fixes #<issue-id> line.

[PLEASE REMOVE] If this PR introduces a fix or feature that should be the upcoming release notes, please add a "Release notes: " label. For a list of available release notes labels, check out CONTRIBUTING.md's Pull Requests.

Test plan

[PLEASE REMOVE] How did you test this PR? Please write down any manual commands you used and note down tests that you have written if applicable.

pytorch-bot · 2025-09-12T09:55:23Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14252

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 5309a25 with merge base 0e9d871 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2025-09-12T09:56:15Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

backends/cortex_m/ops/cortex_m_ops_common.h

digantdesai · 2025-09-12T10:49:35Z

backends/cortex_m/ops/operators.py

+    input_zero_point: int,
+    input_multiplier: int,
+    input_shift: int,


Do we want to just take Tensor in (even for a single element)? Rationale is to support per-token like quant if we want like per-tensor today.

backends/cortex_m/ops/operators.py

digantdesai · 2025-09-12T10:51:47Z

backends/cortex_m/ops/operators.py

+    bias_multiplier: torch.Tensor,
+    bias_shift: torch.Tensor,
+    scratch_buffer: torch.Tensor,
+    output_zero_point: int,


And this one too..

backends/cortex_m/ops/operators.py

digantdesai · 2025-09-12T11:02:44Z

backends/cortex_m/ops/operators.py

+        ) * weight_scales.unsqueeze(1)
+    if bias is not None:
+        if bias_multiplier.numel() == 1:
+            bias_scale = bias_multiplier.item() * (2.0 ** (-bias_shift.item()))


why do we need .item() specialization for numel == 1?

.item() is needed for single-element tensors to extract a Python scalar for correct math .. isn't ?

float() can work with numel() == 1 and then you can get float when you do consume it or pass it down as as tensor..

backends/cortex_m/passes/passes_utils.py

backends/cortex_m/passes/quantized_linear_fusion_pass.py

digantdesai

Great starting point. Thanks @psiddh for all the back and forth.
Left some comments.

backends/cortex_m/passes/quantized_linear_fusion_pass.py

backends/cortex_m/ops/op_quantized_linear.cpp

digantdesai · 2025-09-12T12:27:05Z

backends/cortex_m/ops/op_quantized_linear.cpp

+  }
+
+  // start of cmsis buffer
+  ctx.buf = scratch_ptr;


kernel_sum_state->get_scrtach_ptr()

I encapsulated everything in the simple helper class now

digantdesai · 2025-09-12T12:27:46Z

backends/cortex_m/ops/op_quantized_linear.cpp

+  ctx.buf = scratch_ptr;
+  ctx.size = scratch_buffer.size(0) - sizeof(kernel_sum_state);
+
+  for (int32_t b = 0; b < batch_size; b++) {


did we verify this as the right way to call this API>?

Verify as-in , I get the success result from cmsis api call.
Again reading the impl the code,, we must loop over the batch dimension and call the function for each input vector. I think fully connected (FC) functions like arm_fully_connected_s8 and arm_fully_connected_per_channel_s8 process a single input vector at a time.

digantdesai · 2025-09-12T12:30:13Z

@AdrianLundell - Just FYI. We need more work to do here but want to give you a heads up.

Integrate with CMSIS-NN with per-channel quantization support Test Plan: Run e2e test on FVP simulator ./examples/arm/run_mcu_models_fvp.sh --target=cortex-m55 --models=qlinear Reviewers: Subscribers: Tasks: Tags:

digantdesai · 2025-09-12T21:01:26Z

backends/cortex_m/ops/op_quantized_linear.cpp

+ public:
+  CMSISScratchBufferContext(
+      Tensor& scratch_buffer,
+      const cmsis_nn_dims& filter_dims)


Nit just take weight tensor ref as an arg?

AdrianLundell · 2025-09-15T08:06:04Z

backends/cortex_m/CMakeLists.txt

+    BINARY_DIR CMSIS_NN_BINARY_DIR
+  )
+  set(CMSIS_NN_INCLUDE_DIR "${CMSIS_NN_SOURCE_DIR}/Include")
+  set(CMSIS_NN_LIB "${CMSIS_NN_BINARY_DIR}/libcmsis-nn.a")


This is an absolute path to the binary tree which causes portability issues when installing. Could you take another look at this and try again to only use the cmsis-nn target when building?

AdrianLundell · 2025-09-15T08:15:45Z

backends/cortex_m/ops/cortex_m_ops_common.h

@@ -114,6 +117,27 @@ inline void validate_quantization_params(
      "Single quant Output");
 }

+inline bool validate_per_channel_quant_params(


Would be nice to document where these constraints come from.

AdrianLundell · 2025-09-15T08:17:31Z

backends/cortex_m/ops/op_quantized_linear.cpp

+// ^                        ^                                 ^
+// scratch_ptr(start)   scratch_ptr + cmsis_scratch   scratch_ptr + total_size
+//
+// - CMSIS-NN workspace: used by CMSIS-NN kernels for temporary data


Nice comment, though I feel like it is missing a description of the kernel_sum_state struct?

AdrianLundell · 2025-09-15T08:21:29Z

backends/cortex_m/ops/op_quantized_linear.cpp

+  return out;
+}
+
+// Functional variant (stub, not used at runtime)


I feel like we should work toward removing the need for these stub functions

AdrianLundell · 2025-09-15T08:23:31Z

examples/arm/aot_arm_compiler.py

@@ -297,6 +303,20 @@ def forward(self, x: torch.Tensor, y: torch.Tensor):
    can_delegate = True


+class QuantLinearTest(torch.nn.Module):


We are trying to move away from adding testing logic to the aot_arm_compliler, please see #13902 for my suggestions on the cortex_m testing strategy

AdrianLundell · 2025-09-15T08:26:01Z

backends/cortex_m/ops/op_quantized_linear.cpp

+  fc_params.output_offset = output_zp;
+  fc_params.activation.min = std::numeric_limits<int8_t>::min();
+  fc_params.activation.max = std::numeric_limits<int8_t>::max();
+  cmsis_nn_dims input_dims = {1, 1, 1, in_feat};


This is channels last where as pytorch per default is channels first, how are you handling that?

AdrianLundell · 2025-09-15T08:26:52Z

backends/cortex_m/ops/operators.py

@@ -223,3 +223,220 @@ def quantized_add_out_impl(
    out.copy_(result_quantized)

    return out
+


Should we start splitting these definitions into separate files?

AdrianLundell · 2025-09-15T08:55:10Z

backends/cortex_m/passes/quantized_linear_fusion_pass.py

+        self._cleanup_nodes(graph)
+        return fusion_count
+
+    def _find_original_input_placeholder(self, dq_node: Node) -> Node:


psiddh requested review from larryliu0820, kirklandsign and digantdesai as code owners September 12, 2025 09:55

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 12, 2025

psiddh added this to ExecuTorch Embedded Sep 12, 2025

github-project-automation bot moved this to To triage in ExecuTorch Embedded Sep 12, 2025

psiddh mentioned this pull request Sep 12, 2025

[CortexM] Stateful Ops (Linear) with CMSIS-NN integration with unit Testcases #13507

Open

digantdesai changed the title ~~Summary: Add Statefull FC Cortex-m linearOps~~ Summary: Add Stateful FC Cortex-m linearOps Sep 12, 2025