Skip to content

Conversation

psiddh
Copy link
Contributor

@psiddh psiddh commented Sep 12, 2025

Integrate with CMSIS-NN with per-channel quantization support

Test Plan:
Run e2e test on FVP simulator
./examples/arm/run_mcu_models_fvp.sh --target=cortex-m55 --models=qlinear

Reviewers:

Subscribers:

Tasks:

Tags:

Summary

[PLEASE REMOVE] See CONTRIBUTING.md's Pull Requests for ExecuTorch PR guidelines.

[PLEASE REMOVE] If this PR closes an issue, please add a Fixes #<issue-id> line.

[PLEASE REMOVE] If this PR introduces a fix or feature that should be the upcoming release notes, please add a "Release notes: " label. For a list of available release notes labels, check out CONTRIBUTING.md's Pull Requests.

Test plan

[PLEASE REMOVE] How did you test this PR? Please write down any manual commands you used and note down tests that you have written if applicable.

Copy link

pytorch-bot bot commented Sep 12, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14252

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 5309a25 with merge base 0e9d871 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 12, 2025
Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@digantdesai digantdesai changed the title Summary: Add Statefull FC Cortex-m linearOps Summary: Add Stateful FC Cortex-m linearOps Sep 12, 2025
Comment on lines +257 to +268
input_zero_point: int,
input_multiplier: int,
input_shift: int,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to just take Tensor in (even for a single element)? Rationale is to support per-token like quant if we want like per-tensor today.

bias_multiplier: torch.Tensor,
bias_shift: torch.Tensor,
scratch_buffer: torch.Tensor,
output_zero_point: int,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this one too..

) * weight_scales.unsqueeze(1)
if bias is not None:
if bias_multiplier.numel() == 1:
bias_scale = bias_multiplier.item() * (2.0 ** (-bias_shift.item()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need .item() specialization for numel == 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.item() is needed for single-element tensors to extract a Python scalar for correct math .. isn't ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

float() can work with numel() == 1 and then you can get float when you do consume it or pass it down as as tensor..

Copy link
Contributor

@digantdesai digantdesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great starting point. Thanks @psiddh for all the back and forth.
Left some comments.

}

// start of cmsis buffer
ctx.buf = scratch_ptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kernel_sum_state->get_scrtach_ptr()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I encapsulated everything in the simple helper class now

ctx.buf = scratch_ptr;
ctx.size = scratch_buffer.size(0) - sizeof(kernel_sum_state);

for (int32_t b = 0; b < batch_size; b++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we verify this as the right way to call this API>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verify as-in , I get the success result from cmsis api call.
Again reading the impl the code,, we must loop over the batch dimension and call the function for each input vector. I think fully connected (FC) functions like arm_fully_connected_s8 and arm_fully_connected_per_channel_s8 process a single input vector at a time.

@digantdesai
Copy link
Contributor

@AdrianLundell - Just FYI. We need more work to do here but want to give you a heads up.

Integrate with CMSIS-NN with per-channel quantization support

Test Plan:
Run e2e test on FVP simulator
./examples/arm/run_mcu_models_fvp.sh --target=cortex-m55 --models=qlinear

Reviewers:

Subscribers:

Tasks:

Tags:
public:
CMSISScratchBufferContext(
Tensor& scratch_buffer,
const cmsis_nn_dims& filter_dims)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit just take weight tensor ref as an arg?

BINARY_DIR CMSIS_NN_BINARY_DIR
)
set(CMSIS_NN_INCLUDE_DIR "${CMSIS_NN_SOURCE_DIR}/Include")
set(CMSIS_NN_LIB "${CMSIS_NN_BINARY_DIR}/libcmsis-nn.a")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an absolute path to the binary tree which causes portability issues when installing. Could you take another look at this and try again to only use the cmsis-nn target when building?

@@ -114,6 +117,27 @@ inline void validate_quantization_params(
"Single quant Output");
}

inline bool validate_per_channel_quant_params(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to document where these constraints come from.

// ^ ^ ^
// scratch_ptr(start) scratch_ptr + cmsis_scratch scratch_ptr + total_size
//
// - CMSIS-NN workspace: used by CMSIS-NN kernels for temporary data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice comment, though I feel like it is missing a description of the kernel_sum_state struct?

return out;
}

// Functional variant (stub, not used at runtime)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we should work toward removing the need for these stub functions

@@ -297,6 +303,20 @@ def forward(self, x: torch.Tensor, y: torch.Tensor):
can_delegate = True


class QuantLinearTest(torch.nn.Module):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are trying to move away from adding testing logic to the aot_arm_compliler, please see #13902 for my suggestions on the cortex_m testing strategy

fc_params.output_offset = output_zp;
fc_params.activation.min = std::numeric_limits<int8_t>::min();
fc_params.activation.max = std::numeric_limits<int8_t>::max();
cmsis_nn_dims input_dims = {1, 1, 1, in_feat};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is channels last where as pytorch per default is channels first, how are you handling that?

@@ -223,3 +223,220 @@ def quantized_add_out_impl(
out.copy_(result_quantized)

return out

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we start splitting these definitions into separate files?

self._cleanup_nodes(graph)
return fusion_count

def _find_original_input_placeholder(self, dq_node: Node) -> Node:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not used?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
Status: To triage
Development

Successfully merging this pull request may close these issues.

3 participants