-
Notifications
You must be signed in to change notification settings - Fork 633
Open
Labels
module: examplesIssues related to demos under examples/Issues related to demos under examples/triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
Hi everyone,
I noticed that in executorch, the models are also implemented in examples/model/ which are used for exporting the model to ExportedProgram. My question is how is this model implementation different from HF implementation and also the optimum-executorch flow for exporting the models.
cc @mergennachin @iseeyuan @lucylq @helunwencser @tarun292 @kimishpatel @jackzhxng
Metadata
Metadata
Assignees
Labels
module: examplesIssues related to demos under examples/Issues related to demos under examples/triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Type
Projects
Status
In progress
Milestone
Relationships
Development
Select code repository
Activity
[-]Why do we implement models like Llama again in examples/models/ and how is it different from the HF implementation for the same?[/-][+]Why do we implement models like Llama again in examples/models/ ?[/+][-]Why do we implement models like Llama again in examples/models/ ?[/-][+]Why are models like Llama implemented again in examples/models/ ?[/+]SS-JIA commentedon May 27, 2025
My understanding is that the
examples/model/
directory showcases how models can be developed and exported to ExecuTorch exclusively using the PyTorch + ExecuTorch ecosystem. They are useful as technical references for i.e. how to lower models to specific backends, how to apply specific quantization schemes, etc. Furthermore, for llama models we also provide an example runner binary to showcase how an exported llama model can be executed with the ET runtime to evaluate a prompt and generate output.@guangy10 May be able to add more context regarding the differences with the optimum-executorch export flow. My understanding is that this flow is intended to be a more seamless user experience for those who are mainly interested in deploying a LLM via ExecuTorch, without needing to dive deeper into the technical details of what's going on under the hood.
anzr299 commentedon May 27, 2025
Thank you for the great explaination, Sicheng!
Although I get your point, my question is about the model implementation itself
executorch/examples/models/llama/model.py
Line 38 in 4cf5c06
I see that the code re-written here closely resembles the Llama Model code in transformers. Except a few changes in RoPE, whose motivation is again unclear.
jackzhxng commentedon May 27, 2025
@anzr299 this is a version of Llama that we maintain ourselves and are actively optimizing for performance. It is also the backbone for some of the other models we support, such as Qwen3 and Phi-4-mini. You are also able to use the HF version directly, although it would not leverage all the optimizations that we are doing.
guangy10 commentedon May 27, 2025
@anzr299 LLMs in examples/model/ are lowered using the model definition you pointed. The main reason is perf, as with in-house model definition, there are more flexibility to apply source to source transformations/optimizations (can not easily generalized to arbitrary pytorch models in OSS).
Regarding the models in optimum-et, it's through the partnership with Hugging Face to support lowering native Hugging Face models (we do NOT own the model definitions) to ET and with reasonable out-of-the-box perf. It's not only focus on decoder-only LLMs, but also focusing on breadth of various models in OSS, and the composability of various optimizations, e.g. torchao, export, and customization of HF models, to gradually boost the out-of-the-box perf to ~80% of SOTA you may get for any pytorch model and without having ET to own any model definition. cc: @kimishpatel
anzr299 commentedon May 28, 2025
Is the optimization specifically targeted at the edge level?
I have experimented with the exported FX model obtained by exporting HF Llama and Executorch example Llama compiling with torch.compile().
I found that they have roughly the same performance when compiled with torch.compile.