Skip to content

Why are models like Llama implemented again in examples/models/ ? #11130

@anzr299

Description

@anzr299

Hi everyone,

I noticed that in executorch, the models are also implemented in examples/model/ which are used for exporting the model to ExportedProgram. My question is how is this model implementation different from HF implementation and also the optimum-executorch flow for exporting the models.

cc @mergennachin @iseeyuan @lucylq @helunwencser @tarun292 @kimishpatel @jackzhxng

Activity

changed the title [-]Why do we implement models like Llama again in examples/models/ and how is it different from the HF implementation for the same?[/-] [+]Why do we implement models like Llama again in examples/models/ ?[/+] on May 26, 2025
changed the title [-]Why do we implement models like Llama again in examples/models/ ?[/-] [+]Why are models like Llama implemented again in examples/models/ ?[/+] on May 27, 2025
added
module: examplesIssues related to demos under examples/
triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
on May 27, 2025
SS-JIA

SS-JIA commented on May 27, 2025

@SS-JIA
Contributor

My understanding is that the examples/model/ directory showcases how models can be developed and exported to ExecuTorch exclusively using the PyTorch + ExecuTorch ecosystem. They are useful as technical references for i.e. how to lower models to specific backends, how to apply specific quantization schemes, etc. Furthermore, for llama models we also provide an example runner binary to showcase how an exported llama model can be executed with the ET runtime to evaluate a prompt and generate output.

@guangy10 May be able to add more context regarding the differences with the optimum-executorch export flow. My understanding is that this flow is intended to be a more seamless user experience for those who are mainly interested in deploying a LLM via ExecuTorch, without needing to dive deeper into the technical details of what's going on under the hood.

moved this from To triage to In progress in ExecuTorch Coreon May 27, 2025
anzr299

anzr299 commented on May 27, 2025

@anzr299
Author

My understanding is that the examples/model/ directory showcases how models can be developed and exported to ExecuTorch exclusively using the PyTorch + ExecuTorch ecosystem. They are useful as technical references for i.e. how to lower models to specific backends, how to apply specific quantization schemes, etc. Furthermore, for llama models we also provide an example runner binary to showcase how an exported llama model can be executed with the ET runtime to evaluate a prompt and generate output.

@guangy10 May be able to add more context regarding the differences with the optimum-executorch export flow. My understanding is that this flow is intended to be a more seamless user experience for those who are mainly interested in deploying a LLM via ExecuTorch, without needing to dive deeper into the technical details of what's going on under the hood.

Thank you for the great explaination, Sicheng!
Although I get your point, my question is about the model implementation itself

class Llama2Model(EagerModelBase):

I see that the code re-written here closely resembles the Llama Model code in transformers. Except a few changes in RoPE, whose motivation is again unclear.

jackzhxng

jackzhxng commented on May 27, 2025

@jackzhxng
Contributor

@anzr299 this is a version of Llama that we maintain ourselves and are actively optimizing for performance. It is also the backbone for some of the other models we support, such as Qwen3 and Phi-4-mini. You are also able to use the HF version directly, although it would not leverage all the optimizations that we are doing.

guangy10

guangy10 commented on May 27, 2025

@guangy10
Contributor

@anzr299 LLMs in examples/model/ are lowered using the model definition you pointed. The main reason is perf, as with in-house model definition, there are more flexibility to apply source to source transformations/optimizations (can not easily generalized to arbitrary pytorch models in OSS).

Regarding the models in optimum-et, it's through the partnership with Hugging Face to support lowering native Hugging Face models (we do NOT own the model definitions) to ET and with reasonable out-of-the-box perf. It's not only focus on decoder-only LLMs, but also focusing on breadth of various models in OSS, and the composability of various optimizations, e.g. torchao, export, and customization of HF models, to gradually boost the out-of-the-box perf to ~80% of SOTA you may get for any pytorch model and without having ET to own any model definition. cc: @kimishpatel

anzr299

anzr299 commented on May 28, 2025

@anzr299
Author

@anzr299 this is a version of Llama that we maintain ourselves and are actively optimizing for performance. It is also the backbone for some of the other models we support, such as Qwen3 and Phi-4-mini. You are also able to use the HF version directly, although it would not leverage all the optimizations that we are doing.

Is the optimization specifically targeted at the edge level?
I have experimented with the exported FX model obtained by exporting HF Llama and Executorch example Llama compiling with torch.compile().
I found that they have roughly the same performance when compiled with torch.compile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: examplesIssues related to demos under examples/triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    Status

    In progress

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @SS-JIA@jackzhxng@anzr299@guangy10

        Issue actions

          Why are models like Llama implemented again in examples/models/ ? · Issue #11130 · pytorch/executorch