Add et version of TorchTune MHA for swapping with custom op #5912

jackzhxng · 2024-10-05T03:45:22Z

DRAFT

Summary

Add version of TorchTune MHA which factors out the transposes, repeated interleaves, kv cache updates, and sdpa torch ops so that they can be replaced by the custom sdpa_with_kv_cache op.

Command to export:

python -m examples.models.llama2.export_llama --model llama3_2_vision --checkpoint examples/models/llama3_2_vision/consolidated.pth --params examples/models/llama3_2_vision/params/demo_config.json -kv -X -d bf16 --metadata '{"append_eos_to_prompt": 0, "get_bos_id":128000, "get_eos_ids":[128009, 128001], "get_n_bos": 0, "get_n_eos": 0}' --output_name="llama3_2_vision.pte" --use_kv_cache --quantize_kv_cache --use_sdpa_with_kv_cache]$ ./install_requirements.sh --pybind; python -m examples.models.llama2.export_llama --model llama3_2_vision --checkpoint examples/models/llama3_2_vision/consolidated.pth --params examples/models/llama3_2_vision/params/demo_config.json -kv -X -d bf16 --metadata '{"append_eos_to_prompt": 0, "get_bos_id":128000, "get_eos_ids":[128009, 128001], "get_n_bos": 0, "get_n_eos": 0}' --output_name="llama3_2_vision.pte

Test plan

Tested eager and ExecuTorch executions with test plan described in #6610

PR chain:

pytorch-bot · 2024-10-05T03:45:25Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/5912

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit 3145bde with merge base 8f9fb7e ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner / linux-job (gh)
>>> Lint for examples/models/llama3_2_vision/model.py:
pull / test-llama_runner_eager-linux / linux-job (gh)
RuntimeError: Command docker exec -t daf860aa7220942243dd14bf3b310236bd4ce0ee07dcbbad6685b98104c4d9fe /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: For situations where the forward has non-position arguments, such as https://github.com/pytorch/torchtune/blob/3c450ef5f1fbe8237f899e942fd5222491a47ca7/torchtune/modules/transformer.py#L519 PR chain: - **YOU ARE HERE ~>** [Add kwarg example inputs to eager model base](#5765) - [Llama2 model cleanup](#5859) - [Accept model type parameter in export_llama](#5910) - [Export TorchTune llama3_2_vision in ET](#5911) - [Add et version of TorchTune MHA for swapping with custom op](#5912) Test Plan: Exported Stories110M model. ``` wget "https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.pt" echo '{"dim": 768, "multiple_of": 32, "n_heads": 12, "n_layers": 12, "norm_eps": 1e-05, "vocab_size": 32000}' > params.json python -m examples.models.llama2.export_llama -c stories110M.pt -p params.json -X -kv ``` Differential Revision: D64027696 Pulled By: dvorjackz

Summary: For situations where the forward has non-position arguments, such as https://github.com/pytorch/torchtune/blob/3c450ef5f1fbe8237f899e942fd5222491a47ca7/torchtune/modules/transformer.py#L519 PR chain: - **YOU ARE HERE ~>** [Add kwarg example inputs to eager model base](#5765) - [Llama2 model cleanup](#5859) - [Accept model type parameter in export_llama](#5910) - [Export TorchTune llama3_2_vision in ET](#5911) - [Add et version of TorchTune MHA for swapping with custom op](#5912) Test Plan: Exported Stories110M model. ``` wget "https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.pt" echo '{"dim": 768, "multiple_of": 32, "n_heads": 12, "n_layers": 12, "norm_eps": 1e-05, "vocab_size": 32000}' > params.json python -m examples.models.llama2.export_llama -c stories110M.pt -p params.json -X -kv ``` Reviewed By: tarun292 Differential Revision: D64027696 Pulled By: dvorjackz

Summary: For situations where the forward has non-position arguments, such as https://github.com/pytorch/torchtune/blob/3c450ef5f1fbe8237f899e942fd5222491a47ca7/torchtune/modules/transformer.py#L519 PR chain: - **YOU ARE HERE ~>** [Add kwarg example inputs to eager model base](#5765) - [Llama2 model cleanup](#5859) - [Accept model type parameter in export_llama](#5910) - [Export TorchTune llama3_2_vision in ET](#5911) - [Add et version of TorchTune MHA for swapping with custom op](#5912) Pull Request resolved: #5765 Test Plan: Exported Stories110M model. ``` wget "https://huggingface.co/karpathy/tinyllamas/resolve/main/stories110M.pt" echo '{"dim": 768, "multiple_of": 32, "n_heads": 12, "n_layers": 12, "norm_eps": 1e-05, "vocab_size": 32000}' > params.json python -m examples.models.llama2.export_llama -c stories110M.pt -p params.json -X -kv ``` Reviewed By: tarun292 Differential Revision: D64027696 Pulled By: dvorjackz fbshipit-source-id: 15ecfb458c6194159140d4c601e5443a2e524fdc

Summary: - Removes redundant steps in the Llama2 export - Factors out checkpointing to be shared with future Llama models (namely 3.2 multimodal) - Comments and orders code more clearly PR chain: - [Add kwarg example inputs to eager model base](#5765) - **YOU ARE HERE ~>** [Llama2 model cleanup](#5859) - [Accept model type parameter in export_llama](#5910) - [Export TorchTune llama3_2_vision in ET](#5911) - [Add et version of TorchTune MHA for swapping with custom op](#5912) Test Plan: Ensure export + eval is similar before and after for Stories 110M: ``` python -m examples.models.llama2.eval_llama -c <checkpoint.pth> -p <params.json> -t <tokenizer.model/bin> -d fp32 --max_seq_len 2048 --limit 1000 ``` Before: ``` wikitext: {'word_perplexity,none': 14464.645927166595, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 5.99788806086652, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 2.5844545973083983, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'} ``` After: ``` wikitext: {'word_perplexity,none': 14464.299192404438, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 5.997861173678705, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 2.584448130015399, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'} ``` Differential Revision: D64145852 Pulled By: dvorjackz

Summary: - Removes redundant steps in the Llama2 export - Factors out checkpointing to be shared with future Llama models (namely 3.2 multimodal) - Comments and orders code more clearly PR chain: - [Add kwarg example inputs to eager model base](#5765) - **YOU ARE HERE ~>** [Llama2 model cleanup](#5859) - [Accept model type parameter in export_llama](#5910) - [Export TorchTune llama3_2_vision in ET](#5911) - [Add et version of TorchTune MHA for swapping with custom op](#5912) Test Plan: Ensure export + eval is similar before and after for Stories 110M: ``` python -m examples.models.llama2.eval_llama -c <checkpoint.pth> -p <params.json> -t <tokenizer.model/bin> -d fp32 --max_seq_len 2048 --limit 1000 ``` Before: ``` wikitext: {'word_perplexity,none': 14464.645927166595, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 5.99788806086652, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 2.5844545973083983, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'} ``` After: ``` wikitext: {'word_perplexity,none': 14464.299192404438, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 5.997861173678705, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 2.584448130015399, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'} ``` Reviewed By: dbort Differential Revision: D64145852 Pulled By: dvorjackz

Summary: - Removes redundant steps in the Llama2 export - Factors out checkpointing to be shared with future Llama models (namely 3.2 multimodal) - Comments and orders code more clearly PR chain: - [Add kwarg example inputs to eager model base](#5765) - **YOU ARE HERE ~>** [Llama2 model cleanup](#5859) - [Accept model type parameter in export_llama](#5910) - [Export TorchTune llama3_2_vision in ET](#5911) - [Add et version of TorchTune MHA for swapping with custom op](#5912) Pull Request resolved: #5859 Test Plan: Ensure export + eval is similar before and after for Stories 110M: ``` python -m examples.models.llama2.eval_llama -c <checkpoint.pth> -p <params.json> -t <tokenizer.model/bin> -d fp32 --max_seq_len 2048 --limit 1000 ``` Before: ``` wikitext: {'word_perplexity,none': 14464.645927166595, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 5.99788806086652, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 2.5844545973083983, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'} ``` After: ``` wikitext: {'word_perplexity,none': 14464.299192404438, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 5.997861173678705, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 2.584448130015399, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'} ``` Reviewed By: malfet, dbort Differential Revision: D64145852 Pulled By: dvorjackz fbshipit-source-id: daeee834955e154e7c8262ce776bd3039991027d

Summary: Specify model to export in the CLI. Test Plan: Exported the stories 110M model. ``` python -m examples.models.llama.export_llama -c stories110M/stories110M.pt -p stories110M/params.json -X -kv ``` PR chain: - [Add kwarg example inputs to eager model base](#5765) - [Llama2 model cleanup](#5859) - **YOU ARE HERE ~>** [Accept model type parameter in export_llama](#5910) - [Export TorchTune llama3_2_vision in ET](#5911) - [Runner changes for TorchTune Llama3.2 vision text decoder](#6610) - [Add et version of TorchTune MHA for swapping with custom op](#5912) Differential Revision: D65612837 Pulled By: dvorjackz

Summary: Specify model to export in the CLI. Test Plan: Exported the stories 110M model. ``` python -m examples.models.llama.export_llama -c stories110M/stories110M.pt -p stories110M/params.json -X -kv ``` PR chain: - [Add kwarg example inputs to eager model base](#5765) - [Llama2 model cleanup](#5859) - **YOU ARE HERE ~>** [Accept model type parameter in export_llama](#5910) - [Export TorchTune llama3_2_vision in ET](#5911) - [Runner changes for TorchTune Llama3.2 vision text decoder](#6610) - [Add et version of TorchTune MHA for swapping with custom op](#5912) Reviewed By: helunwencser Differential Revision: D65612837 Pulled By: dvorjackz

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 5, 2024

jackzhxng force-pushed the jz/tt-llama-2 branch from 07b8a4c to 20355d7 Compare October 5, 2024 03:51

jackzhxng force-pushed the jz/tt-llama-3 branch from 6250afe to 6f8192a Compare October 5, 2024 03:52

jackzhxng force-pushed the jz/tt-llama-2 branch from 20355d7 to 850e7c8 Compare October 7, 2024 20:56

jackzhxng force-pushed the jz/tt-llama-3 branch from 6f8192a to 08a37a7 Compare October 7, 2024 20:56

This was referenced Oct 7, 2024

Add kwarg example inputs to eager model base #5765

Closed

Export TorchTune llama3_2_vision in ET #5911

Merged

Accept model type parameter in export_llama #5910

Closed

Llama2 model cleanup #5859

Closed

jackzhxng force-pushed the jz/tt-llama-2 branch from 850e7c8 to e850430 Compare October 8, 2024 06:48

jackzhxng force-pushed the jz/tt-llama-3 branch from 08a37a7 to 7e91d3b Compare October 8, 2024 06:50

jackzhxng force-pushed the jz/tt-llama-2 branch from e850430 to af45b02 Compare October 8, 2024 07:21

jackzhxng force-pushed the jz/tt-llama-3 branch from 7e91d3b to 679a6e7 Compare October 8, 2024 07:21

jackzhxng force-pushed the jz/tt-llama-2 branch from af45b02 to f9c001a Compare October 8, 2024 17:25

jackzhxng force-pushed the jz/tt-llama-3 branch 2 times, most recently from 1cbed8a to 80aa6d1 Compare October 8, 2024 20:02

jackzhxng force-pushed the jz/tt-llama-2 branch from 03779eb to 4a09ff1 Compare October 8, 2024 20:12

jackzhxng force-pushed the jz/tt-llama-2 branch from 4a09ff1 to 8b0c9a5 Compare October 16, 2024 18:15

jackzhxng force-pushed the jz/tt-llama-2 branch from 8b0c9a5 to e0c4b8a Compare October 16, 2024 18:19

jackzhxng force-pushed the jz/tt-llama-3 branch from 74e10a8 to 8796a5f Compare October 17, 2024 00:43

jackzhxng added 3 commits October 22, 2024 02:40

Add et version of TorchTune MHA for swapping with custom op

98c425b

Recent TT updates

1fe0356

Match up mha with TT

8afb8e1

jackzhxng force-pushed the jz/tt-llama-3 branch from a184b50 to 9b75fd2 Compare October 22, 2024 09:44

Split sdpa into custom op and quantized kv cache

a91666d

jackzhxng force-pushed the jz/tt-llama-3 branch from 9b75fd2 to a91666d Compare October 24, 2024 04:24

jackzhxng mentioned this pull request Oct 25, 2024

Accept model type parameter in export_llama #6507

Merged

jackzhxng added 4 commits November 1, 2024 09:59

Merge branch 'jz/tt-llama-2' into jz/tt-llama-3

370f526

Move llama2 -> llama

310b3a3

Merge branch 'jz/native-runner-tt' into jz/tt-llama-3

1dd12f0

Lint and print

4587852

jackzhxng changed the base branch from jz/tt-llama-2 to jz/native-runner-tt November 1, 2024 18:52

jackzhxng mentioned this pull request Nov 1, 2024

Runner changes for TorchTune Llama3.2 vision text decoder #6610

Merged

Revert portion to move to next PR

3145bde

jackzhxng mentioned this pull request Nov 4, 2024

Make TorchTune Llama model KV cache compatible in eager #6643

Merged

jackzhxng closed this Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add et version of TorchTune MHA for swapping with custom op #5912

Add et version of TorchTune MHA for swapping with custom op #5912

Uh oh!

jackzhxng commented Oct 5, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 5, 2024 •

edited

Loading

Uh oh!

Uh oh!

Add et version of TorchTune MHA for swapping with custom op #5912

Add et version of TorchTune MHA for swapping with custom op #5912

Uh oh!

Conversation

jackzhxng commented Oct 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

pytorch-bot bot commented Oct 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/5912

❌ 2 New Failures

Uh oh!

Uh oh!

jackzhxng commented Oct 5, 2024 •

edited

Loading

pytorch-bot bot commented Oct 5, 2024 •

edited

Loading