Stacked cache for MLPerf #154

wang2yn84 · 2024-07-20T01:46:33Z

Same as #151, check in to this branch for MLPerf

… ring buffer support then fix the mask. Int8 updates also included but not tested.

…ing.

…o the end of existing cache attention.

…od enough.

… into jit function.

…quant version

…he. Fix the kv cache quantization test.

…im for quantization from 1,3 to -3,-1 to make it more robust;

…d for performance validation.

…ew cache.

…on. Refactor to use only 1 flash attention kernel. Changes the modified ring buffer ragged attention kernel with quantization, layer, etc.

…run_interactive in CPU mode can work. When we default ring buffer to false, should add additional flags to run_interactive CI to set test mode to true so that pallas kernel can run.

* Almost working except mask, need to rebase to main to pick up the the ring buffer support then fix the mask. Int8 updates also included but not tested. * Fixed the test_model_impl for llama, but test_llama_e2e is still failing. * Adds lazy_cache_update and restructure the cache flags. * Disable all the prints. Fix create engine. * Fix typos and minor errors. * Fixes create engine. * Adds new_cache_stacked and fixes cache update. * Fix cache update when new_cach_stacked is False. * Fix the cache manager and make unit tests pass except for 1. * Updates the exportable model to return cache. * Removed the fori loop in cache finalize. Moves the cache.finalize() to the end of existing cache attention. * Try to use shard_map for cache update. * Fix update single cache line in cache.finalize() * Adds int8 support. * Int8 left aligned lazy cache update working, performance still not good enough. * Fix the stacked cache introduced in the previous couple of commits. * Put original ragged attention back. * Add the original ragged attention kernel. * Fixes the bf16/int8 cache stack. * Fix int8 stacked cache insertion in engine and finalization. * Fixes int8 with lazy cache update. * Updates the int8 test. * Fix the int8 ragged attention output sharding. * Fix group query attention broadcasting issue. * Fix shard map input issue. Variables not listed as inputs are freezed into jit function. * Fix the flash attention mask shape; Fix the update single cache line quant version * Adds the kv cache test. * Replace quantized cache "pos" with "input_pos" to align with bf16 cache. Fix the kv cache quantization test. * Fix prefill cache insertion issue for stacked cache; Changes reduce dim for quantization from 1,3 to -3,-1 to make it more robust; * Adds lazy cache update with generate cache stacked new cache unstacked for performance validation. * Fix the shard map sharding for stacked generate cache and unstacked new cache. * Using Jax API to slicing instead of Pytorch index slicing. * Adds stacked cache support in ragged attention reference kernel. * Adds stacked cache support for the modified ragged kernel. * Llama2 70b int8 optimization done. Output not correct yet. * Remove testing temp output files. * Fix the llama 70b output accuracy resulting from gqa. * Fixes the attention output slicing issue when not using flash attention. Refactor to use only 1 flash attention kernel. Changes the modified ring buffer ragged attention kernel with quantization, layer, etc. * Fix the pallas kernel OOB issue * Fix tests; Fix lint issues; * Fix the interactive script. * Fix lint errors. * Fix errors. * Fix the comments. * Fix based on comments; Fix all the unit tests. * Fix the remaining pylint errors. * Default ring buffer back to true so that all the test_run_server and run_interactive in CPU mode can work. When we default ring buffer to false, should add additional flags to run_interactive CI to set test mode to true so that pallas kernel can run. * Fix all the lint errors. * Remove the deps/JetStream changes. * Fix merge errors, fix lint errors.

wang2yn84 added 30 commits July 20, 2024 01:40

Almost working except mask, need to rebase to main to pick up the the…

1e14081

… ring buffer support then fix the mask. Int8 updates also included but not tested.

Fixed the test_model_impl for llama, but test_llama_e2e is still fail…

64a7d4d

…ing.

Adds lazy_cache_update and restructure the cache flags.

66e2c3b

Disable all the prints. Fix create engine.

0eb11dc

Fix typos and minor errors.

cbf9e68

Fixes create engine.

0fdd8e1

Adds new_cache_stacked and fixes cache update.

ed9ea97

Fix cache update when new_cach_stacked is False.

69aba1e

Fix the cache manager and make unit tests pass except for 1.

fbf2fa6

Updates the exportable model to return cache.

7f3eb0b

Removed the fori loop in cache finalize. Moves the cache.finalize() t…

fbd2dbd

…o the end of existing cache attention.

Try to use shard_map for cache update.

a1e6742

Fix update single cache line in cache.finalize()

2e3951a

Adds int8 support.

ecf5662

Int8 left aligned lazy cache update working, performance still not go…

41a07d2

…od enough.

Fix the stacked cache introduced in the previous couple of commits.

bdc6d4a

Put original ragged attention back.

5443d9b

Add the original ragged attention kernel.

0f0af46

Fixes the bf16/int8 cache stack.

9d80885

Fix int8 stacked cache insertion in engine and finalization.

ce4caeb

Fixes int8 with lazy cache update.

e099066

Updates the int8 test.

e71c6d6

Fix the int8 ragged attention output sharding.

662af29

Fix group query attention broadcasting issue.

6c769a4

Fix shard map input issue. Variables not listed as inputs are freezed…

6ff9311

… into jit function.

Fix the flash attention mask shape; Fix the update single cache line …

20821a4

…quant version

Adds the kv cache test.

cb293c2

Replace quantized cache "pos" with "input_pos" to align with bf16 cac…

29d9670

…he. Fix the kv cache quantization test.

Fix prefill cache insertion issue for stacked cache; Changes reduce d…

fe0dc8c

…im for quantization from 1,3 to -3,-1 to make it more robust;

Adds lazy cache update with generate cache stacked new cache unstacke…

063662d

…d for performance validation.

wang2yn84 added 18 commits July 20, 2024 01:40

Fix the shard map sharding for stacked generate cache and unstacked n…

143d5a6

…ew cache.

Using Jax API to slicing instead of Pytorch index slicing.

0a386e2

Adds stacked cache support in ragged attention reference kernel.

b53476d

Adds stacked cache support for the modified ragged kernel.

5be7d0b

Llama2 70b int8 optimization done. Output not correct yet.

ac0a88b

Remove testing temp output files.

08358b2

Fix the llama 70b output accuracy resulting from gqa.

0c0162e

Fixes the attention output slicing issue when not using flash attenti…

d607920

…on. Refactor to use only 1 flash attention kernel. Changes the modified ring buffer ragged attention kernel with quantization, layer, etc.

Fix the pallas kernel OOB issue

a59338c

Fix tests; Fix lint issues;

181a809

Fix the interactive script.

766e14c

Fix lint errors.

03d9ba6

Fix errors.

331aabf

Fix the comments.

0f10636

Fix based on comments; Fix all the unit tests.

ba61830

Fix the remaining pylint errors.

d36ac81

Default ring buffer back to true so that all the test_run_server and …

93451db

…run_interactive in CPU mode can work. When we default ring buffer to false, should add additional flags to run_interactive CI to set test mode to true so that pallas kernel can run.

Fix all the lint errors.

e8f1469

wang2yn84 requested review from qihqi, FanhaiLu1 and lsy323 July 20, 2024 01:46

wang2yn84 changed the title ~~Stacked cache final.~~ Stacked cache for MLPerf Jul 20, 2024

Remove the deps/JetStream changes.

134247c

wang2yn84 force-pushed the stacked-cache-final branch from 5fbcd47 to 134247c Compare July 20, 2024 02:07

Fix merge errors, fix lint errors.

d36641f

FanhaiLu1 approved these changes Jul 20, 2024

View reviewed changes

FanhaiLu1 merged commit 7760e39 into mlperf-llama Jul 20, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stacked cache for MLPerf #154

Stacked cache for MLPerf #154

Uh oh!

wang2yn84 commented Jul 20, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Stacked cache for MLPerf #154

Stacked cache for MLPerf #154

Uh oh!

Conversation

wang2yn84 commented Jul 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wang2yn84 commented Jul 20, 2024 •

edited

Loading