llama : only copy used KV cache in get / set state #1272

ejones · 2023-05-02T04:05:57Z

Per comments in #1169 and #1247, reduces the size of serialized model state and session files by only storing the used portions of the KV cache. This corresponds to the tokens evaluated so far in the session, so the size scales with # of tokens, up to max state size at full context.

Changes

llama_copy_state_data and llama_set_state_data now use the number of evaluated tokens (as in llama_get_kv_cache_token_count) to (de)compact the KV cache data. As a result, while llama_get_state_size is unchanged, its semantics differ: it is now the maximum buffer size required by the get / set state API
session file version has been bumped and existing session files will be invalid

Testing

Tested with a small prompt and chat-13B on 30B and 65B. Ran cold and warm and ensured same output.

./main -m ~/llama-models/30B/ggml-model-q4_0.bin --seed 1 -p 'The meaning of life is 4' --session meaning-life-is.30.v1.bin -n 10
./examples/chat-13B.sh -m ~/llama-models/30B/ggml-model-q4_0.bin --session chat-session-30.v1.bin --seed 1

Tested examples/save-load-state

Results

Original sizes:

% du -hs *.bin
3.1G	chat-session-30.bin
5.0G	chat-session-65.bin

New sizes:

782M	chat-session-30.v1.bin
1.2G	chat-session-65.v1.bin
 12M	meaning-life-is.30.v1.bin
 20M	meaning-life-is.65.v1.bin

SlyEcho · 2023-05-02T08:39:24Z

llama.cpp

+            {
+                // copy k: k layout is n_layer > n_ctx (tokens) > n_embd
+                const uint8_t * k_data   = (uint8_t *) kv_self.k->data;
+                const size_t    elt_size = ggml_element_size(kv_self.k);
+
+                for (int il = 0; il < n_layer; il++) {
+                    const size_t offset = il * n_ctx * n_embd * elt_size;
+                    const size_t size   =      kv_ntok * n_embd * elt_size;
+                    memcpy(out, k_data + offset, size); out += size;
+                }
+            }
+
+            {
+                // copy v: v layout is n_layer > n_embd > n_ctx (tokens)
+                const uint8_t * v_data       = (uint8_t *) kv_self.v->data;
+                const size_t    elt_size     = ggml_element_size(kv_self.v);
+                const int       n_layer_embd = n_layer * n_embd;
+
+                for (int ile = 0; ile < n_layer_embd; ile++) {
+                    const size_t offset = ile * n_ctx * elt_size;
+                    const size_t size   =       kv_ntok * elt_size;
+                    memcpy(out, v_data + offset, size); out += size;
+                }
+            }


Instead of writing the tensor copy code manually, why not use ggml?

I proposed something like this before:

{ const size_t elt_size = ggml_element_size(kv_self.k); char buffer[4096]; // should be enough ggml_context *cpy_ctx = ggml_init({ sizeof(buffer), buffer, true }); ggml_cgraph gf{}; gf.n_threads = 1; ggml_tensor * kout3d = ggml_new_tensor_3d(cpy_ctx, kv_self.k->type, n_embd, kv_ntok, n_layer); kout3d->data = out; out += ggml_nbytes(kout3d); ggml_tensor * vout3d = ggml_new_tensor_3d(cpy_ctx, kv_self.v->type, kv_ntok, n_embd, n_layer); vout3d->data = out; out += ggml_nbytes(vout3d); ggml_tensor * k3d = ggml_view_3d(cpy_ctx, kv_self.k, n_embd, kv_ntok, n_layer, elt_size*n_embd, elt_size*n_embd*n_ctx, 0); ggml_tensor * v3d = ggml_view_3d(cpy_ctx, kv_self.v, kv_ntok, n_embd, n_layer, elt_size*n_ctx, elt_size*n_ctx*n_embd, 0); ggml_build_forward_expand(&gf, ggml_cpy(cpy_ctx, k3d, kout3d)); ggml_build_forward_expand(&gf, ggml_cpy(cpy_ctx, v3d, vout3d)); ggml_graph_compute(cpy_ctx, &gf); }

ggerganov

This is the way!

Consider implementing @SlyEcho's idea - I also think it is better

ejones · 2023-05-02T18:09:32Z

Thanks! Will do.

ejones requested a review from ggerganov May 2, 2023 04:06

SlyEcho reviewed May 2, 2023

View reviewed changes

ggerganov approved these changes May 2, 2023

View reviewed changes

ggerganov mentioned this pull request May 2, 2023

talk-llama: updating session prompts load ggml-org/whisper.cpp#854

Merged

ejones added 2 commits May 2, 2023 17:37

llama : only copy used KV cache in get / set state

970547e

switch to ggml for copying k, v

0bf20fe

ejones force-pushed the master branch from a4e5579 to 0bf20fe Compare May 2, 2023 21:37

avoid designated initializers

458df74

ejones merged commit e216aa0 into ggml-org:master May 3, 2023

herrera-luis mentioned this pull request May 8, 2023

talk-llama: only copy used KV cache in get / set state ggml-org/whisper.cpp#890

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : only copy used KV cache in get / set state #1272

llama : only copy used KV cache in get / set state #1272

ejones commented May 2, 2023

SlyEcho May 2, 2023

ggerganov left a comment

ejones commented May 2, 2023

llama : only copy used KV cache in get / set state #1272

llama : only copy used KV cache in get / set state #1272

Conversation

ejones commented May 2, 2023

SlyEcho May 2, 2023

Choose a reason for hiding this comment

ggerganov left a comment

Choose a reason for hiding this comment

ejones commented May 2, 2023