-
Notifications
You must be signed in to change notification settings - Fork 11.7k
llama : only copy used KV cache in get / set state #1272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
llama.cpp
Outdated
{ | ||
// copy k: k layout is n_layer > n_ctx (tokens) > n_embd | ||
const uint8_t * k_data = (uint8_t *) kv_self.k->data; | ||
const size_t elt_size = ggml_element_size(kv_self.k); | ||
|
||
for (int il = 0; il < n_layer; il++) { | ||
const size_t offset = il * n_ctx * n_embd * elt_size; | ||
const size_t size = kv_ntok * n_embd * elt_size; | ||
memcpy(out, k_data + offset, size); out += size; | ||
} | ||
} | ||
|
||
{ | ||
// copy v: v layout is n_layer > n_embd > n_ctx (tokens) | ||
const uint8_t * v_data = (uint8_t *) kv_self.v->data; | ||
const size_t elt_size = ggml_element_size(kv_self.v); | ||
const int n_layer_embd = n_layer * n_embd; | ||
|
||
for (int ile = 0; ile < n_layer_embd; ile++) { | ||
const size_t offset = ile * n_ctx * elt_size; | ||
const size_t size = kv_ntok * elt_size; | ||
memcpy(out, v_data + offset, size); out += size; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of writing the tensor copy code manually, why not use ggml?
I proposed something like this before:
{
const size_t elt_size = ggml_element_size(kv_self.k);
char buffer[4096]; // should be enough
ggml_context *cpy_ctx = ggml_init({ sizeof(buffer), buffer, true });
ggml_cgraph gf{};
gf.n_threads = 1;
ggml_tensor * kout3d = ggml_new_tensor_3d(cpy_ctx, kv_self.k->type, n_embd, kv_ntok, n_layer);
kout3d->data = out;
out += ggml_nbytes(kout3d);
ggml_tensor * vout3d = ggml_new_tensor_3d(cpy_ctx, kv_self.v->type, kv_ntok, n_embd, n_layer);
vout3d->data = out;
out += ggml_nbytes(vout3d);
ggml_tensor * k3d = ggml_view_3d(cpy_ctx, kv_self.k, n_embd, kv_ntok, n_layer, elt_size*n_embd, elt_size*n_embd*n_ctx, 0);
ggml_tensor * v3d = ggml_view_3d(cpy_ctx, kv_self.v, kv_ntok, n_embd, n_layer, elt_size*n_ctx, elt_size*n_ctx*n_embd, 0);
ggml_build_forward_expand(&gf, ggml_cpy(cpy_ctx, k3d, kout3d));
ggml_build_forward_expand(&gf, ggml_cpy(cpy_ctx, v3d, vout3d));
ggml_graph_compute(cpy_ctx, &gf);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the way!
Consider implementing @SlyEcho's idea - I also think it is better
Thanks! Will do. |
Per comments in #1169 and #1247, reduces the size of serialized model state and session files by only storing the used portions of the KV cache. This corresponds to the tokens evaluated so far in the session, so the size scales with # of tokens, up to max state size at full context.
Changes
llama_copy_state_data
andllama_set_state_data
now use the number of evaluated tokens (as inllama_get_kv_cache_token_count
) to (de)compact the KV cache data. As a result, whilellama_get_state_size
is unchanged, its semantics differ: it is now the maximum buffer size required by the get / set state APITesting
chat-13B
on 30B and 65B. Ran cold and warm and ensured same output.examples/save-load-state
Results
Original sizes:
New sizes: