Skip to content

llama : only copy used KV cache in get / set state #1272

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 3, 2023

Conversation

ejones
Copy link
Collaborator

@ejones ejones commented May 2, 2023

Per comments in #1169 and #1247, reduces the size of serialized model state and session files by only storing the used portions of the KV cache. This corresponds to the tokens evaluated so far in the session, so the size scales with # of tokens, up to max state size at full context.

Changes

  • llama_copy_state_data and llama_set_state_data now use the number of evaluated tokens (as in llama_get_kv_cache_token_count) to (de)compact the KV cache data. As a result, while llama_get_state_size is unchanged, its semantics differ: it is now the maximum buffer size required by the get / set state API
  • session file version has been bumped and existing session files will be invalid

Testing

  • Tested with a small prompt and chat-13B on 30B and 65B. Ran cold and warm and ensured same output.
./main -m ~/llama-models/30B/ggml-model-q4_0.bin --seed 1 -p 'The meaning of life is 4' --session meaning-life-is.30.v1.bin -n 10
./examples/chat-13B.sh -m ~/llama-models/30B/ggml-model-q4_0.bin --session chat-session-30.v1.bin --seed 1
  • Tested examples/save-load-state

Results

Original sizes:

% du -hs *.bin
3.1G	chat-session-30.bin
5.0G	chat-session-65.bin

New sizes:

782M	chat-session-30.v1.bin
1.2G	chat-session-65.v1.bin
 12M	meaning-life-is.30.v1.bin
 20M	meaning-life-is.65.v1.bin

@ejones ejones requested a review from ggerganov May 2, 2023 04:06
llama.cpp Outdated
Comment on lines 2483 to 2506
{
// copy k: k layout is n_layer > n_ctx (tokens) > n_embd
const uint8_t * k_data = (uint8_t *) kv_self.k->data;
const size_t elt_size = ggml_element_size(kv_self.k);

for (int il = 0; il < n_layer; il++) {
const size_t offset = il * n_ctx * n_embd * elt_size;
const size_t size = kv_ntok * n_embd * elt_size;
memcpy(out, k_data + offset, size); out += size;
}
}

{
// copy v: v layout is n_layer > n_embd > n_ctx (tokens)
const uint8_t * v_data = (uint8_t *) kv_self.v->data;
const size_t elt_size = ggml_element_size(kv_self.v);
const int n_layer_embd = n_layer * n_embd;

for (int ile = 0; ile < n_layer_embd; ile++) {
const size_t offset = ile * n_ctx * elt_size;
const size_t size = kv_ntok * elt_size;
memcpy(out, v_data + offset, size); out += size;
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of writing the tensor copy code manually, why not use ggml?

I proposed something like this before:

{
	const size_t elt_size = ggml_element_size(kv_self.k);
	char buffer[4096]; // should be enough
	ggml_context *cpy_ctx = ggml_init({ sizeof(buffer), buffer, true });
	ggml_cgraph gf{};
	gf.n_threads = 1;

	ggml_tensor * kout3d = ggml_new_tensor_3d(cpy_ctx, kv_self.k->type, n_embd, kv_ntok, n_layer);
	kout3d->data = out;
	out += ggml_nbytes(kout3d);

	ggml_tensor * vout3d = ggml_new_tensor_3d(cpy_ctx, kv_self.v->type, kv_ntok, n_embd, n_layer);
	vout3d->data = out;
	out += ggml_nbytes(vout3d);

	ggml_tensor * k3d = ggml_view_3d(cpy_ctx, kv_self.k, n_embd, kv_ntok, n_layer, elt_size*n_embd, elt_size*n_embd*n_ctx, 0);
	ggml_tensor * v3d = ggml_view_3d(cpy_ctx, kv_self.v, kv_ntok, n_embd, n_layer, elt_size*n_ctx, elt_size*n_ctx*n_embd, 0);

	ggml_build_forward_expand(&gf, ggml_cpy(cpy_ctx, k3d, kout3d));
	ggml_build_forward_expand(&gf, ggml_cpy(cpy_ctx, v3d, vout3d));
	ggml_graph_compute(cpy_ctx, &gf);
}

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the way!

Consider implementing @SlyEcho's idea - I also think it is better

@ejones
Copy link
Collaborator Author

ejones commented May 2, 2023

Thanks! Will do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants