Skip to content

[Question] How to use kv cache? #14

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
SagsMug opened this issue Apr 3, 2023 · 5 comments
Closed

[Question] How to use kv cache? #14

SagsMug opened this issue Apr 3, 2023 · 5 comments

Comments

@SagsMug
Copy link
Contributor

SagsMug commented Apr 3, 2023

Hello!

I have been trying to test the new kv cache loading and ran into an issue, it seems to segfault when running llama_eval.
To save the current cache i do:

import llama_cpp
import pickle
from ctypes import cast
# Some work...
kv_tokens = llama_cpp.llama_get_kv_cache_token_count(ctx)
kv_len = llama_cpp.llama_get_kv_cache_size(ctx)
kv_cache = llama_cpp.llama_get_kv_cache(ctx) 
kv_cache = cast(kv_cache, llama_cpp.POINTER(llama_cpp.c_uint8 * kv_len))
kv_cache = bytearray(kv_cache)
with open("test.bin", "wb") as f:
    pickle.dump([kv_cache,kv_tokens], f)

Loading:

with open("test.bin", "rb") as f:
    kv_cache, kv_tokens = pickle.load(f)
    llama_cpp.llama_set_kv_cache(ctx, 
	    (llama_cpp.c_uint8 * len(kv_cache)).from_buffer(kv_cache),
	    len(kv_cache),
	    kv_tokens
    )

But running llama_cpp.llama_eval after will result in a segfault.

llama-cpp-python version: 0.1.16

How do i fix this?
Thanks

@abetlen
Copy link
Owner

abetlen commented Apr 3, 2023

Unfortunately it looks like the kv_state API is still not enough to restore the model state see issue.

This is something I'm following so once it's possible I'll drop an example.

@abetlen
Copy link
Owner

abetlen commented Apr 8, 2023

@SagsMug well this is emberrasing, so my c++ example from that thread apparently was correct haha. Here is the general structure.

import time
import llama_cpp

llama = llama_cpp.Llama(model_path="../../models/ggml-alpaca.bin")
prompt = llama.tokenize(b"The quick brown fox")
# Reset the model
llama.reset()
# Feed the prompt
t1 = time.time()
llama.eval(prompt)
t2 = time.time()
### Save model state
assert llama.ctx is not None

# Save kv cache
kv_cache_token_count = llama_cpp.llama_get_kv_cache_token_count(llama.ctx)
kv_cache_size = llama_cpp.llama_get_kv_cache_size(llama.ctx)
kv_cache = llama_cpp.llama_get_kv_cache(llama.ctx)
kv_cache_new = (llama_cpp.c_uint8 * int(kv_cache_size))()
llama_cpp.ctypes.memmove(kv_cache_new, kv_cache, int(kv_cache_size))

# Save last_n_tokens_data and tokens_consumed
last_n_tokens_data = llama.last_n_tokens_data.copy()
tokens_consumed = llama.tokens_consumed
###

# Sample 4 token
for i in range(4):
    next_token = llama.sample(top_k=40, top_p=1.0, temp=0.0, repeat_penalty=1.0)
    print(llama.detokenize([next_token]).decode("utf-8"), end="", flush=True)
    llama.eval([next_token])

print()
print("---reset---")

### Restore model state
# Restore kv cache
# llama = llama_cpp.Llama(model_path="../models/ggml-model.bin")
assert llama.ctx is not None
llama_cpp.llama_set_kv_cache(llama.ctx, kv_cache_new, kv_cache_size, kv_cache_token_count)

# Restore last_n_tokens_data and tokens_consumed
llama.last_n_tokens_data = last_n_tokens_data
llama.tokens_consumed = tokens_consumed
t3 = time.time()
llama.eval(prompt)
t4 = time.time()
###

for i in range(4):
    next_token = llama.sample(top_k=40, top_p=1.0, temp=0.0, repeat_penalty=1.0)
    print(llama.detokenize([next_token]).decode("utf-8"), end="", flush=True)
    llama.eval([next_token])

@liyanbo2
Copy link

@abetlen i've used some of the functions like "llama_cpp.llama_get_kv_cache_size" and "llama.last_n_tokens_data.copy()" but it turns out that these functions do not exist. how does this happen?

@KaiLv69
Copy link

KaiLv69 commented Sep 2, 2024

llama_get_kv_cache_size

same question. how can i manipulate KV cache in lastest version?

@asdkazmi
Copy link

asdkazmi commented Dec 30, 2024

I'm trying to do the same, how can I get and reuse the KV cache in llama-cpp-python, or at least how can I use the KV cache in llama-cpp that I can generate from model in its standard format? @abetlen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants