-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Question] How to use kv cache? #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Unfortunately it looks like the This is something I'm following so once it's possible I'll drop an example. |
@SagsMug well this is emberrasing, so my c++ example from that thread apparently was correct haha. Here is the general structure. import time
import llama_cpp
llama = llama_cpp.Llama(model_path="../../models/ggml-alpaca.bin")
prompt = llama.tokenize(b"The quick brown fox")
# Reset the model
llama.reset()
# Feed the prompt
t1 = time.time()
llama.eval(prompt)
t2 = time.time()
### Save model state
assert llama.ctx is not None
# Save kv cache
kv_cache_token_count = llama_cpp.llama_get_kv_cache_token_count(llama.ctx)
kv_cache_size = llama_cpp.llama_get_kv_cache_size(llama.ctx)
kv_cache = llama_cpp.llama_get_kv_cache(llama.ctx)
kv_cache_new = (llama_cpp.c_uint8 * int(kv_cache_size))()
llama_cpp.ctypes.memmove(kv_cache_new, kv_cache, int(kv_cache_size))
# Save last_n_tokens_data and tokens_consumed
last_n_tokens_data = llama.last_n_tokens_data.copy()
tokens_consumed = llama.tokens_consumed
###
# Sample 4 token
for i in range(4):
next_token = llama.sample(top_k=40, top_p=1.0, temp=0.0, repeat_penalty=1.0)
print(llama.detokenize([next_token]).decode("utf-8"), end="", flush=True)
llama.eval([next_token])
print()
print("---reset---")
### Restore model state
# Restore kv cache
# llama = llama_cpp.Llama(model_path="../models/ggml-model.bin")
assert llama.ctx is not None
llama_cpp.llama_set_kv_cache(llama.ctx, kv_cache_new, kv_cache_size, kv_cache_token_count)
# Restore last_n_tokens_data and tokens_consumed
llama.last_n_tokens_data = last_n_tokens_data
llama.tokens_consumed = tokens_consumed
t3 = time.time()
llama.eval(prompt)
t4 = time.time()
###
for i in range(4):
next_token = llama.sample(top_k=40, top_p=1.0, temp=0.0, repeat_penalty=1.0)
print(llama.detokenize([next_token]).decode("utf-8"), end="", flush=True)
llama.eval([next_token]) |
@abetlen i've used some of the functions like "llama_cpp.llama_get_kv_cache_size" and "llama.last_n_tokens_data.copy()" but it turns out that these functions do not exist. how does this happen? |
same question. how can i manipulate KV cache in lastest version? |
I'm trying to do the same, how can I get and reuse the KV cache in llama-cpp-python, or at least how can I use the KV cache in llama-cpp that I can generate from model in its standard format? @abetlen |
Hello!
I have been trying to test the new kv cache loading and ran into an issue, it seems to segfault when running
llama_eval
.To save the current cache i do:
Loading:
But running
llama_cpp.llama_eval
after will result in a segfault.llama-cpp-python version: 0.1.16
How do i fix this?
Thanks
The text was updated successfully, but these errors were encountered: