Skip to content

server: Cache is not reused between completions by default. #3738

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
shibe2 opened this issue Oct 23, 2023 · 4 comments
Closed

server: Cache is not reused between completions by default. #3738

shibe2 opened this issue Oct 23, 2023 · 4 comments
Labels

Comments

@shibe2
Copy link
Contributor

shibe2 commented Oct 23, 2023

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Expected Behavior

Cache is reused and only a part of the prompt starting from the first mismatched token needs to be processed.

Current Behavior

I think, after #3677, it stopped reusing the cache and processes the whole prompt on each completion request.

Environment and Context

Linux, CLBlast build.

Steps to Reproduce

command: server -c 4096 -m xwin-lm-70b-v0.1.Q6_K.gguf

llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 1280.00 MB
llama_new_context_with_model: compute buffer total size = 574.13 MB
Available slots:
-> Slot 0 - max context: 4096

all slots are idle and system prompt is empty, clear the KV cache

request: POST /completion {"prompt":"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n\nUSER: Hello, can you help me?\nASSISTANT:"}

slot 0 is processing [task id: 0]
slot 0 : kv cache rm - [0, end)

print_timings: prompt eval time = 27268.61 ms / 47 tokens ( 580.18 ms per token, 1.72 tokens per second)
print_timings: eval time = 59156.03 ms / 42 runs ( 1408.48 ms per token, 0.71 tokens per second)
print_timings: total time = 86424.64 ms
slot 0 released (90 tokens in cache)

response:

{"content":" Hello! I'd be happy to help you with any questions or topics you have in mind. Please feel free to ask, and I'll do my best to provide you with useful information and assistance.","generation_settings":{"frequency_penalty":0.0,"grammar":"","ignore_eos":false,"logit_bias":[],"mirostat":0,"mirostat_eta":0.10000000149011612,"mirostat_tau":5.0,"model":"xwin-lm-70b-v0.1.Q6_K.gguf","n_ctx":4096,"n_keep":0,"n_predict":-1,"n_probs":0,"penalize_nl":true,"presence_penalty":0.0,"repeat_last_n":64,"repeat_penalty":1.100000023841858,"seed":4294967295,"stop":[],"stream":false,"temp":0.800000011920929,"tfs_z":1.0,"top_k":40,"top_p":0.949999988079071,"typical_p":1.0},"model":"xwin-lm-70b-v0.1.Q6_K.gguf","prompt":"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n\nUSER: Hello, can you help me?\nASSISTANT:","slot_id":0,"stop":true,"stopped_eos":true,"stopped_limit":false,"stopped_word":false,"stopping_word":"","timings":{"predicted_ms":59156.026,"predicted_n":42,"predicted_per_second":0.7099868405629547,"predicted_per_token_ms":1408.4768095238094,"prompt_ms":27268.61,"prompt_n":47,"prompt_per_second":1.7235935385045293,"prompt_per_token_ms":580.1831914893617},"tokens_cached":89,"tokens_evaluated":47,"tokens_predicted":42,"truncated":false}

At this point the original prompt as well as generated text should be in the cache.

Making exact same request as before. The prompt should match first half of the cache.

slot 0 is processing [task id: 1]
slot 0 : kv cache rm - [0, end)

print_timings: prompt eval time = 18216.41 ms / 47 tokens ( 387.58 ms per token, 2.58 tokens per second)
print_timings: eval time = 59435.15 ms / 42 runs ( 1415.12 ms per token, 0.71 tokens per second)
print_timings: total time = 77651.56 ms
slot 0 released (90 tokens in cache)

response:

{"content":" Hello! I'd be happy to help you with any questions or topics you have in mind. Please feel free to ask, and I'll do my best to provide you with useful information and guidance.","generation_settings":{"frequency_penalty":0.0,"grammar":"","ignore_eos":false,"logit_bias":[],"mirostat":0,"mirostat_eta":0.10000000149011612,"mirostat_tau":5.0,"model":"xwin-lm-70b-v0.1.Q6_K.gguf","n_ctx":4096,"n_keep":0,"n_predict":-1,"n_probs":0,"penalize_nl":true,"presence_penalty":0.0,"repeat_last_n":64,"repeat_penalty":1.100000023841858,"seed":4294967295,"stop":[],"stream":false,"temp":0.800000011920929,"tfs_z":1.0,"top_k":40,"top_p":0.949999988079071,"typical_p":1.0},"model":"xwin-lm-70b-v0.1.Q6_K.gguf","prompt":"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n\nUSER: Hello, can you help me?\nASSISTANT:","slot_id":0,"stop":true,"stopped_eos":true,"stopped_limit":false,"stopped_word":false,"stopping_word":"","timings":{"predicted_ms":59435.148,"predicted_n":42,"predicted_per_second":0.7066525686114217,"predicted_per_token_ms":1415.1225714285715,"prompt_ms":18216.411,"prompt_n":47,"prompt_per_second":2.580091105761722,"prompt_per_token_ms":387.58321276595746},"tokens_cached":89,"tokens_evaluated":47,"tokens_predicted":42,"truncated":false}

It erases the whole cache and processes all 47 request tokens again.

@ggerganov
Copy link
Member

You have to pass cache_prompt: true now:

https://github.com/ggerganov/llama.cpp/blob/96981f37b1e3f450d9e63e571514217bf60f0a7f/examples/server/public/index.html#L232

@shibe2
Copy link
Contributor Author

shibe2 commented Oct 23, 2023

Oh, that worked. I looked in the documentation, but missed the new option because slot_id, cache_prompt, system_prompt are listed in under Result rather than under Options.

@shibe2 shibe2 changed the title server: Cache is not reused between completions. server: Cache is not reused between completions by default. Oct 23, 2023
@whoreson
Copy link
Contributor

whoreson commented Jan 8, 2024

I can't get this to work. Even when I hardcode it to default 'true' in server.cpp, the second run (with the default "web chat" interface) is a magnitude slower in both prompt processing and predicting. It's practically unusable. What gives?

Actually, it works with 81bc921, but doesn't work with HEAD.

@github-actions github-actions bot added the stale label Mar 19, 2024
Copy link
Contributor

github-actions bot commented Apr 4, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants