third party applications are overwhelmingly slow for subsequent prompt evaluation compared to examples/main and examples/server

# Prerequisites

- [x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [x] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [x] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new bug or useful enhancement to share.

# Motivation

third party applications are overwhelmingly slow for subsequent prompt evaluation. where a subsequent prompt in the examples/server web interface can be evaluated in seconds, longer chats in these applications can take several minutes just to **begin** generating additional text.

i believe there are two separate issues:

- users of the OpenAI compatible endpoint in examples/server are not taking advantage of the prompt cache
- users of the llama-cpp-python high level API (including the server it ships with) are not taking advantage of the prompt cache

# Description

N.B. it is possible that this is only a documentation issue.

Request: provide a well-lit path for consumers of the llama.cpp API and the OpenAI compatible examples/server endpoint to avoid reprocessing the full chat history on each subsequent prompt evaluation.

i suspect there is a usability or discoverability issue with the llama.cpp APIs which is leading to inefficient use of llama.cpp. i've tested many llama.cpp based apps on Linux and Android (many listed in the README) and all of them struggle with this problem.

- llama-cpp-python[server]
- oodabooga/text-generation-webui
- KoboldCPP
- Mobile-Artificial-Intelligence/maid (using examples/server API)
- ztjhz/BetterChatGPT (using examples/server API)

in the case of text-generation-webui and KoboldCpp, i tested both the builtin (llama-cpp-python based) inference as well as using them as an API client for examples/server endpoint. Both suffer from this problem.

examples/main and examples/server are the only two pieces of software i've tested which handle this well, which results in these two simple examples being the most performant way to interact with LLMs.

the high level llama-cpp-python API seems to be perpetuating this mistake, which has follow-on effects for other consumers such as oodabooga-webui: https://github.com/abetlen/llama-cpp-python/issues/181 (don't be fooled by the closed status, the issue persists)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

third party applications are overwhelmingly slow for subsequent prompt evaluation compared to examples/main and examples/server #7185

Prerequisites

Motivation

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

third party applications are overwhelmingly slow for subsequent prompt evaluation compared to examples/main and examples/server #7185

Description

Prerequisites

Motivation

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions