-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Description
Prerequisites
- I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new bug or useful enhancement to share.
Motivation
third party applications are overwhelmingly slow for subsequent prompt evaluation. where a subsequent prompt in the examples/server web interface can be evaluated in seconds, longer chats in these applications can take several minutes just to begin generating additional text.
i believe there are two separate issues:
- users of the OpenAI compatible endpoint in examples/server are not taking advantage of the prompt cache
- users of the llama-cpp-python high level API (including the server it ships with) are not taking advantage of the prompt cache
Description
N.B. it is possible that this is only a documentation issue.
Request: provide a well-lit path for consumers of the llama.cpp API and the OpenAI compatible examples/server endpoint to avoid reprocessing the full chat history on each subsequent prompt evaluation.
i suspect there is a usability or discoverability issue with the llama.cpp APIs which is leading to inefficient use of llama.cpp. i've tested many llama.cpp based apps on Linux and Android (many listed in the README) and all of them struggle with this problem.
- llama-cpp-python[server]
- oodabooga/text-generation-webui
- KoboldCPP
- Mobile-Artificial-Intelligence/maid (using examples/server API)
- ztjhz/BetterChatGPT (using examples/server API)
in the case of text-generation-webui and KoboldCpp, i tested both the builtin (llama-cpp-python based) inference as well as using them as an API client for examples/server endpoint. Both suffer from this problem.
examples/main and examples/server are the only two pieces of software i've tested which handle this well, which results in these two simple examples being the most performant way to interact with LLMs.
the high level llama-cpp-python API seems to be perpetuating this mistake, which has follow-on effects for other consumers such as oodabooga-webui: abetlen/llama-cpp-python#181 (don't be fooled by the closed status, the issue persists)